Tutorial 19: Understanding Advanced Analyses of Kinetic Networks (Markov State Models)
Preface:
This tutorial uses a very similar setup for creating a model system as Tutorial 13 does, so you should ideally complete that one first (at least the first 2 steps). In Tutorial 13, we demonstrated not only the use of advanced sampling techniques but also attempts at reweighting biased distributions using, for the case of samplers relying on an unbiased Hamiltonian such as Progress Index-Guided Sampling (PIGS), Markov state models. At the very end of Tutorial 11, which is also worth running beforehand, you computed a cut-based (pseudo-)free energy profile, which is one example of an advanced analysis of kinetic networks.Step-by-Step:
Step 1 - Definition of a one-dimensional toy model
First create an empty directory that we will use as the lone working directory for this tutorial. As in Tutorial 13, we will use two particles with one completely frozen and the other restricted to move in a single dimension. With an arbitrary choice of monoatomic molecule, a sequence file is:NA+
NA+
END
Write this file as "toy1D.in". Now let us create a basic key file named "tutorial19.key":
PARAMETERS <full path to folder>/campari/params/oplsaal.prm
FMCSC_SEQFILE toy1D.in
FMCSC_SC_IPP 0.0
FMCSC_SIZE 60.
FMCSC_ORIGIN 0. 0. 0.
FMCSC_SHAPE 2
FMCSC_BOUNDARY 4
This creates a droplet system with radius 60Å and turns off any nonbonded interactions. In the first step, we obtain a PDB file we will edit by hand. If you still have the file from Tutorial 13, you can just copy it. If not, execute the serial version of CAMPARI as follows (here and below, as in all other tutorials, you may have to add paths for your executables):
campari -k tutorial19.key > t19_generate.log
mv yourcalc_END.pdb toy1D.pdb
Now edit the coordinate section of the two HETATM records in the PDB file "toy1D.pdb" such that the first atom is placed on the y-axis at a positive y-value of 3.0 and the second atom exactly at the origin. The resulting lines should look like this (spaces might not be preserved exactly by the browser/html):
HETATM 1 NA NA+ 1 0.000 3.000 0.000 1.00 0.00 HETATM 2 NA NA+ 2 0.000 0.000 0.000 1.00 0.00Note that the PDB file format is completely fixed, i.e., any column shift renders the file incorrect.
We here select the y-dimension as the one to use. We thus need to constrain the motion of the first particle in the other two dimensions and freeze the second particle entirely. This is possible by a specific input mode for custom constraints, which addresses constraints individually. The format, interpretations, and limitations are explained in detail elsewhere. To implement our example toy model, the file should look like this:
A
1 XZ
2 XYZ
Paste this into a file called "toy1D.frz". This concludes the definition of the system.
Step 2 - Setting up the potential and sampling the model
As before, we choose a stochastic dynamics sampler, i.e., a Langevin integrator. Add the following to the key-file:FMCSC_DYNAMICS 3
FMCSC_CARTINT 2
FMCSC_TIMESTEP 0.015
FMCSC_NRSTEPS 10000000
FMCSC_DISABLE_ANALYSIS 1 # remember to never put this above NRSTEPS
FMCSC_FRICTION 3.
FMCSC_FRZFILE toy1D.frz
FMCSC_FRZREPORT 1
FMCSC_PDBFILE toy1D.pdb
This turns on stochastic dynamics in Cartesian space with an intermediate friction value of 3/ps at a time step of 15fs run for 10000 steps (150ps). It also enables the desired constraints (including a report) and starts the simulation from a structure consistent with the dimensionality we want to achieve.
Our system has only a single degree of freedom, viz., the y-position of the 2nd atom, and we again use tabulated potentials based on interatomic distances for this. Remember that tabulated potentials are truncated by possible cutoffs, and we must disable them or explicitly adjust the range.
In this tutorial, we use 3 different potentials, each with 8 states of almost equivalent likelihood, but a major energetic barrier splitting the states into 2 groups. The potentials differ in where this major barrier is placed. Copy the files tutorial19_*.tpot into the current working directory and plot them to have a better picture of the implied landscape. The number in the name refers to the barrier corresponding to the larger one.
As in Tutorial 13, we have to write the following map file to apply it to the interatomic distance:
1 1
1 2 1
Paste this to a file called "toy1D.map". The first line chooses input mode 1 and tells CAMPARI to expect exactly 1 request. The second line gives the atom indices in question (1 and 2) and the potential to use (by index, 1). Now append the key-file as follows:
FMCSC_SC_TABUL 0.75
FMCSC_TABCODEFILE toy1D.map
FMCSC_TABPOTFILE tutorial19_4.tpot
FMCSC_TABREPORT 1
FMCSC_PCCALC 1
FMCSC_INSTGPC 100
FMCSC_PCCODEFILE toy1D.map
FMCSC_EQUIL 0
FMCSC_CUTOFFMODE 1 # this disables cutoffs
We right away also begin to monitor our only degree of freedom (with no steps being discarded at the beginning). This information could be extracted from trajectory output, obtained by the Python module, or gotten by monitoring the distance between the 2 atoms directly, which is what the distance analysis utility accomplishes. Here, we accumulate the histogram of the distance between the two particles at every step (FMCSC_PCCALC) and report their instantaneous values every 100 steps in an output file called GENERAL_DIS.dat. In order to get a histogram that we can compare directly with the exponential of the potential, which we do to assess the sampling here, we require a further modification, which, as in Tutorial 13, requires an additional input file:
T
1
Paste this to a file called "toy1D.agp". It defines that our only molecule type should be of type "solute", which is required for some analyses (including the one producing RBC_PC.dat) to become available. We can now append the key-file again:
FMCSC_ANGRPFILE toy1D.agp
This means that we will obtain the pair correlation function as the third column in output file RBC_PC.dat (bin centers in the first column). To recover the raw histogram, we need to multiply the volume element (second column) back in, e.g., using the tool "awk" (assuming you are on a Linux system).
campari -k tutorial19.key > t19_sampling.log
awk '{if ((NR > 1) && ($3 > 0)) print $1,-(310./0.75)*(8.3145/4184.)*(log($2*$3)-log(0.2))-2.723}' RBC_PC.dat > toy1D.hist
Note that the factors in the awk command are precalculated from the setup of the problem. A visual comparison of columns 2 in both "toy1D.hist" and tutorial19.tpot should reveal that the balance between the two halves is most likely not perfect but that both major states are visited, that the major states are well-equilibrated within, and that the boundaries are never violated. You should also plot the output in GENERAL_DIS.dat once, which highlights the stepwise dynamics of the particle in jumping between adjacent states.
Step 3 - Manual coarse-graining and file-based clustering
To proceed further, we first discretize the information in GENERAL_DIS.dat by a simple, regular-space binning:grep -v "#" GENERAL_DIS.dat > b4.dis
awk '{print int($2/1.0)+1}' b4.dis > b4.clu
This uses a bin width of 1.0Å, which we will evaluate below regarding its appropriateness. Now create a new key-file, which copies the system information from above but is meant for analysis, specifically for all the analyses associated with the structural clustering facility of CAMPARI. Name this key file "t19_analysis.key":
PARAMETERS <full path to folder>/campari/params/oplsaal.prm
FMCSC_SEQFILE toy1D.in
FMCSC_PDBFILE toy1D.pdb
FMCSC_SC_IPP 0.0
FMCSC_SIZE 60.
FMCSC_ORIGIN 0. 0. 0.
FMCSC_SHAPE 2
FMCSC_BOUNDARY 4
FMCSC_PDBANALYZE 1
FMCSC_NRSTEPS 100000
FMCSC_DISABLE_ANALYSIS 1 # remember to never put this above NRSTEPS
FMCSC_CCOLLECT 1
FMCSC_CMODE 7 # this selects file-based clustering with no knowledge of the original system retained
FMCSC_CLUFILE b4.clu
FMCSC_CRESORT 1 # makes sure cluster indices are always assigned consistently
FMCSC_CDISTANCE 7 # interatomic distance(s)
The setup of this type of run is a bit awkward in CAMPARI because not absolutely everything is disabled, as it is, for example, for the NetCDF data miner highlighted in Tutorial 14. This means that we continue to have to specify a valid system with a compliant PDB file (FMCSC_SEQFILE and FMCSC_PDBFILE) but also FMCSC_CDISTANCE even though we will not extract any geometric features. The second problem is specific to our particular system since the default value for FMCSC_CDISTANCE will throw an error with only two particles defined. By selecting FMCSC_CMODE to be 7, we instruct CAMPARI to read the coarse-grained trajectory in the file provided to FMCSC_CLUFILE and immediately skip out of its normal workflow and just perform graph-based analyses. Because geometrical information is not available, some features, for example certain augmentation strategies for the inferred transition matrix, are not available. These difficulties would be remedied (at cost) by also reading the data (corresponding to FMCSC_CMODE being 6), but we did not save trajectories here.
Now run this once:
campari -k t19_analysis.key > t19_temp.log
From the log-file you should find that we just recapitulated our coarse-graining but with renumbered cluster indices. CAMPARI will always sort clusters by decreasing size, which means that in all new output files, you will have to provide a map (we will get to that later). Because there is no information on geometry, the snapshots assigned as centroids will be the first ones encountered for any given cluster, which is a limitation.
Step 4 - Understanding (+) and (-) committor probabilities
For the analyses below, we will need to define end states, which we want to be our leftmost and rightmost basins/minima. Here, it is sufficient to find exactly two snapshots corresponding to these, e.g., by choosing the ones closest to the minimum positions:awk 'BEGIN{x1=1; d1=100.0; x2=1; d2=100.0} {if (($2-3.125)*($2-3.125) < d1) {d1 = ($2-3.125)*($2-3.125); x1=NR}; if (($2-46.875)*($2-46.875) < d2) {d2 = ($2-46.875)*($2-46.875); x2=NR}} END{print x1,x2}' b4.dis
Use the first number being printed as input for keyword FMCSC_ENDSYNSNAP (our end state) and the second number as input for keyword FMCSC_INISYNSNAP (our initial state). We will immediately request that CAMPARI computes the steady-state of our transition matrix as well as the so-called committor probabilities. For every bin, the (+) committor measures the probability with which any trajectory will reach the end state before reaching the initial state while the (-) committor measures the (retroactive) probability how likely a given trajectory came from the initial rather than the end state. Naturally, the (+) committor is 1.0 for the end state and 0.0 for the initial state while the (-) committor is 0.0 for the end state and 1.0 for the initial state (see description of output files PFOLD_PLUS_xxx.dat and PFOLD_MINUS_xxx.dat for further details). Taken together, you should add the following keywords:
FMCSC_TMAT_MD 1 # for clarity: use only forward time
FMCSC_DOPFOLD 1
FMCSC_DOPFOLD_MINUS 1
FMCSC_EIGVAL_MD 2 # get eigenvalues with largest real component
FMCSC_NEIGV 1 # the first one is enough
FMCSC_DOEIGVECT 1 # for the selected eigenvalue, get the eigenvector, which here is the steady state
FMCSC_INISYNSNAP insert value (see above)
FMCSC_ENDSYNSNAP insert value (see above)
Now run again:
campari -k t19_analysis.key > t19_temp.log
If you briefly inspect the file PFOLD_PLUS_FWD.dat that should have been produced, you will most likely note that the values in the second column are either close to 0.0 or close to 1.0. This is a reflection of the effective two-state nature of the problem we created, and we can this visualize this easily. To do so, we have to address the mapping problem mentioned before:
awk 'BEGIN{init=0;} {if (($2 == "CLUSTER") && ($3 == "SUMMARY")) {init=1} else if (NF <= 1) {init=0}; if ((init == 1) && (1*$1 == $1)) {print $1,$3}}' t19_temp.log > b4.lst
for i in $(awk '{print $2}' b4.lst); do echo $(head -n ${i} b4.clu | tail -n1); done > b4.remap
for pi in PLUS MINUS; do awk -v fn=b4.remap 'BEGIN {cnt=0; while (getline ln<fn > 0) {cnt=cnt+1; map[cnt] = 1*ln}} {printf("%6d %s\n",map[1*$1],$2)}' PFOLD_${pi}_FWD.dat > PFOLD_${pi}_FWD.remapped; done
The first command simply cuts the list of clusters from the log file, retaining only the centroid snapshot. The second command (very inefficiently) maps the cluster to the original index. The third line replaces the CAMPARI cluster indices with the original ones. If you plot the two files in the same graph, you should see their complementarity immediately. They graph will also show that the committor probabilities are never close to 0.5 and vary systematically in the expected manner with the two larger states. In translation, this means that there is no intermediate but that the separation of time scales of crossing the different barrier heights is not so large that the smaller barriers can be ignored completely.
By definition, even in the absence of an intermediate, there should be a point along the reaction coordinate where the chance of reaching the initial or end state first is exactly even. This is one of the definitions of what a transition state is. Clearly, if we interpolate the committor values, this point will be between clusters 25 and 26. The reason why we find no such state is because of our discretization, which places a bin boundary exactly at the point of highest potential energy. To see the impact of this, redo all of the steps with a modified coarse-graining as follows:
awk '{print int(($2-0.5)/1.0)+1}' b4.dis > b4.clu2
And modify the existing keyword in t19_analysis.key:
FMCSC_CLUFILE b4.clu2
Now, run:
campari -k t19_analysis.key > t19_temp2.log awk 'BEGIN{init=0;} {if (($2 == "CLUSTER") && ($3 == "SUMMARY")) {init=1} else if (NF <= 1) {init=0}; if ((init == 1) && (1*$1 == $1)) {print $1,$3}}' t19_temp2.log > b4.lst2
for i in $(awk '{print $2}' b4.lst2); do echo $(head -n ${i} b4.clu2 | tail -n1); done > b4.remap2
for pi in PLUS MINUS; do awk -v fn=b4.remap2 'BEGIN {cnt=0; while (getline ln<fn > 0) {cnt=cnt+1; map[cnt] = 1*ln}} {printf("%6d %s\n",map[1*$1],$2)}' PFOLD_${pi}_FWD.dat > PFOLD_${pi}_FWD.remapped2; done
Replotting these files reveals that there is now exactly one cluster that qualifies as a transition state. However, and this is very important to recognize, we have not actually changed the quality of the discretization. The bins are still too big and the only reason that the committor probabilities of the central cluster are close to 0.5 is that we have chosen a bin that is symmetric around the barrier, which is not something a density-based clustering algorithm would be prone to do. While all of this is fairly obvious in one dimension, the problem with real applications in high-dimensional spaces is that it is absolutely not clear how much of a calculated committor value is the result of a hidden, heterogeneous distribution within a cluster. The are different ways to overcome this: the most general one is to choose a fine enough clustering resolution whenever the data allow it from a statistical point of view. As an exercise you should redo the workflow one more time with the following coarse-grained trajectory:
awk '{print int($2/0.1)+1}' b4.dis > b4.clu3
Such tests are common to confirm the robustness of quantities extracted from Markov state models. A second test to do is to change the assumed lag time to something larger, for example 500, which is longer than the time it takes for the particle to move between substates but shorter than it takes to move between the two big states. To understand why this compresses the committor values of all states except the boundary ones is to recognize that a transition matrix at increasing values of lag time will converge toward the steady state distribution on every row. This means that the starting point is no longer relevant for where the model predicts the system to be at the next iteration. In terms of committor probabilities, it means that we have lost the ability to recognize whether a trajectory will more likely end up in one state than in the other because that point is so far in the future that most correlation is lost. This is not particularly intuitive when looking at committor probabilities per se.
Step 5 - Making use of committor probabilities
Restore the settings to our first, full analysis run above and add the following keyword:FMCSC_CMSMCFEP 10 # create a cut-based free energy profile based on both committors
The idea of a cut-based free energy profile is to use a meaningful order parameter (here, the committor probabilities) that can split the transition network into two halves: those states closer to the initial state(s), and those closer to the target state(s). From the inferred transition matrix, we can measure the flux across this two-state boundary, arguing that points of low flux are barriers (difficult to cross) while points of high flux are basins (easy to cross). In a perfect application, this should recover the true free energy profile with the main caveat being that points are obtained unevenly because the possible cutting surfaces in the network are a discrete, fixed set. Our one-dimensional toy model is as close to a perfect application as one can get. Run:
campari -k t19_analysis.key > t19_cfeps.log
This generates two new output files: PFOLD_PLUS_FWD_CFEP_0000000X.dat and the same for the (-) committor, and you should check the corresponding documentation to understand what they contain. If you plot both profiles, using column 3 (the integrated probability) on the x-axis and column 5 (the logarithmic transform of the cut value) on the y-axis, you will note that they are practically identical. This is expected since we have set up a symmetric problem. By turning on linear interpolation, you will probably be able to gauge that the profile does in fact resemble the potential, which here is equivalent to the true free energy (remember from above that FMCSC_SC_TABUL 0.75). One difference is that there can be no barrier to the left. This is because the plot interpolates between committor values and not the geometric dimensions of the problem. To show the latter, you would have to remap the clusters to the original index using the information in column 2 (due to the very specific nature of our setup, they will here correspond almost exactly to the information in the first column). A much more general alternative is to annotate the plot with geometric information on top as in this reference, which is slightly more complicated than when it is done for a progress index plot as in Tutorial 11. This is left as an exercise.
What else can committor probabilities be used for? First and foremost, they are directly related to probability flux so that the reaction rate for the transition between the initial and target states is an analytical function of the committor probabilities. The rate is given as F (τ Σip(-)iπi)-1, where π is the vector of steady-state probabilities, which we get from output file EIGENS_FWD.dat, p(-) refers to the (-) committor, and τ is the chosen lag time. We add the following keyword:
FMCSC_TPTMODE 1 # just request reporting the net flux
Followed by:
campari -k t19_analysis.key > t19_flux.log
awk '{for (k=3;k<=NF;k++) {print $k}}' EIGENS_FWD.dat > b4.ss
denom=$(paste b4.ss PFOLD_MINUS_FWD.dat | awk 'BEGIN{val=0.0} {val=val+$1*$3} END{print val}')
echo "Rate and characteristic time are: $(grep "forward-time reactive flux" t19_flux.log | awk -v dd=${denom} '{print $NF/dd,dd/$NF}')"
The result will be in units of steps-1 for the rate and steps for the time. How reasonable are these numbers? One cross-check we can do is to calculate the closely-related mean-first passage time to go from our chosen source state (left-hand side) to the target state (right-hand side of our one-dimensional problem). It is formally defined as the average time that passes between last leaving the source state and first entering the target state. Add the following keyword:
FMCSC_MFPT_MATRIX 1 # this also requires LAPACK
And run again:
campari -k t19_analysis.key > t19_mfpt.log
This will produce the full matrix of mean-first passage times to convert between states. These are always asymmetric quantities because the probability to enter a state depends on that state's steady-state probability, but the probability to leave does not. The dedicated output file MFPT_MATRIX_fwr_00000001.dat has cluster indices in its first row and column (which should be redundant here). From the file "b4.remap" you should be able to identify the matrix element of interest and find that the value is comparable but not identical to the one computed from the flux (rate). If you reorder the matrix, it will highlight the expected blocky structure dominated by the main barrier.
Step 6 - Finding pathways
A Markov state model, by definition, is memoryless. This means that the probabilities to reach new states are never correlated. Thus, every time the system is in a state where it has two or more options with similar probabilities, the family of paths the system is on will split into branches of similar probabilities. This makes it extremely unlikely that a Markov state model ever predicts a few dominant pathways unless it has very few states or unless almost all of the rows of the transition matrix have an off-diagonal element that is 1.0 or nearly 1.0. Another way of expressing the same property is that all pathways of appreciable length will have vanishing (joint) probability. This leads to a problem when asking the question how a system typically moves from state A to state B. For our one-dimensional toy model, this might appear to be a trivial question (it will move "from left to right"), but even this simple system exposes the problems with the general idea quite well.First, we can ask what actual paths the system took during the original simulation. Run:
awk 'BEGIN{tot=0; tcnt=0;} {if ($1 <= 4) {st=NR;} else if (($1 >= 47) && (st != -1)) {print NR-st; tot=tot+NR-st; tcnt=tcnt+1; st=-1}} END{print tot/tcnt}' b4.clu
This just prints the lengths of all the productive transitions from the source to the target state. If our Markov state model is faithful to the data, we expect that it predicts pathways that have similar properties. This immediately points to one major problem: the space of possible pathways is almost always so large that even simple properties like the distribution of lengths cannot be tested statistically with the, in comparison, vanishingly small number of events observed in a simulation.
To generate predicted pathways, we can use the synthetic trajectory generation tool. Add to the key-file:
FMCSC_SYNTRAJ_MD 1 # use target and source state
FMCSC_SYNTRAJOUT 0 # only record length statistics
FMCSC_NSYNTRAJS 500
FMCSC_NSYNSNAPS 50000 # capture the tail correctly
This will run 500 synthetic trajectories (random walkers) on the state network, each for a maximum of 50000 steps. The walker is initialized to start in the source state and terminated when it has reached the target state. Do:
campari -k t19_analysis.key
The relevant output will just be printed to the terminal near the end. Repeat this a few times to get a feel for how tight the distribution of lengths is for a network like this. You can also reduce FMCSC_NSYNTRAJS to the same number that was sampled in the simulation and compare the numbers from multiple runs then. At this level of variability, the simulation average will almost certainly be indistinguishable from the one for the synthetic trajectories, providing what is unfortunately only weak support for the Markov state model.
How can the space of possible pathways be reduced, and how can we control that we see pathways (or families thereof) that are, relatively speaking, particularly important? In transition path theory, the first step is to restrict ourselves to pathways that are systematically progressive. The net flux of probability for the reaction defined by the source and target states implies that, at the network level, every edge that is on any reactive pathway has only one direction where the flux is positive. If we restrict ourselves to traversing edges only in this direction, the resultant pathways will be streamlined by removing all unproductive detours. They can also be found in a systematic manner that orders them by the reactive flux they carry. Naturally, if the space of possible pathways of this kind is still very large, we cannot expect that the first few pathways describe the flow of the reaction comprehensively. For our one-dimensional system, this is all appears unnecessary since we know that the committor values change systematically from left to right and that any productive pathway will also move systematically from left to right. This is a fundamental weakness of what a pathway in a Markov state model is: it is not an intellectual abstraction of the geometry of the problem (which is what it sounds like it should be) but a specific sequence of states that satisfies certain properties.
Change the existing value for FMCSC_TPTMODE and add:
FMCSC_TPTMODE 5 # find k shortest paths
FMCSC_TPTMAXPATHS 1000 # this sets k
Followed by:
campari -k t19_analysis.key > t19_eppstein.log
This will produce an output file called SHORTESTPATHS_FWD.dat. It contains the desired list of pathways in addition to some properties of the pathways. The data in column 5 are of particular interest: this is the net integrated flux. If you check the last line, and you kept the settings as suggested in this tutorial, the value will still be tiny, meaning that 1000 pathways are far from covering the (sub)set of pathways that we should consider. They are the ones carrying the most flux individually, but that distinction is marginal: you should find that the actual flux (column 4) differs by less than an order of magnitude across 1000 paths. To make the file more readable, remap the clusters:
awk -v fn=b4.remap 'BEGIN {cnt=0; while (getline ln<fn > 0) {cnt=cnt+1; map[cnt] = 1*ln}} {for (k=1;k<=NF;k++) {if ((k==2) || (k==3) || (k>5)) {printf("%d ",map[1*$k])} else {printf("%s ",$k)}}; printf("\n");}' SHORTESTPATHS_FWD.dat > Eppstein.remapped
If you inspect or plot them, you will find that the reason for the large diversity is that, at our settings for clustering resolution, saving frequency, and lag time, there are many connections spanning several states forward (and backward). If the graph were perfectly linear instead (every state has at most two connections), there would indeed be exactly one pathway. One counterintuitive property is apparent immediately. You might find that several pathways overshoot the target state in the penultimate step. This is geometrically misleading but a correct result in terms of the network and the definitions of committor probabilities and flux. It highlights that in all Markov state models the first and foremost concern should be understanding the consequences of the choices made to construct it.
The data in columns 2 and 3 give the so-called bottleneck edge. One would normally consider the edge with the smallest amount of reactive flux capacity as the bottleneck one. The problem with this definition is that the capacity is determined by all paths rather than a specific one. This approach will thus identify the edge that is "most unique" to a given path, which in general is poorly defined. Here, because our settings allow jumps across several states, the data-derived bottleneck would be any transition with low relative likelihood. Because capacity (net flux) is a function of a difference, the lowest capacity along a pathway can mean different things: in particular, a large gain in committor probability is often offset by a low (predicted) sampling weight and vice versa, leading to values that can be relatively homogeneous. A similar ambiguity is apparent when there is only a single path: all edges are equally unique to this path, the flux is the same throughout, and the bottleneck is formally undefined. This is why CAMPARI instead reports the edge with the lowest absolute sampling weight as the bottleneck edge in SHORTESTPATHS_FWD.dat. From the remapped file, you should find here that, especially for the paths carrying the most flux, the bottleneck is usually the edge spanning the central, larger barrier (between states 25 and 26).
To make sure that this analysis can give the expected results, let us adjust the coarse-graining and repeat the above procedure:
awk '{print int($2/6.25)+1}' b4.dis > b4_coarse.clu
This now uses a bin width that is equivalent to the size of a single (sub-)basin. Based on the results, we would still expect the occasional transition skipping an adjacent state but it should now be rare. Change the key-file:
FMCSC_CLUFILE b4_coarse.clu
And rerun:
campari -k t19_analysis.key > t19_coarse_eppstein.log
awk 'BEGIN{init=0;} {if (($2 == "CLUSTER") && ($3 == "SUMMARY")) {init=1} else if (NF <= 1) {init=0}; if ((init == 1) && (1*$1 == $1)) {print $1,$3}}' t19_coarse_eppstein.log > b4_coarse.lst
for i in $(awk '{print $2}' b4.lst2); do echo $(head -n ${i} b4_coarse.clu | tail -n1); done > b4_coarse.remap
awk -v fn=b4_coarse.remap 'BEGIN {cnt=0; while (getline ln<fn > 0) {cnt=cnt+1; map[cnt] = 1*ln}} {for (k=1;k<=NF;k++) {if ((k==2) || (k==3) || (k>5)) {printf("%d ",map[1*$k])} else {printf("%s ",$k)}}; printf("\n");}' SHORTESTPATHS_FWD.dat > Eppstein_coarse.remapped
If you inspect the remapped output file, you should find that most of the flux is carried by the path that goes exactly in the order of states from 1 to 8 with the bottleneck between 4 and 5. The remaining paths skip 1 (or more) states but, despite being shorter, are clearly identifiable as less relevant. In these, the bottleneck edge will generally be the jump (or one of jumps), not necessarily the edge spanning the central barrier. This still does not line up with a geometric interpretation of the problem, but at least it is now comprehensible enough. Here, it is possible to interpret a jump as a sort of entropic barrier but applying this interpretation in general is not possible since this nature is not evident without a full knowledge of the (free) energy surface. Fundamentally, the required information will be contained in the geometric distance associated with an edge, coupled with a metric that maps to a constant rate of diffusion across phase space. There are strategies in use that exploit this, but they are not discussed further here.
Step 7 - Trying to find representatives of clusters of pathways
Since the Eppstein decomposition will realistically never converge unless an extreme coarse-graining is attempted, and since the strategies for correct coarse-graining of this type are manifold and difficult (unless the problem is a toy model as chosen here), one might ask the question whether pathways can be discovered that are actually representatives of families of pathways. The classic approach in the literature following Berezhkovskii et al. is to lump pathways by the flux capacity of the edge carrying the least amount of flux. This is the aforementioned "most unique" edge of the pathway. To do so, the flux capacity of that edge is removed from the entire path, and a new flux network is analyzed, proceeding iteratively until the total flux is accounted for. In doing so, a single pathway (that with the most reactive flux) becomes a representative for all pathways using that edge. It is generally very hard to understand the hidden diversity in the pathways, which is one of the fundamental drawbacks of the approach. A second major drawback is the aforementioned problem that the edge of least flux is not necessarily reflecting the geometric nature of the problem. To use the method of Berezhkovskii et al., change/append the key-file:FMCSC_TPTMODE 2 # change existing keyword
FMCSC_TPTMAXPATHS 100000 # change existing keyword
FMCSC_TPTMAXFRACTION 1.0 # fraction of integrated flux to find
Followed by:
campari -k t19_analysis.key > t19_coarse_mode2.log
awk -v fn=b4_coarse.remap 'BEGIN {cnt=0; while (getline ln<fn > 0) {cnt=cnt+1; map[cnt] = 1*ln}} {for (k=1;k<=NF;k++) {if ((k==2) || (k==3) || (k>5)) {printf("%d ",map[1*$k])} else {printf("%s ",$k)}}; printf("\n");}' SHORTESTPATHS_FWD.dat > Mode2_coarse.remapped
If you compare the remapped output files side-by-side ("Mode2_coarse.remapped" and "Eppstein_coarse.remapped"), you should find that the first path is the same but now accounts for an even larger fraction of the flux. This is because it effectively represents all paths that share the edge with the lowest flux. From knowing the committor values and the transition matrix, it is possible to recapitulate which paths this encompasses, but the number can be combinatorially large in real examples. The problems start with the next iteration. Specifically, the flux network analyzed in subsequent steps has zero flux on the bottleneck edge. The main issue with interpretability comes from the fact that it also has close-to-zero flux on all other edges of the pathway because, in this example, the flux values do not vary strongly between edges on the same path. This means that the remaining flux is attributed to pathways that can be of marginal importance in the correct (Eppstein) result. These later pathways are characterized more by having comparatively low overlap in the edges traversed with the ones discovered before than by carrying comparatively high amounts of flux. In the concrete example, you should find that the pathways ranked 2 and onward in the Eppstein case are unlikely to be ranked near the top (if they exist at all) in the few remaining pathways in the Berezhkovskii approach.
CAMPARI offers different strategies to mitigate this, and two are discussed here. First, we can use a relatively simple modification of updating the flux network by removing carried flux rather than minimum capacity. This should, in theory, recover a result similar to Eppstein.
To use this method, change the key-file:
FMCSC_TPTMODE 4
FMCSC_TPTMAXFRACTION 0.999999 # fraction of integrated flux to find (using exactly 1.0 here can cause underflow)
Followed by:
campari -k t19_analysis.key > t19_coarse_mode4.log
awk -v fn=b4_coarse.remap 'BEGIN {cnt=0; while (getline ln<fn > 0) {cnt=cnt+1; map[cnt] = 1*ln}} {for (k=1;k<=NF;k++) {if ((k==2) || (k==3) || (k>5)) {printf("%d ",map[1*$k])} else {printf("%s ",$k)}}; printf("\n");}' SHORTESTPATHS_FWD.dat > Mode4_coarse.remapped
awk 'BEGIN{all=0.0;} {key=""; for (k=6;k<=NF;k++) {key=key "_" $k;}; if (val[key]) {val[key] = val[key] + $4;} else {val[key] = $4;}; all=all+$4;} END {for (k in val) {print val[k],all,k}}' Mode4_coarse.remapped | sort -rg | sed -e "s|_| |g" | awk 'BEGIN{run=0.0;}{run=run+$1; printf("%12.6g %12.6g",$1,0.999999*run/$2); for (k=3;k<=NF;k++) {printf(" %d",$k)}; printf("\n");}' > Mode4_coarse.recombined
In the two awk commands, we not only remap the clusters but subsequently recombine increments that are for the same path. It is a natural weakness of any decomposition approach that subtracts flux from the network without zeroing any edge that the exact same path can be rediscovered. Since the carried flux (a path quantity) and the capacity (an edge quantity) are not the same (the former being much smaller in general), this also happens here. Comparing the three solutions obtained thus far for the coarse clustering, you should find that all or almost all of the paths in Eppstein also appear in "Mode4_coarse.recombined", that the flux for the top path is different in all 3 (the one in "Mode4_coarse.recombined being somewhere in between the other two), and that the order of subsequent paths is heavily altered: Eppstein will favor long paths because it correctly recognizes the cumulative decrease on the likelihood of jumping states whereas short paths are further up in the other two. Evidently, the subtraction of the carried flux can overcome the path lumping (approximate Eppstein better) but with different flux contributions, which are hard to interpret, as for FMCSC_TPTMODE being 2.
The second approach, and both can be combined, is to update the flux network by subtracting a fraction of the bottleneck capacity rather than all of it as in Bacci et al.. To use this method, change/append the key-file:
FMCSC_TPTMODE 3 # change existing keyword
FMCSC_TPTMAXPATHS 100000 # change existing keyword
FMCSC_TPTBTNFRACTION 0.5 # fraction of capacity to remove
Followed by:
campari -k t19_analysis.key > t19_coarse_mode3.log
awk -v fn=b4_coarse.remap 'BEGIN {cnt=0; while (getline ln<fn > 0) {cnt=cnt+1; map[cnt] = 1*ln}} {for (k=1;k<=NF;k++) {if ((k==2) || (k==3) || (k>5)) {printf("%d ",map[1*$k])} else {printf("%s ",$k)}}; printf("\n");}' SHORTESTPATHS_FWD.dat > Mode3_coarse.remapped
awk 'BEGIN{all=0.0;} {key=""; for (k=6;k<=NF;k++) {key=key "_" $k;}; if (val[key]) {val[key] = val[key] + $4;} else {val[key] = $4;}; all=all+$4;} END {for (k in val) {print val[k],all,k}}' Mode3_coarse.remapped | sort -rg | sed -e "s|_| |g" | awk 'BEGIN{run=0.0;}{run=run+$1; printf("%12.6g %12.6g",$1,0.999999*run/$2); for (k=3;k<=NF;k++) {printf(" %d",$k)}; printf("\n");}' > Mode3_coarse.recombined
You will most likely obtain results that are similar to the previous case, i.e., all or almost all of the Eppstein paths are present but with altered weights and unduly favoring short pathways. To our knowledge, there currently is no straightforward way to meaningfully and implicitly group pathways and retain authentic flux information as the above illustrates. As explained above, an explicit clustering of pathways after running Eppstein is hindered by the fact that not enough pathways can be generated to capture a meaningful amount of the net flux. The methods defined above apply best in the case where the flux is carried in largely independent channels and the aim is to recognize these (few) independent channels. Take some time to explore the other options CAMPARI offers and, especially, to redo this for the original clustering (bin size of 1.0Å) to get a feel for the difficulty in interpreting these results.
Quickly run through the data generation and Eppstein analysis using the coarse clustering with the following change to the key-file:
FMCSC_TABPOTFILE tutorial19_6.tpot
Remember to change the snapshots for FMCSC_INISYNSNAP and FMCSC_ENDSYNSNAP as well as the names of all derived input files throughout. This potential, as is evident if you plot it, differs in that the barrier is closer to the right end. The expectation is therefore that the bottleneck node in the path analysis will simply shift to the right and that the two (super-)states no longer have equal weights. Only the second aspect will be evident from the committor and path analyses while the first will show, for example, in an asymmetry in the net rate and the mean-first passage times as you swap end and initial states.
Step 8 - Changing the model
In the final part, we utilize a toy model that has explicit pathways built into it. Specifically, we use a pair of φ and ψ-angles in alanine dipeptide with a customized potential energy surface that you can find in this file. Visualize this file to see that there are two energetically favored states with 3 distinct pathways connecting them. Setting up a sampling run for this model is best done from scratch. Put a sequence file "tutorial19_2D.seq" as:ACE
ALA
NME
END
And a constraint file "tutorial19_2D.frz" as:
M
1 RBC
Then populate a key-file (explanations embedded) called t19_2D.key:
PARAMETERS <full path to folder>/campari/examples/tutorial19/tutorial19.prm
FMCSC_CMAPDIR <full path to folder>/campari/examples/tutorial19 # CAMPARI expects to find the CMAP inputs here
FMCSC_SEQFILE tutorial19_2D.seq
FMCSC_SC_IPP 0.0
FMCSC_SC_BONDED_M 0.85 # turns on CMAP with the desired file (mapping through parameter file)
FMCSC_SIZE 100.
FMCSC_DYNAMICS 2
FMCSC_CARTINT 1 # now we are using internal coordinate space dynamics (otherwise not really 2D)
FMCSC_TIMESTEP 0.01 # 10fs
FMCSC_TMD_UNKMODE 0 # no extra degrees of freedom needed (would be uncoupled in any case)
FMCSC_TSTAT 2 # Andersen is the best choice here
FMCSC_NRSTEPS 11000000 # the system spends a lot of time in the boundary states
FMCSC_DISABLE_ANALYSIS 1
FMCSC_FRZFILE tutorial19_2D.frz
FMCSC_TOROUT 10 # to obtain instantaneous features
FMCSC_ANGCALC 1 # to get the sampled probability surface
FMCSC_EQUIL 100000 # to safely discard the initial search for the valleys
Followed by:
campari -k t19_2D.key > t19_2D_sampling.log
mv FYC.dat t19_2D.fyc; mv RAMACHANDRAN.dat t19_2D.rama
If you visualize the negative logarithm of the data in "t19_2D.rama", it should look quite similar to the potential. Note that, as described elsewhere, this file carries an extra row and column (for historical/technical reasons) that are empty because 360° is divisible by our implicit bin size of 10°. The visualization will most likely also reveal that our pseudo-particle occasionally escapes the favorable parts of the map and takes shortcuts. Here, this is naturally a function of the chosen temperature (indirectly through FMCSC_SC_BONDED_M). Based on the instantaneous data, we can create a coarse-grained trajectory analogously to before:
awk '{b1 = int(($3+180.0)/24.0)+1; b2 = int(($4+180.0)/24.0)+1; print (b1-1)*15+b2}' t19_2D.fyc | tail -n 1000000 > tutorial19_2D.clu
This applies a regular-space binning with a generous bin-size of 24° in both directions. From here on, the analysis will look very similar to steps 5-7 above since this analysis was agnostic to the geometric nature of the problem.
We can edit our existing key-file for analysis, t19_analysis.key:
FMCSC_NRSTEPS 1000000 # change existing keyword
FMCSC_CLUFILE tutorial19_2D.clu # change existing keyword
FMCSC_TPTMODE 5 # try different approaches, starting with Eppstein
FMCSC_TPTMAXPATHS 100000 # change existing keyword
FMCSC_INISYNSNAP change value to first instance of cluster 37
FMCSC_ENDSYNSNAP change value to first instance of cluster 154
Running the analysis and remapping the decomposed pathways is straightforward:
campari -k t19_analysis.key > t19_2D.log
awk 'BEGIN{init=0;} {if (($2 == "CLUSTER") && ($3 == "SUMMARY")) {init=1} else if (NF <= 1) {init=0}; if ((init == 1) && (1*$1 == $1)) {print $1,$3}}' t19_2D.log > tutorial19_2D.lst
for i in $(awk '{print $2}' tutorial19_2D.lst); do echo $(head -n ${i} tutorial19_2D.clu | tail -n1); done > tutorial19_2D.remap
awk -v fn=tutorial19_2D.remap 'BEGIN {cnt=0; while (getline ln<fn > 0) {cnt=cnt+1; map[cnt] = 1*ln}} {for (k=1;k<=NF;k++) {if ((k==2) || (k==3) || (k>5)) {printf("%d ",map[1*$k])} else {printf("%s ",$k)}}; printf("\n");}' SHORTESTPATHS_FWD.dat > Decomposition_2D.remapped
awk -v wh=1 '{if (NR == wh) {for (k=6;k<=NF;k++) {b1=int(($k-1)/15); b2=$k-15*b1; print -180.0+(b1+0.5)*24.0,-180.0+(b2-0.5)*24.0}}}' Decomposition_2D.remapped > phipsi_pathway.dat
The last line, for our specific binning, transforms back the remapped (original) cluster number to associated φ- and ψ-values. These can then be plotted, for example, over the image plot of the negative logarithm of the Ramachandran map in "t19_2D.rama". This example works relatively well in terms of accounting for a large amount of the flux in relatively few paths, even with Eppstein. Try to recapitulate the flux through the 3 intended channels (and the unintended one consisting of shortcuts) by finding a way to group pathways from Eppstein, by counting what happened in the trajectory, and, lastly, by what other choices for FMCSC_TPTMODE give you.