Tutorial 13: The Creation of Toy Models, Energy Landscape Sculpting, and Progress Index-Guided Sampling


In this tutorial, we present a hands-on approach to the use of two advanced sampling techniques. The first is called energy landscape sculpting (ELS) within CAMPARI, and the second is "Progress Index-Guided Sampling" (or PIGS in short). For this, we construct a simple toy model, similar to the one used in the PIGS reference publication, which has the additional benefit of demonstrating the use of some of CAMPARI's less obvious customization options, predominantly custom constraints and tabulated potentials (Step 1). We first demonstrate sampling limitations when using short trajectories obtained with conventional sampling that all start from the same well-defined reference state (Step 2).
Energy landscape sculpting, which is a straightforward generalization of the accelerated molecular dynamics method of Hamelberg et al., achieves a sampling benefit by modifying the potential energy landscape, viz., by decreasing the absolute depths of minima and maxima of one or more individual terms. It is well-suited to scenarios where an individual energy term with well-defined bounds provides barriers that are hindering conformational transitions of interest, e.g., cis-trans isomerization of proline residues by means of torsional potentials. Conversely, the PIGS method is based on the idea that multiple identical copies of a system evolving in parallel can be directed toward as diverse areas of accessible phase space as possible if data from all copies are analyzed jointly. The analysis method of choice is the progress index introduced in Tutorial 11. This method has the ability to identify sampling basins of a system without overlap given a distance metric as input. Consequently, the most critical inputs for any PIGS simulation are keywords FMCSC_CDISTANCE and FMCSC_CFILE. The PIGS algorithm is memoryless across multiple stretches, which means that it does not penalize recurrence to the same places in general. Instead it primarily avoids multiple copies of the system occupying a given region of phase space. The PIGS method uses no energetic bias (unlike energy landscape sculpting or umbrella sampling on specific reaction coordinates, e.g., those measuring secondary structure content as in Tutorial 9). In PIGS, unlike in the replica exchange method, all copies propagate under identical conditions. This property means that PIGS simulations are particularly well-suited for the exploration of systems with many states separated by relatively small energetic barriers.
To keep the tutorial fast and tractable, we use a one-dimensional toy model to demonstrate the virtues of these two methods.


Step 1 - Definition of a one-dimensional toy model
First create an empty directory that we will use as the lone working directory for this tutorial. In order to create a one-dimensional system, we could consider a single particle in an external field. The only external fields CAMPARI currently supports are the soft-wall boundary conditions and the spatial density restraints. Only the latter are externally customizable but this is not very convenient as the entire technology rests on 3D spatial grids. Instead we are resorting here to tabulated potentials applied to interatomic distances. To emulate the one-dimensional case, we will use two particles with one completely frozen and the other restricted to move in a single dimension. With an arbitrary choice of monoatomic molecule, a sequence file is:


Write this file as "toy1D.in". Now let us create a basic key file named "tutorial13.key":

PARAMETERS <full path to folder>/campari/params/oplsaal.prm
FMCSC_ENSOUT 100000000
FMCSC_RSTOUT 100000000
FMCSC_ENOUT 100000000
FMCSC_ACCOUT 100000000
FMCSC_POLOUT 100000000

This creates a droplet system with radius 60Å and turns off any nonbonded interactions. The last 6 lines are to suppress unwanted output below. In the first step, we obtain a PDB file we will edit by hand. Execute the serial version of CAMPARI as follows (here and below, as in all other tutorials, you may have to add paths for your executables):

campari -k tutorial13.key > t13_generate.log
mv yourcalc_END.pdb toy1D.pdb

Now edit the coordinate section of the two HETATM records in the PDB file "toy1D.pdb" such that the first atom is placed on the y-axis at a positive y-value of 3.0 and the second atom exactly at the origin. The resulting lines should look like this:
HETATM 1 NA NA+ 1 0.000 3.000 0.000 1.00 0.00
HETATM 2 NA NA+ 2 0.000 0.000 0.000 1.00 0.00

Note that the PDB file format is completely fixed, i.e., any column shift renders the file incorrect. We here select the y-dimension as the one to use. We thus need to constrain the motion of the first particle in the other two dimensions and freeze the second particle entirely. This is possible by a specific input mode for custom constraints, which addresses constraints individually but does not work for Monte Carlo simulations (yet). The format, interpretations, and limitations are explained in detail elsewhere. To implement our example toy model, the file should look like this:

1 XZ

Paste this into a file called "toy1D.frz". This concludes the definition of the system.
Step 2 - Conventional sampling of the model
We are now in the position to set up the sampler for our system. While it is perfectly possible to run PIGS in conjunction with a Monte Carlo engine, we cannot, as mentioned before, achieve the restriction to a single dimension with custom constraints. Note that for the tutorial to be interesting, we would also have to restrict the Monte Carlo move set to local steps, which should become clear below. We thus choose a stochastic dynamics sampler, i.e., a Langevin integrator. Add the following to the key-file:


This turns on stochastic dynamics in Cartesian space with an intermediate friction value of 3/ps at a time step of 15fs run for 10000 steps (150ps). It also enables the desired constraints (including a report) and starts the simulation from a structure consistent with the dimensionality we want to achieve. Execute:

campari -k tutorial13.key > t13_test.log

Reading through the errors printed in the beginning to "t13_test.log" (before the summary portion), you should only find warnings about unsupported analysis features, which are enabled by default (and which we could adjust explicitly to avoid these warnings). More importantly, a short report should tell you that the constraint input file has most likely been interpreted correctly.
Our system has only a single degree of freedom, viz., the y-position of the 2nd atom, and we would like to bias its distribution using a custom potential. We use tabulated potentials based on interatomic distances for this. CAMPARI will interpolate the potential using cubic Hermite splines. The potential(s) must be read from file (see here for details) as does the file used to map potentials to specific interatomic distances (see here for details). The first derivatives can be read from file (see here for details) or inferred from the potential(s). In this tutorial, we use a potential with 8 states of almost equivalent likelihood, which is an extremely simple model of a system with a rugged free energy landscape without a main attractor basin. Copy the file tutorial13.tpot into the current working directory and plot it to have a better picture of the implied landscape. Importantly, at the default temperature, the large barriers at 0.0 and 50.0Å are practically impenetrable. This simplifies our analysis later because it will restrict the 1st particle to positive y-values (given the starting conformation we constructed). The 8 attractors are more or less evenly spaced (6.25Å) and the 1st particle is initially found in the leftmost one.
The input file for the potential has only a distance column (in Å) and a single column with the potential (in kcal/mol). This means that we have to write the following map file to apply it to the interatomic distance:

1 1
1 2 1

Paste this to a file called "toy1D.map". The first line chooses input mode 1 and tells CAMPARI to expect exactly 1 request. The second line gives the atom indices in question (1 and 2) and the potential to use (by index, 1). If you have not already done so, copy the file with the potential, tutorial13.tpot to the current working directory and then append the key-file as follows:

FMCSC_TABPOTFILE tutorial13.tpot

We right away also begin to monitor our only degree of freedom (with no steps being discarded at the beginning). This could be done be either post-processing trajectory output or by monitoring the distance between the 2 atoms. We choose the latter here. CAMPARI has a versatile utility for distance analysis that uses actually the same input functionality as the map file for the tabulated potentials (keyword FMCSC_PCCODEFILE). The above keywords request CAMPARI to collect values of the distance between the two particles every step (FMCSC_PCCALC) and to report the instantaneous values every 10 steps in an output file called GENERAL_DIS.dat (the spacing of 10 steps is actually obtained as the value of FMCSC_PCCALC·FMCSC_INSTGPC). In order to get a histogram that we can compared directly with the exponential of the potential, we require a further modifications, viz., we need to work around the fact the CAMPARI will normalize distance distributions by the 3D volume element (see details elsewhere), which we need to multiply back in due to the actual dimensionality being 1. To achieve this workaround, we need the volume element that is printed to output file RBC_PC.dat, which in turn requires an additional input file:


Paste this to a file called "toy1D.agp". It defines that our only molecule type should be of type "solute", which is required for some analyses (including the one producing RBC_PC.dat) to become available. Of course, in the simple case here, the histograms from our custom distance analysis and those in RBC_PC.dat are the same because we use monoatomic particles. We can now append the key-file again:


This means that we will obtain the pair correlation function as the third column in output file RBC_PC.dat (bin centers in the first column). To recover the raw histogram, we need to multiply the volume element (second column) back in, e.g., using the tool "awk" (assuming you are on a Linux system).

campari -k tutorial13.key > t13_serial.log
awk '{if ((NR > 1) && ($3 > 0)) print $1,-310.*(8.3145/4184.)*(log($2*$3)-log(0.2))-2.723}' RBC_PC.dat > toy1D.hist

Note that the factor of 2.723 is precalculated from the absolute scale of the potential in tutorial13.tpot and the bin spacing. The factor of log(0.2) is due to the bin spacing of RBC_PC.dat, which is controllable with keyword FMCSC_PCBINSIZE. A visual comparison of columns 2 in both "toy1D.hist" and tutorial13.tpot should reveal a noisy inferred potential that is quite unlikely to extend all the way to the largest distances (when the population histogram is zero, the potential diverges of course). This is a sampling problem, and initial state bias is retained. It is left as an exercise to change the simulation to be longer and reanalyze the data to demonstrate eventual convergence to the expected solution. You should also plot the output in GENERAL_DIS.dat once, which highlights the stepwise dynamics of the particle in jumping between adjacent states.
Step 3 - Improved sampling of the model
If this were not a toy system, the approach to create a single simulation that is long with respect to all relevant time scales is very likely unachievable. It is common instead to run multiple copies of the same system in parallel. It is also common that there is a reference state (e.g., the crystal structure in simulations of folded proteins) that all these copies start from, which we mimic here with the leftmost basin. We resort to the MPI averaging mode of CAMPARI to test this, which requires MPI enabled code (executable campari_mpi, see installation instructions). Add the following to the key-file (and make sure that FMCSC_NRSTEPS is 10000):


Now run (note that you should be able to do this on a Linux desktop even if the system has fewer than 32 cores):

mpirun -np 32 campari_mpi -k tutorial13.key > t13_parallel.log
awk '{if ((NR > 1) && ($3 > 0)) print $1,-310.*(8.3145/4184.)*(log($2*$3)-log(0.2))-2.723}' RBC_PC.dat > toy1D_MPIx32.hist

After completion, first spend some time to inspect the output in N_000.log, especially the part on parallelism. In MPI averaging mode, CAMPARI will automatically collect data from all replicas, the number of which is taken directly from the MPI environment ("-np 32") above. We thus have analyzed 32 times as much data as before. Comparing to the result from the serial run, the new data should show a smoother curve that, however, does not necessarily manage to extend the range further. This is clear if one considers that the answer is equivalent to an average across 32 separate executions of the serial case. This is an important result or caveat that often gets lost in simulations of more complex systems, i.e., naive parallelism can increase the number of states discovered (phase space coverage) only in those regions that are kinetically accessible from the starting state. It is left as another exercise to verify that an average across a large number of runs (whichever way produced) will eventually converge to a well-defined answer that yields a robust illustration of the remaining initial state bias given the choice for FMCSC_NRSTEPS. It is true that increasing the number of replicas (linearly) increases the chance of unlikely events occurring. This is a strategy of limited scope, however, as the "unlikely chance" can be arbitrarily small depending on system and settings. This is particularly troubling if (unlike here) the set of states with significant equilibrium populations is not known a priori.
There are many successful strategies to overcome sampling difficulties, and CAMPARI supports several of them. Here, it would be possible to use pretty much any of them, e.g., umbrella sampling via FMCSC_SC_DREST or replica exchange (see also Tutorial 3 and Tutorial 9). In this tutorial, we will employ two different strategies, one representative each of two broad classes of simulations. The first class are simulations that extract their sampling benefit from the adaptability/selection of the starting conformation such as PIGS (see above and below). The second class are simulations modifying the potential energy landscape, and the method termed originally "accelerated molecular dynamics" belongs to it. In CAMPARI, this is implemented in a generalized form. The basic idea is that by knowledge of the potential energy (or individual terms contributing to it), one can at all times define an analytical transformation that reduces the depths of minima (attractors) and maxima (barriers). A more detailed overview of the theory is provided elsewhere. Add the following to the key-file:


These keywords enable the method for a specific energy term, i.e., joint nonbonded energies. The code for this is 15. We request this term (which here is practically equivalent to the total energy, which means we could also use code 1) to be sculpted such that any values below 0.9kcal/mol are raised by a Δ that is a buffered difference between the actual and the threshold values (buffer parameter of 0.5kcal/mol). Clearly, prior knowledge of the range of energies is required. The last keyword instructs to CAMPARI to write a file with sampling weights for every 5th snapshot. These weights are theoretically (and here also practically) able to restore Boltzmann sampling with respect to the original (unsculpted) energy function from the biased distribution generated by the method. To simplify analysis, also add the following:

FMCSC_MPIAVG_XYZ 1 # for clarity (default)

Now run as before:

rm N_*.*; mpirun -np 32 campari_mpi -k tutorial13.key > t13_ELS_sim.log
awk '{if ((NR > 1) && ($3 > 0)) print $1,-310.*(8.3145/4184.)*(log($2*$3)-log(0.2))-2.723}' RBC_PC.dat > toy1D_ELSx32_biased.hist

Plotting the results thus far in the same graph should clearly highlight the flattened landscape created by the ELS method. The absence of significant barriers implies that it is now much more likely that one or more of these short runs reach the tall barrier at the "right", which is unaffected by our chosen sculpting. It is illustrative in this regard to plot the output in one or more of the files named N_???_GENERAL_DIS.dat. It is precisely this behavior that gave rise to the moniker "accelerated" in the method's title. While the raw distribution is obviously wrong, the weights in output files "N_000_ELS_WFRAMES.dat" and so on are suitable for correcting it. This reanalysis can not currently be done in parallel, so we concatenate the output from the previous run:

cat N_???_*traj.pdb > toy1D_ELS.pdb
cat N_???_ELS_WFRAMES.dat | awk '{print NR,$2;}' > toy1D_ELS.fwts

For analysis, we need a simple key-file as follows:

PARAMETERS <full path to folder>/campari/params/oplsaal.prm
FMCSC_NRSTEPS 64000 # needs to be adjusted if the number of replicas differed
FMCSC_ENSOUT 100000000
FMCSC_RSTOUT 100000000
FMCSC_ENOUT 100000000
FMCSC_ACCOUT 100000000
FMCSC_POLOUT 100000000

The key element here is supplying a file selecting frames for analysis and associating them with floating point weights for CAMPARI's built-in analysis routines. Note that the file here contains all frames in the concatenated pdb trajectory so that the frame selection aspect is not used. Paste the above into a file called "tutorial13_analysis.key", make the necessary adjustments, and run:

campari -k tutorial13_analysis.key > t13_ELS_analysis.log
awk '{if ((NR > 1) && ($3 > 0)) print $1,-310.*(8.3145/4184.)*(log($2*$3)-log(0.2))-2.723}' RBC_PC.dat > toy1D_ELSx32_reweighted.hist

If you now plot the reweighted data as well, it should be obvious that the sculpting method has indeed provided a sampling benefit compared to the naive parallel run with the same resource investment, i.e., the solution is quite close to the correct one, the "right end" is almost certainly reached, and the flatness is gone. Some initial state bias is retained (tilt toward the left). One attractive feature of the sculpting method is that it is relatively cheap and can be implemented at a high level. Unfortunately, in high-dimensional energy landscapes, it is often impossible to sculpt complete energy terms without creating extremely heterogeneous sampling weights. These imply that the statistical "size" of the reweighted solution will be, unlike in the present example, often negligible.
Step 4 - Application of PIGS to the model
In the last stage of this tutorial, we will apply the PIGS advanced sampling method to our 1D toy model. As mentioned above, PIGS is part of a class of advanced sampling methods that create the sampling benefit purely by how initial conditions for (generally short) trajectories are selected. The absence of modifications to the energy landscape means that the only bias present in the final data is initial state bias. In PIGS, a set of copies evolving in parallel is periodically analyzed by the progress index algorithm (see FMCSC_CMODE, option 4 and Tutorial 11) for details. This analysis yields a decision to terminate one ore more copies in favor of more interesting ones with the overall goal of eliminating sampling redundancy.
Setting up a PIGS simulation requires, as the major parameter, a choice of representation, which is trivial here as there is only one degree of freedom. Much of the setup is shared with that of structural clustering. First, modify the key-file to disable energy landscape sculpting:

# FMCSC_SCULPT 15 # when this keyword is not present, the method is disabled

Next, append the key-file as follows:

FMCSC_CDISTANCE 7 # interatomic distance(s)
FMCSC_MPI_PIGS 1 # enables the method
FMCSC_MPI_GOODPIGS 16 # needs to be adjusted if replica number differs

The first keyword defines the measure of conformational distance (normally, for a meaningful selection, an extra input file would be required). The next 2 + 2 keywords set parameters for the purely auxiliary clustering (not discussed further) and control the random search methodology for the approximate progress index, respectively. Finally, the last 4 keywords enable and control the actual PIGS method, which requires a choice of interval for attempting reseedings (FMCSC_REFREQ), a choice of number of copies to declare as "good" or protected in each reseeding attempt (FMCSC_MPI_GODDPIGS), which is often taken as half of the total number of copies, and a choice for how frequently to collect data from each copy (FMCSC_CCOLLECT). With these modifications, run:

rm N_*.*; mpirun -np 32 campari_mpi -k tutorial13.key > t13_PIGS_sim.log
awk '{if ((NR > 1) && ($3 > 0)) print $1,-310.*(8.3145/4184.)*(log($2*$3)-log(0.2))-2.723}' RBC_PC.dat > toy1D_PIGSx32_biased.hist
cat N_???_*traj.pdb > toy1D_PIGS.pdb
mv N_000_PIGSTRACE.dat toy1D_PIGS.trace

By comparing the PIGS-generated distribution to the reference solution in tutorial13.tpot, it should become clear that PIGS by itself approximates the correct distribution quite well. Repeated executions of the above commands should also demonstrate that it is very reliable in finding the rightmost states. In comparison to energy landscape sculpting, it emerges that PIGS creates no discernible overpopulation of barrier regions, which is despite the fact that there are only 8 well-defined states but 32 copies (meaning some overlap in the sampling domains of individual copies is inevitable). Because PIGS strives to minimize sampling redundancy, the system with 8 states of equivalent weights is of course a favorable scenario in terms of bias. In general, the initial state bias incurred by PIGS will be more significant. The PIGS results depend on the chosen parameters of course. While the metric of distance is fixed here, you should try to change the settings for the reseeding interval, the number of protected replicas, and the data collection frequency. For example, a PIGS run can approximate or emulate the naive parallelism from above by appropriate changes in all 3 parameters.
As we saw above and expected by definition, initial state bias cannot be removed by energetic (Boltzmann) reweighting. These types of biases are notoriously difficult to address, and one of the few potentially general strategies rests in Markov state (network) models. Without going into any details (see literature examples for details), we will apply this framework to the PIGS data set generated above. To create the model, calculate its equilibrium population (steady states), and print out snapshot weights, add the following to the file "tutorial13_analysis.key":

FMCSC_CMODE 5 # use tree-based clustering algorithm
FMCSC_CCOLLECT 1 # collect data from every snapshot
FMCSC_CDISTANCE 7 # interatomic distance(s)
FMCSC_TRACEFILE toy1D_PIGS.trace # the PIGS trace (read back in to supply correct connectivity for network)
FMCSC_REPLICAS 32 # number of replicas used in production
FMCSC_RE_TRAJTOTAL 2000 # the total number snapshots per copy (for trace reading)
FMCSC_RE_TRAJSKIP 0 # number of snapshots skipped initially per copy (for trace reading)
FMCSC_RE_TRAJOUT 5 # output interval for snapshots in steps (for trace reading)
FMCSC_CLAGT_MSM 10 # Markov model lag time in trajectory steps
FMCSC_CREWEIGHT 1 # compute equilibrium distribution (steady state) using iterative algorithm

In addition, edit two lines to the following:

FMCSC_PDBFILE toy1D_PIGS.pdb # (use the PIGS data)
# FMCSC_FRAMESFILE # (disabled)

The first 4 added keywords provide the metric and clustering settings as before (for the actual PIGS production). We choose a relatively coarse radius here (2.5Å), which is due to the requirement of obtaining a reasonably converged transition matrix. It means that the reweighting, which is at the resolution of network nodes (clusters), will have limited resolution as well. A PIGS simulation heuristically terminates some copies at well-defined points and reseeds them with configurations from other copies. This leads to a rerouting of snapshot connectivity in the trajectory files that needs to be corrected for when constructing a conformational space network (transition matrix) from the data. This is done by the next 4 keywords. Keyword CLAGT_MSM sets the assumed lag time (implies a sliding window interpretation of the data) for the transition network inference. It is important that the lag time is not too short (in which case the Markov assumption likely breaks down) nor too long (in which case sampling quality on the transition matrix deteriorates as most observed transitions are discarded). Lastly, keyword FMCSC_CREWEIGHT instructs CAMPARI to use an iterative algorithm to compute the Markov state model's steady state given the requested settings. This could also be done with linear algebra routines assuming CAMPARI was linked to the HSL library. For diagnosing the "validity" of the model, the literature is rich with suggestions, heuristic tests, and error analyses none of which are explored here. Run:

campari -k tutorial13_analysis.key > t13_MSM.log
mv yourcalc_FWD.fwts toy1D_PIGS.fwts

It is recommended to take some time to inspect the output in "t13_MSM.log", which documents the clustering results and the subsequent analysis. In particular, you should take note of the number of clusters (should be around 20), and the reported "network imbalance measure", which gives an indication of how far away from meeting the detailed balance condition the inferred network is. Detailed balance is a good indicator of sampling quality as it mandates that each and every pairwise transition that was observed in at least one direction is sampled such that tijπi is equal to tjiπj where t are transition matrix elements (conditional transition probabilities) and π steady state (equilibrium) probabilities. CAMPARI automatically writes a file with snapshot probabilities (which are the ratios of steady state populations and raw sampling of the clusters the respective snapshots belong to). It is identical in format to the ELS output file we used before.
To conclude the tutorial, we need to redo the distance analysis with the calculated weights per snapshot. Edit two lines in the analysis key-file as follows:

FMCSC_CCOLLECT 10000000 # disable this analysis

And finally run:

campari -k tutorial13_analysis.key > t13_PIGS_analysis.log
awk '{if ((NR > 1) && ($3 > 0)) print $1,-310.*(8.3145/4184.)*(log($2*$3)-log(0.2))-2.723}' RBC_PC.dat > toy1D_PIGSx32_reweighted.hist

A comparison of the reweighted and biased PIGS distributions should highlight that the Markov model steady state reweighting is able to recognize that the rightmost state(s) is/are undersampled in the original (biased) data set because of a bias of initial conditions originating from the choice of reference state. The correction may not be quantitatively accurate, however. It is important to realize that this procedure is very fragile. Here, we use the forward time matrix with a specific lag time and clustering resolution. It is left as an exercise to demonstrate that a very large number of alternative settings will not produce a meaningful correction. There are a number of (coupled) reasons for this, but most of them ultimately reduced to a sampling problem at the level of the transition matrix.
A final comparison of all the (at least) 6 answers obtained in this tutorial, viz. from a single replica, from a naive of use of 32 copies, from ELS with 32 copies (biased and Boltzmann-reweighted), and lastly from PIGS (biased and Markov model-reweighted), to the correct solution in should illustrate that both advanced sampling methods provide a benefit for this system given an investment of essentially identical resources, and that energetic biases are more easily removed than initial state biases. While the one-dimensional potential is a terrible model in many respects, it does highlight that the initial state matters, and that the ability to find new states depends on barrier heights, on random chance, and on "distance" from this initial state. Many simulations feature a reference state as an inevitable prerequisite, e.g., simulations of folded proteins. In such a scenario, it is thus extremely difficult to assert that states relevant at the simulations conditions have been sampled exhaustively and that their relative weights have converged.

Design downloaded from free website templates.