Tutorial 11: Structural Clustering and Related Methods in CAMPARI
Preface:
This tutorial can be viewed as a follow-up to Tutorial 10, as it covers the analysis of a trajectory with rather specific tasks in mind. Specifically, here we wish to run clustering and related algorithms on a trajectory of a toy system (n-butane) exhibiting clear and intuitive state partitioning.Step-by-Step:
Step 1 - Generating a suitable trajectory
The molecule under study, n-butane, is so small that it is feasible and convenient to generate a statistically meaningful trajectory within this tutorial. Since some of the algorithms tested here respond to kinetics, we will obtain a stochastic dynamics trajectory in torsional space via the following key-file:PARAMETERS <full path to folder>/campari/params/oplsaal.prm
FMCSC_SEQFILE nbu.in
FMCSC_BASENAME NBU
FMCSC_SC_IPP 0.0
FMCSC_SC_BONDED_T 1.0
FMCSC_DYNAMICS 3
FMCSC_FRICTION 3.0 # 1/ps
FMCSC_TIMESTEP 0.005 # ps
FMCSC_TEMP 400.0 # K
FMCSC_NRSTEPS 10000000
FMCSC_EQUIL 0
FMCSC_XYZOUT 100
FMCSC_XYZPDB 3
FMCSC_TOROUT 100
FMCSC_COVCALC 20000000 # disable
FMCSC_SAVCALC 20000000 # disable
FMCSC_POLCALC 20000000 # disable
FMCSC_RHCALC 20000000 # disable
FMCSC_INTCALC 20000000 # disable
FMCSC_POLOUT 20000000 # disable
FMCSC_ENSOUT 20000000 # disable
FMCSC_ENOUT 20000000 # disable
FMCSC_RSTOUT 20000000 # disable
After copying over the sequence file to an empty working directory, and after pasting the above into a key-file "simul.key", you should be able to run:
campari -k simul.key >& tutorial11_simul.log
mv NBU_traj.dcd NBU.dcd
mv FYC.dat NBU.fyc
This generates a dcd-trajectory of 50ns length with structures saved every 0.5ps (100000 snapshots). The degrees of freedom are the central torsion angle (C-C-C-C) and the two methyl spins (all-atom representation), which are all independent of one another on account of the Hamiltonian chosen (purely torsional potentials). The simulation itself should take no longer than a few minutes on a modern desktop machine. Aside from the trajectory ("NBU.dcd"), an auxiliary output file is produced listing the actual values for the three torsional angles for every trajectory snapshot in columns 3-5 (renamed to NBU.fyc).
Step 2 - Simple Clustering
Clustering, i.e., the grouping of similar objects (here, trajectory snapshots that represent molecular conformations) into sets, is a task that - without additional constraints - lacks universal criteria of optimality or appropriateness, and therefore requires considerable user input. In CAMPARI, clustering is performed as a post-processing operation on selected features of a trajectory extracted during an analysis run (this example) or during a simulation (feature extraction on-the-fly). This means that there are some possible memory limitations on very large data sets (CAMPARI does not yet support stream-like handling of trajectories within the clustering routines), e.g., when extracting 107 instances of 100 double precision numbers each from a trajectory, the memory requirement already has a lower bound of 8 · 109 bytes (8GB).The most basic keywords required for an analysis run with structural clustering here are as follows:
FMCSC_SEQFILE nbu.in
PARAMETERS <full path to folder>/campari/params/oplsaal.prm
FMCSC_PDBANALYZE 1
FMCSC_DCDFILE NBU.dcd
FMCSC_PDB_FORMAT 4
FMCSC_SC_IPP 0.0
FMCSC_NRSTEPS 100000
FMCSC_EQUIL 0
FMCSC_CCOLLECT 1 # use every snapshot
You could put the above in a new key-file, and it should run through fine. However, default algorithm, feature selection, and parameters are not appropriate for this system, which is why the results are most likely meaningless. Instead, we first add specific requests:
FMCSC_CMODE 5 # use tree-like algorithm
FMCSC_CDISTANCE 1 # use dihedral angles to represent each snapshot
FMCSC_CRADIUS 60.0 # threshold radius (in degrees) of clusters
FMCSC_BIRCHHEIGHT 8 # tree height
FMCSC_CMAXRAD 120.0 # coarsest threshold level in tree
Here, we select the tree-like algorithm of Vitalis and Caflisch via keyword FMCSC_CMODE. Details are provided in the corresponding documentation and of course in the reference publication. As a choice of representation ("feature selection"), we pick all dihedral angles of the system (the natural degrees of freedom for our sampling engine here). By supplying both FMCSC_CDISTANCE and an auxiliary input file for FMCSC_CFILE, it is possible to select subsets of features (for example, for n-butane one could cluster purely based on the central torsion angle). How this is done in detail is explained elsewhere. The last two options control parameters of the tree-based algorithm, i.e., its height (→ FMCSC_BIRCHHEIGHT) and a coarse threshold radius that should correspond roughly to the maximum pairwise distance observed between snapshots (→ FMCSC_CMAXRAD). Finally, the choice for FMCSC_CRADIUS sets a threshold radius parameter for the final clusters obtained (all algorithms in CAMPARI currently offer control over cluster size rather than cluster number). Here, the chosen value of 60.0° corresponds to the expected results for this system. Specifically, the torsional potentials partition every dihedral angle into three well-defined states spaced 120° apart. This means that we expect 33=27 well-defined states overall with 9 of those being more likely than the other 18 due to the central torsion angle energetically favoring the anti state over the two gauche states.
Before running the clustering, it is convenient to disable some analyses and outputs turned on by default:
FMCSC_XYZOUT 20000000 # disable
FMCSC_TOROUT 20000000 # disable
FMCSC_COVCALC 20000000 # disable
FMCSC_SAVCALC 20000000 # disable
FMCSC_POLCALC 20000000 # disable
FMCSC_RHCALC 20000000 # disable
FMCSC_INTCALC 20000000 # disable
FMCSC_POLOUT 20000000 # disable
FMCSC_ENSOUT 20000000 # disable
FMCSC_ENOUT 20000000 # disable
FMCSC_RSTOUT 20000000 # disable
After pasting all the above keywords into a key-file "clustering.key", you can run:
campari -k clustering.key >& tutorial11.log
This run should finish within seconds due to the computational efficiency of the algorithm we picked and due to the low dimensionality of representation. You should examine "tutorial11.log" in some detail, as it lists valuable information regarding the clustering. In particular, for each cluster, its number of snapshots, a putative central snapshot, and some information regarding size are provided. If our conjecture regarding the system is correct, and if the cluster settings are appropriate, the first 9 clusters should correspond to all possible states with the limitation that the central torsion angle be in anti. To check whether this is the case, we can utilize the output in NBU.fyc and tutorial11.log as follows:
export SHOW_N=9; SHOW_Np1=$(echo "${SHOW_N}+1" | bc); for i in `grep -F 'CLUSTER SUMMARY' -A ${SHOW_Np1} tutorial11.log | tail -n ${SHOW_N} | awk '{print $3}'`; do j=`echo -e "$i + 1" | bc`; head -n $j NBU.fyc | tail -n 1 | awk '{printf("Step %10i: C1C2C3C4: %10.3f C3C2C1H11: %10.3f C2C3C4H41: %10.3f\n",$1/10,$2,$3,$4)}'; done
This identifies the representative snapshots for the 9 largest clusters (clusters are ordered by decreasing size) and finds the corresponding values for the three torsion angles in NBU.fyc printing them in user-friendly fashion. The above command should (at least in most cases) indeed yield the expected result, i.e., values should be close to the energetic minima (-60°, 60° and 180°), and the "C1C2C3C4" values should all be close to 180° (note that torsion angles are defined over the interval [-180°180°], which implies that values close to either -180° or 180° are technically adjacent). The tree-based algorithm is particularly appropriate here, because it effectively chooses cluster centroids based on local data density (similar to what a hierarchical scheme would do). The partitioning inherent to the system should also be apparent in the cluster sizes, i.e., tutorial11.log should show relatively clearly that the first 9 clusters are significantly larger than the rest. You should consider investigating whether the mapping between largest clusters and expected states continues to hold even for those states where the central torsion is not in anti (it usually works for most if not all of the 27 largest clusters).
This concludes a first structural clustering on a toy system yielding interpretable results.
Step 3 - Additional Output
The analysis run in the previous step should have produced some additional output files, specifically a VMD visualization script, which will not be discussed further here, and two other files, viz., STRUCT_CLUSTERING.clu and STRUCT_CLUSTERING.graphml. The former is an annotated trajectory specifying for each snapshot the number of the cluster it belongs to (numbering as in tutorial11.log). In other words, it provides a coarse-grained mesostate trajectory from which a network model can be constructed. Such a network model or graph is indeed what is provided in the latter file, and you can load and visualize it like any other graph in .graphml format (using packages such as Gephi or Visone). Based on these output files, you could perform a Markov model analysis in CAMPARI: there is not only an option to directly read the coarse-grained trajectory back in (→ FMCSC_CLUFILE via option 6 or 7 for FMCSC_CMODE) but of course also to do it in one shot. You could, for example, compute committor probabilities (see FMCSC_DOPFOLD) between (groups of) reference states and subsequently use this information to construct a cut-based (pseudo-)free energy profile.Step 4 - Tuning Options and Choices for Clustering
Here we will use RMSD of the Cartesian coordinates of the carbon atoms to measure distance between snapshots. First, change the value for FMCSC_CDISTANCE to 5, which selects RMSD (pairwise alignment is turned on by default). Atom selections are passed to CAMPARI via an additional input file containing the atom indices. To do so, write the numbers 1, 2, 3, 4 to "carbons.lst" (one number per line), and add the following keyword:FMCSC_CDISTANCE 5 # RMSD of atomic coordinates (in Angstrom)
FMCSC_CFILE carbons.lst # atom selection for clustering
Note that atom numbers have to be given according to CAMPARI's numbering scheme, which can be found in "yourcalc_END.pdb" and might differ from the scheme used by other software. Keep that in mind when analyzing trajectories not generated with CAMPARI. Since we changed the distance function, adjust the threshold parameters as follows:
FMCSC_CRADIUS 0.02 # threshold radius (in Angstrom) of clusters
FMCSC_CMAXRAD 1.0 # coarsest threshold level in tree
You can now run
campari -k clustering.key >& tutorial11.log
and examine the resulting output. Note that the output files from Step 2 are overwritten. Because we selected the carbon atoms only for the RMSD computation, we do not expect to observe the same state partitioning as before. Specifically, the information on the methyl groups is lost (cluster centroid positions for these two torsion angles should approximately appear as if drawn randomly from the underlying distribution).
An additional clustering method offered by CAMPARI is hierarchical clustering. Copy "clustering.key" to "hierachicalClustering.key" and change the clustering specific keywords to
FMCSC_CMODE 3 # use hierarchical clustering
FMCSC_CDISTANCE 1 # use dihedral angles to represent each snapshot
FMCSC_CRADIUS 30.0 # threshold radius (in degrees) of clusters
FMCSC_CMAXRAD 30.0 # threshold radius (in degrees) for truncated leader algorithm used for preprocessing
FMCSC_CCUTOFF 60.0 # cutoff distance for neighbor list
FMCSC_CLINKAGE 3 # linkage criterion (3 = mean)
Do not forget to remove the entry for FMCSC_CFILE. In theory, hierarchical clustering requires all the N(N-1)/2 pairwise distances. To speed up the algorithm, the following heuristics are used by CAMPARI. First the data are preprocessed by leader clustering with cluster size set by FMCSC_CMAXRAD. Based on that, truncated neighbor lists are computed with a truncation cutoff set by FMCSC_CCUTOFF. This cutoff has to be at least twice as large as the target threshold radius for the clustering (→ FMCSC_CRADIUS). During the last stage of the algorithm the chosen linkage criterion is also a critical setting for efficiency (maximum linkage is very slow). Since the algorithm is not scalable, we subsample the trajectory and set:
FMCSC_CCOLLECT 10 # use only every 10th snapshot
You can now run:
campari -k hierarchicalClustering.key >& tutorial11_hierarchicalClustering.log
As with the tree-based algorithm discussed in Step 2, a cluster summary is written to the log file and the same additional output files are produced (STRUCT_CLUSTERING.vmd, STRUCT_CLUSTERING.clu, and STRUCT_CLUSTERING.graphml). There may be an additional output file (FRAMES_NBL.nc) that is a representation of the mutual neighbor relationships amongst snapshots (binary). You should perform the same tests as above to verify that the clustering is meaningful:
export SHOW_N=9; SHOW_Np1=$(echo "${SHOW_N}+1" | bc); for i in `grep -F 'CLUSTER SUMMARY' -A ${SHOW_Np1} tutorial11_hierarchicalClustering.log | tail -n ${SHOW_N} | awk '{print $3}'`; do j=`echo -e "10*$i + 1" | bc`; head -n $j NBU.fyc | tail -n 1 | awk '{printf("Step %10i: C1C2C3C4: %10.3f C3C2C1H11: %10.3f C2C3C4H41: %10.3f\n",$1/100,$2,$3,$4)}'; done
Note that the command is slightly modified compared to above. In particular, the altered setting for FMCSC_CCOLLECT must be reflected appropriately. If not, the centroid positions extracted from "NBU.fyc" will be incorrect. Even though data density is reduced, the state partitioning will usually be superior to the one obtained with the tree-based algorithm. Unfortunately, the lack of scalability of strict hierarchical clustering schemes allows them to be used only on small data sets.
Step 5 - Advanced Algorithms
Here we apply the progress index method of Blöchliger et al. to study the metastable states sampled in our simulation. We start with a new key-file from scratch. As for the case of simple clustering, the following basic keywords are required:FMCSC_SEQFILE nbu.in
PARAMETERS <full path to folder>/campari/params/oplsaal.prm
FMCSC_PDBANALYZE 1
FMCSC_DCDFILE NBU.dcd
FMCSC_PDB_FORMAT 4
FMCSC_SC_IPP 0.0
FMCSC_NRSTEPS 100000
FMCSC_EQUIL 0
FMCSC_CCOLLECT 1 # use every snapshot
Again, we disable some analyses and outputs turned on by default:
FMCSC_XYZOUT 20000000 # disable
FMCSC_TOROUT 20000000 # disable
FMCSC_COVCALC 20000000 # disable
FMCSC_SAVCALC 20000000 # disable
FMCSC_POLCALC 20000000 # disable
FMCSC_RHCALC 20000000 # disable
FMCSC_INTCALC 20000000 # disable
FMCSC_POLOUT 20000000 # disable
FMCSC_ENSOUT 20000000 # disable
FMCSC_ENOUT 20000000 # disable
FMCSC_RSTOUT 20000000 # disable
The progress index method has a number of auxiliary keywords, and we make the following choices:
FMCSC_CMODE 4 # selects the analysis method
FMCSC_CPROGINDMODE 2 # use fast approximate algorithm
FMCSC_CDISTANCE 1 # use dihedral angles to represent each snapshot
FMCSC_CRADIUS 60.0 # threshold radius (in degrees) of clusters
FMCSC_BIRCHHEIGHT 8 # tree height
FMCSC_BIRCHMULTI 7 # make sure clustering quality is consistent across all levels
FMCSC_CMAXRAD 120.0 # coarsest threshold level in tree
FMCSC_CPROGINDRMAX 1000 # number of search attempts
FMCSC_CPROGINDSTART 1 # start from the first snapshot in the trajectory (default)
FMCSC_CPROGRDEPTH 7 # keep looking into the atmost 7 lower levels of the tree until guesses are satisfied
FMCSC_NRTHREADS 4 # the workflow used here is nearly fully threads-parallelized
Here, we select the method of Blöchliger et al. via keyword FMCSC_CMODE. Details are provided in the corresponding documentation and in the reference publication. As in Step 2, we pick all the dihedral angles to represent the system. We use the fast approximate algorithm (→ FMCSC_CPROGINDMODE), which relies, to achieve scalability, on preorganizing the data via the tree-like algorithm of Vitalis and Caflisch discussed in Step 2. We use the same settings for this auxiliary clustering except that we request a consistent quality of clustering in all tree levels (via FMCSC_BIRCHMULTI). In addition, three progress index-specific keywords are specified. FMCSC_CPROGINDRMAX sets the number of search attempts per snapshot for the approximate algorithm (a higher value gives more accurate results), FMCSC_CPROGINDSTART specifies the starting snapshot for the resulting profile, and FMCSC_CPROGRDEPTH sets the maximum of levels of descend to in order to satisfy the requested number of search attempts per snapshot. Finally, we switch to the threads-aware executable (if you have it available, set FMCSC_NRTHREADS to an appropriate value) since the tree-based clustering and progress index both take advantage of it (see this preprint for details on the latter). After pasting all the above keywords into a key-file "oneshot.key", you can run:
campari_threads -k oneshot.key >& tutorial11_oneshot.log
The cost of the algorithm scales roughly linearly with the number of search attempts. Depending on whether you use threads or not, the generation of the approximate minimum spanning tree may take up to a few minutes. The resulting output is collected in the file PROGIDX_000000000001.dat and plotting the fourth column will give you the desired profile. You can use this R-script to plot the output with some structural annotation by running:
Rscript plotOneShot.R
mv oneShot.eps basic_PIX.eps
The plot saved to "basic_PIX.eps" should look similar to Fig. 2 in the reference paper for this method. In particular, there should be 27 distinct non-overlapping basins. The output is a combination of cut-based and structural annotations. You might note that there are occasional narrow groups of snapshots that appear to be structurally inhomogeneous. This is a problem discussed specifically for the n-butane example in more recent work by Cocina et al. Using an idea proposed in this preprint, it is possible to reassign these "noisy" snapshots to meaningful basins. Add this keyword:
FMCSC_CPROGMSTFOLD 15 # selects the number of leaves-folding operations
This requests CAMPARI to identify those snapshots corresponding to leaves in the spanning tree and lump them into their parent. This is repeated 14 additional times. Note that for systems of higher dimensionality you would not use such a large value. Now, run again and reanalyze the data:
campari -k oneshot.key >& tutorial11_oneshot_fold15.log
Rscript plotOneShot.R
mv oneShot.eps folded_PIX.eps
A comparison of the two plots should illustrate the effect of the folding technique although the effect might not be, due to the high quality of the partitioning in general, particularly pronounced for the settings here. The groups of "noisy" snapshots should have largely disappeared, and you might find that the snapshots corresponding to eclipsed values of dihedral angles (indicated by red dots) are distributed more evenly across progress index values.
Because the progress index aims to separate the basins of a system and arrange them without overlap while delineating them using kinetic annotations, it would be interesting to obtain a direct comparison to a cut-based pseudo free energy profile. For this, add and/or change the key-file as follows:
FMCSC_CPROGINDSTART -2 # consistently use largest cluster or representative thereof for reference in profiles
FMCSC_CMSMCFEP 1 # request MFPT-based cFEP
FMCSC_CRADIUS 20.0 # threshold radius (in degrees) of clusters
We improve the resolution of the clustering so that analyses based on the coarse-grained trajectory can still be meaningful. Now run CAMPARI for the final time:
campari -k oneshot.key >& tutorial11_final.log
This will create a progress index output file as before, but using a different starting snapshot (reflected in the name of the file). The relevant information can be obtained from log output. It will also create an output file called MFPT_FWD_CFEP_00000001.dat, and a plot of columns 3 against 5 would provide output comparable to the annotation we studied before. It is left as an exercise to modify or create plotting commands preparing an explicit comparison of the data (refer to publications on progress index and clustering for details).