Tutorial 10: Utilizing CAMPARI to Analyze Trajectory Data
Preface:
While all previous tutorials have covered simulation (or equivalent) tasks, one of CAMPARI's strong points is the ability to use a trajectory in a variety of binary and human-readable formats as input via keyword FMCSC_PDBANALYZE. In essence, propagating through the trajectory replaces actual sampling, and the functionality of the program is otherwise the same. There are a few peculiarities, however, that are most easily illustrated by an example.Step-by-Step:
Step 1 - Basic settings
Many of the keywords needed otherwise are no longer required in trajectory analysis mode and can be left at default values. We will therefore construct a key-file from scratch. A short trajectory distributed along with the code serves as an example although the tutorial will probably me more useful if it is modified to be able to work on a custom trajectory of actual interest.We start as follows:
PARAMETERS <full path to folder>/campari/params/oplsaal.prm
FMCSC_PDBANALYZE 1
FMCSC_SC_IPP 0.0
FMCSC_UNSAFE 1
FMCSC_UAMODEL 0
This turns off the only energy term that is turned on by default in order not to compute energies for every trajectory snapshot. A parameter file is still required, and it should support the system of interest as much as possible. The setting of FMCSC_UNSAFE ensures that certain setup errors that could be critical issues in simulations will not cause CAMPARI to abort the run. For this example we use a trajectory with all-atom representation (→ FMCSC_UAMODEL), but it is important to remember to change this setting if necessary.
Step 2 - Input files
Before CAMPARI can run, we of course need to specify which file(s) to use as input. In this particular case, the distributed example trajectory is in xtc-format. This along with knowledge of the specifications of the underlying simulation makes the following choices necessary:FMCSC_PDB_FORMAT 3 # 1 for pdb, 4 for dcd, 5 for NetCDF
FMCSC_XTCFILE tutorial10.xtc # you need to either copy the file or change the path
FMCSC_SEQFILE tutorial10.in # ditto
FMCSC_BOUNDARY 4
FMCSC_SHAPE 2
FMCSC_SIZE 40.0
FMCSC_ORIGIN 0. 0. 0.
FMCSC_NRSTEPS 1995
FMCSC_EQUIL 0
Since the binary trajectory was generated by CAMPARI itself, no template pdb file is needed. If this were not the case, keyword FMCSC_PDB_TEMPLATE allows you to supply such a template pdb file from which CAMPARI will infer the order of atoms in the binary trajectory. An example usage is demonstrated later in this tutorial. As can be gleaned from the sequence file, the calculation involved a peptide composed of proline and glycine along with three, identical small molecules. This is why it is important to specify the boundary conditions and system dimensions in the key-file (otherwise, analyses relying on intermolecular distances may give incorrect results). This information is (currently) not read from the trajectory data itself. Similarly, the number of snapshots (steps) in the trajectory file needs to be set explicitly via FMCSC_NRSTEPS. It is not generally safe to simply set this to a large number as CAMPARI will have to terminate unexpectedly upon failing to read in the next snapshot. While it should lead to a normal program termination, this can sometimes cause issues with the normalization of measured averages and distributions. Note that the referencing to step numbers will now exclusively refer to the trajectory file, which is why we choose not to discard any additional steps as equilibration in the above example.
Step 3 - Straightforward analysis requests
Now it is time to decide which of the built-in analysis features we wish to use (see elsewhere for a summary of output files and documentation of associated keywords). Most of the calculation frequency settings for analyses we are interested in will be set to 1. The system contains the capped polypeptide Ace-P5G5-Nme along with three small molecules (ethylmethylthioether). In a first effort, we want to generate Ramachandran maps for the peptide and information about residue-residue contacts both within the peptide and between the different molecules. We will also disable analyses that would otherwise clutter the directory with output files. Unlike in prior tutorials, we do not use the meta-keyword FMCSC_DISABLE_ANALYSIS here so that the list below can serve as an overview what other analyses could be performed.FMCSC_ANGCALC 1
FMCSC_ANGRES 6.0
FMCSC_CONTACTCALC 1
FMCSC_CLUSTERCALC 1
FMCSC_CONTACTOFF 0
FMCSC_CONTACTCOM 5.0
FMCSC_CONTACTMIN 3.5
FMCSC_INTCALC 10000000
FMCSC_SAVCALC 10000000
FMCSC_SEGCALC 10000000
FMCSC_DSSPCALC 10000000
FMCSC_DIPCALC 10000000
FMCSC_POLCALC 10000000
FMCSC_PCCALC 10000000
FMCSC_RHCALC 10000000
FMCSC_EMCALC 10000000
FMCSC_CCOLLECT 10000000
FMCSC_ENOUT 10000000
FMCSC_POLOUT 10000000
FMCSC_TOROUT 10000000
FMCSC_XYZOUT 10000000
For details regarding the individual keywords, please follow the links. After copying the tutorial files to a local directory, and pasting the above information to a file called "trajan.key", you should now be able to execute:
campari -k trajan.key >& log1
cp RAMACHANDRAN.dat all.rama
This run should finish in a second or less due to the small number of steps and limited analyses requested. Have a look (especially the 2D files will require using a suitable plotting software) at the produced output files; they should be CONTACTMAP.dat, CONTACT_HISTOGRAMS.dat, CLUSTERS.dat, MOLCLUSTERS.dat, COOCLUSTERS.dat, yourcalc_END.pdb, yourcalc_END.int, yourcalc_END.mol2, JCOUPLING.dat, and RAMACHANDRAN.dat. While the φ/ψ-analysis should be as expected, the output for the contact-based analysis is actually not what we wanted or expected. If you inspect MOLCLUSTERS.dat for example, you will note that data were only obtained for one of the four molecules in the system, i.e., the peptide, and are therefore quite uninformative. As can be gleaned from the documentation, the reason lies in the fact that by default CAMPARI will interpret small molecules (consisting of only a single "residue") as solvent-like molecules, and ignore them for certain classes of analyses. To change this behavior, we need to utilize the concept of analysis groups. Paste the following into a file called "angrp.in":
M
1
2
3
4
By picking different positive numbers for each molecule (mode "M" references molecules by number of occurrence in the sequence file), we instruct CAMPARI to regard all four molecules as independent solutes. It would also be possible to use mode "T" or to pick the same number for all small molecules, in which case those quantities for which averaging over molecules of the same type applies would be reported only as the average over the three individual molecules. Append the key-file with:
FMCSC_ANGRPFILE angrp.in
FMCSC_SEQREPORT 1 # so we get a report on what is treated as solute/solvent
Then rerun:
campari -k trajan.key >& log2
Note that this will overwrite the previously generated output files (the only exception to this general rule are trajectory files and, in simulation restarts, files like N_000_REXTRACE.dat or ELS_WFRAMES.dat). If you copy or move the old files beforehand, you can use "diff" or a similar utility to quickly verify that RAMACHANDRAN.dat is unchanged, but that the contact-based output files are indeed extended to include information about the small molecules as well.
Step 4 - Customizing analyses
Suppose that you are interested in further properties, say simple polymeric characteristics of the peptide. Maybe, due to the diblock copolymeric character of the peptide, it is even of interest to compute properties for different segments of each peptide. Sometimes it may also be desirable to process the trajectory with different software packages, and in such a case, it may be necessary to preprocess the data (for pruning see below; for alignment, see FMCSC_ALIGNCALC):FMCSC_POLCALC 1
FMCSC_RGBINSIZE 0.1
FMCSC_XYZOUT 1
FMCSC_XYZPDB 4
FMCSC_TRAJIDXFILE justpeptide.idx
The above turns on analysis of simple polymeric properties, and trajectory output in xtc-format. You should make sure to replace the entry for FMCSC_XYZOUT in the key-file. It generally works to append the key-file with updated entries as well, since in such a case CAMPARI will read the keyword two or more times and only remember the last one (furthest down in the key-file). However, for new users, this can lead to confusing "mistakes". Notably, the input file FMCSC_TRAJIDXFILE allows you to exactly specify which atoms to keep in the trajectory output. It uses simple atomic indices that correspond to CAMPARI's internal ordering, which can (almost always) be taken from the reference pdb file written out at the end. To create these files, it is advisable to use a combination of typical command line tools such as grep and awk. For example:
grep ATOM yourcalc_END.pdb | awk '{print $2}' > justpeptide.idx
Here, we take advantage of the small molecules using the pdb idenfitier "HETATM". Now rerun CAMPARI, and rename the produced trajectory.:
campari -k trajan.key >& log3
mv yourcalc_traj.xtc justpeptide.xtc
cp yourcalc_END.pdb justpeptide.pdb
Of course, the polymeric analyses requested had access to the full system, and you should check the produced output files, i.e., POLYAVG.dat, RGHIST.dat, PERSISTENCE.dat, DENSPROF.dat, TURNS_RES.dat, RETEHIST.dat, and RDHIST.dat, to make sure that they conform to our request of all being considered independent solute molecules (via FMCSC_ANGRPFILE). There is an important caveat, however, and that lies in the fact that not all analyses apply to all molecule types. In the case considered here, the small molecule will for instance not produce any data for end-to-end distances (in RETEHIST.dat and POLYAVG.dat). Feel free to verify that the resultant trajectory does indeed only retain information about the peptide:
vmd yourcalc_END.pdb justpeptide.xtc
For the sake of this tutorial, we are only going to use "justpeptide.xtc" in an auxiliary role. CAMPARI handles molecule-specific analyses well as is. Instead, we will attempt two more advanced tasks.
Step 4a - Radius of gyration of hydrophobic sidechains
The first task is to find out the radius of gyration distribution of the sidechains of the proline residues. Such subset radii of gyration of hydrophobic moieties are often evoked as putative order parameters in the analysis of peptide and protein conformational equilibria. Unfortunately, the analysis group facility does not allow you to split a molecule into different parts. We will therefore utilize a trick to get around this restriction. First, "justpeptide.xtc" is pruned further. Change the relevant entries in the key-file to:FMCSC_TRAJIDXFILE justprolylsidechains.idx
FMCSC_XTCFILE justpeptide.xtc
Next, remove the three "EMT" entries from the sequence file in the local directory and execute:
grep PRO yourcalc_END.pdb | grep -v " [N,C,O] " | grep -v " .A " | awk '{print $2}' > justprolylsidechains.idx
campari -k trajan.key >& log4
mv yourcalc_traj.xtc justprolylsidechains.xtc
mv yourcalc_END.pdb justprolylsidechains.pdb
It would of course also have been possible to retain the full sequence and trajectory files and do the same extraction from there. You can again use VMD to verify the sanity of the produced trajectory. The coordinates are now no longer recognizable as polypeptide residues due to the complete absence of the backbone. There are two paths to obtain the desired data. First, we could set FMCSC_POLOUT to 1, create a sequence file that pretends that the individual sidechains are small molecules, change the residue names in "justprolylsidechains.pdb", instruct CAMPARI to read both the pdb file (as a template) and "justprolylsidechains.idx", and get system-wide polymeric output in POLYMER.dat. This file would then hold the instantaneous radii of gyration for every snapshot, from which the desired distribution (histogram) could be computed. Alternatively (and this is done explicitly in this tutorial), we can create a sequence file that pretends that the system in the xtc-file is still a polymer:
PRN_N
PRX
PRX
PRX
PRC_C
END
Paste the above into a file called "justprolylsidechains.in". Next, edit "justprolylsidechains.pdb" to correspond to the sequence file in terms of residue names (i.e., replace all entries PRO with PRN/PRX/PRC using a text editor or a command line utility such as sed). We are pretending that CAMPARI has no internal representation of these residues, and therefore need to avoid using three-letter codes that correspond to any of the internally supported residues. Even though the residues all possess the same set of atoms, unsupported and terminal residues in polymers must be distinguished by name (and not just by the "_N" addition). This is because rules of the interaction models, etc. depend on whether a residue is terminal or not. Next, change/append the key-file as follows:
FMCSC_SEQFILE justprolylsidechains.in
FMCSC_XTCFILE justprolylsidechains.xtc
FMCSC_PDB_TEMPLATE justprolylsidechains.pdb
# FMCSC_TRAJIDXFILE justprolylsidechains.idx
The template pdb file is absolutely required, since this is where CAMPARI infers the necessary settings from (topology and some required parameters). Commenting out (or removing) the entry for FMCSC_TRAJIDXFILE is needed since the input no longer applies (it refers to the numbering in justpeptide.xtc). Execute:
campari -k trajan.key >& log5
First, have a look at yourcalc_END.int, as it will tell you which internal topology was inferred. The method outlined here may not always work as expected, and this is a good diagnosis tool. Consulting the log-file, "log5", you will find summary information on how CAMPARI coped with these unknown residues. Note that the handling of unsupported residues itself is explained in more detail in Tutorial 12. From the log-file it should be clear why FMCSC_UNSAFE was set to 1 (because the automatic inference of atom types leads to valence errors), and why certain analyses are no longer available and were automatically disabled (because the polymer is no longer of recognized type). After making sure that everything seems to have behaved as expected, you can now plot the data in RGHIST.dat or RDHIST.dat, or simply consult the mean in POLYAVG.dat. When using a strategy like the one just described, it is always necessary to keep in mind what kind of information is needed for the respective analyses. For example, the data in DENSPROF.dat require atomic masses that have to be guessed based on pdb-names, and in such a case it is worthwhile to double check the implied assignment. Pending any other relevant patches (see FMCSC_MPATCHFILE or FMCSC_BIOTYPEPATCHFILE), this information is also printed to yourcalc_END.int (the third column gives the guessed atom (LJ) type → elsewhere). In conjunction with the parameter file, this allows extraction of the desired information.
Step 4b - Angular correlation functions of subsequences
We start by reverting the key-file to the state before we began step 4a with one modification to FMCSC_TRAJIDXFILE:FMCSC_XTCFILE justpeptide.xtc
FMCSC_SEQFILE tutorial10.in
# FMCSC_PDB_TEMPLATE justprolylsidechains.pdb
FMCSC_TRAJIDXFILE justglycine.idx
Suppose we want to first analyze the angular correlation function of the glycine stretch. The goal is to create an artificial system that - unlike in the previous case - is still recognized by CAMPARI as a polypeptide. We want to pretend that the glycine residues are capped on both sides (the C-terminal NME already being in place). In this particular case, we could actually emulate a completely sane peptide chain by picking out and renaming atoms of the last proline residue to pretend it were ACE instead. CAMPARI does, however, automatically recognize certain atom names, so that we can achieve what we want with slightly less work:
echo -e "63\n64\n65\n66" > justglycine.idx
grep ATOM justpeptide.pdb | grep -v PRO | grep -v ACE | awk '{print $2}' >> justglycine.idx
rm yourcalc_traj.xtc
campari -k trajan.key >& log6
mv yourcalc_END.pdb justglycine.pdb
mv yourcalc_traj.xtc justglycine.xtc
Now paste the following into a new sequence input file "justgylcine.in":
PRX_N
GLY
GLY
GLY
GLY
GLY
NME
END
Next, in analogy to before, the residue name for the first residue in "justglycine.pdb" from PRO to PRX. The key-file has to be modified as follows (this should be fairly straightforward by now):
FMCSC_SEQFILE justglycine.in
FMCSC_XTCFILE justglycine.xtc
FMCSC_PDB_TEMPLATE justglycine.pdb
# FMCSC_TRAJIDXFILE justglycine.idx
We can then rerun CAMPARI:
campari -k trajan.key >& log7
cp PERSISTENCE.dat justglycine.pers
cp RAMACHANDRAN.dat justglycine.rama
As before, inspect the log-file and the remaining diagnostic output to verify that the artificial construct was interpreted as intended. In particular, "log7" should clearly spell out that residue "PRX" was identified as a bona fide polypeptide residue. If we had not emulated an N-terminal cap, the Ramachandran analysis for instance would not contain information about the first of the glycine residues (despite it being a nonterminal polypeptide residue in the original peptide). Note of course that this type of trickery is only required for certain analyses. Quantities that are per se reported at residue-level resolution (for example the data in TURNS_RES.dat or CONTACTMAP.dat) are either nonsensical or redundant to decompose in this way.
Naturally, the procedure to perform the same analysis for the proline stretch is highly analogous. We start by again reverting the key-file to the state before we began step 4a (with a single modification):
FMCSC_XTCFILE justpeptide.xtc
FMCSC_SEQFILE tutorial10.in
# FMCSC_PDB_TEMPLATE justglycine.pdb
FMCSC_TRAJIDXFILE justproline.idx
It continues by identifying those atoms of the first of the glycine residue that can reconstitute a proper C-terminal cap:
echo -e "77\n78\n79\n81" > justproline.idx
grep ATOM justpeptide.pdb | grep -v GLY | grep -v NME | awk '{print $2}' >> justproline.idx
rm yourcalc_traj.xtc
campari -k trajan.key >& log8
mv yourcalc_END.pdb justproline.pdb
mv yourcalc_traj.xtc justproline.xtc
Now paste the following into a new sequence input file "justproline.in":
ACE
PRO
PRO
PRO
PRO
PRO
GLX_C
END
Next, analogously to before change the residue name for the last residue in "justproline.pdb" from GLY to GLX, and modify again the key-file:
FMCSC_SEQFILE justproline.in
FMCSC_XTCFILE justproline.xtc
FMCSC_PDB_TEMPLATE justproline.pdb
# FMCSC_TRAJIDXFILE justproline.idx
Run again:
campari -k trajan.key >& log9
cp PERSISTENCE.dat justproline.pers
cp RAMACHANDRAN.dat justproline.rama
Now you can directly compare the results for the proline stretch to those for the glycine stretch. It may be pedagogically helpful to compare the two Ramachandran plots ("justproline.rama" and "justglycine.rama") to the one obtained in Step 1 ("all.rama"); in fact, the average of the two split maps should agree exactly (within rounding error) with the data in "all.rama". If not, this would indicate that something went wrong in the interpretation of the artificial constructs in Step 4b. Note that a similar reconstruction of the properties for the entire peptide is of course not possible for other quantities such as angular correlation function, density profile, radius of gyration histograms, etc.
The tools introduced in this tutorial should have familiarized you with the concepts underlying the usage of CAMPARI as a trajectory analysis tool and in particular should have demonstrated some more advanced techniques to get around restrictions imposed by the organization of representations of molecules within CAMPARI.