CAMPARI Tutorial 17

Tutorial 17: Utilizing CAMPARI to Process Trajectories from Other Software Like GROMACS

Preface:

This tutorial complements the general overview of trajectory analysis presented in Tutorial 10. Here, it is explored what steps are needed and available to analyze a trajectory in a binary format produced and converted directly by GROMACS and its suite of tools. The system covered in this tutorial is arbitrary except that it is meant to capture many of the problems that a user might encounter.

Step-by-Step:

Step 1 - Understanding the system

In the tutorial folder you can find exactly two files: an example trajectory in xtc-format and an associated pdb file. The trajectory was converted by GROMACS' "trjconv" tool from the original .trr-trajectory, and it contains all coordinates of the system (including explicit water molecules). The pdb-file was also extracted by "trjconv" from the .trr-trajectory. The GROMACS version was 2020.5 throughout.
To begin, we need to create the only custom, mandatory input file CAMPARI always needs: a sequence file. To do so, we use CAMPARI's own conversion tool. The pdb file no longer contains any substantial header information, which is normal for pdb files produced by simulation software or preparation tools. This means that the conversion script will simply assume that the sequence corresponds exactly to what it is in the file (for the function of the tool with experimental pdb files, see Tutorial 6 or Tutorial 16). Copy the two files from the example directory into an otherwise empty folder, then run:

<full path to folder>/campari/tools/convert_SEQRES_toseq.sh tutorial17.pdb t17.seq HID 0 1 T3P
The pdb file contains a few nonstandard names: the 5'-RNA terminal residues are called "RU5" and "RG5", respectively, and water molecules are called "SOL". These three names all have different meanings in the PDB database (specific ligands), which makes it difficult to find ways of safely autotranslating them. The conversion tools covers these cases, at least under the circumstances provided by this particular input file, but in general one should expect to have to make small modifications to the sequence file. The option "HID" is irrelevant (because all histidine residues are already explicit in the pdb file) as are the options "0 1" (because the structure is fully there and all residues are CAMPARI-supported). The option "T3P" is what achieves translation of "HOH" or "SOL" into "T3P" with the restriction that the latter must have standard atom names ("OW" in particular).
With the sequence in place, we also have to understand the geometry of the simulation container since many analyses rely on it. More importantly, the output coordinates may be broken across periodic boundaries, which might make any analysis impossible if the size and shape of the container are lost irretrievably. The header indicates only box vector lenghts and cell angles:

CRYST1 85.000 85.000 85.000 90.00 90.00 60.00 P 1 1
However, these are also only the dimensions of a single snapshot, and we must consider the fact that the volume might fluctuate, which is the case here. There is no obvious way of knowing this from the binary trajectory file, however. A unit cell that has cell angles of 90°, 90°, and 60° is not orthorhombic but corresponds to a hexagonal lattice. In simulation software, including CAMPARI, non-orthorhombic unit cells are generally represented as triclinic lattices with 3 box vectors. In principle, the specification above is insufficient because the absolute rotation is not apparent from the lengths of the box vectors. Unlike translation (change of origin), absolute rotation is not a mere formality in 3D periodic boundary conditions, which are in use here. That said, the common convention is to place the first two vectors in the xy-plane and to create a right-handed coordinate system.

Step 2 - Basic settings

As mentioned in Tutorial 10, trajectory analysis mode allows leaving many keywords at default values. We will therefore again construct a key-file from scratch. We start with the sequence file, parameter file, and the input specification:

PARAMETERS <full path to folder>/campari/params/abs4.2_charmm36.prm FMCSC_SEQFILE t17.seq FMCSC_PDBANALYZE 1 FMCSC_PDB_FORMAT 3 # 1 for pdb, 4 for dcd, 5 for NetCDF, 6 for PostgreSQL FMCSC_XTCFILE tutorial17.xtc FMCSC_XTCPREC 1000000.0 # actually overridden by XDR library on input but here to preserve precision later FMCSC_PDB_TEMPLATE tutorial17.pdb FMCSC_PDB_R_CONV 4 # AMBER, actually needed here FMCSC_DYNAMICS 2 # assume molecular dynamics-like scenario
For the parameter file, we simply go with the most general force field available in CAMPARI since we have (here) no interest in detailed energetic analyses. The setting for FMCSC_PDB_R_CONV corrects some naming convention issues for the pdb-file but, generally speaking, the vast majority of these is taken care of automatically. From the PDB-file, in particular the nucleic acid part, the file is closest to AMBER convention.
To express the container, we need to calculate what a box vector of 85Å at an angle of 60° means. The third angle, so the one between the first two box vectors, is the only one that is not orthogonal:

FMCSC_SHAPE 4 # triclinic FMCSC_BOUNDARY 1 # full, 3D periodic boundary conditions FMCSC_ORIGIN 0.0 0.0 0.0 # undefined in the available files FMCSC_BOXVECTOR1 85.0 0.0 0.0 # first along x-axis FMCSC_BOXVECTOR2 42.5 73.61216 0.0 # second in the xy-plane at angle of 60 deg. so that third points in positive z for a right-handed system FMCSC_BOXVECTOR3 0.0 0.0 85.0 # third along z-axis (orthogonal to both first and second)
The only other thing we specify at this point is a large number of steps (read-in will fail eventually) and that no energy terms should be computed or analyses be performed. In addition, we request a report to get a human-readable summary of what the script-generated sequence file actually means.

FMCSC_NRSTEPS 100000 # we will find out soon FMCSC_SC_IPP 0.0 # the only term enabled by default FMCSC_SEQREPORT 1 # to get some information about composition FMCSC_DISABLE_ANALYSIS 1 # disable analyses for now, make sure this is always below FMCSC_NRSTEPS
Put all these keywords into a file called "t17.key". Now run this once to arrive at a clear-cut error:

campari_threads -k t17.key
The complaint is about the sequence mismatch "SOL" vs. "T3P". As mentioned above, it is not safe to automatically translate names that might have other uses and are not present internally in CAMPARI (like "SOL"). The ambiguity of the water model in output files is simply a peculiarity of GROMACS, so we will remove it:

sed -i 's| SOL | T3P |g' tutorial17.pdb
This command simply bulk-replaces the residue name to the one we picked above for the sequence sequence file, i.e., "T3P". Now we can try again:

campari_threads -k t17.key
The run finishes albeit with an error: the read-in fails at step #21, which means that we should adjust the setting for FMCSC_NRSTEPS accordingly as the number of snapshots in "tutorial17.xtc" evidently is only 20. Scroll up to get to the report: there you can learn that we have four different polymer chains (two polypeptide chains and two polynucleotide chains) along with a relatively high concentration of salt and 17326 water molecules. The numbering of molecule types will become relevant later. Secondly, have a look in VMD at "yourcalc_END.pdb" in comparison to "tutorial17.pdb": while they are clearly different snapshots as expected, they look qualitatively similar (be sure not to load them as frames in VMD, though, since the atom orders differ).

Step 3 - Getting it right

One feature that stands out from the visualization is that the macromolecules appear to be split across the periodic boundary. In GROMACS (and other software of course, including CAMPARI), molecules can either be preserved (default in CAMPARI) or split as is the case here. To account for this, there is a dedicated keyword to add:

FMCSC_XYZ_FORCEBOX 1 # 1 means yes on input, no on output
This correction relies critically on having the correct box information at every snapshot. But how does one diagnose that this information is indeed correct? This is surprisingly difficult. Visual inspection is tedious and error-prone. Specific analyses like solution structure characterization or backbone dihedral angle statistics often cover only a subset of the problem. On the other hand, the potential energy is actually a relatively global indicator that can reliably identify errors of the type we might commit here. So we are going to add a standard gas-phase Hamiltonian including cutoffs:

FMCSC_INTERMODEL 2 # standard FF model FMCSC_ELECMODEL 1 # standard FF model FMCSC_SC_IPP 1.0 # change existing keyword FMCSC_SC_ATTLJ 1.0 # 6th power dispersion FMCSC_EPSRULE 2 FMCSC_SIGRULE 1 FMCSC_SC_POLAR 1.0 # Coulomb FMCSC_SC_BONDED_B 1.0 # bonded terms FMCSC_SC_BONDED_A 1.0 FMCSC_SC_BONDED_T 1.0 FMCSC_SC_BONDED_I 1.0 FMCSC_LREL_MD 3 # reaction-field correction FMCSC_CUTOFFMODE 3 # grid-based cutoffs FMCSC_NBCUTOFF 12.0 FMCSC_ELCUTOFF 12.0 FMCSC_GRIDDIM 10 10 10 FMCSC_GRIDMAXGPNB 1024 FMCSC_N2LOOP 0 # to save time FMCSC_ENOUT 1 # to obtain print-out of energies
In addition, you could adjust FMCSC_NRTHREADS to what is appropriate for the machine you are working on. Now run this updated key-file:

campari_threads -k t17.key cp ENERGY.dat en_nvt.dat
Inspect the energy file, ENERGY.dat. The information we want to focus on is in column #2 (total energy) as well as in columns 17-20 (bonded terms). The first row, corresponding to the first snapshots, shows a potential energy below -190000 kcal/mol whereas all the other ones are large positive numbers, mostly due to contributions from column #3 (excluded volume interactions). For systems that consist mostly of water, the average per-molecule energy is very roughly in the vicinity of -10 kcal/mol, so -190000 kcal/mol is a good match for 17326 molecules. From this we can conclude that nothing is obviously wrong about the first snapshot and that the "repairing" of molecules with FMCSC_XYZ_FORCEBOX worked as intended. This is corroborated by the bonded terms, especially the improper dihedrals (column #19), which are pure penalty terms and roughly 0.0 here. Conversely, both total energy (massively) and bonded terms (clearly) are off for all remaining snapshots. This is consistent with a small mismatch in container dimension. As mentioned above, the GROMACS simulation has fluctuating volumes (used the NPT ensemble), and we need to account for that:

FMCSC_ENSEMBLE 3 # assume NPT conditions (only trajectory analysis at the moment) FMCSC_PDB_AUXINFO 3 # maximal auxiliary information for PDBs
The second added keyword is a simple flag that instructs CAMPARI to include more information about the system in pdb files. This can sometimes be useful for visualization software (please refer to the documentation of FMCSC_PDB_AUXINFO for details). Now rerun this updated key-file:

campari_threads -k t17.key cp ENERGY.dat en_npt.dat
Reinspecting ENERGY.dat you should find that now all total energies are below -190000 kcal/mol indicating that we are now understanding the trajectory correctly. One thing that might be confusing is that the bonded terms still differ systematically between the first and the remaining snapshots. This is simply further proof that knowing whether something is correct is not as easy as it might seem. Here, the reason is that the first snapshot is the starting snapshot of the run that was prepared with fixed (and closer-to-ideal) covalent geometries. However, comparing the data in "en_nvt.dat" and "en_npt.dat" it is clear that most of the discrepancy in "en_nvt.dat" was due to the wrong assumption of a fixed volume.

Step 4 - Manipulating (rewriting) the input trajectory

With the first 3 steps completed, you could proceed to performing targeted built-in analyses. Most analyses will respect these particular conditions correctly (fluctuating volumes, triclinic box, broken molecules) but some caveats apply. In generalized triclinic systems, the treatment of periodic images, which affects all intermolecular distance calculations underlying analyses like solution structure characterization or contact patterns, is a particular problem and CAMPARI, differing from many other software packages, takes no shortcuts and makes no simplifying assumptions in this regard. Prominent caveats apply to the normalization of pair correlation functions (the so-called volume element) or the treatment of periodicity in spatial density analyses.
Here, we focus not on built-in analyses but on trajectory rewriting. As they are, the data in "tutorial17.xtc" are difficult for visualization purposes if we presume that the interest is on the macromolecular complex. To rectify this, we first rewrite the trajectory into a macromolecule-centric version that has the added benefit of visualizing the true unit cell of the system.

FMCSC_XYZ_REFMOL 3 # one of the RNA strands FMCSC_XYZOUT 1 # every step FMCSC_XYZPDB 4 # 1 is pdb, 3 is dcd, 5 is NetCDF, 6 is PostgreSQL FMCSC_BASENAME unitcell
If you want, you can also disable all the energy-related keywords as they have no bearing on what follows:

campari_threads -k t17.key vmd -e unitcell_VIS.vmd
This will rewrite the trajectory to unitcell_traj.xtc and open it in VMD. In VMD, you will have 21 snapshots (frames) open, the first being the template, which does not obey the specification for FMCSC_XYZ_REFMOL. You can delete it ("Molecule" → "Delete frames") or ignore it. The display of water obscures the macromolecules but highlights the shape of the true unit cell. This is the cell of nearest images that a point at the center would "see", and it does not correspond to the triclinic shape we specified but instead reveals the underlying hexagonal lattice. This is an important difference to orthorhombic unit cells. It essentially is the same principle as going from a triclinic lattice to the associated Wigner-Seitz cell (link-out to Wikipedia). Under "Representations" you find the third representation of type "VdW". Disable it (double-click) to see the RNA-protein complex. The simulation is based on PDB "5n8m" if you are curious about the system.
While the system barely moves in absolute terms in this short simulation, in the general case we might want to align it to some reference. We might also want to not include the solvent in the output. For the former, we use our reference snapshot but need to define an index file. For the latter, we make use of analysis groups:

echo -e "T\n1\n2\n3\n4\n5\n6\n-7" > t17.angrp grep "ATOM " unitcell_END.pdb | grep -F " P " | awk '{print substr($0,7,5)}' > t17.alix
The first line creates an analysis group file that tags all molecule types as solutes except water (the 7^th and last type). The second parses "unitcell_END.pdb" for the RNA's P-atoms and writes their CAMPARI-internal indices to a file called "t17.alix". It is important to use "unitcell_END.pdb" rather than "t17.pdb" due to atom order differences (aside: it is also important not to have FMCSC_PDB_NUCMODE set to 2 in the key-file for the same reason). Now we can add (or change existing) keywords as follows:

FMCSC_ANGRPFILE t17.angrp FMCSC_XYZ_SOLVENT 0 # suppresses structural output for solvent-tagged molecules FMCSC_ALIGNCALC -1 # every step and asking for instantaneous output to RMSD.dat FMCSC_ALIGNFILE t17.alix FMCSC_BASENAME aligned
If you tried to run this, it would seemingly work but the results in RMSD.dat would indicate a problem (they would be large values). The problem is that the template used for alignment ("tutorial17.pdb") still has the broken molecules in it, which gets propagated into the alignment set due to the fact that CAMPARI does not allow itself to correct the template (which would undermine part of the template function, e.g., with unsupported residues, see Tutorial 13). A second problem is that the alignment rotates the coordinates, which means that operations relying explicitly on knowing the periodic boundaries are no longer safe, specifically the use of FMCSC_XYZ_REFMOL to move coordinates into a reference frame. The (by far) easiest workaround for the first problem is to just use the rewritten trajectory. For the second one, we could disable FMCSC_XYZ_REFMOL but that does not address the general problem that the rotation creates a mismatch with box geometry, and we also would have to rely tacitly on the reference molecule being guessed correctly by CAMPARI (see FMCSC_ALIGNCALC). The better solution here is to no longer assume periodic boundary conditions or fluctuating volumes, so update keywords as follows:

FMCSC_XTCFILE unitcell_traj.xtc FMCSC_PDB_TEMPLATE unitcell_END.pdb FMCSC_SHAPE 2 # 2 is a sphere FMCSC_BOUNDARY 3 # residue-based soft-wall FMCSC_SIZE 100.0 # generous radius in A FMCSC_GRIDDIM 20 20 20 # Modify existing kw; needs to be increased as system is formally much larger FMCSC_ENSEMBLE 1 # if we kept NPT, the box coordinates read in would be misinterpreted badly
It should be emphasized that these modifications are not a good idea in general. It is only due to the wish to perform alignment that it is advantageous to mask the real system geometry and boundary conditions. It leads to loss of information. For example, if you kept the computation of energies and check the new data in ENERGY.dat, there is now not only a large boundary term (column 14, expected since the sphere is not placed well enough) but the energies without the boundary penalty are also not quite as favorable as before (ca. -180000 kcal/mol). This is due to the missing interactions with periodic images. Now run again:

campari_threads -k t17.key vmd -e aligned_VIS.vmd
You should find that the values in RMSD.dat are all below 1.5 Å and they get progressively smaller since the reference structure is the last one in the trajectory. The visualization reveals nothing fundamentally interesting except that the water molecules are no longer present as intended.
In the final step, we use a feature that allows us to not omit water molecules entirely but to keep some in a dynamic manner. This is again primarily a feature useful for visualization purposes and for preparing trajectory data for storage as a compromise between storing all or no solvent coordinates. It is nontrivial because (binary) trajectories and molecular viewers generally require a number and sequence of atoms that is exactly fixed. In CAMPARI, we just have to add one keyword (and we update the base name to avoid appending/overwriting files):

FMCSC_XYZ_PRUNE 3 5 40 6 20 7 2500 # first number is no. of types, after that every pair is type index and target number of mol.s FMCSC_BASENAME pruned
If you run this, an error is produced. FMCSC_XYZ_PRUNE is a keyword that is (partially) incompatible with the other ways of reducing information written to the structural output files (FMCSC_XYZ_SOLVENT and FMCSC_TRAJIDXFILE), so we also need to change FMCSC_XYZ_SOLVENT or change the tagging of the water molecules in "t17.angrp". The logic of FMCSC_XYZ_PRUNE is to, at every step, determine a fixed number of "closest" molecules of the selected types and only include those in the output. Here, we select 40 potassium ions, 20 chloride ions, and 2500 water molecules. The criterion for "closest" is based on a reference set of atoms that is either determined automatically or provided by the user. Here, we first choose the automatic version, which basically tries to use all macromolecules (heuristically). So change:

FMCSC_XYZ_SOLVENT 1 # re-enable water output
Then run:

campari_threads -k t17.key vmd -e pruned_VIS.vmd
Upon visualization of the trajectory you will find that you are essentially looking at the first 1-3 hydration "shells" of the complex. If you stored this trajectory, you would preserve a reasonable grasp of the hydration behavior (occupation, hydrogen bond statistics, etc.) and save, at the same time, approximately 80% in file size.
As mentioned, you can also use FMCSC_XYZ_PRUNE with a custom reference set. If you were, for example, interested in the hollow space between the (major groove of the) dsRNA and the two protein domains, you could put the numbers 3368 (N6 of A-186), 2557 (P of U-161), 3282 (P of A-183), and 2620 (P of U-163) into a file called "t17.refix". Then add/change:

FMCSC_TRAJIDXFILE t17.refix # if XYZ_PRUNE is used, this gives the reference set FMCSC_XYZ_PRUNE 3 5 10 6 5 7 200 # first number is no. of types, after that every pair is type index and target number of mol.s FMCSC_BASENAME focused
Now run for the final time:

campari_threads -k t17.key vmd -e focused_VIS.vmd
From the visualization, it is clear that you successfully isolated the part of the solvent that fills that hollow space. Whenever you work with flexible molecules, it must be kept in mind that reference frames can effectively change throughout a simulation, and this is generally an unsolvable problem.
This concludes this tutorial. You should now have a good idea how to process an input trajectory generated, for example, by GROMACS. It is always possible that there are naming convention issues encountered in step 1. In these cases, try to first rely on keyword FMCSC_PDB_R_CONV and on log-output before starting to edit files directly. If the input trajectory contains residues/molecules that are not supported by CAMPARI, the role of the template is more important (see Tutorial 10). If the input trajectory contains residues/molecules that are in an unsupported protonation or terminus state, they might have to be masked as unsupported. The possibly biggest obstacle you might encounter are unsupported residues where CAMPARI complains about the atom order, which might necessitate doing more artificial masking (like splitting a residue into two).