Tutorial 6: Introduction to Polynucleotides - Setting up and Running a Complex Simulation in CAMPARI
Preface:
In this tutorial we expand upon the simple monomer simulation and look at systems with multiple molecules and high charge content. CAMPARI provides wide support for simulating polynucleotides, and our task here will be to set up a simulation of a piece of double-stranded DNA (i.e., two chains) in a solvent composed of appropriate counterions, and an aqueous solution of a background electrolyte. Such a system is not suitable for simulation by Monte Carlo methods, and we will expand the basic control obtained in the previous tutorial over using molecular dynamics within CAMPARI. We want to simulate a folded form of DNA, and therefore this tutorial will also require the use of a starting structure.The tutorial in itself does not attempt to compute any ensemble averages that can be compared against other simulation or experimental data. It is therefore a purely technical tutorial whose outcome will consist in nothing but a stable simulation that can be visualized by appropriate visualization software such as VMD.
Step-by-Step:
Step 1 - Create a pdb-file of folded DNA in CAMPARI naming convention with hydrogens and ions added and crystal waters preservedIn the examples directory, we have the pdb-file of the Dickerson dodecamer. This structure was determined around 1980 by Dickerson and colleagues (reference) by X-ray crystallography. Copy this pdb file to the local working directory. We first require a sequence file with the appropriate DNA sequence and residue names. To get this, we use a tool delivered with CAMPARI that does precisely this.
<full path to CAMPARI directory>/tools/convert_SEQRES_toseq.sh tutorial6.pdb prep.seq HIE 0 0 T4E
This will parse the pdb-file and try to write a sequence file with the name "prep.seq" using a default choice of protonation at the Nε for histidines (not an issue here) and preserving crystal waters as the TIP4P model adjusted specifically for use with the particle mesh Ewald method (see below). There are no unsupported residues to consider and no missing loops to reconstruct (this is what the two zeroes refer to). It also prints a warning that the waters in the pdb-file must be renamed manually. Note that the tool has a few important limitations that are not covered by this example: single-residue missing "loops" in the coordinate information require editing the PDB file, the desire to keep some but not all resolved small molecules requires editing the PDB file, shortened (2- or 1-character) nonstandard nucleotide names are not handled well or at all, the correspondence between records of missing residues ("REMARK 465") and sequence numbering ("SEQRES") must be linear (often fails when sequence insertion codes are in use in missing segments), and names of CAMPARI-known but nonstandard amino acids are not automatically translated where this would be necessary.
In the preparation step, let us now rebuild the hydrogens (which are missing in this crystal structure) and add mobile ions (and electrolyte solution). To do this, we will need to know the final system dimensions. Here we choose a cubic, periodic box with 60Å side length, which is able to hold the dodecamer irrespective of tumbling and can also respect requirements for interaction cutoffs later on. Create a file called "prep.key" and paste the following keywords:
PARAMETERS <full path to CAMPARI directory>/params/amber03.prm
FMCSC_SEQFILE prep.seq
FMCSC_BASENAME PREP
FMCSC_BOUNDARY 1
FMCSC_SHAPE 1
FMCSC_SIZE 60.0 60.0 60.0
FMCSC_ORIGIN -10.0 -15.0 -20.0 # just so that the initial DNA placement is roughly central
FMCSC_PDB_READMODE 3
FMCSC_PDB_HMODE 2
FMCSC_PDB_R_CONV 4
FMCSC_PDBFILE tutorial6.pdb
FMCSC_PDB_OUTPUTSTRING "a6,i5,1x,a4,a1,a3,1x,a1,i4,a1,3x,3f18.10"
Note that the choice of parameter file is more or less arbitrary at this stage. Through use of FMCSC_PDB_R_CONV, we instruct CAMPARI to interpret the input pdb as a file written in AMBER-convention, which is closest to the pdb convention for nucleotides. In this particular case, they keyword is not actually needed, but it may be in other scenarios (most issues related to residue and atom naming are handled automatically and independently of this keyword). The last keyword, FMCSC_PDB_OUTPUTSTRING, requests CAMPARI to write a pdb file with nonstandard format (expanded columns for the coordinates). This is sometimes useful if we want to reuse pdb files for starting simulations to avoid inaccuracies in constrained degrees of freedom. Its downside is that these files are not processed correctly by molecular visualization programs.
In each strand the dodecamer has 11 phosphate bridges, and its total charge is therefore -22. To also create a background of ~0.5M monovalent salt, we add 87 sodium ions and 65 chloride ions. Append these to the sequence file, e.g.:
awk '{if ($1 == "END") {for (k=0;k<87;k++) {printf("NA+\n")} for (k=0;k<65;k++) {printf("CL-\n")} print "END"; exit} else {print}}' prep.seq > tmpseq
mv tmpseq prep.seq
We want to let CAMPARI's internal database and the supplemental randomizer take care of the placement of the missing bits (rather than doing this by an actual simulation). Thus, we set:
FMCSC_RANDOMIZE 1
FMCSC_RANDOMATTS 100
FMCSC_RANDOMTHRESH 0.1
FMCSC_RANDOMBURIAL 0.4 # this is the default
FMCSC_NRSTEPS 1
FMCSC_DISABLE_ANALYSIS 1
By using an extremely stringent threshold for the randomizer, we can ensure that the initial placement of ions is appropriate in steric terms. Another very important protection mechanism is provided (since version 5) by keyword FMCSC_RANDOMBURIAL, which specifically detects and utilizes how much a randomly added molecule is surrounded by species (usually biomacromolecules) that have been read in from file. If the degree of burial is larger than the burial threshold, a proposed random placement is rejected. While not likely when just adding ions, this procedure prevents monoatomic ions and water molecules from occupying internal cavities. It will be important later and is added now for clarity. Now we are able to run:
campari -k prep.key >& log1
This should take less than a second. Output will be very limited but should contain the file "PREP_START.pdb" which is the (extended format) pdb-file of the Dickerson dodecamer with added hydrogens, ions, and in CAMPARI naming convention. Inspect "log1": toward the end of the summary of the calculation you can find information on how many atoms were rebuilt, etc. The beginning contains a number of warnings relating to precisely these missing atoms. As mentioned elsewhere, CAMPARI differs fundamentally from other simulation software in that it maintains an internal knowledge of molecular topology for the systems it supports. For biomolecules, these are mostly derived from high-resolution crystallographic databases coupled to chemical knowledge and/or implied parameters from other force fields. This involves the ability to rebuild hydrogen atoms meaningfully (the only problematic cases for this procedure are those where a polar hydrogen atom is (partially) free to rotate).
Copy the file "PREP_START.pdb" to a duplicate file called "dna.pdb". The file "log1" contains an additional warning that we will take care of now, viz., there is a complaint that water molecules are not found at the end of sequence input. This has something to do with specialized loops for force evaluations later on. To address it, simply cut the ions section in "dna.pdb" (all residues with names "NA+" or "CL-" and paste it right before the first instance of a residue with name "T4E". Note that it does not matter to CAMPARI that residue numbers are now discontinuous (the only requirement is that they must change from residue to residue). Analogously, copy the file "prep.seq" to a file "tutorial6.seq" and perform the same cut-and-paste operation before proceeding to the next step.
Step 2 - Create a pdb-file of a partially randomized system
With the converted pdb-file of the piece of DNA with ions in hand, we will now, in a second step, generate a snapshot of the actual system of interest (including water at the correct density). For this purpose, modify existing entries in prep.key as follows:
FMCSC_SEQFILE tutorial6.seq
FMCSC_PDBFILE dna.pdb
FMCSC_NRSTEPS 100000
FMCSC_PDB_R_CONV 1
FMCSC_PDB_HMODE 1
We also need to add:
FMCSC_PDB_INPUTSTRING "a6,a5,1x,a4,a1,a3,1x,a1,i4,a1,3x,3f18.10"
This keyword above tells CAMPARI that it should expect "dna.pdb" to be in the format as specified. Note that the second field is always treated as a character variable on input and as an integer on output (this is the only difference to the setting for FMCSC_PDB_OUTPUTSTRING above). Our final system is meant to contain 6700 water molecules, and we need to edit the sequence file to reflect this. For example:
awk -v cnt=0 '{if ($1 == "T4E") {cnt=cnt+1} if ($1 == "END") {for (k=0;k<6700-cnt;k++) {printf("T4E\n")} print "END"; exit} else {print}}' tutorial6.seq > tmpseq
mv tmpseq tutorial6.seq
As for the ions, CAMPARI will automatically generate random starting positions for all the missing water molecules (and it could do so also for many missing parts of polymers, see FMCSC_RANDOMIZE for details). For a condensed phase system, the difficulty, as you may have seen already in Tutorial 5, is of course the density. The full sequence file contains 6700 water molecules, 87 sodium ions, 65 chloride ions, and the two DNA chains making for an overall density of approximately 1.02g·cm-3 with the chosen box dimensions. Note that our box is certainly too small to study counterion effects around a piece of DNA with appreciable accuracy, but the smaller size helps making this tutorial more tractable. Periodic boundary conditions as chosen here are by far the most common choice for simulations employing explicit solvent since they are thought to minimize boundary artifacts.
Thus far, we have not specified an interaction model, but note that excluded volume interactions are turned on by default. As can be gleaned from the setting for FMCSC_NRSTEPS above, we plan to perform some sort of preliminary simulation. To be specific, we wish to use Monte Carlo sampling in rigid-body space to solvate the structure in "dna.pdb", which already contains a number of water molecules and ions, with water but without too many steric clashes. Most molecular simulation packages will typically have a protocol to use preexisting water boxes to "soak" a macromolecule. Instead, here we will use CAMPARI to create our own custom system completely from scratch. This has its advantages as it is a general strategy that can be used for any solvent (including solvent mixtures of arbitrary complexity). To make sure that this preliminary Monte Carlo run proceeds at appreciable pace, add the following lines:
FMCSC_CUTOFFMODE 3
FMCSC_NBCUTOFF 10.0
FMCSC_GRIDDIM 10 10 10
FMCSC_GRIDMAXGPNB 1024
FMCSC_N2LOOP 0
FMCSC_CHECKFREQ 10000000
Here, we instruct CAMPARI to use grid-based interaction cutoffs which is common practice in most modern simulation software. We should also provide details of the move set so that it is better-suited to our task than the default settings:
FMCSC_RIGIDFREQ 1.0
FMCSC_RIGIDRDFREQ 0.1
FMCSC_TRANSSTEPSZ 0.3
FMCSC_ROTSTEPSZ 10.0
FMCSC_FRZREPORT 1
FMCSC_FRZFILE <full path to CAMPARI directory>/examples/tutorial6/tutorial6.frz
The MC move set is focused entirely on rigid-body motion. The step sizes are suitably small for a condensed phase system. In addition, we constrain the positions of the two DNA chains to prevent changes in their relative orientation from happening. The input file is very simple and its format is explained elsewhere.
Lastly, like in Tutorial 5, we should use the OpenMP-threaded version to speed things up:
FMCSC_NRTHREADS 4 # if you can allow it, set it to the number of physical or even logical cores in your system (hyperthreading can be beneficial)
FMCSC_THREADS_VERBOSE 1 # level of diagnostic output
FMCSC_THREADS_DLB_FREQ 500 # restarts DLB segments every X steps
FMCSC_THREADS_DLB_STOP 300 # stops DLB for each segment after X steps
FMCSC_THREADS_DLB_EXT 1 # accumulate load data over X steps for DLB adjustments
These dynamic load balancing settings are appropriate for the subsequent molecular dynamics we will perform. Note that we are still relying only on excluded volume for solvent placement, which will lead to a starting structure that has no knowledge about hydrogen bonds or ionic interactions. We can now run CAMPARI again as follows:
campari_threads -k prep.key >& log2 &
This should not take more than a few minutes. We obtain a pdb-file called "PREP_END.pdb" that you should rename to "MCsoaked.pdb".
Step 3 - Create the minimized pdb-file to start production simulations from
For the actual simulation and the prior "minimization", we will use a standard molecular dynamics protocol in Cartesian space using a thermostat to obtain the canonical (NVT) ensemble. This means that you should remove or comment the lines requesting explicit constraints on the DNA from prep.key (see above, keyword FMCSC_FRZFILE). After doing that, append the key-file by:
FMCSC_ENSEMBLE 1
FMCSC_DYNAMICS 2
FMCSC_TSTAT 4
FMCSC_CARTINT 2
The thermostat used throughout the rest of this tutorial is the velocity-rescaling thermostat of Bussi et al.. There is a small discussion of thermostats also in Tutorial 5 Previously, we operated in a rigid body space that satisfied desired constraints on bond lengths and angles automatically. Our choice for keyword FMCSC_CARTINT above means that we switch to Cartesian degrees of freedom (atomic positions), which requires us to impose constant bond lengths for all bonds for the DNA as holonomic constraints instead. In addition, the water model we use (TIP4P-Ewald) is parameterized as a rigid model which means that one additional constraint per molecule is needed (note that lone-pair particle is kept rigid by different means due to its lack of mass). In theory, we could use option 6 for keyword FMCSC_SHAKESET to provide an exact list of the constraints we want. If written properly, such a list would be parsed by CAMPARI, and during execution the coupled constraints on the DNA molecules would be addressed by our choice of constraint solver whereas the ones on water would be solved analytically by the so-called SETTLE (reference) procedure (CAMPARI will do this for all suitable constraint groups). Since such a list may be inconvenient to prepare, we use the following syntax:
FMCSC_SHAKEMETHOD 1
FMCSC_SHAKESET 3
FMCSC_SETTLEH2O 1
FMCSC_SHAKEFROM 2
The third keyword above is in fact implied as a default choice (added for clarity here), and instructs CAMPARI to override constraint choices via FMCSC_SHAKESET for all supported water models (TIP3P, TIP4P, TIP4P-Ew, TIP5P, or SPC) in order to make them completely rigid. Here, this means that the choice via FMCSC_SHAKESET 3 of just constraining bond lengths in water molecules is replaced by complete rigidification. The fourth keyword, FMCSC_SHAKEFROM, is important because it defines what to actually set the constraints to. Option 2 means that these values are taken from the minimum positions of bonded potentials (DNA) in "amber03.prm" or the published rigid geometries (water models). This implies that the values of the constraints found in "MCsoaked.pdb", which are different, must be adjusted (which happens before the simulation starts). The remaining controllable parameters are left as default values.
We are now finally ready to specify the details of the realistic interaction model we plan on using. Here, we try the 2003 revision of the AMBER force field, and can therefore paste a lot of keywords straight from the documentation:
FMCSC_UAMODEL 0
FMCSC_INTERMODEL 2
FMCSC_ELECMODEL 1
FMCSC_MODE_14 2
FMCSC_FUDGE_EL_14 0.833
FMCSC_FUDGE_ST_14 0.5
FMCSC_SC_ATTLJ 1.0
FMCSC_EPSRULE 2
FMCSC_SIGRULE 1
FMCSC_SC_POLAR 1.0
FMCSC_SC_BONDED_B 1.0
FMCSC_SC_BONDED_A 1.0
FMCSC_SC_BONDED_T 1.0
FMCSC_SC_BONDED_I 1.0
FMCSC_SC_EXTRA 0.0
FMCSC_IMPROPER_CONV 2
You are probably familiar with most of these terms by now (if not, click on the links). However, the interaction model needs to be tuned to limit its range such that the simulation can be reasonably efficient. Part of it is the cutoff scheme we already introduced bits of. However, truncation may be fine for true short-range interactions (inverse power potentials as in Lennard-Jones), but is inappropriate for Coulomb interactions decaying with r-1. In periodic systems, the infinite sum of Coulomb interactions within the "crystal" lattice can be approximated by a grid-based numerical implementation of the Ewald decomposition of this sum, commonly referred to as particle-mesh Ewald (PME) summation. This is what we are going to use:
FMCSC_LREL_MD 2
FMCSC_BSPLINE 8
FMCSC_EWFSPAC 2.0
FMCSC_EWFFTPLANNER 2 # this is the default
FMCSC_EWERFTOL 1.0e-9 # this tolerance is relevant only if "DISABLE_ERFTAB" was not used during compilation FMCSC_ELCUTOFF 10.0
The Fourier spacing (in Å) and the order of the B-spline approximation are the most common tunable parameters in PME. Note that when using an OpenMP-enabled CAMPARI executable, the threads parallelization of the FFTW library is used as well. Since the performance of this is not always easy to predict, keyword FMCSC_THREADS_TEST offers a means to do some testing. In dynamics, FMCSC_ELCUTOFF is the outer cutoff of the twin-range regime. Interactions in this regime are only recomputed every FMCSC_NBL_UP steps. For the PME method, it is most appropriate to use the same values for FMCSC_ELCUTOFF and FMCSC_NBCUTOFF due to the altered functional form. It is difficult to tune the parameter-dependent precision of the PME method exactly although it is usually relatively easy to find "safe" settings (where errors when comparing to a reference set of parameters are marginal). This tutorial will not discuss the many intricacies of the PME method. One should be clear, however, that any Ewald sum does imply that molecules "see" copies of themselves at large distance, which, unless there are many molecules of a type in the central cell, does not translate to a well-defined notion of concentration (since the relative orientation cannot change of images).
Lastly, our first run will be very short, and use a small time step since, as alluded to above, the structure we use is by no means equilibrated with respect to the Hamiltonian just defined. Change existing entries in prep.key as:
FMCSC_NRSTEPS 1000
FMCSC_PDBFILE MCsoaked.pdb
FMCSC_PDB_OUTPUTSTRING "a6,i5,1x,a4,a1,a3,1x,a1,i4,a1,3x,3f8.3" # or comment the keyword
Note that because our system is now fully flexible and we use FMCSC_SHAKEFROM to reinitialize constraints to the correct values, the extended pdb format is no longer required (the above setting is the standard one). And add or replace the following lines:
FMCSC_TIMESTEP 0.00005
FMCSC_TSTAT_TAU 0.0005
FMCSC_NBL_UP 1
FMCSC_UNSAFE 1 # to be able to try dynamics on clashy structure
FMCSC_CHECKFREQ 100 # change existing keyword
Because we expect a fast decrease in potential energy very quickly, we choose a very small time step (0.05fs) and a very short coupling time to the thermostat (0.5fs) along with neighbor list updates at every step. This will make the integration of equations of motion rather slow but is necessary to prevent the simulation from "blowing up" due to integrator instability. Because it would normally blow up, we have to set FMCSC_UNSAFE. There are other ways to deal with the issue of an unstable numerical integration. First, you could try to use the targeted relaxation procedure (see keyword FMCSC_TMD_RELAX). This is not too promising here since our clashes arise from the solvent, and the relaxation procedure works on individual degrees of freedom. Second, you could set FMCSC_THRESHOLD_TEMP to a low enough value. This keyword can help in rescuing a simulation on the verge of exploding but should not be used in production simulations. Here, the settings specified above should suffice. We can now run CAMPARI for the third time:
campari_threads -k prep.key >& log3
This again should only take a few minutes. Once complete, rename "PREP_END.pdb" one last time, this time to "equil.pdb" (which should be a standard pdb file that can be visualized using any molecular viewer). By inspecting "log3", you should confirm that the potential energy has dropped to a "large" negative number (very crudely, approaching an energy of -10kcal/mol per water molecule). Of course, the above run lasted only 50fs of actual simulation time, and was thus much to short to equilibrate other properties, e.g., the positions of ions. In case you wonder, note that there are several reasons why CAMPARI will not provide you with the raw speed that other software packages like GROMACS may. This will still be apparent in the actual production run that we move on to now.
Step 4 - Run the actual simulation and inspect the results
For the production run, we will adjust settings so that the simulation runs more efficiently. Change existing keywords as follows:
FMCSC_TIMESTEP 0.002
FMCSC_TSTAT_TAU 2.0
FMCSC_NBL_UP 5
FMCSC_NRSTEPS 250000
FMCSC_PDBFILE equil.pdb
FMCSC_PDB_INPUTSTRING "a6,a5,1x,a4,a1,a3,1x,a1,i4,a1,3x,3f8.3" # or comment this keyword
We will simulate for 250000 steps (500ps). We have also increased the update frequency for the neighbor list and adjusted time step and coupling constant of the thermostat. Note that in CAMPARI, the effective cutoff is almost always effectively larger than what is specified via the corresponding keywords. This is because neighbor relationships in CAMPARI are residue-based, not atom-based. This is both a nuisance (computationally more expensive) and necessary (Coulombic interactions should always be cut off in a group-consistent fashion). Of course, there are other protocols that can ensure the latter condition in a different manner. The only exception to this silent bloating of cutoffs are interactions that explicitly decay to zero at a given cutoff value such as the reaction-field methods for dealing with long-range electrostatics.
The only missing thing now are analysis specifications. Above, we have disabled all analyses by default (through keyword FMCSC_DISABLE_ANALYSIS, so you should enable whatever you are interested in collecting. Since the time needed per step is pretty high for this simulation, we do not expect analyses to become computationally limiting. Note that CAMPARI by default will define single-residue molecules as solvent, which excludes them from analysis. To avoid this, we need to resort to analysis group files. It is recommended to write such a file, "tutorial6.agp" that declares ions as solutes but leaves water as solvent. Then, append the key-file as follows:
FMCSC_XYZOUT 100
FMCSC_XYZMODE 2
FMCSC_XYZPDB 3
FMCSC_TOROUT 100
FMCSC_CONTACTCALC 50
FMCSC_PCCALC 50
FMCSC_EQUIL 50000
FMCSC_FLUSHTIME 1.0
These keywords enable several output files, including a contact map, site-site correlation functions based on centers of mass, and a trajectory file in dcd-format. For all of these, we request the first 100ps to be discarded as equilibration. Now execute:
campari_threads -k prep.key >& log4 &
Depending on the number of available threads, this will take quite some time to complete (hours or more). You can track progress from the output in THREADS.log. It is possible that the simulation crashes very early on, and this may indicate that the structure in "equil.pdb" still carries some minor clashes that compromise integrator stability. If this happens, or if an initial inspection of instantaneous temperatures in "log4" suggests to you that the system is running "too hot", you can repeat Step 3 form the beginning with a larger value for FMCSC_NRSTEPS. Alternatively, it should be a straightforward extension to simply repeat Step 3 with "equil.pdb" as input (but make sure that the input file, FMCSC_PDB_INPUTSTRING, and FMCSC_PDB_OUTPUTSTRING always remain consistent), This can also be done iteratively while gradually increasing the time step and the thermostat coupling time. In rare cases, a crash may also indicate that the input geometry of the DNA was altered due to bond angle or length violations. In this latter case, it may be required to increase the tolerance thresholds for this with the help of keywords FMCSC_PDB_TOLERANCE_B and FMCSC_PDB_TOLERANCE_A. All of the above issues can be diagnosed with the help of the initial output in "log4". This should be inspected right after the start of the simulation.
After the production simulation has finished, check the final output in "log4" to make sure everything went fine (look at the reported average temperature and so forth). As the next step, it is probably most informative to use VMD to have a look at the trajectory data (for this, it is important that the pdb files used as templates are in standard format):
vmd -e PREP_VIS.vmd
This should typically load the trajectory (irrespective of format) and create some common graphical representations. At first sight, you will mostly see water, but - assuming a certain familiarity with VMD - it should be straightforward to make the water invisible or to just visualize the solvent layer around DNA. Further things to try:
- Try to determine and plot RMSD of DNA over time (several ways)
- In addition to the ion-ion site-site correlation functions, try to obtain ion-phosphate ones using the trajectory analysis feature of CAMPARI
- Analyze sugar pucker distributions and (if applicable) isomerization frequencies using the output in FYC.dat that was created on-the-fly.