Tutorial 7: Advanced Simulation Methodology - Calculating the Free Energy of Solvation of Methanol in Water

Preface:

In this tutorial, we show how CAMPARI can be used to calculate the free energy of solvation of a net-neutral small molecule in aqueous solution (FES), specifically, the transfer free energy associated with the transfer from gas phase into dilute solution. Free energies are not typically observables that can be obtained from an individual simulation in a single state or phase. For the transfer from the gas phase to aqueous solution, we could theoretically attempt to measure the partition coefficient in a two phase simulation (akin to a vapor pressure measurement). This is generally impractical, however. Instead, the FES can be estimated by a computational trick: we scale the interactions between solute (small molecule) and solvent (water) between zero (as if the molecule were absent → gas-phase reference) and their normal strength (as if the molecule were fully present → dilute solution reference). In constant volume, constant temperature conditions, the two end states are connected via:

A(I→F) = AF - AI = -kbT·ln(<eβ·(UF-UI)>I)

Here, β-1=kbT, A denotes the Helmholtz free energy, and U the internal energy. This free energy perturbation (FEP) formula forms the basis for the more complex Bennett Acceptance Ratio method (BAR). It is generally taken advantage of microscopically by running a simulation at a single condition (for example I), but evaluating energies at both conditions F and I to accumulate the thermodynamic average in above equation. The convergence of this formula is very poor if the ensembles corresponding to the two conditions are very different. Therefore, the two end states of the solvation process are interpolated by a series of numerically appropriate, unphysical states. With free energy being a state function, the total free energy for the process can then be recovered as the sum of stepwise perturbations between sufficiently similar states:

A(I→F) = AF - AI = -kbT·Σi=1,N-1ln(<eβ·(Ui+1-Ui)>i)

By flipping states I and F, we can define a formally equivalent process that in practice often behaves somewhat differently (in terms of systematic errors and convergence properties). The reason lies in the directionality of the process (one is either "growing in" or "fading out" the solute molecule) coupled to the properties of the exponential average. The BAR method can be thought of as a self-consistent way to treat both directions at the same time, and is thus the superior estimator.
In summary, our tutorial will consist of a set of parallel but non-exchanging simulations able to determine the FES of methanol in water by scaling interactions according to a sufficiently fine schedule. It will cover the previously untouched field of free energy growth methods within CAMPARI and explain a few more details regarding multi-Hamiltonian simulations in parallel execution. Many details and algorithms will be touched upon briefly, and the reader is referred to the literature for appropriate background knowledge. This tutorial does aim to provide answers for why certain choices are necessary and others would be unacceptable. It naturally builds on both Tutorial5 and Tutorial6, which both deal with molecular dynamics simulations in liquid water.

Step-by-step:

Step 1 - Set up system (topologically) and parameters

Free energies of solvation will naturally depend upon the force field chosen to represent solute-solute, solute-solvent, and solvent-solvent interactions. Here, we settle on the OPLS-AA/L force field in conjunction with the TIP3P water model:

PARAMETERS <full path to CAMPARI directory>/params/oplsaal.prm

Take the reference settings for the Hamiltonian straight from the documentation:

FMCSC_UAMODEL 0
FMCSC_INTERMODEL 2
FMCSC_ELECMODEL 1
FMCSC_MODE_14 2
FMCSC_FUDGE_EL_14 0.5
FMCSC_FUDGE_ST_14 0.5
FMCSC_SC_IPP 1.0
FMCSC_SC_ATTLJ 1.0
FMCSC_EPSRULE 2
FMCSC_SIGRULE 2
FMCSC_SC_POLAR 1.0
FMCSC_SC_BONDED_B 1.0
FMCSC_SC_BONDED_A 1.0
FMCSC_SC_BONDED_T 1.0
FMCSC_SC_BONDED_I 1.0
FMCSC_SC_EXTRA 0.0
FMCSC_IMPROPER_CONV 2
FMCSC_TEMP 298.0

This is enough to ensure that we describe the interactions within the system according to the OPLS standard. In addition, we set the temperature to 298.0K. Next, we need to think about the system geometrically. In a condensed phase, boundary artifacts are most easily minimized by using periodic boundaries. Another concern are finite-size effects: i) we will want to use an appropriate cutoff scheme, which imposes certain minimal size requirements on our system, ii) numbers are more reliable and long-range corrections (see below for all of these) less relevant if the physical water bath is large enough, iii) our simulation will be performed in the canonical (NVT) ensemble and setting the right density requires some estimate of solute volume (which is different along the interpolation schedule) which would cause problems if the water bath does not possess a sufficiently large volume. Here, we choose a 32.0x32.0x32.0Å periodic, cubic box filled by a single methanol and 1084 water molecules:

FMCSC_BOUNDARY 1
FMCSC_SHAPE 1
FMCSC_SIZE 32.0 32.0 32.0
FMCSC_SEQFILE <full path to CAMPARI directory>/examples/tutorial7/tutorial7.seq

Feel free to experiment later on with these settings.

Step 2 - Set up interpolation conditions (in parallel):

So far, our key-file allows for the simulation of methanol in water with the solute fully interacting with solvent only. How do we enable simulations at different interpolated conditions?

FMCSC_GHOST 1
FMCSC_FEG_MODE 2
FMCSC_FEGFILE <full path to CAMPARI directory>/examples/tutorial7/tutorial7.ghost

These three keywords enable scaling of specific interactions between designated residues or molecules (file input). FMCSC_FEG_MODE controls what happens to non-bonded interactions between atoms that are both part of the designated molecule(s). For methanol, this is an issue we cannot entirely ignore since the polar hydrogen does interact with the aliphatic hydrogens, as they are separated by three bonds. We choose the mode corresponding to having those interactions be scaled as well. You will find in the literature that typically solute-solute interactions are not scaled which simplifies the corrections we have to apply. However, in such a case the framework for applying long-range electrostatic algorithms is not entirely consistent (for most users, it will be appropriate to ignore this detail).
For a simulation at a single condition, the following keywords control the details of solute-solvent interactions:

FMCSC_FEG_IPP 1.0
FMCSC_FEG_ATTLJ 0.55
FMCSC_FEG_POLAR 0.61
FMCSC_FEG_BONDED_B 1.0
FMCSC_FEG_BONDED_A 1.0
FMCSC_FEG_BONDED_T 1.0
FMCSC_FEG_BONDED_I 1.0
FMCSC_CBMODE 1
FMCSC_LJMODE 2

The first three keywords are what varies across different replicas in the parallel simulation in the end, and the values provided here are for illustration only (in all cases: setting to 0.0 will completely disable that component of the solute-solvent interactions, while unity corresponds to the fully interacting limit). The next four keywords are simply required to match the OPLS reference interaction model with respect to bonded interactions. The two remaining keywords determine how exactly interactions are scaled. For electrostatic interactions, linear scaling is appropriate, but for steric and dispersive terms a modified functional form is selected (a so-called "soft-core" potential). This entails additional parameters which we we will leave at default values (FMCSC_LJRAD, FMCSC_LJEXP, and FMCSC_LJSCEXP).
Previous tutorials showed how parallel runs are set up in which temperature varies across replicas. Here, we now require that parameters of the Hamiltonian controlling solute-solvent interactions vary across replicas.

FMCSC_REFILE <full path to CAMPARI directory>/examples/tutorial7/tutorial7.rex
FMCSC_REPLICAS 24
FMCSC_REDIM 3
FMCSC_REMC 1
FMCSC_REFREQ 20000000000

If you examine the replica input file, you will note a fairly specific schedule which interpolates between all three non-bonded interaction keywords being zero to all being unity. Cavity formation is usually the most critical step in "growing in" a solute so most the schedule is spent building up the value for FMCSC_FEG_IPP. Note how polar interactions are only allowed to be present if the solute is sterically fully present (causes numerical instabilities otherwise). There are not many rules for creating good schedules, but any calculation should use suitable metrics (described in the literature) to identify weak portions of the schedule and refine it accordingly (since systematic errors are incredibly difficult to diagnose in FES calculations). Here, we use a 24-point schedule and consequently have 24 replicas. We do not actually wish to use the replica-exchange functionality here, and therefore disable swaps by setting their interval parameter to something longer than the simulation itself.

Step 3 - Think about implementation / execution details that guarantee sufficient computational efficiency:

In order to actually perform any type of simulation, we have to deal withe the set of technicalities pertaining to cutoffs. Cutoffs are indispensable to almost any simulation by limiting the physical range of interactions in some (usually artificial) framework. Only then can large systems be simulated for any reasonable amount of steps or time. For the production runs, we will use plain cutoffs for steric and dispersive interactions, and the reaction-field method for electrostatic interactions. Given that the solute is neutral, this is appropriate:

FMCSC_CUTOFFMODE 3
FMCSC_NBCUTOFF 12.0
FMCSC_ELCUTOFF 12.0
FMCSC_LREL_MD 3
FMCSC_NBL_UP 5
FMCSC_GRIDDIM 8 8 8
FMCSC_GRIDMAXGPNB 512

For dynamics calculations, this corresponds to grid-based cutoffs with an update frequency for neighbor list every five steps, a single cutoff for all interactions at 12.0Å with reaction-field corrections applied to taper the electrostatic contributions to exactly zero at 12.0Å. Most settings here deal not with physical principles but with computational efficiency. However, for this type of application, some of them will have additional relevance. For instance, we choose the reaction-field method because it is compatible with free energy calculations of the type attempted here.

Step 4 - Create different random, somewhat equilibrated starting conformations for all conditions (in parallel):

Ultimately, for the condensed phase, we wish to use a dynamics method to sample the system. To create randomized structures, however, dynamics is inappropriate since it requires somewhat equilibrated structures to start with (the integrator will otherwise become unstable and the system will "explode"). Furthermore, we will want to restart the actual production simulation later on with each different replica using a different starting conformation. To achieve all this, we differ from the approaches used in Tutorial5 and Tutorial6, and instead resort to a bit of trickery that CAMPARI offers:

FMCSC_NRSTEPS 210000
FMCSC_EQUIL 500000
FMCSC_DISABLE_ANALYSIS 1
FMCSC_ENSEMBLE 1
FMCSC_DYNAMICS 5
FMCSC_TSTAT_TAU 0.1
FMCSC_TIMESTEP 0.001
FMCSC_CYCLE_MC_FIRST 200000 # together with two keywords below, ensures one MC segment of 200000 steps followed by one dynamics segment of 10000 steps
FMCSC_CYCLE_DYN_MIN 140000 # must not be larger than FMCSC_NRSTEPS
FMCSC_CYCLE_DYN_MAX 140000 # must not be larger than FMCSC_NRSTEPS
FMCSC_RSTOUT 210000
FMCSC_RANDOMIZE 1
FMCSC_RANDOMATTS 1000
FMCSC_RANDOMTHRESH 1.0 # the threshold is a mean residue-residue value
FMCSC_RIGIDFREQ 0.995

This set of keywords yields a hybrid Monte Carlo (MC) / molecular dynamics (MD) run in the canonical ensemble that starts with 2x105 MC steps from a truly random system followed by 104 MD steps. A restart file is written exactly once for all replicas at the end of the run. We define a small integration time step (1fs, to avoid integrator crashes in case the MC part leaves some sterically unfavorable contacts) and a small coupling time for the (default) thermostat. Note that the dynamics happen in rigid-body/torsional space as in Tutorial 5. Since the system is quite similar, we also use the same stability improvements here from the beginning:

FMCSC_TMD_INTEGRATOR 1
FMCSC_TMD_INT2UP 4

Note that during production runs it will be important that our integrator produces appropriate fluctuations for the potential energy in order for the analysis to be accurate (this is the reason why the use of the Berendsen thermostat would give rise to systematic errors here).
In this tutorial, we also want to explain the hybrid parallelization in CAMPARI briefly. As a simple rule (that applies to almost all functionalities), the MPI (distributed memory) layer handles communication between replicas (the explicit copies of a system used in methods like replica exchange or PIGS) whereas the OpenMP (shared memory) layer tries to speed up the propagation of each individual copy:

FMCSC_THREADS_VERBOSE 1 # level of diagnostic output
FMCSC_THREADS_DLB_FREQ 500 # restarts DLB segments every X steps
FMCSC_THREADS_DLB_STOP 300 # stops DLB for each segment after X steps
FMCSC_THREADS_DLB_EXT 1 # accumulate load data over X steps for DLB adjustments

To proceed, paste all keywords mentioned thus far in an appropriate file (say "fes.key"). Using such a hybrid MPI/OpenMP parallelization strategy on modern HPC (cluster) hardware is not always simple: individual compute nodes are partitioned into sockets (CPUs), physical cores per CPU, and hyperthreads per physical core (logical cores), and anywhere from 1 to nlc MPI processes can reasonably run on such a node (where nlc is the total number of logical cores). Thus, the fundamental question is how many cores to allocate per MPI process, which are then reserved for the OpenMP layer, and how to map them to the physical cores. Let us assume you have compute nodes with two sockets each holding a 4-core CPU. Per FMCSC_REPLICAS above, we require 24 replicas. If each runs on an entire CPU, we would need 12 nodes. A direct invocation of "mpirun" as part of a job submission script could look like this (OpenMPI syntax):

export CAMPDIR=<full path to CAMPARI bin-directory>
export OMP_NUM_THREADS=4
mpirun -np 24 -npernode 2 ${CAMPDIR}/campari_mpi_threads -k fes.key >& masterlog

This would be the minimal required settings and should work if the job scheduler (like SGE) granted the job 12 complete nodes. Theoretically, it may of course be necessary to guide MPI to the specific nodes you plan on using (this is rare for current generation schedulers, however). The command silently relies on mpirun distributing cores to MPI processes intelligently. This may not always work as desired, and thus the mapping may have to be done explicitly. Such an example with the srun command of the SLURM scheduler (which abstracts the mpirun command), could be as follows:

export CAMPDIR=<full path to CAMPARI bin-directory>
export SLURM_CPU_BIND=verbose
export OMP_NUM_THREADS=4
srun --unbuffered --ntasks-per-node 2 --cpus-per-task 4 --ntasks-per-core 1 -n 24 --cpu_bind=sockets ${CAMPDIR}/campari_mpi_threads -k fes.key >& masterlog

Again, this should work if the job scheduler (here, SLURM) granted the job 12 complete nodes. Note that the SLURM command is more explicit: it disables hyperthreading ("--ntasks-per-core 1") and it explicitly instructs the scheduler to limit the 4 threads of an MPI process to sockets wherever possible ("--cpu_bind=sockets"). This rule corresponds to a specific "mask" generated by srun. It may sometimes be necessary to create these masks from scratch, and there is source code for a program in the tools/ subdirectory to do this. In both examples, note the line "export OMP_NUM_THREADS=4", which is setting the standard environment variable for the number of OpenMP threads to 4. When using hybrid-parallel CAMPARI ("campari_mpi_threads"), the number of threads can no longer be set with keyword FMCSC_NRTHREADS. If only the MPI executable is available (see installation instructions) and/or there are not enough computing resources, you can resort to the simple invocation of MPI with one thread per MPI process, which has been discussed as part of Tutorial 3:

mpirun -np 24 campari_mpi -k fes.key

While it is theoretically possible to maintain more MPI processes than there are cores on a single machine, this overloads the machine, which 1) can sometimes crash the machine; and 2) means that MPI processes spend most of the time waiting, i.e., everything is slowed down. If you want to use this tutorial on a single computer (with multiple cores), simply adjust the value for FMCSC_REPLICAS to something reasonable - the only difference will be that in the end, the free energy change with some unphysical process is calculated and not the proper FES (since only the first so many entries in the replica input file are used). If CAMPARI complains about this (see N_000.log and so on ...), you may also have to adjust the value for FMCSC_REDIM Technically, the tutorial does not depend on the number of replicas.
In rare cases, CAMPARI may fail due to integrator instability. In this case, simply rerun the calculation (this behavior is due to the stochastic nature of the MC portion). The run should finish within an hour or so, but this may vary considerably depending on compilation options, hardware, etc. If successful, the latest structure for each condition is most accurately contained in the restart file which will be called N_0??_yourcalc_1.rst. Copy all of those files such that they drop the "_1", i.e., are called N_000_yourcalc.rst and so on.

Step 5 - Run the simulation:

The first thing we need to do is to instruct CAMPARI to use the restart files we created in the last step as input.

FMCSC_RESTART 1
FMCSC_RST_MC2MD 1

The second keyword above is a general instruction to use limited amounts of information from the restart file. We require some modification because we want to switch from an MC/MD hybrid calculation to an MD calculation (the alternative would be to remove two lines exclusive to the hybrid setup from each restart file, i.e., to fix the files by hand). Next, we need to think about the extent of the simulation. Here, we will simulate in straight rigid-body/torsional dynamics for 4ns with 1ns of equilibration using a 2fs integration time step. Hence, replace the entries for the corresponding keywords with the following:

FMCSC_NRSTEPS 2210000
FMCSC_EQUIL 710000
FMCSC_RSTOUT 50000
FMCSC_TIMESTEP 0.002
FMCSC_TSTAT_TAU 2.0
FMCSC_DYNAMICS 2

You may wonder why the simulation uses 2.21x106 instead of 2x106 steps. The reason is that the starting step will be read from the restart file which stopped at 2.1x105 at the end of the setup run. So to add a full 4ns to that, the number of steps will have to be adjusted accordingly. We also adjust the coupling time of the thermostat to a safe regime where suppression or alteration of fluctuations is quite unlikely (see reference).
The only thing missing are instructions to tell CAMPARI to produce the output required to be able to perform the necessary analyses. Part of this can be done automatically. In fact, CAMPARI estimates relative free energies between replicas via free energy perturbation as explained in the preface. Set:

FMCSC_REOLALL 0
FMCSC_REOLINST 5
FMCSC_REOLCALC 10

The above instructs CAMPARI to - every 10 integration steps - compute the energy for a structure evolved at a given condition in all neighboring Hamiltonians (where neighbor relationships are purely established by the order in which replicas are entered in the replica input file). It also instructs CAMPARI to print out instantaneous values for those "foreign energy" calculations every 50 steps.
Lastly, we only enable some diagnostic analyses:

FMCSC_ENSOUT 1000
FMCSC_CHECKFREQ 100

After applying these modifications/additions, you should now be able to execute the same command(s) as above, for example (for the direct MPI solution):

export CAMPDIR=<full path to CAMPARI bin-directory>
export OMP_NUM_THREADS=4
mpirun -np 24 -npernode 2 ${CAMPDIR}/campari_mpi_threads -k fes.key >& masterlog

This simulation will take a while to complete (you should check for the progress it made based on the output in the threads-diagnostic log-file or in standard log-output when using pure MPI).

Step 6 - Compute corrections:

Aside from systematic and stochastic errors introduced via insufficient sampling and/or a poor interpolation schedule, additional errors may arise because FES calculations are acutely sensitive to system fluctuations and details of the interaction model. This is why we use an integrator in conjunction with an appropriate thermostat here (Bussi et al.), which guarantees proper sampling of the canonical ensemble. Corrections, however, are needed because of the truncation of the interaction model (see section on cutoffs above). If we assume homogeneous solvent density, the truncation of Lennard-Jones interactions an be corrected for (the 12th repulsion contributes very little to this). This essentially corresponds to a definite integral (with the proper volume element) for each solute atom with each type of solvent atom (which for water is only the oxygen, as hydrogens maintain no steric or dispersive interactions). Care has to be taken in interpreting the truncation model correctly; in CAMPARI, cutoffs are implemented via buffered neighbor lists and (generally speaking) no second distance check is applied. This means that the true cutoff distance for methanol and water is actually 12.0Å plus the combined buffer distances. For our system, this correction comes out to be -0.062 kcal/mol.
The second type of correction was touched upon already as well. Given our choice for FMCSC_FEG_MODE, solute-solute interactions are integrated up along the interpolation schedule as well. They do not, however, form a component of the physical transfer process described and therefore must be subtracted. For this purpose, do the following:
  1. Create a sub-directory within your working directory with the name SELFCORR.
  2. Create a new sequence input file that just encodes a single methanol molecule (do not forget the "END").
  3. Copy the working key-file for your final run into the new subdirectory and edit it such that it points to the new sequence file.
  4. Further edit the key-file by removing the line with FMCSC_RESTART, and by setting the total number of steps to 2x106 and the number of equilibration steps to 0.
  5. In the new subdirectory, execute again "mpirun -np 24 campari_mpi -k fes.key".
This will perform an analogous simulation on a single methanol molecule in the gas-phase. After it is finished, switch back to your original working directory.

Step 7 - Analyze the results:

As mentioned before, CAMPARI performs free energy perturbation calculations automatically. In detail: In the language from above, both Ui+1 and Ui-1 are computed at condition i. This is then used to obtain the thermodynamic averages:

A(i→i+1) = -kbT·ln(<eβ·(Ui+1-Ui)>i)
A(i→i-1) = -kbT·ln(<eβ·(Ui-1-Ui)>i)

At the end of the simulation, these values are written to files called - for example for the very first replica - N_000_OVERLAP.dat. They immediately allow an estimation of the total free energy of the process via both forward and backward FEP (not including corrections):

/bin/bash
ffpr=`head -q -n1 N*OVERLAP.dat | awk -v k=0 '{k=k-$3; print k;}' | tail -n 1`
ffpf=`tail -q -n1 N*OVERLAP.dat | awk -v k=0 '{k=k+$3; print k;}' | tail -n 1`
ffpm=`echo "${ffpf} ${ffpr}" | awk '{print ($1+$2)/2.0}'`
echo -e "FEP forward: ${ffpf} kcal/mol ; FEP backward: ${ffpr} kcal/mol ; Average: ${ffpm} kcal/mol (corrections missing)"
exit

This assumes you run in a UNIX shell. Note that the sign for the reverse process is naturally swapped but has been adjusted in the above commands. Of course, you can also do this manually or any myriad number of alternative ways. As mentioned above, the BAR method will yield more rapidly convergent and more accurate results of the FES. Therefore, this tutorial provides a script for a free statistics package (R) that can compute the BAR estimates based on the generated data, specifically the files with instantaneous "foreign" energies. You use the script, you will need these files in the same folder layout as your run produced (i.e., with the subdirectory called "SELFCORR") and place the R-script in the folder with the data generated by the initial calculation in water. You can of course also edit it according to your needs (if you followed this tutorial exactly and operate in a UNIX environment, no changes should be necessary). In your working directory, start R, then:

doread <- TRUE
source("tutorial7.R")

This will print the final numbers to the screen (R-terminal). In most cases you will find that the forward BAR estimate lies between the FEP forward estimate and the negative of the FEP backward estimate (with the caveat that the numbers in N_000_OVERLAP.dat and so on rely on more data than we use for the BAR method, which is a result of keyword FMCSC_REOLINST). There is a lot more information to be analyzed, however, and we suggest - if you are interested in FES calculations per se - to extend the script (or write your own) to compute hysteresis errors (differences in forward and backward FEP estimates), plot forward and reverse work histograms (Crooks plots), and so on. As en exercise, it is suggested to compute the free energy of solvation at three different temperatures (say, spaced 10K apart). From these data, one can calculate a consistency check for enthalpy-entropy decompositions. Specifically, through either FEP or Bennett estimators, three separate free energy, enthalpy, and entropy estimates can be obtained. These data can then be compared to a simple fit of ΔA = ΔU - TΔS, which assumes a ΔU independent of T. Please consult our 2010 JPCB publication and references therein. Note that the integrator we use in this tutorial is more accurate than the one in the publication, and this means that the effective temperatures are slightly different.


Design downloaded from free website templates.