CAMPARI Tutorial 4

Tutorial 4: Testing Move Sets - How to Establish Global Balance in a Monte Carlo Move Set

Preface:

In conformational space sampling whether it be using Monte Carlo Sampling or Dynamics, one can define a "degree of freedom" as any system descriptor that changes during the sampling process. In Cartesian sampling the degrees of freedom are the X, Y, and Z positions of all atoms. In a sufficient space of internal coordinates (Z-matrix), the degrees for freedom for molecular systems can correspond to the bond lengths, bond angles and torsional angles of the structure. The transformation operation to convert between the two coordinate frames has to be well-defined (unique, reversible mapping). In multimolecular systems, intermolecular degrees of freedom could be expressed by pseudo-internal coordinates that, however, do not carry intuitive meaning. Therefore, it is just as well possible to define the reference frame for each molecule by using that molecule's first three atoms' coordinates. This implies six degrees of freedom (nine total minus one implied bond angle minus two implied bond lengths).

Monte Carlo sampling in CAMPARI happens explicitly in a rigid-body / torsional coordinate frame, i.e., an internal coordinate reference frame that used explicit constraints on bond lengths and bond angles. There are exceptions to the imposition of constraints, for example when using the simplified concerted approach of Ulmschneider and Jorgensen (see FMCSC_ANGCRFREQ), when sampling the puckering of 5-membered rings (see FMCSC_PKRFREQ and FMCSC_SUGARFREQ), and when using inexact chemical crosslinks (see FMCSC_CRLK_MODE). In general, however, the Monte Carlo degrees of freedom sampled in CAMPARI are the torsional angles of the backbone, the torsional angles of sidechains, and the rigid-body coordinates of all molecules (rotation and translation).

The simplest type of sampling move changes the value of only one degree of freedom in completely random (unbiased) fashion as an elementary step. CAMPARI also implements several advanced sampling techniques, such as concerted rotations, i.e. concerted (coupled) changes to several degrees of freedom at once in order to satisfy some external constraint. These complex moves usually serve to increase the efficiency of sampling certain regions of conformational space that are easily only accessible through correlated changes in several degrees of freedom at once (and not via multiple, independent perturbations of individual degrees of freedom). The domain of a single degree of freedom is the range of all possible values it can assume.

We can therefore define "ergodicity" for the move set as the requirement to - at least theoretically - be able to construct a chain of Monte Carlo moves, which allow the system to visit any point in phase space given the coordinate frame phase space is defined in. We can also define that "global balance" holds if each point in phase space is visited with equal likelihood in the absence of an energetic bias (Boltzmann weights are all equal).
In reality, ergodicity is usually easier to understand conceptually given a move set. Numerically, establishing ergodicity and global balance require the same basic methodology. Note that a move set consists of different move types, and that it is quite possible for individual move types to fail the "ergodicity" requirement if considered by themselves. Such a case exists for instance with exact concerted rotation moves for which the ends of the chain are strictly unable to move relative to one another (and which therefore have to combined with other move types to yield an ergodic move set).
Global balance is much trickier to assert conceptually, and often entails the use of elaborate corrections in the MC acceptance step. To circumvent having to perform any numerical tests of ergodicity and global balance, it is often formally proven that the move set obeys detailed balance, a sufficient but not necessary condition ensuring global balance. For this reason and the reason that such a formal proof does not constitute a test of an actual implementation, one can and should perform basic numerical tests as is done in this tutorial.

Frequently, it will only be possible to collect 1D-histograms for most of the degrees of freedom in the infinite temperature (maximum entropy) limit, i.e. in the absence of a Hamiltonian. When plotting those histograms, we expect them to be flat within some level of noise stemming from the finite length of the simulation. This does not address other parts of global balance such as the statistical independence of conceptually independent degrees of freedom. For example, a move set that randomly sets the φ/ψ-angles in a polypeptide to the same random value would appear balanced when analyzing 1D-histograms only. If we, however, create a 2D-histogram of φ/ψ-values it would become immediately clear that such a move is not ergodic, and will not - even when mixed with other moves - produce a globally balanced distribution. This example shows that ultimately it is the N_f-dimensional distribution function that is of interest to us (N_f being the number of degrees of freedom) that is of course impossible to analyze directly for any system of appreciable size.

In the following, we will guide you through the process of testing whether a given CAMPARI move set appears unbiased when considering low-dimensional projections. With explicit knowledge of the level of coupling introduced by the move set, this can often be enough to safely conclude that it is in fact globally balanced.

Step-by-Step:

Step 1 - Study and understand the components of the move set

Inspect the section titled "# Monte Carlo Sampler" in the key-file for this tutorial found in the corresponding subdirectory (example/tutorial4/). The move set in CAMPARI is a binary decision tree with the parameters for each binary decision being controllable through a single "frequency" keyword. Binary decisions toward the root of the tree will naturally effect the overall frequency of moves accessible in all downstream branches.

The primary classes of moves are controlled by the following keywords (ordered hierarchically starting at the root):
FMCSC_PARTICLEFLUCFREQ - Moves for the grand-canonical ensemble (sampling particle number as degree of freedom)
FMCSC_RIGIDFREQ - Moves for rigid-body coordinates of molecules
FMCSC_CHIFREQ - Moves for dihedral angles in branches (sidechains) of polymers
FMCSC_CRFREQ - Concerted rotation moves for polypeptides
FMCSC_OMEGAFREQ - ω-angle sampling for polypeptides
FMCSC_NUCFREQ - Sampling moves for nucleic acids
FMCSC_PKRFREQ - Sampling moves for pucker angles in polypeptides
FMCSC_OTHERFREQ - Sampling moves for single dihedral angles
φ/ψ-moves - The "drop-through" moves are polypeptide φ/ψ-moves

It is important to realize that some of these moves contain much more complex subbranches than others. For instance, ω-moves for polypeptides are only split into two variants - stepwise vs. fully randomizing (→ FMCSC_OMEGARDFREQ). Conversely, the nucleic acid branch contains several different moves (→ FMCSC_NUCCRFREQ or FMCSC_SUGARFREQ). This tutorial tests elements in all of the above branches with the exception of the branch pertaining to particle number fluctuations. If you inspect the key-file further, you may also note that:

Alternative implementations of φ/ψ-moves for polypeptides are not tested (→ FMCSC_COUPLE or FMCSC_PIVOTMODE).
Concerted rotation for polypeptides with bond angle variations is disabled (→ FMCSC_ANGCRFREQ).
No exclusive rotations (without translation) of rigid-bodies are attempted (→ FMCSC_ROTFREQ).
Moves operating on unsupported residues in CAMPARI (compare Tutorial 12) are excluded (→ FMCSC_OTHERUNKFREQ).
Titration pseudo-moves are omitted (→ FMCSC_PHFREQ).
Only a single setting for each dependent control keyword will be explored (e.g., FMCSC_NRCHI).

Some features are skipped because they do not qualify as proper MC moves, because usage is discouraged (quasi-obsolete features), or because they are incompatible with other more important features. Take the time to read up on the move types that are going to be in use as a conceptual understanding of the moves is indispensable in order to interpret the results. You should be familiar with all of the dependent keywords. If you plan to test a specific move set meant to be used in actual simulations, simply replace the appropriate settings.

If you inspect the section titled "# Hamiltonian" in the key-file, you will note that two energy terms are in fact enabled. These are bonded potentials. Bond length energies should remain rigorously constant throughout the simulation (this can be used as a test to the rigorous enforcement of the implied constraints). Bond angle terms are needed for two algorithms dedicated to the sampling of ring pucker states (→ FMCSC_PKRFREQ and FMCSC_SUGARFREQ) that fail numerically if extreme values for bond angles were allowed. Given an implicit set of constraints (here, constant bond lengths all around), it is not at all easy to define a meaningful expectation for the infinite temperature ensemble either. We will therefore sample ring pucker states given a (strictly local) Hamiltonian. The property of bonded potentials to only affect those degrees of freedom they act on directly is very important since it allows the remainder of the results to stay meaningful.

Step 2 - Run the simulation to obtain distribution (histogram) data on degrees of freedom

Copy all the files from the examples/tutorial4/ subdirectory into a writable working directory. Adjust any file paths as needed. Note that the sequence file in use corresponds to an arbitrary system meant to allow for testing most of the possible move types in a single simulation.

The actual calculation is run by:

campari -k MOVESETS.key >& movesets.log
Even though only extremely limited energy evaluations are needed, the calculation will need some time to complete due to the computational cost of exact concerted rotation moves (→ FMCSC_TORCRFREQ and FMCSC_NUCCRFREQ) and, to a lesser extent, the cost of data collection routines. That said, the run should not take more than a few hours at the specified length. While it is certainly possible to shorten runtime by reducing NRSTEPS, a proper result in statistical terms, even when using projections only, does in fact require a fairly lengthy simulation. If you have wider resources available, this calculation is perfectly suited for the simple MPI averaging mode that pools data across copies of the system, which are all being propagated under identical conditions.

Step 3 - Analyze distribution data

Once the simulation is finished, we can use the simulation output to collect evidence that the move set works properly. It is important to note that from the construction of the move set we know that certain degrees of freedom will always be sampled independently of one another. Therefore, we can also analyze them independently:

Rigid-body coordinates:
It is rather inconvenient to obtain the spatial distribution function with respect to the laboratory frame, so we will instead analyze the relative distribution functions by means of pair correlation functions g(r) obtained between the centers of mass of the individual molecules. For this, simply plot all columns in output file RBC_PC.dat from the third one up against the intermolecular distance in the first column. Apart from noise, they should all be flat and equal to unity everywhere. We do not test rigid-body rotation here (see below).
Sidechain dihedral (χ-)angles:
Freely rotatable χ-angles are sampled by at most two types of moves, viz., dedicated sidechain moves and single torsion pivot moves on native CAMPARI degrees of freedom. This implies that only very basic moves are applied to these dihedral angles. The first thing you can do is to open output file INTERNAL_COORDS.idx, and to search manually for the correct entries in the dihedrals section. Then, plot the appropriate columns in the final containing torsional histograms, viz., INTHISTS_DI.dat. All the corresponding histograms should be flat. To test for independence between concurrently sampled χ-angles, open output file FYC.dat, and identify two or more columns giving the values for χ-angles on the same residue. Given the chosen output mode for this file, you may require a Z-matrix file for reference. Construct 2D or higher histograms and visualize them; they again should be flat.
Pucker states:
Variables characterizing saliently the pucker state of each flexible ring in the system (including bond angles) would be reported in output file FYC.dat for the other output mode. In our case, only dihedral angles will be listed. Therefore, for bond angles, plot the data in INTHISTS_BA.dat to look at the 1D histograms directly. Use INTERNAL_COORDS.idx to identify the correct columns. Plot the implied probability densities along with the correct weights expected for an isolated (i.e., not coupled to other degrees of freedom by a set of constraints) bond angle of each type. These can be constructed as p(α)=exp[-U(α)/k_BT_sim]/(Q·δ_bin). Here, k_BT_sim is the thermal energy, U(α) is the bonded potential in use for a particular angle, Q is the appropriate normalization constant (partition function), and δ_bin is the bin size used to construct the simulation histogram. By having specified FMCSC_BONDREPORT to be true, the log file will contain a section listing explicitly which bond angle potential acts on which combination of atoms. With this number, consult the parameter file in use (abs3.2_charmm36.prm here) and the proper documentation to find parameters and functional form that will allow you to compute U(α) explicitly. This is only an approximate test meant to identify major flaws (quantitative matches cannot be expected). It should not be used for cases where the set of coupled bond angle parameters produces stress in conjunction with the fixed bond lengths (can be diagnosed by isolating the residue in question and evaluating the total bond angle energy in ENERGY.dat of a new simulation). For the particular case here, the analysis for proline is probably more meaningful than that for nucleotides. Note that potentials acting on dependent angle terms involving atoms bound covalently to ring atoms also influence the distributions. All the pucker degrees of freedom are coupled, and probing their independence is meaningless (although visualizing their level of coupling may be instructive). As one additional test, you can compare the dihedral angle statistics for two proline residues with opposite chirality and make sure they are mirror images of one another (for example for the C_i-1-N-C_α-C_β torsion).
Polypeptide φ/ψ-angles:
Visualize the 2D-histogram corresponding to the φ/ψ-distribution for the lysine residue in the second molecule in the sequence file. This residue is only sampled by regular pivot φ/ψ-moves. The output file should be called "RESRAMA_00013.dat". Next, do the same for the other residue for which we also requested residue-specific φ/ψ-maps (via keyword FMCSC_RAMARES). This residue is part of a longer polypeptide and therefore sampled by all types of polypeptide backbone moves that we enabled. All of the histograms should appear flat. Conversely, the overall φ/ψ-distribution cannot be flat because it includes the φ-angles of proline residues. Lastly, find entries corresponding to φ/ψ-angles by either of the two methods used above and (construct and) visualize the corresponding 1D histograms. Inspect those histograms corresponding to the first polypeptide with particular care as the biased move types applied to it are much more likely to introduce minor errors. Feel free to attempt constructing higher-dimensional distribution functions (the theoretical level of coupling for concerted rotation moves is too high, though) from the data in FYC.dat.
Nucleic acid backbone dihedral angles:
Proceed exactly as for φ/ψ-angles. The only exceptions are that you need to create all 2D (or higher) histograms yourself, and that the reference system using only pivot-type nucleic acid backbone moves is the free deoxycytidine monophosphate (CMP). The pivot moves come from two different classes, viz., single torsion pivot moves on native CAMPARI degrees of freedom and dedicated nucleic acid pivot moves.
Other dihedral angles:
Freely rotatable dihedral angles not covered thus far are sample by exactly one move type, i.e. single torsion pivot moves. These moves are very simple, and biases are highly unlikely. Nevertheless, proceed as before using the output in FYC.dat and INTHISTS_DI.dat to produce or inspect the relevant histograms. Use a Z-matrix file to identify the indices associated with these degrees of freedom. Since these degrees of freedom are accessed individually and not on a per-residue basis (as for most other moves), there is no theoretical foundation to assume correlations to appear.

If desired, the simulation can of course be modified to offer much more output that can be used to assess the sanity of move sets through alternative routes. Keep in mind, however, that more complex and indirect functions of degrees of freedom may help with analyses of coupling but may also mask errors due to projections.

The following aspects are not at all tested when following the instructions above:

Rigid-body rotation: Testing this is not trivial, but is crudely done by taking a molecule with non-spherical symmetry and applying only rotation moves to it. The resultant gyration tensor with respect to the static laboratory frame should then be isotropic.
Absolute spatial sampling with respect to the laboratory frame. This can be analyzed by performing spatial density analysis and subsequent visualization. Note that this tutorial uses 3D periodic boundary conditions. In restraining boundary conditions, there is an additional complication related to randomizing rigid-body moves (see also discussion on keyword FMCSC_RIGIDRDBUF).
Modified move sets that are conceptually identical but use (drastically) different choices for dependent parameters.
Additional move set options as mentioned previously.

Aside from these caveats, it should inspire confidence if all the performed tests suggest that the move set is in fact ergodic and globally balanced. This is an important test since it can be very tricky to diagnose/isolate a problem with the move set itself in actual production simulations.