CAMPARI Keywords
Full Keywords Index:
- Parameter File:
- Random Number Generator:
- Simulation Setup:
- Box Settings:
- Integrator Controls (MD/BD/LD/Minimization):
- Move Set Controls (MC):
- Files and Directories:
- Structure Input and Manipulation:
- Energy Terms:
- Cutoff Settings:
- Parallel Settings (MPI and OpenMP):
- Output and Analysis:
- NetCDF Data Mining:
The overall setup of simulations becomes more and more involved and complicated with increasing numbers of options offered by simulation software, and CAMPARI is no exception here. Not all settings are relevant in all circumstances (in fact, often very few are), and a complete understanding of all keywords is clearly not required to use subsets of CAMPARI's functionality. Since version 4, CAMPARI ships with an interactive tool to create key-file templates. This tool, written in Ruby, is not complete but will be able to create a solid starting point for an actual key-file for a simulation or analysis task. Importantly, it automatically embeds comments into the key-file, which serve a twofold purpose: first, as a miniature documentation for active keywords and, second, as a list of associated keywords which are left at default values (are commented out).
Users should keep the following points in mind when writing or editing key-files:
- Most keywords have default choices. In case of doubt, check parsekey.f90 to locate the variable associated with the selection, and then initial.f90, allocate.f90, and sometimes other files for default assignments.
- Not all keywords can be connected and arranged such that they group nicely. The documentation here groups keywords into a small number of sections, some of which end up being very large. This has both advantages and disadvantages.
- For navigation, it is highly recommended to a) search for terms within the page with the help of the browser (all keywords are described within a single html-page), b) follow the links that are provided everywhere.
- If an option is unclear, but easily testable, it is probably fastest to just try it out. If it is difficult to test, post a question on the SF forums. Every successful execution of CAMPARI prints a summary of the attempted calculation to log-output once the initialization and setup phase is complete.
- The understanding of many implemented, standard methodologies requires the corresponding literature. This is why a bibliography is provided.
- The fastest way to learn how to run basic simulations or perform trajectory analyses is to consult the various tutorials. Tutorials offer the chance to group information in a more natural workflow compared to the documentation here. They cannot explain all options in detail, though, and it is crucial to follow the links within the tutorial pages that point back to this and the other documentation pages.
Notes on Nomenclature and File Parsing:
All keywords used by CAMPARI are named FMCSC_* where the different possible strings for "*" are explained below. This means that in your key-file the correct keyword to use to specify the simulation temperature is FMCSC_TEMP and not just "TEMP". There are only two exceptions to this, viz. keywords PARAMETERS and RANDOMSEED. This has purely historical reasons (as does the ad libitum acronym "FMCSC").The beginning of log output will print some information regarding the parsing of information in the key-file. Superfluous lines should be masked as comments using the hash character ("#"). Lines that are neither empty nor comments will be pointed out unless they correspond to the two exceptional keywords just mentioned or unless they begin with the canonical prefix "FMCSC_". The keyword parser operates hierarchically meaning that some legitimate keywords will not be processed because the required base functionality has not been enabled (e.g., thermostat settings are not processed unless a gradient-based method is in use). This is done mostly to avoid needless warnings from popping up. All apparent keywords that have not been processed will be reported by the parser. However, the hierarchical dependence is not enforced stringently, which means that a keyword not reported in this list but appearing in the key-file does not automatically control a setting relevant to the attempted calculation. It is important to realize that the list of unprocessed keywords can also include misspelled ones. To make the detection of typos easier, it is recommended to comment or remove unused keywords from the key-file.
Finally, most read operations of simulation settings are prone to data type mismatch errors. Supplying a character value to a numerical setting will trigger a Fortran I/O error. The error message is usually informative yet the relevant position in the key-file is not reported. I/O in general (also for input files) may be made less error-sensitive in the future, but for now we apologize for this limitation.
Parameter File Keywords:
This keyword allows the user to provide the location and name of the parameter file to be used for the simulation. The different files offered by default (shipped with CAMPARI) are listed below:Custom Parameter Sets:
The parameter sets fmsmc*.prm are outdated and should be used with utmost caution. They contain no bonded parameters except dummy declarations and are therefore only suitable for torsional space calculations.
In general, the Lennard-Jones parameters for ions in the files starting with abs3.1 and abs3.2 require a cautionary note as they simply are those from Aqvist's work. They have not been specifically parameterized to work together with the ABSINTH continuum solvation model in case a full Hamiltonian is used (they merely have been shown to reside on the "safe" side). This is a matter of ongoing development. It may be be more appropriate to use parameters for ions that feature harder cores and better congruence between σii parameters and actual contact distances such as those in the files starting with abs3.3, abs3.4, or abs4. Similarly, the original offsets on the free energies of solvation for charge moieties on polymers as found in the files starting with 3.2 are very large (-30kcal/mol) and generally disallow salt bridge formation. The parameter revisions in the later models (abs3.4, abs4.2) have alleviated this somewhat by reducing the offsets to -15kcal/mol.
This are basic parameters fit for
simulations in the excluded volume ensemble. As Lennard-Jones
parameters, they employ Hopfinger radii with generic (and generally
small) interaction parameters. They contain a reduced charge
set derived from the OPLS brand of force fields but are thoroughly
unsuitable for simulations with "complete" Hamiltonians if just
for the fact that they lack support in many places.
This file is identical to fmsmc.prm only
that pairwise LJ-terms (σij) for pairs involving
a polar atom and a polar hydrogen are specifically reduced. It also
lacks support for phosphorus.
This file is identical to fmsmc_exp.prm only
that LJ interaction parameters (εii) are raised
for polar heavy atoms (nitrogen and oxygen).
This file is identical to fmsmc_exp3.prm only
that LJ size parameters (σii) for common atoms
atoms are bloated to approximately 107% which makes the parameter set
more OPLS-AA-like in terms of LJ parameters.
This file combines ABSINTH LJ parameters
with the full OPLS-AA/L charges including
the Kaminski et al. revision. OPLS-AA/L's bonded parameters are
only retained inasmuch as they are
required to maintain quasi-rigid geometries (i.e., bond length
and angle potentials, improper dihedral potentials,
and torsional potentials around bonds with hindered rotation). Comparison to the
reference parameter set may be useful. In
addition, the free energies of solvation are
reduced by ~30 kcal/mol for ionic groups on biomolecules. This is the
file used for most published work employing
the ABSINTH implicit solvation model thus far.
This file is identical to abs3.2_opls.prm
only that the free energies of solvation are
not artificially lowered by ~30 kcal/mol for ionic groups on
This parameter file is identical to abs4.2_charmm36.prm below
except that partial charges and required bonded parameters are taken from
the OPLS-AA/L force field. Note that this
parameter file has not been used in any published work as of late 2020.
This file is identical to abs4.2_opls.prm
only that the free energies of solvation for charged moieties on polymers are not lowered artificially by
-15 kcal/mol. As a result, ionic groups are likely to associate strongly in often artificial behavior
(such as sequestering mobile ions into condensates).
This file combines ABSINTH LJ parameters
with the full CHARMM charges from version 22 (polypeptides)
and 27 (polynucleotides), respectively. CHARMM's bonded parameters are
only retained inasmuch as they are
required to maintain quasi-rigid geometries (i.e., bond length
and angle potentials, improper dihedral potentials,
and torsional potentials around bonds with hindered rotation). Comparison to the
reference parameter set may be useful. In
addition, the free energies of solvation are
reduced by ~30 kcal/mol for ionic groups on biomolecules. In
conjunction with the ABSINTH implicit solvent model, CHARMM parameters
probably offer the best combination of simplicity (small enough dipole
groups) and completeness (support for both
nucleotides and peptides as well as most terminal groups and some small
This file is identical to abs3.2_charmm.prm
only that the free energies of solvation are
not artificially lowered by ~30 kcal/mol for ionic groups on
This file is identical to abs3.2_charmm.prm
only that the parent force field is CHARMM36 and not CHARMM22/27
(see the documentation for CHARMM36 below).
This file is identical to abs3.1_charmm.prm
only that the parent force field is CHARMM36 and not CHARMM22/27
(see the documentation for CHARMM36 below).
This file is the set of current reference parameters as published in the paper introducing
the small molecule extension of ABSINTH. Compared to abs3.2_charmm36.prm, it
features the introduction of new atom types, e.g., for describing halogen atoms in organic halides,
significant updates to Lennard-Jones parameters, and revisions to the free energy of solvation parameters
(following, for charged model compounds and ions, Kelly et al.), which includes
a reduction of the offsets on charged moieties in polymers from -30 kcal/mol to -15 kcal/mol. Partial
charges and bonded parameters are identical to abs3.2_charmm36.prm and taken from
the CHARMM36 reference as described below.
As of October 2020, there are no published works simulating
conformational equilibria of ordered and/or disordered polypeptides using abs4.2_charmm36.prm athough
it has been used in unpublished work in these types of applications and found to be reasonably conservative compared to
abs3.2_charmm36.prm or abs3.2_opls.prm.
The applications in the reference publication are on predicting
small molecule interactions with folded (and largely constrained) proteins. This published work
is obtained with a choice for SCRMODEL of 1 (group-consistent screening),
and we recommend using it throughout with the abs4.* parameters. Note that this screening model
is theoretically superior to the purely atom-based screening of the original model
but does require the recognition of meaningful charge groups.
This file is identical to abs4.2_charmm36.prm
only that the free energies of solvation for charged moieties on polymers are not lowered artificially by
-15 kcal/mol. As a result, ionic groups are likely to associate strongly in often artificial behavior
(such as sequestering mobile ions into condensates).
This file is identical to abs3.1_charmm36.prm
only that some Lennard-Jones and free energy of solvation parameters have been adjusted. This file is
considered developmental and superseded by abs4.2_charmm36.prm.
As mentioned in the disclaimer above, the ion parameters in the "3.1" and "3.2" files
were somewhat of a weakness, and this is at least partially addressed here, both at the
level of reference free energies of solvation and at the level of Lennard-Jones parameters.
The problems these revisions try to address are related to the problems discussed, encountered, and
at least partially addressed in both Arnon et al. and
Choi et al., albeit by different means. Broadly speaking,
some of the Lennard-Jones size parameters of the "3.2" generation are a bit small leading to
overly compact assemblies and to too much local steric flexibility. The poor transferability
of parameters for inorganic ions is discussed in Mao and Pappu,
but the particular solutions presented there are not implemented in any of the standard CAMPARI parameter
This file is identical to abs3.3_charmm36.prm
only that the free energies of solvation are artificially lowered by ~15 kcal/mol for ionic groups on
biomolecules. This offset is only half as large as that in abs3.2_charmm36.prm,
which means that transient ionic interactions are more likely to occur. This lower offset
is possible in part because of the changes to Lennard-Jones parameters relative
to abs3.2_charmm36.prm.
This file combines ABSINTH LJ parameters
with the full AMBER charge set from the '94-revision
(Cornell et al.). AMBER charges are generally not well-suited
to be used in conjunction with the ABSINTH paradigm since the latter is
most meaningful for small dipole groups with local neutrality. AMBER
charges are determined by a more or less
unconstrained QM-fit and spread polarization across the (arbitrary)
unit of each residue (see FMCSC_ELECMODEL).
bonded parameters are only retained inasmuch as they are
required to maintain quasi-rigid geometries (i.e., bond length
and angle potentials, improper dihedral potentials,
and torsional potentials around bonds with hindered rotation). Comparison to the
reference parameter set may be useful. In
addition, the free energies of solvation are
reduced by ~30 kcal/mol for ionic groups on biomolecules. When using AMBER charges
in conjunction with ABSINTH (keyword FMCSC_ELECMODEL is 2
), it is recommended to evaluate nonzero values for the detection tolerance. The impact of this choices can be visualized using the output file DIPOLE_GROUPS.vmd. Please refer to the details provided for AMBER reference force fields below in order to obtain answers concerning AMBER-specific implementation details of force field parameters.
), it is recommended to evaluate nonzero values for the detection tolerance. The impact of this choices can be visualized using the output file DIPOLE_GROUPS.vmd. Please refer to the details provided for AMBER reference force fields below in order to obtain answers concerning AMBER-specific implementation details of force field parameters.
This file is identical to abs3.2_a94.prm
except that the free energies of solvation are
not artificially lowered by ~30 kcal/mol for ionic groups on
abs3.2_a99.prm, abs3.1_a99.prm, abs3.2_a03.prm, abs3.1_a03.prm:
These files are analogous to abs3.2_a94.prm
and abs3.1_a94.prm only that they incorporate AMBER
parameters of revisions '99 (Wang et al., abs3.2_a99.prm,
abs3.1_a99.prm) and '03 (Duan et al., abs3.2_a03.prm,
abs3.1_a03.prm), respectively.
This file combines ABSINTH LJ parameters
with full GROMOS53a6 charges. The only modifications are to terminal nucleic acid
residues to avoid polar atoms (like the hydroxyl H on the 5'-phosphate) carrying a zero charge
and to avoid fractional charges for these terminal residues.
Note that GROMOS53 is a united atom model and that aliphatic hydrogens
(which do exist here) therefore carry no charge.
This is inconsistent - at least compared to other force fields in
which aliphatic hydrogens almost
universally carry a small positive charge of less than 0.1e -
but might speed up simulations with
screened electrostatics interactions. Bonded parameters
are only retained inasmuch as they are
required to maintain quasi-rigid geometries (i.e., bond length
and angle potentials, improper dihedral potentials,
and torsional potentials around bonds with hindered rotation). Comparison to the
reference parameter set may be useful. In
addition, the free energies of solvation are
reduced by ~30 kcal/mol for ionic groups on biomolecules.
This file is identical to abs3.2_GR53a6.prm
except that the free energies of solvation are
not artificially lowered by ~30 kcal/mol for ionic groups on
abs3.2_GR53a5.prm and abs3.1_GR53a5.prm:
These files are analogous to abs3.2_GR53a6.prm and
abs3.1_GR53a6.prm only for
the a5-revision of the GROMOS53 charge set.
In order to employ these parameters files in a small molecule screen, it is necessary to have a map from the atom types in the files to the Sybyl atom types used in standard mol2-files. This map can be determined automatically, but the guessing procedure crude. A better approach is to provide this map manually using keyword SYBYLLJMAP. The input file documentation explains this in detail, but, for convenience, at map suitable for 4.2 parameters is provided with CAMPARI (as abs4.2.ljmap in the "params" folder).
Some recommended settings controlling energy terms and the interaction model suitable for using any of these custom parameter files are listed below. Note that these are also the settings required to achieve an exact match with the original ABSINTH reference. Note as well that this implies several defaults that are not listed explicitly.
We do, however, strongly recommend replacing FMCSC_SC_EXTRA being unity with FMCSC_SC_BONDED_T set to unity since the above files will typically contain (unless otherwise noted) the required and "native" bonded potentials for each parent force field. This ensures better parameter coherence (the ones used for SC_EXTRA are taken from OPLSAA/L) and - more importantly - control over all torsional potentials (and bonded potentials in general) through the parameter file. If the system to be sampled contains proline residues, other flexible rings, or chemical crosslinks, it will also be necessary to set FMCSC_SC_BONDED_A, FMCSC_SC_BONDED_B, and FMCSC_SC_BONDED_I to 1.0 to avoid obtaining nonsensical results. Depending on the force field, even a simple extension like allowing the C-N dihedral angle in primary amides to be sampled (see OTHERFREQ and TMD_UNKMODE) might require improper dihedral angle potentials. Thus, it is recommended practice to always enable all bonded terms except CMAP corrections even if the enabled potentials end up being largely redundant. An updated set of choices compatible with the above parameter files is:
Note that the parameters specific to the solvation model are discussed elsewhere.
Reference Parameter Sets:
The parameter sets below attempt to be as complete as possible for the biopolymer types supported by CAMPARI. In general, support for small molecules (which often use derived parameters) will often be limited (but can easily be added by the user). In addition, rare and generally poorly parameterized biopolymer constructs (such as zwitterionic amino acids or free nucleosides) may have incomplete parameter portings in particular of bonded parameters. If a perfect match of a certain parameter set paradigm cannot be achieved (against the reference implementation), this is stated explicitly.
oplsaal.prm (reference implementation: GROMACS 4.5.2)
This file provide full OPLS-AA/L
parameters, i.e., it includes the Kaminski et al. revision
of peptide torsions and sulphur parameters. Note that GROMACS 4.5.2 was used as the reference
implementation (and not BOSS or MCPRO).
Required settings for emulating reference standard:
GROM53a6.prm, GROM53a5.prm (reference implementation: GROMACS 4.0.5)
Required settings for emulating reference standard:
This file provide full
GROMOS53 parameters. Torsional
potentials for which
the same biotype is attached multiple times to an axis atom are only
supported by replacing the potential acting on just an arbitrary and
single one of those atoms in the GROMACS reference implementation with
proportionally reduced potentials acting on all of those atoms.
This should be chemically more correct but prevents exact matches of
torsional terms. The choice within GROMOS is motivated by computational
efficiency, but evaluation of torsional terms is not a time-critical
execution component in almost all present-day simulations (and trivially parallelizable).
Moreover, cap- and terminal residues may have been adjusted to use more
consistent parameters (terminal and cap residues are generally not specifically
parameterized in GROMOS from what we can tell, in particular for polynucleotides).
GROMOS uses a rather specific interaction model and represents
aliphatic CHn moieties
in united-atom representation. Note that revisions a5 and a6 only
differ in a few partial
charge parameters.
Required settings for emulating reference standard:
amber94.prm, amber99.prm, amber03.prm (reference implementation: AMBER port in GROMACS 4.5.2)
Required settings for emulating reference standard:
These files provide full AMBER parameters in
three different revisions which differ
mostly in their parameterization of torsional potentials for
polypeptides. Note that support for terminal
amino acid residues through the parameter file is marginal since AMBER's charge set is so
detailed that each atom in each
terminal residue would have to be an independent biotype. Normal
polypeptide caps are fully supported, however. To allow a more accurate emulation
of the AMBER standard for terminal polypeptide residues, the
charge patch functionality within CAMPARI can be used. We have tested this for
a few examples, and recovered 100% accurate matches to the AMBER standard that way. Keep in mind as well
that the parameterization of terminal polymer residues is often the "sloppiest" component in a
biomolecular force field since their impact on overall conformational equilibria is deemed small. Note
that we did not use the actual AMBER software in the porting.
Required settings for emulating reference standard (skipping eventual charge patches):
charmm.prm (reference implementations: CHARMM35b2 and CHARMM38b1)
Required settings for emulating reference standard (skipping eventual charge patches):
This file provides access to simulation employing the full CHARMM parameters
as provided in parameter set 27 for polypeptides and polynucleotides.
CMAP corrections for
polypeptides are supported and included. Note that <ABSINTH_HOME>
should be the exact same
directory specified in the localization of the Makefile (see installation instructions). To simulate polynucleotides
with 5'-phosphate groups using 100% authentic CHARMM parameters for the
terminal phosphate, the
charge patch functionality within CAMPARI has to be used. The same applies to
the polarization on the hydrogen atoms on the NH2 groups in
guanine and cytosine (this is a much smaller effect, though; also compare FMCSC_AMIDEPOL).
Similarly, the use of the amidated (NH2) C-terminus in polypeptides requires use of the
biotype patch and other patch functionalities.
CAMPARI's port of CHARMM parameters generally offers the most complete support for the systems supported natively
by CAMPARI, e.g., for phosphorylated amino acid sidechains.
Required settings for emulating reference standard:
FMCSC_AMIDEPOL 0.01 # or -0.01
charmm36.prm (reference implementations: CHARMM38b1 and CHARMM39b1)
Required settings for emulating reference standard:
FMCSC_AMIDEPOL 0.01 # or -0.01
This file incorporates the various revisions of the CHARMM force field contained in parameter set 36.
All other comments made for parameter set 27 apply here as well.
Required settings for emulating reference standard:
FMCSC_AMIDEPOL 0.01 # or -0.01
In order to create a new parameter file, it is advisable to start
with "template.prm". For details on the paradigms underlying the
construction of a parameter file consult the detailed documentation on this topic.Required settings for emulating reference standard:
FMCSC_AMIDEPOL 0.01 # or -0.01
Random Number Generator Keywords:
(back to top)
This keyword allows the user to provide a specific seed for the PRNG. This is usually relevant in two contexts:- Reproducibility:
Eliminate mismatches between different versions of the program (for example) by doing the stringent test that the results must be exactly the same if the PRNG is seeded with the same seed. Such tests may occasionally be hampered by a lack of precision in any input files and in particular by different compiler/architecture optimization levels.
Eliminate identical calculations if jobs are submitted simultaneously. Normally the PRNG uses a seed derived from from system time, which can be identical if jobs are submitted exactly in parallel. Avoiding this behavior by specifying different values for RANDOMSEED is only adequate if the jobs are indeed submitted as individual, serial jobs. Conversely, in intrinsically parallel applications (MPI), CAMPARI uses the node number to vary the seed across different nodes unless RANDOMSEED is specified. This means that a provided value for RANDOMSEED will homogenize the PRNG across all replicas which is almost always undesirable.
Simulation Setup:
(back to top)
This keyword is a simple but very important switch. It allows the user to control whether non-polar hydrogens are going to be part of the system's topology or not. In particular in earlier simulation work, it was a common and convenient trick to improve simulation efficiency by uniting all atoms of a methyl or methylene group into a single, coarse-grained "united atom". Different force fields used or use different varieties of this trick. In the GROMOS line of force fields, for instance, all aliphatic hydrogen atoms are merged into the carbon atoms they are bonded to. Conversely, the CHARMM19 protein force field in addition eliminates non-polar hydrogens bound to sp2-hybridized carbon atoms in aromatic rings.Unlike other simulation software, CAMPARI maintains a complete internal "knowledge" of biomolecular topology of those systems it allows the user to build from scratch. Therefore, choosing between all- or united-atom models is not simply a matter of parameter files (although it is possible to create inefficient united-atom variants of force fields by disabling all interaction parameters pertaining to the required hydrogens). Instead, the software itself requires knowledge of this choice.
Choices are:
- Use an all-atom model for those molecules represented explicitly.
- Use a united-atom model according to GROMOS convention, i.e., all aliphatic hydrogen atoms are merged into the carbon atoms they are linked to (this does include terminal aldehyde hydrogen atoms).
- Use a united-atom model according to CHARMM19 convention, i.e., all aliphatic and all aromatic hydrogens bound to carbon atoms are merged into the latter.
Outside of simulations using the GROMOS force field, this keyword is most useful when using CAMPARI to analyze trajectory data generated by other software using such a united-atom force field. Such a run would not tolerate atom number mismatches between the internal representation of the system and what is found in the binary trajectory files (mismatches are acceptable only if the input format is pdb → see below). Note that this keyword has no impact on systems involving residues not supported natively by CAMPARI (→ sequence input and PDB input).
This keyword is a simple but very important logical. It specifies whether the proposed simulation is a trajectory analysis run: in these, a pdb- (or xtc-, dcd-, NetCDF, PostgreSQL)-trajectory is read from file and analyzed with CAMPARI's internal analysis routines. The desired format is chosen with keyword PDB_FORMAT. All outputs and parameters are completely analogous to normal calculations. Essentially, the snapshot read-in replaces the sampling step. This means that low analysis frequencies will be desirable, since usually the number of snapshots will be relatively small compared to the number of simulation steps in a typical simulation. Note that - in particular for large systems (> 104 atoms) - the analysis run may be slowed down by:- Certain time-consuming analyses scale poorly with the number of atoms (solution structure analyses, see for example PCCALC or CLUSTERCALC).
- At each step, the global system energy is calculated using - depending on the setting for DYNAMICS - either CAMPARI's energy (MC) or force (MD/LD) routines and making little to no simplifying assumptions. To ensure decent speed, this may require setting the system Hamiltonian to zero (see below) and/or using an efficient cutoff / neighbor-list routine (see CUTOFFMODE).
- Very large files in particular in pdb-format may cause memory shortages which slow down the machine entirely. In general binary trajectory files in conjunction with an optional template file are the preferred and much faster way of performing analysis runs.
When using an MPI executable of CAMPARI in parallel, it is also possible to perform trajectory analysis across many processors. This uses the replica exchange setup and is described in detail elsewhere. The four primary applications are simultaneous analyses of several trajectories, the unscrambling of replica exchange trajectories that are normally output continuously for a given condition, the post facto computation of energetic overlap distributions, and the evaluation of the PIGS heuristic for analysis purposes. Specific analysis routines (such as DSSP analysis may be restricted to specific types of residues, and this may limit the utility of these routines for entities that are not natively supported by CAMPARI (see sequence input). In general, analysis runs on systems featuring unsupported residues should be relatively straightforward. This is true at least as long no energetic analyses are required (which naturally entails the complex issue of parameterization).
Analysis runs can also utilize the shared memory (OpenMP) parallelization of CAMPARI. As is described elsewhere, this decomposes the workload for many time-intensive tasks that CAMPARI can perform. For analysis functionalities, the load per step for an individual analysis is often so low that is not effective to let multiple threads operate on it. This is why most simple analysis functions are not parallelized per se but simply performed simultaneously. This is obviously ineffective if only a single such analysis is needed. Currently, the only exceptions are certain analyses related to structural clustering and the calculation of spatial density maps. Important analyses that can be time-consuming but are not parallelized per se are those controlled by CONTACTCALC, DSSPCALC, RHCALC, SAVCALC, and DIFFRCALC. As a general comment, tt should be noted that CAMPARI will always spend some of its execution time dealing with coordinate operations. Depending on the chosen settings, there may also be a large contribution from evaluating energies at every step. While the former is never parallel in analysis runs and constitutes a hidden performance bottleneck (just as file I/O does), the latter takes full advantage of the parallelization offered in regular simulations. These considerations should be kept in mind when deciding whether to use the OpenMP code in an analysis setting. Generally, it will of course be more efficient to parallelize in snapshot space by letting the MPI version operate on separate pieces of a longer trajectory. As is generally the case, the MPI and OpenMP frameworks can also be used simultaneously in analysis runs with the standard hierarchy (each MPI process maintains a separate copy of the system, and the processing of each copy can be sped up by using more than one OpenMP thread per MPI process).
This simple logical keyword instructs CAMPARI to serialize a given calculation across a changing set of systems. These systems are composed of a fixed part (which can be nil) and an exchangeable part (which must be present). This exchangeable part has to be a single molecule consisting of exactly one residue, which is defined as a placeholder molecule at the end of sequence input (e.g., LIG_N_C). The various molecules substituted for the placeholder molecule are defined in a single input file in Sybyl mol2 format. The data structures are allocated only once, and the maximum number of atoms for any single molecule in the library should be specified by keyword MOL2MAXSIZE. The number of individual calculations performed will be (roughly, see below) equivalent to the number of molecules in the mol2 file. If this number is large and/or time control is needed, keyword MOL2MAXTIME can be used to abort the screen after a set time.At the level of an individual calculation, there is little difference to a normal CAMPARI run, and the majority of keywords are understood and processed in exactly the same way. This means that it is possible to run any type of CAMPARI-supported calculation (except trajectory analysis) in an automatic throughput fashion. The differences to normal execution are as follows:
- The placeholder molecule per se is irrelevant, but any feature desired for the molecules in the mol2 file must also be enabled, at least formally, for the placeholder molecule. In some cases, CAMPARI will explicitly recognize that the fact that parameters are missing can be ignored, in other cases it may be necessary to supply a suitable patch to circumvent this issue. A pdb file (or at least template) must be present and specified even when there is no fixed part (receptor). This is despite the fact that absolutely no information regarding the placeholder is extracted from this file (this is a current technical limitation).
- The parameterization of the molecules in the mol2 file happens in different ways, most of which are automatic. For the Coulomb potential and the ABSINTH DMFI, the mol2 file itself is the source of information (see elsewhere ). For bonded potentials (e.g., bond angle potentials), CAMPARI relies on the input geometry and a chemistry parser that will apply dihedral angle potentials in the spirit of the ABSINTH force field paradigm. The amount of guessing and assigning is controlled by keyword MOL2BONDMODE. Lennard-Jones parameters are guessed based on Sybyl atom types provided on input and the atom types in the parameter file. It is therefore important to use correct values for the nucleus number in the fifth column in the atom type section. The way the CAMPARI atom types are matched to Sybyl types is explained further in the context of keyword SYBYLLJMAP, which provides a facility to override the automatic mapping. Other required parameters for the ABSINTH model and exclusion rules are determined automatically in agreement with keyword settings (e.g., INTERMODEL). Some keywords exist to limit the volatility of inferred parameters, in particular with respect to the details of the specific conformation found in the input mol2 file, such as MOL2RESPECTBOND and MOL2CUTMODE.
- A general complication is introduced by system subset-specific adjustments, e.g., constraints. The problem is that different molecules in the mol2 file will have different numbers of degrees of freedom, different numbers of atoms, etc, which is why it is not generally feasible to control these things globally. Consequently, the majority of customization apply to the fixed part of the system only, whereas the exchangeable part requires special keywords. Currently, these keywords are MOL2FRZMODE, MOL2CUTMODE, MOL2BONDMODE, MOL2RESPECTBOND, MOL2POLMODE, MOL2CLUMODE, MOL2DRESTMODE, and MOL2EMMODE. Keyword MOL2FOCUS is special in that it controls constraints for the fixed part in response to the reference molecule, which is the first screened molecule and either found in the main input file (as the first molecule) or in a separate input file.
- The reference molecule also allows the definition of a reference substructure to perform a more targeted search for the screened molecules. This means that chemical information can be used to align matching parts of the screened molecules with the reference substructure to enable a anchor- or tether-driven search procedure (see elsewhere and keyword MOL2ASSIMILAR). The alignment happens prior to the actual calculation. Keyword MOL2DRESTMODE is also responsive to this information and usually essential for maintaining the anchoring.
- Many output files will continue to be appended with information from different molecules (step numbers are reset), e.g. ENERGY.dat. The main output file of the screen is a mol2 file itself that has relevant conformers for the variable part of the system possibly in addition to a selected subset of atoms from the fixed part. What types of conformers are written to the output file is controlled by keyword MOL2OUTMODE. They can include the minimum energy and final conformers for each screened molecule, for example. Other conformers can be obtained by enabling structural clustering, which is adjustable for the small molecule screen using keywords MOL2CLUMODE and MOL2THRESH. This option can require a lot of memory if NRSTEPS is large, the set in MOL2AUXINDEX is large, and/or CCOLLECT is small. Energies associated with poses can be collected in a dedicated manner by means of keyword MOL2ENMODE, which includes cluster-based, averaged energies. Note that the base name of the calculation is the user-defined one for this file only. For other output, e.g., trajectory output, individual files may be generated per screened molecule, which is usually undesirable. In particular, keyword MOL2PDBMODE will create PDB output of the final conformations of the entire system for each molecule.
- While resembling a simulation with an unsupported residue in many ways, the setup routines in the small molecule screen will try harder to deal with the input. This implies that atoms are reordered to be able to construct a hierarchical Z matrix and guess dihedral angle degrees of freedom correctly. In addition, the mol2 file provides connectivity information, which is used.
- Most analyses are not supported, a modified version of structural clustering being the exception.
When using the shared memory parallelization of CAMPARI, the enhancements are restricted entirely to the sampling part that is part of CAMPARI's general engine. Conversely, the specific setup routines for the small molecules do not benefit from the shared memory parallelization at all (and this includes calculations of initial energies and forces). Since it is almost always the case that a small molecule screen tests many small molecules, it is consequently recommended to use a single thread per process and to parallelize manually (split the input files) or via the MPI code. A special execution mode for a small molecule screen is a scoring calculation. Here, many setup steps are skipped, there is no sampling, and the threads parallelization never plays a role (in this mode, the number of threads can only be 1).
If CAMPARI was compiled with MPI support (see installation instructions), a small molecule screen has access to two different execution modes. First, each molecule can be operated on redundantly by the chosen number of MPI processes in parallel (keyword MPIAVG is 1). This is useful when the sampling needs to be exhaustive, and it implies that data are pooled for structural clustering. Second, the input file is read by a master process, and the molecules are distributed dynamically to different MPI processes. In this mode (REMC is 1 and MOL2ISSPLIT is 0), the master process is often idle, and it may prove worthwhile to specify a number of replicas as N+1 where N is the formally available number of computing cores to run on. Here, every molecule is worked on by exactly one MPI process, and the results are completely independent. Third, each replica reads separate input files (REMC is 1 and MOL2ISSPLIT is 1). No pooling of results occurs, and no load balancing can be guaranteed. This mode is particularly useful for parallel scoring runs following a parallel screening run of the second type (which produces separate output files). Note that the two main approaches (pooling vs. no pooling) seldom yield equivalent results. If multiple processes operate on the same molecule, there is a putative benefit of different randomized starting positions that is never available in the second approach.
If a small molecule screen was requested, and CAMPARI was compiled with MPI support (see installation instructions), this simple logical keyword (1 means "on") specifies that the name supplied as the main input file is to be interpreted as a base name. CAMPARI will systematically construct names for distinct input files from this base name by separating any possible file system path component and prepending the standard prefix "N_xxx_" where "xxx" is the number of individual replicas (starting with 0 and using left-padding with zeros). Note that this keyword is only relevant in MPI runs where REMC is 1.MOL2VERBOSE
If a small molecule screen was requested, this keyword controls the amount of information written to log output. A value of 0 suppresses all log information except important warnings. Permissible values range from 0 to 4, and the default is 1. Note that the log output can grow very quickly in such a calculation. This keyword has no impact at all on the dedicated output control keywords MOL2OUTMODE and MOL2PDBMODE.MOL2FILE
This keyword provides location and name of the main input file for a small molecule screen. CAMPARI expects a Sybyl mol2 file in standard format such as those obtained from public repositories. A file containing 1500 molecules will cause CAMPARI to perform the same calculation on a fixed system plus each of the molecules in this file for a total of 1500 runs. The only exceptions to this are possible in parallel execution modes. If keyword MPIAVG is turned on, the same calculation will be run 1500 times but each run is a parallel run pooling data from multiple replicas. Conversely, if keyword REMC is turned on and keyword MOL2ISSPLIT is false, the 1500 individual calculations will be distributed across multiple replicas. Lastly, if both REMC and MOL2ISSPLIT are turned on, CAMPARI expects separate input files for the different replicas (as it does, e.g., for keyword PDB_MPIMANY).Additional information encoded and used in the mol2 file are partial charges (if the Coulomb potential is in use) and group assignments for the ABSINTH DMFI, which works in conjunction with a library input file. The group assignments are integer assignments in the substructure column with the integer referring to a specific solvation group in the library file. To be able to distinguish two identical solvation groups within the same molecule, the column with the substructure name should have running integers per molecule for each individual group. See elsewhere for further details.
CAMPARI supports some options in a small molecule screen that derive information on restraints, constraints, and/or alignment from a reference molecule. This reference molecule either has to be the first molecule in this input file, or it can be supplied as a separate file with the help of keyword MOL2_REFMOL. The latter is generally more convenient. Prominent keywords that stand in relation to the information extraction for the reference molecule are MOL2FOCUS, MOL2DRESTMODE, MOL2ASSIMILAR, and MOL2PRUNEMODE.
This keyword provides location and name of an auxiliary input file in Sybyl mol2 format. It is supposed to contain a single molecule that serves as a reference molecule. If there is a fixed part of the system (receptor), the coordinates of this molecule should correspond to a desired relative orientation (binding pose). Details are provided elsewhere. Prominent keywords that stand in relation to the information extraction for the reference molecule are MOL2FOCUS, MOL2DRESTMODE, MOL2ASSIMILAR, and MOL2PRUNEMODE.MOL2SCOREONLY
If a small molecule screen was requested, this keyword allows selecting a special mode of operation that disables all sampling activities (i.e., dynamics and Monte Carlo propagators). As such, it is somewhat analogous to trajectory analysis mode, which is not independently supported for small molecule screens. In scoring mode, the energy of the system (separated by terms) is evaluated and reported (nonzero terms only). This happens in 3 ways: i) for the actual combination of the fixed part of the system (if any) and each small molecule in exactly the (absolute) conformation it is is found in in the mol2 input file; ii) for the fixed part of the system alone; iii) for the small molecule alone. This corresponds to a simple energy scoring analysis of a binding process of perfectly rigid molecules, and the corresponding difference (complex - fixed part - small molecule) is reported as well. It is generally not useful to enable this option in the absence of a fixed part or for meaningless coordinates of the small molecules. By default, the fixed part will have the same conformation throughout. This behavior can be altered by keyword MOL2_PDB_RELAXED, which allows the specification of a file with an additional conformation of the fixed part assumed to be the correct ligand-free state.If the mol2 input file contains additional coordinates belonging to the fixed part of the system (see keyword MOL2AUXINPUT), the interpretation is slightly more complicated for more than one molecule. This is because, without specifying an input file for MOL2_PDB_RELAXED, the implied reference state of the receptor changes, which is not generally justifiable (because, physicochemically speaking, the ligand-free state cannot differ for different ligands). Having to recalculate the ligand-free receptor energies also increases the cost of the calculation somewhat.
Because sampling activities are disabled, this mode of operation can speed up the calculation by disabling parts of the initial setup (useful when working with large receptors) and by combining what are essentially 3 separate calculations into (ideally) a single one. It should reproduce the same values as a manual execution of these 3 calculations. Because some terms explicitly couple different molecules (e.g., the DSSP potential), not all energy terms are supported in this mode, and corresponding warnings and errors are produced. The resultant energies imply a complete neglect of entropy except for what may be captured by "effective" energies (e.g., from implicit solvent models). Compared to the physical process of interest, this generally includes conformational entropy differences of both receptor and small molecule as well as entropy differences due to solvent (including other solution species such as ions). The reported energy values also imply that the complex structure is unique, which is often a poor approximation (i.e., the single point value should really be an integral over a small but finite domain). Moreover, the aforementioned assumption of complete rigidity implies that the unbound and bound states of the ligand differ in their intrinsic state (strain), and this energetic contribution cannot be captured either.
Lastly, care has to be taken when mixing the values for different scoring runs or from sampling and scoring runs with each other, for example when sampling the relaxed state of the ligand separately. In this case, unjustifiable errors can creep in because the parameterization is somewhat conformation-dependent. The largest impact is usually on guessed torsional potentials (see MOL2BONDMODE, SC_BONDED_T, and, especially, PLANAR_TOLS). For instance, the starting ligand in an actual screen might have two aromatic rings connected by a rotatable bond planar relative to each other while in a docked conformation the same rings are out-of-plane, leading to differing assignments of torsional potentials. The second significant impact can stem from cutoffs: if keyword MOL2CUTMODE is 1 (or, to lesser extent, 3), the choice of reference atom and the (buffered) radius are conformation-dependent. This matters in particular for Coulomb interactions with the detailed impact dependent on cutoff distance and the treatment of long-range interactions (see LREL_MC and LREL_MD). In addition, some atomic solvation parameters relevant for the ABSINTH DMFI and screened Coulomb interactions can be weakly conformation-dependent, but the impact of these is usually small. For circumventing some of these problems, keyword MOL2ENMODE is available to instruct CAMPARI to write (partial) scores already during the run. Of course, in the output of a rescoring run, the treatment is consistent for an individual rescored pose, but the issue persists as soon as different poses of the same molecules are compared to each other in terms of their scores.
If a small molecule screen was requested, this keyword lets the user specify the maximum size (in number of atoms, including dummy particles) for an individual molecule to expect in the screened library. This is required because some arrays are allocated only once at the beginning.MOL2MAXTIME
If a small molecule screen was requested, this keyword lets the user specify the maximum time in hours for the (main part of the) screen. Once this time is exceeded, CAMPARI will perform a clean, premature termination. This functionality is primarily meant for cases where CAMPARI is run in a computing environment with limited queue times and using non-redundant MPI parallelism. In these cases, the distribution of molecules from the master process to the replicas leads to ambiguity in case of an unclean termination (forced cancellation) because, while the "furthest" molecule in the screened library is usually identifiable, this does not imply that all molecules "before" have been fully processed. This amibguity does not exist to the same extent in a screen with redundant MPI parallelism, without MPI parallelism, or when distinct input files are requested for every MPI process.To simplify restarting the screen, the unprocessed molecules are written to a new output file (or set of new output files), and this is reported in the main log output. In addition, upon starting the calculation again, the standard output files (poses in mol2 format and associated scores) are appended by default. Some redundant information might be present in these cases, e.g., from processing again the reference molecule, which is unavoidable, or from additional header lines being written.
If a small molecule screen was requested, this keyword controls whether the input file(s) (the screened library and/or the reference molecule) is used to derive a list of permissible bonds per molecule and whether this list is parsed for CAMPARI-specific information on assigned bond types. The options are as follows:- The 3D geometry found in the input file is the exclusive source of information used to derive molecular topology (the set of covalent bonds) and to infer the types of chemical bonds.
- The 3D geometry found in the input file is the exclusive source of information used to infer the types of chemical bonds. Conversely, the permissible covalent bonds are restricted to the set defined in the @<TRIPOS>BOND section of the input file(s) for each molecule.
- The 3D geometry found in the input file is the exclusive source of information used to derive molecular topology (the set of covalent bonds). Conversely, if a custom @<TRIPOS>BOND section entry is found for a bond, that will be used to set the type of that chemical bond. If no such entries are present, this is the same as option 0.
- The permissible covalent bonds are restricted to the set defined in the @<TRIPOS>BOND section of the input file(s) for each molecule. If a custom @<TRIPOS>BOND section entry is found for a bond, that will be used to set the type of that chemical bond. If no such entries are present, this is the same as option 1.
The primary application for choosing option 1 over option 0 (or 3 over 2) is for cases where the 3D conformers in the input file(s) are not free of steric clashes due to the presence of rotatable bonds / conformational flexibility. Using options 1 or 3 will prevent CAMPARI from inferring a bond on account of a very short interatomic distance caused by a clash. This naturally assumes that the @<TRIPOS>BOND section in the input file(s) is curated perfectly. Any error (missing bond, index mismatching, etc.) will lead to an unfixable problem from CAMPARI's point of view, i.e., it will result in the molecule being skipped. Note that this keyword cannot fix problems with the covalent geometry (bond lengths, bond angles, impropers, chirality) itself.
The primary application for choosing option 2 over 0 (or 3 over 1) is to increase the robustness of strain energies when working with the same molecule multiple times. This is common if the screened library already contains conformers of the same molecule or, more importantly, if docked poses are reanalyzed (for example in scoring mode). When using limited sampling in combination with tethering to a reference molecule, it is quite common for poses to be in violation of planarity favored by torsional potentials, and in modes 0 and 2 this might mean that the newly assigned potentials differ from the originally assigned ones.
If a small molecule screen was requested, and the reference molecule contains substructure information (see elsewhere for details on how to define and manually match substructures), CAMPARI can match this substructure in other, unannotated molecules using a heuristic procedure. The other molecules are defined by the screened library.The heuristic procedure relies on a best-match algorithm that provides a quantitative measure of similarity/difference. Briefly, molecules are scanned, and substructures with identical base topologies are identified. This means that there has to be a set of atoms in the screened molecule that has (at least) the same interatomic connectivity, which implies that substructures larger than a screened molecule can never be matched. This is a rigorous requirement, and no tolerance exists for comparing just matching parts of substructures. To avoid such cases, the reference substructure simply has to be made smaller. For every unique set of atoms matching the target topology, a penalty score is calculated. Unique refers to the set of labeled atoms. This has consequences for substructures containing atoms that are not unique: such sets will be proposed several times by the algorithm but with different permutations. Since the growth is exponential, and the number of considered sets is (currently) restricted to 20, it is strongly recommended to not include more than a few of such atoms. Occasionally this apparent redundancy will be explicitly desired, for example when using a substructure that has a symmetry axis but this symmetry is broken by parts of the screened molecules outside of the matched substructure.
For every atom in a candidate set, the penalty score is an enumeration with contributions from different sources: elements are compared with a limited hierarchy (meaning that, for example, an O-S replacement is better than a O-N replacement, which in turn is better than an O-C or O-P replacement). Next, there is a penalty for the geometry an atom forms with its substituents, and a penalty is applied if the center is planar in one but not the other (a stiffer penalty is applied if it is linear in one but not the other). Finally, the identities of the substituents are compared: there are penalties for differences in valencies, atom identities, and numbers of substituents that are not terminal atoms. The maximum total penalty for a typical atom is roughly 10-15, the minimum is 0, and a good but inexact match should carry a penalty of around 1-5. Note that the penalties of neighboring atoms are linked to each other due to the contributions from substituents. As explained elsewhere, substructures can (necessarily) be defined to not include all atoms connected to selected atoms. This is particularly relevant for hydrogen atoms. If hydrogen atoms are included, the average penalty will generally decrease if a matching subgroup (like -CH3 is found). This is because two paired hydrogen atoms have almost no capacity for creating a nonzero penalty.
For a given set, the final difference score is computed as the average across all atoms in the substructures. Up to the aforementioned limit of 20, all matches with a difference score that is below the value of MOL2ASSIMILAR are at least initially processed. Once the matched substructure has been used to align the molecule, CAMPARI will calculate the RMSD value across the substructure to which an additional threshold is applied (specified by keyword MOL2RMSDTHRESH). What are good strategies in choosing and matching a substructure? Generally speaking, there are probably five main strategies:
- Use a small, rigid, and restricted substructure (only heavy atoms) with a stringent value for MOL2ASSIMILAR (up to 1.0). This can give exact chemical matches in a highly focused manner (like a carboxylate group).
- Use an exhaustive but rigid substructure (including hydrogen atoms) with a stringent value for MOL2ASSIMILAR and a very stringent value for MOL2RMSDTHRESH (like 0.5Å). The latter is needed to avoid redundancy in the calculations since RMSD can be used to pick out one the best of several matching but permuted sets.
- Use a larger and possibly flexible substructure (only heavy atoms) with an intermediate value for MOL2ASSIMILAR (like 1 to 2.5) and a very lenient value for MOL2RMSDTHRESH. This will reproduce comparable chemotypes but ignore the starting geometry. Note that there is usually no reason to try to produce exact matches using larger substructures although it is possible (smaller ones will generally suffice). One danger of including more and more atoms, even if they are heavy atoms, is the appearance of additional symmetry-related duplications or variants. Depending on the problem at hand, this might be both a feature and a caveat.
- Use a larger and possibly flexible substructure (only heavy atoms) with a lenient value for MOL2ASSIMILAR (like 2.5 to 5) and a stringent value for MOL2RMSDTHRESH (like 1.0Å) This will prioritize shape matching over chemical similarity. The biggest risk in this approach is that there are too many initial matches, which preclude relevant ones from even being evaluated (the best matches according to the chemical similarity are prioritized, however). This is where bigger substructures can be useful since they might be able to limit the space of available matches upfront.
- Use a small but maximally characteristic substructure with an intermediate or lenient value for MOL2ASSIMILAR and a lenient criterion for MOL2RMSDTHRESH. The goal of this would be to find weak but potentially interesting matches.
Because the tethering will frequently place parts of molecules such that they overlap with receptor atoms, it is not recommended to perform the actual search without, at least in the beginning, a Monte Carlo sampler. MC methods have the advantage that they are highly tolerant to the extreme energies resulting from nuclear overlap. A second caveat is that the tethering most likely makes sense only if the movement of the tether is restricted. This can, in rare cases, be achieved by keyword MOL2FRZMODE, but a much more general and useful solution is the use of automatic position restraints. To solve the issue with molecules clashing with receptor, there are two main strategies:
First, keyword MOL2PRUNEMODE can be used to enable a filtering technique that performs a targeted relaxation of the same type as that enabled by TMD_RELAX and controlled by the same parameters. This has the advantage that MC sampling is used in a focused manner specifically to resolve conflicts. If the relaxation is deemed unsuccessful, the alignment in question is skipped immediately and the receptor coordinates are restored to what they were before the relaxation attempt. Conversely, if no relaxation was necessary or the relaxation was successful, the molecule is processed normally.
Second, it is possible to rely on the normal sampling engine to similar effect by choosing a hybrid or pure MC sampler. For the hybrid method, keyword CYCLE_MC_FIRST will be of particular interest. The goal is that the steric frustration is resolved through MC if possible. However, even then some molecules will inevitably end in a highly frustrated conformation that is unsuitable for gradient-based methods (extreme forces). Here, a "safe" termination will be desirable, and this can be controlled by keyword THRESHOLD_INCR. This will generally be much slower than the first method, it will provide many strained poses with poor scores, and it might cause a higher level of receptor deformation, which is harder to rescue for subsequent molecules.
To distinguish multiple alignments for the same molecule, each of which may produce several output poses, the names of the input molecules are appended with a string like "al0001" and so on. Lastly, in some applications, users may want to process molecules with unmatched substructures regardless. Since the requirements toward the sampler can differ dramatically, this is an intrinsically difficult proposition. Nevertheless, to achieve this behavior, a negative value for MOL2ASSIMILAR can be specified (the interpreted difference threshold is always taken as the absolute value).
If a small molecule screen was requested, and the reference molecule contains substructure information (see elsewhere for details on how to define and manually match substructures), CAMPARI can identify matches between similar substructures in the reference and the screened molecules. This keyword defines the tolerance for the geometric similarity of the atoms once they have been aligned according to this chemical matching algorithm, which is explained in detail for keyword MOL2ASSIMILAR. The value has to provided in Å. The screening based on this RMSD threshold is purely for the alignment of the substructure and contains no information about steric compatibility with the receptor (see MOL2PRUNEMODE for further filtering options).NRSTEPS
This keyword sets the total number of simulation steps including equilibration. A step is either a single propagation event by the chosen propagator or the advancement to the next trajectory snapshot in a trajectory analysis run. Currently, the only keyword to manipulate the user choice for NRSTEPS is a file with selected input frames in an analysis run. This can have consequences as other keywords, for example DISABLE_ANALYSIS rely on correct values for NRSTEPS. The default is 100000, and this keyword should always be specified.EQUIL
This keyword specifies the total number of equilibration steps. This implies that no analysis is performed as long as the current step number does not exceed this value. Note that this also means that no structural output (trajectory) is produced. Conversely, certain necessary diagnostics are provided irrespective of equilibration (see for example ENOUT or ACCOUT). The default differs: it is 10000 for a simulation run and 0 for an analysis run (see PDBANALYZE).TEMP
This keyword sets the absolute (target) temperature in K (default of 298K).ENSEMBLE
This crucial keyword determines which ensemble to simulate the system in. The options available are limited in that they depend strongly on the type of sampler (e.g., there is no NVE (microcanonical) ensemble if sampling is done via Monte Carlo → DYNAMICS).The options are as follows:
1) NVT (Constant Particle Number, Constant Volume, Constant Temperature):
Always available, this is the
canonical ensemble and currently the only option available
for pure Monte Carlo runs.
2) NVE (Constant Particle Number, Constant Volume,
The microcanonical ensemble
(adiabatic conditions) is only supported (and possible) for
non-dissipative, i.e., Newtonian dynamics (see option 2 in DYNAMICS).
3) NPT (Constant Particle Number, Constant Pressure,
Constant Temperature):
The isothermal-isobaric ensemble is at the moment only supported as a compatibility
option for trajectory analysis runs
operating on trajectories that were obtained under NPT conditions, i.e.,
have fluctuating volumes encoded in the structural input files such as
those supplied to keyword FMCSC_XTCFILE or
Accounting for this is primarily an issue when computing distances
in periodic boundary conditions since a
mismatch between an assumed and an actual container size
would lead to errors.
5) μiVT (Constant Chemical Potential(s),
Constant Volume, Constant
This requests the grand canonical ensemble where the number of
particles in the system is allowed to fluctuate. Subscript i
indicates that not all particle types may be subject to number
fluctuation (typical for example in the simulation of macromolecules
and a (co-)solvent atmosphere for
which only the small molecule would be treated in "grand" fashion. This
implies that technically incorrect hybrid ensembles are populated
(sometimes referred to as "partially grand" ensembles). The rigorous
grand canonical ensemble would require all particle types to be
permitted to fluctuate in number. Such partially grand ensembles are
not to be confused with the "semigrand" ensemble (see below).
Technically, the GC ensemble is realized in CAMPARI by allowing
molecules to transfer between a real and a shadow existence, the latter
also serving as the reference state. The discreteness of transitions
between shadow and real existence implies that currently the grand ensemble
is only available in pure Monte Carlo simulations.
Note that currently the reference
state is modeled in the infinite dilution limit (there are no
intermolecular interactions). This is consistent with the default
implementation choice (→ GRANDMODE), in
which the bath communicates with the system via an expected bulk concentration
and an excess chemical potential correcting for the interactions arising
from that finite bulk concentration.
6) ΔμiNtVT (Constant Chemical
Potential Difference(s), Constant Total Particle Number, Constant
Volume, Constant
This requests the semigrand ensemble as originally formulated by Kofke
and Glandt (1988), in which particle types are allowed to fluctuate in
number under the constraint that the total particle number (Nt)
remains constant. Just like for the μiVT-ensemble, CAMPARI
allows the definition of partial semigrand ensembles in which - for
example - a bath of water and methanol solvating a macromolecule is
subjected to moves attempting to transmute methanol into water or vice
versa. Note that the amount of real-world applications for such an
ensemble to be appropriate is very small. Technically, the constraints
to keep Nt fixed may improve acceptance rates in dense fluid
mixtures. For both options (5 and 6), please refer to the documentation
for the particle fluctuation file, specified using PARTICLEFLUCFILE, for
details. Note that the sanity of results obtained with any partial
grand or semigrand ensemble must be investigated with utmost care.
To be added or completed in the future:3) NPT (Constant Particle Number, Constant Pressure, Constant Temperature):
May eventually be made available also for simulation runs.
4) NPE (Constant Particle Number, Constant Pressure,
May eventually be made available for Newtonian MD runs.
Note to developers: there is rudimentary support for running simulations in the NPT and NPE ensembles in CAMPARI right now but those branches are completely disabled and encompass only the strictly serial code.
If an ensemble is chosen that allows particle number fluctuations, this keyword acts as a simple logical whether or not to write out a summary of the grand-canonical setup, i.e., which particle types are allowed to fluctuate in numbers, what the initial numbers (bulk concentrations) are, and what (excess) chemical potentials are associated with those.GRANDMODE
If an ensemble is chosen that allows particle number fluctuations, this keyword acts to choose between two different implementation modes. In the first (choice 1), file input is used to provide CAMPARI with the initial numbers and absolute chemical potentials of fluctuating particle types. This is generally inconvenient for cases with realistic interaction potentials and/or multiple fluctuating particle types that require coupled chemical potentials (such as individual ionic species). The bulk concentrations are set implicitly by the chemical potentials. This formulation involves the "thermal volume" of particles meaning that a monoatomic ideal gas will require a mass-dependent chemical potential. In the second option (choice 2, which is the default), the same file input is used to set the bulk concentration explicitly (based on the initial particle number provided), and the chemical potentials listed are merely the excess terms. This formulation involves no mass-dependent terms, is numerically more stable (accuracy of exponentials), and provides an easy reference limit for dilute solutions (zero excess chemical potential).To illustrate the difference in implementation, consider the additional contribution to the acceptance probability (term cb in description of keyword MC_ACCEPT) of a particle insertion attempt:
Mode 1:
cb = eβμideal · eβμexcess · V· (N+1)-1· ζ-1
Here, V is the system volume, N is the current number of particles of the type to be inserted, μideal and μexcess are the components of the chemical potential, and ζ is the aforementioned thermal volume.
Mode 2:
cb = eβμexcess · <N> · (N+1)-1
This equation contains the expected bulk concentration as <N>.
While numerically the two cases can be made equivalent, the latter contains a self-consistency check by being able to compare the measured <N> to the assumed <N> given the chosen μexcess. In the former, the assumed <N> is unknown, because the partitioning between μideal and μexcess is not explicit. For a single-component system (or a system with multiple independent components), the measured <N> can be used to derive the μexcess that the simulation essentially corresponded to. With dependent components, however, this becomes very difficult to adjust. For general calibration strategies of excess chemical potentials and background, see references.
This is one of the core keywords and specifies how to propagate (sample) the system, i.e., how to obtain a new conformation of the system given the current one. The system configuration usually involves both momenta and coordinates unless the sampler is momentum-free (e.g., Monte Carlo). Most propagation schemes are able to take advantage of the shared memory parallelization, but it is important to benchmark this routinely as the scalabilities differ. For example, the work load available for parallelization in an incremental energy evaluation in Monte Carlo is usually much smaller than that in a full force evaluation in dynamics. Options are as follows:1) Pure Monte Carlo sampling (see keyword MC_ACCEPT and section on Monte Carlo move sets).
2) Molecular Dynamics:
Integration of Newton's equations
of motion either in internal or Cartesian coordinate space (see CARTINT). This is fully
supported. The internal coordinate space formulation is based upon a published algorithm. More details are found in
the documentation to keywords TMD_INTEGRATOR and
A simplified summary of the internal coordinate space variant is as follows: - Dynamics are performed on internal degrees of freedom which are assumed to be independent (rigid body translation, rotation around the Cardinal x, y, and z axes of the laboratory frame (static) centered at the center of mass of each molecule, torsional degrees of freedom).
- Dynamics for polymers vary along the chain (faster at the termini) as they should, but this does not happen in any fashion proven to comply rigorously with a specific dynamics. By altering the chain alignment mode, more exotic dynamics can be produced. This is because the building directions of any polymer chains represent an arbitrary choice in the method.
- By assuming a diagonal mass (inertia) matrix (viz., a block of the mass metric tensor), applicability of simple integrators is a given. In the absence of interaction-based forces, the goal is to preserve rotational kinetic energy (but not angular momentum) by considering the effective masses associated with various rotational degrees of freedom as time-dependent variables in a discrete integration scheme. This treatment is intrinsically consistent, and agreement with data obtained from Monte Carlo simulations has been shown (for select cases). CAMPARI provides a simple diagnostic of the impact of assuming a diagonal mass matrix by printing kinetic energies in both internal and Cartesian coordinates to log-output.
- Because the algorithm does not produce dynamics that obey Gauss' principle of least constraint or conserve angular momentum, integrator stability can be inferior to that for a case of identical constraints realized as holonomic constraints in Cartesian molecular dynamics. This effect cannot always be quantified since the holonomic constraints implied by the internal coordinate space treatment often become too highly coupled for linear solvers to converge (→ SHAKEMETHOD). Select cases with quickly varying masses highlight the effect, and the most significant example are probably rigid-body simulations of water (water has tiny rotational inertia and is a prototypical test case for rigid-body integrators). Quantification of relative integrator stabilities for such a case can be performed. The stability can be increased by altering atomic masses (mass redistribution in methyl or hydroxyl groups is a common technique) or by using an automatic approach that takes advantage of the formal independencne of the degrees of freedom achieved by the diagonal inertia matrix.
- Subtle equipartition artifacts (i.e., some individual or collective degrees of freedom heating up at the expense of others because they are either more susceptible to integration error or weakly coupled to the rest of the system) can always occur. Effects differ between internal coordinate and Cartesian treatments. This is because dihedral angles will generally have a rather different level of energetic coupling and integration stability than the positional coordinates of an atom embedded in a polyatomic molecule.
Conversely, the integration of Newton's equations of motion for the Cartesian coordinates of all atoms
represents the more canonical approach to molecular dynamics. These
algorithms are conceptually much simpler, and users are referred to standard literature on the topic. This is primarily because the mass
matrix is diagonal leading to independent equations.
The simplicity holds primarily for unconstrained simulations in the microcanonical ensemble.
In practice, additional procedures are needed in almost all cases, for example the enforcement of holonomic constraints through appropriate algorithms such
as SHAKE or LINCS. Most three-point water models are explicitly calibrated as rigid models, and it is therefore
necessary to maintain water geometry as a set of holonomic constraints throughout
a Cartesian dynamics simulation. Similarly, the desired switch to the canonical ensemble requires the action of
a thermostat. CAMPARI always uses the simple leapfrog integrator in Cartesian
molecular dynamics, which has excellent energy conservation properties due to error cancellation. This does not mean
that it is free of discretization errors, which increase with increasing time step.
The latter statement is of course true for any numerical integration of equations of motion.
3) Langevin Dynamics: Integrations of Langevin equation
motion. This is supported via the impulse integrator due to Izaguirre
and Skeel (reference). With respect
to the torsional dynamics implementation, the same caveats apply as for
Newtonian dynamics. There is an additional limitation in that the only implementation supported
currently is an approximate scheme (corresponding to keywords
TMD_INTEGRATOR being 2 and
TMD_INT2UP being 0). This is because the
structure of the impulse integrator is more complex, thus allowing a straightforward
extension to our torsional dynamics only for the simplest case (research in progress). It
also means that the shared memory parallelization will not (yet) work
with this choice.
Note that all LD simulations work in the fluctuation-dissipation limit, which means that all degrees of freedom are automatically coupled to a heat bath, and which assumes an underlying continuum providing frequent collisions as the source of the stochastic term as well as the frictional damping. In addition, note that hydrodynamic interactions are neglected and that currently there is only a single, uniform frictional parameter for all degrees of freedom (see FRICTION). The latter is a major and non-obvious assumption in internal coordinate spaces featuring polymers with flexible dihedral angles. This is because it is not clear what the frictional drag incurred by rotations around molecular bonds is and what the results of ignoring communication between these drag effects are.
5) Mixed Monte Carlo and Newtonian (Molecular) Dynamics: Note that all LD simulations work in the fluctuation-dissipation limit, which means that all degrees of freedom are automatically coupled to a heat bath, and which assumes an underlying continuum providing frequent collisions as the source of the stochastic term as well as the frictional damping. In addition, note that hydrodynamic interactions are neglected and that currently there is only a single, uniform frictional parameter for all degrees of freedom (see FRICTION). The latter is a major and non-obvious assumption in internal coordinate spaces featuring polymers with flexible dihedral angles. This is because it is not clear what the frictional drag incurred by rotations around molecular bonds is and what the results of ignoring communication between these drag effects are.
This hybrid method mixes MC with MD sampling
and assumes consistency of ensembles at all times. Since MC sampling only
supports the canonical ensemble at the moment, this means that Newtonian MD has
to be performed with a thermostat preserving the correct ensemble, e.g.,
the Andersen or Bussi et al. schemes.
Then, the entire trajectory should be treatable as a Markov chain and
analysis is performed as if the sampling engine were one of the two.
A potential caveat lies in velocity autocorrelation. The method is implemented such that segments of MC sampling alternate with MD segments. Upon switching from MC to MD, new velocities are assigned from the proper Boltzmann distribution. This may introduce some amount of noise. Aside from this particular concern, all independent concerns about both Monte Carlo and dynamics-based methods apply. It is up to the user to ensure that either sampler yields the required ensemble rigorously.
A particular concern lies with the selection of degrees of freedom. In general, it will be highly desirable for the set of sampled degrees of freedom to be exactly identical between the two samplers. This is not always possible, however, e.g., when sampling sugar pucker angles in MC, but not in dynamics. In these scenarios it will be desirable to use short segments lengths in order to improve the chances of convergence (in the given example, convergence is unlikely if long dynamics segments only "see" few frozen conformations of the sugar pucker states in the system). This issue is particularly difficult in mixed Cartesian/internal coordinate space simulations attainable by selecting a hybrid scheme here and 2 for CARTINT. Some improvement can be made by including geometric constraints in Cartesian space, but a rigorous match will generally be out of reach.
Technically, the simulation simply alternates between MC-based and dynamics-based segments whose minimum and maximum lengths are controllable by the user (→ keywords CYCLE_MC_FIRST, CYCLE_MC_MIN, CYCLE_MC_MAX,CYCLE_DYN_MIN, and CYCLE_DYN_MAX).
6) Minimization: A potential caveat lies in velocity autocorrelation. The method is implemented such that segments of MC sampling alternate with MD segments. Upon switching from MC to MD, new velocities are assigned from the proper Boltzmann distribution. This may introduce some amount of noise. Aside from this particular concern, all independent concerns about both Monte Carlo and dynamics-based methods apply. It is up to the user to ensure that either sampler yields the required ensemble rigorously.
A particular concern lies with the selection of degrees of freedom. In general, it will be highly desirable for the set of sampled degrees of freedom to be exactly identical between the two samplers. This is not always possible, however, e.g., when sampling sugar pucker angles in MC, but not in dynamics. In these scenarios it will be desirable to use short segments lengths in order to improve the chances of convergence (in the given example, convergence is unlikely if long dynamics segments only "see" few frozen conformations of the sugar pucker states in the system). This issue is particularly difficult in mixed Cartesian/internal coordinate space simulations attainable by selecting a hybrid scheme here and 2 for CARTINT. Some improvement can be made by including geometric constraints in Cartesian space, but a rigorous match will generally be out of reach.
Technically, the simulation simply alternates between MC-based and dynamics-based segments whose minimum and maximum lengths are controllable by the user (→ keywords CYCLE_MC_FIRST, CYCLE_MC_MIN, CYCLE_MC_MAX,CYCLE_DYN_MIN, and CYCLE_DYN_MAX).
This uses the potential energy gradient to
steer the system to a near minimum through a
variety of techniques (see MINI_MODE).
Minimization is not a technique to sample phase
space in terms of a well-defined ensemble, and the closest approximation of its results is
probably that of a locally sampled constant-volume (NVT) condition at extremely low temperature.
In general, minimizers are apt at finding local but not global minima.
Note that these algorithms are still numerically discrete schemes, i.e.,
they employ finite step sizes. This means that
irrespective of any theoretical guarantees or expectations an algorithm offers,
results may not always be as straightforward. In addition, minimizers
are poor tools if the basic step sizes should be heterogeneous for different
degrees of freedom, e.g., for a dilute phase of Lennard-Jones atoms or clusters.
Note that minimizers are currently not compatible with multi-replica MPI calculations
(see MPIAVG and REMC) with
one exception: a small moelcule screen in
distributed mode supports it.
7) Mixed Monte Carlo and Langevin Dynamics: This is analogous to 5) only that Newtonian
dynamics are replaced with Langevin dynamics (see 3). (example reference)
To be added in the future are:
4) Brownian Dynamics
Note that in all of the above methods relying on forces (options 2-7), it is very likely that optimized loops will be used (depending on settings for the Hamiltonian). These currently have the property of using few stack-allocated array variables that may become large if cutoff settings are very generous or if no cutoffs are in use. This may lead to unannotated segmentation faults (depending on compiler, architecture, and local settings). There are several workarounds (on Unix-systems, the shell command "ulimit" can for example be used to increase stack size for the local environment) some of which will be compiler-specific (for example to force the compiler to always allocate local arrays from the heap). Stack access is faster and therefore generally desirable in the speed-critical portions of the code.
If the simulation uses (at least partially) Monte Carlo sampling, this very important keyword allows the user to choose between (currently) three different types of acceptance rules for MC moves that are as follows:- The Metropolis criterion is used. A random number sampled uniformly over the interval is compared to the term cb·e-β ΔU. Here, ΔU is the difference in (effective) energy of the new vs. the original conformation (Unew - Uold), β is the inverse temperature, and cb is a bias correction factor that is specific to the move type. If the random number is less than the term above, the move is accepted. Note that cb can encompass different types of bias. It is also important to keep in mind that some advanced move types may imply incorporating biasing terms during the picking of a new conformation (see TORCRMODE), and no longer show up in cb. The Metropolis criterion has the advantage that it is rejection-free in the limit of no energetic or other biases. With a non-zero energy function in place, the distribution sampled from is the Boltzmann distribution.
- A Fermi criterion is used. A random number sampled uniformly over the interval is compared to the term (1 + cb-1·eβ ΔU)-1. If the random number is less than the term above, the move is accepted. The Fermi criterion's only advantage over the Metropolis criterion is that it defines an actual probability on the interval [0,1]. The downside is that the limiting acceptance rate is only 50%. However, the impact is much weaker if ΔU is relatively large on average (in absolute magnitude). The sampled distribution is again the Boltzmann distribution.
- A Wang-Landau / Metropolis criterion is used. A random number sampled uniformly over the interval is compared to the term cb·eβ Δln T or to the term cb·e-β ΔU - Δln T (see keyword WL_MODE). Here, Δln T is the difference in the logarithms of the current and proposed estimates of the target distribution (e.g., the density of states), i.e., Δln T = ln Tnew - ln Told. The Wang-Landau algorithm is explained in detail elsewhere, but it should be pointed out that the sampled distribution is no longer the Boltzmann distribution (instead it is ill-defined, and the simulation results require snapshot-based reweighting), the simulation does not satisfy detailed balance (the estimate of the density of states changes continuously), and convergence/errors are much more difficult to assess (since the method is essentially an iteration and not an equilibrium sampling scheme). It is crucial to keep in mind that the standard Metropolis criterion is used while the simulation has not exceeded the number of equilibration steps. This is mostly to avoid range problems when starting from random initial configurations.
If the attempted calculation is a gradient-based run in internal coordinate space, this keyword provides the option to use highly specialized Monte Carlo sampling to relax features of the starting geometry that could cause the integrator to become unstable. The keyword is interpreted as a threshold setting (a.u.) for the absolute, normalized force acting on internal degrees of freedom, and values from 10-100 are probably appropriate. If no forces exceed the threshold, no relaxation steps are performed.The relaxation works hierarchically in three stages: First, dihedral angles experiencing large forces are identified. Those with rotation lists including only atoms that are all from the same residue are deemed eligible, and the residue is added to a temporary list for Monte Carlo sidechain sampling. CAMPARI then performs TMD_NRRELAX steps of sidechain moves per eligible sidechain in accordance with move parameters such as CHIRDFREQ and CHISTEPSZ. If the shared memory parallelization is in use, the actual Monte Carlo cycles utilize more than a single thread, but the global force calculations do not. This relaxation methodology is useful for dealing with systems with incomplete structural input (from the PDB, for example), where sidechains have been rebuilt blindly. Second, single-residue molecules experiencing strong rigid-body forces are identified, and a temporary list for rigid-body sampling using only rigid rotation moves is populated. CAMPARI then performs TMD_NRRELAX steps of these rigid rotation moves per molecule. This is useful for dealing with dense systems that were only incompletely equilibrated or (partially) randomized. In a last step, a temporary list for rigid-body sampling is again populated this time using a looser threshold that is calculated as 10 times TMD_RELAX. CAMPARI then performs TMD_NRRELAX of joint rotation/translation moves per molecule on these molecules. This can lead to a complete displacement of molecules (see RIGIDRDFREQ). The resultant "relaxed" starting structure is written to a dedicated output file.
Naturally, there is a point at which the starting geometry should simply be deemed inappropriate for starting a gradient-based calculation. Importantly, these custom MC steps observe the complete Hamiltonian in use as well as relevant cutoff settings. They do not, however, count toward the final output or any analysis tools irrespective of the type of parent calculations. Related keywords are RANDOMTHRESH, RANDOMATTS, RANDOMBURIAL, and CYCLE_MC_FIRST. Unlike those keywords, TMD_RELAX is particularly useful for large systems with few defects as it relies on the measured force explicitly. Of course, this functionality still overlaps with the approach of using two separate calculations where the Monte Carlo run is heavily customized using the preferential sampling utility. Lastly, note that this option is not available in Cartesian dynamics.
In a small molecule screen, the relaxation cycle is repeated before every molecule is screened. This can be a significant contribution to the cost of the calculation although normally few degrees of freedom should become eligible again.
If the attempted calculation is a gradient-based run in internal coordinate space, and an initial relaxation has been requested (→ TMD_RELAX), this keyword sets the number of Monte Carlo steps to be performed per stage and per eligible entity (residue or molecule). As described, there are three independent stages. The default value for TMD_NRRELAX is 20.MOL2PRUNEMODE
If a small molecule screen was requested, and the reference molecule contains substructure information (see elsewhere for details on how to define and manually match substructures), CAMPARI offers an additional filtering technique to more quickly skip alignments that are unlikely to be sterically compatible. For this, steric interactions need to be turned on (keywords SC_IPP or SC_WCA, tethering and alignment must occur, and a (partially) gradient-based sampler must be in use in an internal coordinate space.The technique is a modified version of the initial relaxation protocol enabled and controlled by keywords TMD_RELAX and TMD_NRRELAX. The main difference is that rigid-body coordinates are not considered as eligible for relaxation and that torsional degrees in the ligand are only considered if they do not change the alignment. There are 7 options available:
- The filtering is turned off. This is the default.
- If the normalized max/min gradients of the remaining degrees of freedom that are in violation of the threshold exceed this threshold 10000 or 100 times, respectively, the alignment in question is skipped.
- If the normalized max/min gradients of the remaining degrees of freedom that are in violation of the threshold exceed this threshold 100 or 10 times, respectively, the alignment in question is skipped.
- If the normalized max/min gradients of the remaining degrees of freedom that are in violation of the threshold exceed this threshold 20 or 5 times, respectively, the alignment in question is skipped.
- If the mean gradient across the remaining degrees of freedom that are in violation of the threshold exceeds this threshold 1000 times, the alignment in question is skipped.
- If the mean gradient across the remaining degrees of freedom that are in violation of the threshold exceeds this threshold 100 times, the alignment in question is skipped.
- If the mean gradient across the remaining degrees of freedom that are in violation of the threshold exceeds this threshold 10 times, the alignment in question is skipped.
This keyword allows the user to specify the uniform damping coefficient acting on all degrees of freedom. The value is interpreted to be in ps-1. Currently, this is only relevant if DYNAMICS is set to either 3 or 7. In Langevin dynamics, the velocity damping through friction is given by e-γ·δt. Here, γ is the damping coefficient, and δt is the integration time step (see TIMESTEP). Note that in Cartesian dynamics (see CARTINT) each degree of freedom is an orthogonal direction of the Cartesian movement of each atom. Typically, Langevin dynamics integrators may make the friction on those degrees of freedom dependent on atom mass but CAMPARI does not support this at the moment since the hydrodynamic properties of individual atoms are poorly described in any case. Conversely, in torsional dynamics, the rigid-body and torsional degrees of freedom of each molecule are integrated and the friction is applied uniformly to all of those. This means that hydrodynamic properties are - again - ill-represented. Bias torques on account of variable effective masses for most dihedral angle degrees of freedom will continue to be in effect (see elsewhere).When applying Stokes' law (which should be inapplicable when the diffusion object is strongly aspherical and/or of similar size compared to the molecules comprising the surrounding fluid) to the self-diffusion of water, the measured diffusion constant of around 2.3·10-9 m2s-1 is roughly consistent through the Einstein-Stokes equation with the measured viscosity of about 8.9·10-4 kgm-1s-1 (both at 25°C). By dividing by the mass, a damping constant of about 90ps-1 can be obtained from the Stokes approximation. When performing stochastic dynamics simulations of large, spherical rigid bodies, such a value may be appropriate. For molecular simulations, however, it is not. First, in conjunction with typical time steps, the value is so large that the impulse integrator in use (→ DYNAMICS) can no longer sample the correct ensemble (it becomes overdamped implying temperature artifacts). Second, in a Cartesian treatment, unless one samples a monoatomic fluid of inert particles, the correlations between particles are so high that a treatment as independently diffusing spheres is not just inaccurate, but nonsensical in the absence of hydrodynamic interactions. Third, in internal coordinate spaces, the individual degrees of freedom hardly ever fit the Stokes approximation. Torsional and rigid-body rotational degrees of freedom would require a completely different model of friction. Furthermore, unlike in a Cartesian treatment, the degrees of freedom are not all similar to one another. The above means that the damping constant should be understood as an empirical parameter. Better control over values for individual degrees of freedom will be implemented in the future. It defaults to a value of 1.0 ps-1 on par with the coupling times of thermostats in molecular dynamics (→ TSTAT_TAU).
If a hybrid MC/M(B,L)D method is used (see DYNAMICS), this keyword controls the length of the first segment (in number of steps) which is always a MC segment. This is to ensure that hybrid runs can safely be started from poorly equilibrated (random) structures where forces are large and integrators quickly become unstable.CYCLE_MC_MIN
If a hybrid MC/M(B,L)D method is used (see DYNAMICS), this keyword controls the minimum length of MC segments (in number of steps) with the exception of the first segment.CYCLE_MC_MAX
If a hybrid MC/M(B,L)D method is used (see DYNAMICS), this keyword controls the maximum length of MC segments (in number of steps) with the exception of the first segment.CYCLE_DYN_MIN
If a hybrid MC/M(B,L)D method is used (see DYNAMICS), this keyword controls the minimum length of dynamics-based segments (in number of steps). This should probably be significantly larger than the velocity autocorrelation time of the system.CYCLE_DYN_MAX
If a hybrid MC/M(B,L)D method is used (see DYNAMICS), this keyword controls the maximum length of dynamics-based segments (in number of steps).PH
This keyword sets the assumed simulation pH which currently possesses significance for titration moves only → PHFREQ. This keyword may later be extended to represent the assumed (bath) pH in constant-pH simulations.IONICSTR
This keyword sets an assumed simulation ionic strength in two different contexts. The first is to override the inference of ionic strength based on system and parameters when the generalized reaction-field correction is in use. The second is to override the inference of ionic strength based on system and parameters for estimated pKa computations by a special Markov chain technique. The unit is molar (M). In either case, despite the conceptually very different influence of the parameter on the model, ionic strength is used in an (over)simplified Debye-Hückel approach. Real systems of appreciable complexity are very unlikely to exhibit the required homogeneity at the observed length and time scales. The two uses of the keyword are mutually exclusive, and the Markov chain technique, which is used to estimate cross-influences between multiple ionizable sidechains on a polypeptide is explained for keyword PHFREQ.RESTART
This keyword is a simple logical indicating whether to restart a previously discontinued run.It tells the program to attempt to restart a simulation which was accidentally or intentionally terminated. The program writes out ASCII-files containing relevant information in comparatively high precision (see RSTOUT). This file (one for each node in MPI calculations) is called {basename}.rst (see elsewhere). If it is successfully read, the simulation is extended from the simulation step the file was last written for. Non-synchronous MPI runs are synchronized to the step number of the slowest node. Note that instantaneous output of the crashed run should be saved separately (i.e., moved to another directory) since with the exception of running trajectory pdb/xtc/dcd-output new files will replace the old ones. All non-instantaneous analysis of the crashed run is unfortunately lost. The simulation will then proceed starting effectively at that step, so the same key-file (with the exception of the RESTART-keyword itself of course) can be used. If it is past the equilibration step, on-the-fly analysis will begin immediately. Final output will reflect only the restarted portion of the run. The program will acknowledge in the log-file that it's restarting, and will post a warning message if the energies of the structures reported in the restart-file and re-computed by the program are inconsistent. Note that it is - rigorously speaking - only safe to restart the exact same calculation, since the information contained in the restart file will depend on the type of calculation performed. It will often be possible to start MC runs (see DYNAMICS) from a non-MC restart file, however. For the opposite and all other cases, consider using the auxiliary keyword RST_MC2MD.
It should be noted that these restarts are not fully deterministic meaning that they deviate from the original run if it had continued for more steps (this is typically unknown of course). The reasons for this are several. First, no information about the state of the random number generator is preserved. This affects Monte Carlo and Langevin dynamics sampling , stochastic thermostats, and so on. Second, the information in the restart files is not printed to full double precision (this has historical reasons). This means that even a conceptually deterministic simulation will start to deviate after some number of steps (depending on the system). Third, if the shared memory parallelization is used, the balancing of load is initialized and reoptimized as it would be at the beginning of the simulations. This leads to a different sequence and blocking of computations, which subtly affects sums, for example. Fourth, as a related point, the order and grouping of compute tasks are architecture- and compiler-dependent. This means that code using different optimization levels or simply a different compiler is not the same at the machine level, and consequently the results are not the same either. Much more dramatic deviations are obtained by enabling aggressive optimization settings, for example those that reduce the precision of built-in mathematical functions. While the first two points could be avoided easily, the latter two are essentially insurmountable with present-day computers. One way to state this deficiency is to redefine numerical reproducibility as the matching of a reference result with finite accuracy, i.e., for suitably rounded results to be the same. The required level of rounding depends on the "depth" of the calculations, i.e., on how often inaccurate results are reused. This technique was used extensively in debugging the OpenMP parallel code across different architectures and compilers.
This is a rather specialized keyword meant for the specific case of (re)starting a dynamics run from a restart-file generated by an MC run. In this case, the restart file is shorter and only contains atomic positions, the Z-matrix, and whatever else is necessary. When set to 1, this keyword instructs the restart-file reader to assume the MC format even though the run is set to be a dynamics run (see DYNAMICS). Initial velocities are then generated from a Boltzmann distribution using the bath temperature (see TEMP). Ff this keyword is not set, an attempt to read mismatched restart files will crash the program (most likely in a segmentation fault). This is due to the assumed rigid formatting. The inverse procedure (reading a restart file generated by a dynamics run as the starting point for an MC run) is currently not supported. Note that the typical application for this is to use MC for equilibration of a system and to continue the run using a dynamics sampler. In single-CPU calculations, this simplifies the overall procedure and avoids using the generally low-precision pdb format as an intermediate step (although this can be adjusted with keyword PDB_OUTPUTSTRING). For some replica-exchange runs (see REMC), restart files are actually the only option which allows starting the individual nodes from individual, non-random conformations stored in an input file. The primary application for this keyword therefore probably lies in replica-exchange molecular dynamics runs which use Replica-Exchange Monte Carlo runs for equilibration purposes.DYNREPORT
This minor keyword is a simple logical which ensures that in dynamics calculations with different temperature-coupling groups a summary is provided of the partitioning in that regard.CHECKGRAD
This keyword is a simple logical which instructs CAMPARI to test the gradients for the current calculation given the Hamiltonian, system, and starting structure. It tests Cartesian gradients first, followed by the transformed gradients acting on the internal degrees of freedom (if settings allow that: see CARTINT).It is mostly for developer's usage and creates at most two undocumented output files: NUM_GRAD_TEST_XYZ.dat and NUM_GRAD_TEST_INT.dat). The procedure works by numerically computing gradients using pure energy routines (finite differencing) and juxtaposing the analytical solution. It is slow and can sometimes be misleading or uninformative for the following reasons:
- For just a single molecule, rigid-body gradients are always net zero (outside of boundary contributions).
- The dynamics Hamiltonian must be identical to the MC Hamiltonian (in particular, a matched pair of settings for LREL_MC and LREL_MD like 3/4 should be set, keyword MCCUTMODE should be 2, and the two cutoffs (NBCUTOFF and ELCUTOFF) should be set to the same value. If any of these conditions are not met, the output might report errors that are (well) above the base level provided by the numerical approximation.
- For Cartesian gradients to be accurate, no strictly torsional space Hamiltonian terms should be used (see for example SC_ZSEC and SC_TOR). For those, Cartesian gradients are circumvented unless CARTINT is 2.
This keyword is a simple logical (default off) which allows selected fatal errors to be transformed into warnings (for example the simulation of systems which are not net-neutral). It should be used with caution (obviously) and the log-output should always be studied meticulously. In addition, enabling unsafe execution may skip some costly sanity checks, e.g., when reading in trajectories in pdb format.CRLK_MODE
CAMPARI currently provides limited support in dealing with chemical crosslinks which either create one (or multiple) intramolecular loops, or link multiple molecules together. For force-based sampling in Cartesian space only (see CARTINT and DYNAMICS), this functionality matters exclusively for the following reasons:- A chemical crosslink can be thought of as a branch in the main-chain. Such non-linear polymers violate CAMPARI's model of identifying topologically connected sequence neighbors purely based upon primary sequence. Therefore, non-bonded interactions have to be corrected if the two residues in question are crosslinked to each other (to comply with the settings provided via INTERMODEL and ELECMODEL). This is supported by CAMPARI independent of crosslink type (even though there currently are only disulfide linkages supported → sequence input).
- A single intermolecular crosslink essentially merges two molecules into a single one. However, CAMPARI continues to treat both chains as if they were independent molecules. This has a variety of reasons most of which pertain to the consistency of internal data representation and to the support of internal analysis routines. One area where this is tricky is for simulations in periodic boundary conditions (→ BOUNDARY), as shift vectors are generally applied only to intermolecular contacts. For two crosslinked molecules, this continues to be the case thereby allowing - given a poor simulation system setup - the theoretical possibility of one of the two crosslinked molecules to interact with parts of different images of the other molecule. Trajectory output may also appear confusing for the same reason.
- New bonded interactions are created which have to be correctly accounted for. In accordance with the previous point this implies that distance vectors have to be image-corrected in periodic boundary conditions even for those. For the crosslink to be actually established it is necessary that the parameter file offer support for the required bond length, angle, and dihedral terms. This is of course true for any topological interaction in a Cartesian treatment. Request a report to obtain more information at the beginning of the simulation.
- For random initial structures it will be necessary for the crosslink to be satisfied to allow stable integration of the equations of motion. This is elaborated upon elsewhere.
- If the ABSINTH implicit solvation model is used (→ SC_IMPSOLV), the crosslink usually modifies two solvation groups (one on each "side") to yield a single new unit. CAMPARI will typically split this group such that the solvation groups may remain associated with their "host residue".
- The crosslink is treated as restraints and the sampler is unaware of its explicit existence.
- The crosslink is treated as a set of (hard) constraints and the sampler is adjusted to preserve these constraints. This mode is currently under development and not yet supported.
The latter is the primary reason for supporting mode 2 in the future. Here, the move set will be explicitly adjusted to only allow moves which automatically satisfy the crosslink exactly. For torsional dynamics this option will be less useful as CAMPARI does not possess the capability to enforce high-level loop closure constraints in torsional space and consequently all residues within the loop region would have to be completely constrained for the crosslink to remain intact exactly.
This simple keyword lets the user provide the location and name of an optional input file that can be used to (re)set the assigned biotypes for specific atoms or groups of related atoms in the system. The corresponding biotype number has to be available (listed) within the parameter file in use. Biotypes are the most fundamental assignment for atoms within in CAMPARI and can indirectly set many other properties such as charge, mass, etc. This is explained in detail elsewhere. However, there are parameters not affected by biotype assignment, specifically the default geometries and parameters derived from them. This means that it is generally impossible to, for example, mutate a molecule into a different molecule using such patches. Applications of this type may be more feasible for simulations in Cartesian space.The main domains of application for biotype patches are twofold. First, they allow the fastest and most convenient route to include parameter support for atoms in residues not supported natively by CAMPARI (→ sequence input). Second, they allow to diversify a parameter file regarding natively supported residues, .e.g., by maintaining multiple parameterizations for a small molecule or by including extra distinctions for atoms in terminal polymer residues. Biotype patches are applied first and may be largely overridden by successive application of other patches, e.g., atom type patches, charge patches, etc.
This simple keyword offers the user to provide the location and name of an optional input file that can be used to alter the masses of specific atoms in the system (in g/mol). Normally, masses are chosen for atoms based on the assigned atom types in the parameter file, and this behavior can be overridden by this keyword specifically for atomic mass. Note that this different from changing the atom type of the atom itself, for which a dedicated patch facility is in place. Some more details are given elsewhere.RPATCHFILE
Similar to keyword MPATCHFILE, this simple keyword offers the user to provide the location and name of an optional input file that can be used to alter specifically the radii of individual atoms in the system (in Å). By default, these radii are inferred either from the assigned atom types, i.e., computed from the Lennard-Jones size parameters, or they are overridden at the level of the parameter file by the "radius" specifications. Because the latter still operate at the resolution of assigned atom types, this keyword offers an atom-specific override facility. Note that there is a distinct hierarchy to this. Specifically, changing the radius via a patch does not change the atom type for that atom. It does, however, alter the default values of parameters that depend on radius, such as maximum SAV fractions or atomic volume reduction factors, which are then again patchable themselves. Furthermore, a radius patch overrides a radius inferred by applying a patch to the Lennard-Jones parameters of a specific atom. Details on the input are given elsewhere.WATER3S_GEOM
If sequence input contains the residue type T3P, which by default is the classic TIP3P water model, this keyword allows the user to vary the covalent geometry in use. Because supporting all different flavors of water models as different residue types is impractical, this keyword is a workaround to alter those properties that are not encoded in the parameter file in use, specifically Lennard-Jones types and partial charges. The options are:- Jorgensen's TIP3P (→ reference).
- SPC and SPC/E (→ reference).
- TIP3P-FB (→ reference).
- OPC3 (→ reference).
If sequence input contains the residue type T4P, which by default is the classic TIP4P water model, this keyword allows the user to vary the covalent geometry in use. Because supporting all different flavors of water models as different residue types is impractical, this keyword is a workaround to alter those properties that are not encoded in the parameter file in use, specifically Lennard-Jones types and partial charges. The options are:- Jorgensen's TIP4P (→ reference).
- TIP4P/2005, which is reused in TIP4P-D (→ reference and reference).
- TIP4P-FB (→ reference).
- OPC (→ reference).
If sequence input contains the residue type T5P, which by default is the classic TIP5P water model, this keyword allows the user to vary the covalent geometry in use. In principle, this keyword alters those properties that are not encoded in the parameter file in use, specifically Lennard-Jones types and partial charges. However, at the moment, there is only a single option since all common TIP5P variants (TIP5P-Ewald, and TIP5P/2018) use the same geometry. The keyword exists in analogy to WATER4S_GEOM and WATER3S_GEOM and might be extended in the future. The option is:- Jorgensen's TIP5P (→ reference).
By specifying the Wang-Landau acceptance criterion for a (partial) Monte Carlo run, the WL method is enabled. This keyword defines the reaction coordinate of choice and the coupled pair to be iterated (see below). Suppose we have an augmented Hamiltonian as follows:H = K + λE + X(Y)/β
Here, K and E are kinetic and potential energies, β is the inverse temperature, and X(Y) is an unknown function of a selected reaction coordinate. The factor λ can be either 0 or 1. Assuming that the Hamiltonian is separable, expected sampling weights from the Boltzmann distribution for the augmented Hamiltonian are:
w(Y1)/w(Y2) = (pλ(Y1)/ pλ(Y2)) exp[X(Y2)−X(Y1)]
Here, pλ(Y) is the expected probability (usually treated numerically as the integral over a finite interval, i.e., by binning). If λ is 1, it corresponds to the equilibrium (Boltzmann) probability for the original Hamiltonian. Conversely, if it is 0, pλ(Y) corresponds to the density of states (distribution as T→∞). If Y=E, p(E) can be written simply as p(E) = g(E) exp(-λE/β), with g(E) being the density of (energy) states. This simple form is not available for other reaction coordinates. The Wang-Landau method's key ingredient is choosing X(Y) such that w(Yi)/w(Yj) = 1 ∀ i,j over an interval of interest. This statement is equivalent with the definition of a flat walk in the space of Y. A flat walk eliminates all barriers in the projected space of Y and should therefore be efficient at exploring phase space (see associated keywords for details on this). The main use of the flatness is as a diagnostic, however, and the Wang-Landau algorithm uses X(Y) and the apparent distribution in Y as a coupled pair to iteratively build up X(Y). If the apparent distribution becomes flat, confidence rises that X(Y) corresponds to the target distribution of interest. The target distribution is set by this keyword:
- The target distribution is ln g(E) (arbitrary offset). This is achieved by letting λ be zero and Y=E. This is also the implementation chosen in the original publication. Interest in the density of states comes from the fact that it (theoretically) enables reweighting of the flat-walk ensemble to any condition of interest. This is the default.
- The target distribution is ln p(Z) or ln p(Za,Zb) (arbitrary offset), where the Z are geometric reaction coordinates (→ WL_RC) restricted to specific molecules (→ WL_MOL). By letting λ be unity, the target distribution is actually the potential of mean force (PMF) for that (pair of) reaction coordinate(s). Unlike for umbrella sampling (see, e.g., Tutorial 9), it is obtained without further post-processing. This variant was introduced here. As stated, it is possible to estimate a two-dimensional target distribution.
- The target distribution is ln p(E) or ln p(E,Z) (arbitrary offset). This is achieved by letting λ be unity and Y=E. In comparison to the first option, this will oversample low likelihood states rather than low degeneracy states. It can be combined with a geometric reaction coordinate (Z) in a two-dimensional approach.
A few technical comments are necessary. First, the Wang-Landau acceptance criterion can be combined with a hybrid sampling technique. In such a case, the dynamics segments will propagate the system as usual, but will contribute in no way to the Wang-Landau histograms. They merely serve to evolve the system to find new states that may be hard to access given the Monte Carlo sampler. The MC segments will utilize the Wang-Landau criterion and increment the histograms. As a result, it may be possible that a dynamics segment starts in a high energy state. This may make the integrator unstable initially, and cause unforeseen crashes. Second, Wang-Landau sampling is also supported in parallel runs. For pure Monte Carlo simulations, the MPI averaging technique implies a parallel Wang-Landau implementation, i.e., an implementation in which the histograms are updated globally. Wang-Landau sampling is also supported in conjunction with the replica-exchange method, but here each replica is confined to its own iterative Wang-Landau procedure (since the Hamiltonians are most likely different).
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, and if a molecular reaction coordinate was chosen as the histogram to consider (→ WL_MODE), this keyword allows the user to select the molecule that the reaction coordinate is computed on. The numbering of molecules follows the user-selected sequence in sequence input. Note that it is up to the user to ensure that the chosen reaction coordinate is defined and has a meaningful range for the chosen molecule (see WL_MAX, WL_EXTEND, and WL_BINSZ). If a two-dimensional variant with two geometric reaction coordinates is chosen, it is theoretically possible to supply two different molecules here. Note that the effective coupling is likely to be low in this scenario, which may lead to poor convergence properties in the 2D space. In conjunction with WL_MODE being 3, specification of a legal entry for WL_MOL will extend the WL estimation of ln p(E) to a two-dimensional case with an additional, geometric reaction coordinate (ln p(E,Z)). Note that this keyword is the only way to control the dimensionality for WL_MODE being either 2 or 3.WL_RC
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, and if a molecular reaction coordinate was chosen as the histogram (or as one or both axes of the 2D histogram) to consider (→ WL_MODE), this keyword allows the user to select amongst few geometric reaction coordinates as follows:- The molecule's radius of gyration is used (default). The range of this quantity is difficult to predict and depends on the constraints in the system. For example, in Cartesian space, it will be advisable to restrict the range of the histograms (→ WL_MAX and WL_EXTEND) to those values that do not coincide with steric overlap (low end) or stretching of bonds (high end).
- The molecule's mean α-content is used as defined for the global seconday structure biasing potential. The quantity always has finite range, but for small systems and typical settings, it exhibits sharp spikes connected by low likelihood regions that may challenge the discretization of the WL scheme.
- The molecule's mean β-content is used. See previous option for details and caveats.
This is one of the keywords that controls the convergence properties of a Wang-Landau run. The target distribution in question is accumulated as a histogram (always logarithmic), and this keyword sets the frequency (step interval) for updating it with the current value of the f parameter, i.e., the current increment size (equivalent to multiplication by f in the linear space). The accumulation of the target distribution begins only after the equilibration phase has passed. Naturally, a small setting here will quickly increment the histogram, which may accelerate convergence (in case the effective "mobility" of the system defined by system properties and sampling engine is good enough). However, a small setting may also interfere with convergence because it emphasizes the noise in initial estimates of the target distribution (in absolute magnitude), and this may make it harder to refine the guess upon reductions of the f parameter (see WL_HVMODE and WL_FREEZE). The default choice is 10 elementary steps. Note that if the parallel Wang-Landau implementation is used, the step number provided refers to the sampling amount for each individual node.WL_HVMODE
This is one of the keywords that controls the convergence properties of a Wang-Landau run. It has been argued that the flatness of the accumulated histogram for the target distribution in question (usually tested via some maximum relative deviation criterion) is not generally useful as a criterion for considering a switch to the next stage of refinement (by lowering the f parameter), and can be replaced with a recurrence (minimum visitation) criterion (discussed for example in Zhou and Bhatt). This keyword selects two different options for such a recurrence criterion. Option 2 requires each (relevant) bin to be visited exactly once in every stage, whereas option 1 mandates that each bin be visited the nearest integer of 1/sqrt(f) times (at least once, though). In the parallel parallel Wang-Landau implementation, the condition will always be checked against the combined data. If the condition is fulfilled, and if the number of post-equilibration Wang-Landau steps exceeds the buffer setting, ln f will be reduced (initial value set by keyword WL_F0) by a factor of 2. Note that the f parameter is implied to operate on a logarithmic scale (same as target distribution) of counts to avoid numerical issues with large numbers. The rule used here is equivalent to the square root update rule suggested in the original publication. Belardinelli and Pereyra suggest that the exponential update becomes inappropriate for small f and CAMPARI implements their suggestion to switch over to f ∝ 1/Nsteps, where Nsteps is the current number of WL steps having being executed. In the parallel parallel Wang-Landau implementation, this implies the combined total of WL steps from all replicas. This modified update rule is implemented irrespective of the fulfillment of the criterion defined by WL_HVMODE.It is useful to keep in mind that option 1 will initially lead to fewer reductions of the f parameter, which may be beneficial for establishing correctness, and at the same time may be harmful for the rate of convergence. An issue often affecting convergence adversely are very-low-likelihood bins. In this context, it should be emphasized that the relevance of a bin toward defining flatness is partially controlled by keyword WL_FREEZE, which consequently serves two purposes, and partially controlled by the general range settings (WL_MAX, WL_EXTEND, and WL_BINSZ).
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, this keyword can be used to control the step interval at which the evaluation of the visitation criterion for the temporary histogram is performed. If the parallel Wang-Landau implementation is used, this coincides with the requirement to (at least temporarily) combine the data from all replicas and therefore imposes a communication requirement. Should a check return a positive result, the temporary histogram is added to the overall estimate, the temporary histogram is reset to zero, and the f parameter is altered as described elsewhere. In the parallel version, additional operations are performed to broadcast the new total (combined) histogram identically to all replicas. In case the criterion is not fulfilled, the temporary histogram(s) is (are) left unchanged.The technical use of this keyword is twofold: First, to reduce communication requirements for the parallel implementation; second, to artificially delay the progression of the iteration. The latter can sometimes be useful for complex systems with strong degeneracy in the chosen reaction coordinate (also see WL_RC). Note that for the parallel code the step number provided refers to the sampling amount for each individual node.
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, this keyword defines the starting value for the f parameter (logarithmic). The f parameter is meant to decay from some positive number to 0, which corresponds to multiplicative factors larger than 1 reducing to 1 in the linear space. The default is 1.0. The number of reductions of the f parameter by the exponential rule (see elsewhere) is printed to log output. Depending on the properties of the system and the resultant convergence rate, the rule may change as described for WL_HVMODE.WL_MAX
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, this keyword sets the (initial) upper bound (given as the bin center of the last bin) of the energy or reaction coordinate histogram (→ WL_MODE and WL_RC). At the beginning, 100 bins of equivalent size are created. Depending on the choice for WL_EXTEND, the histogram and its upper limit may be extended throughout the simulation. It is safe to extend the histogram to values that are impossible to realize for the system in question, since bins that are strictly empty do not meaningfully contribute to the algorithm (see WL_FREEZE). CAMPARI accepts two separate entries for any 2D histogram. Note that the choice for this keyword may be overwritten if a dedicated input file is used to set an initial guess for the target histogram (→ WL_GINITFILE). The maximum value that will not trigger a range exception or an automatic histogram extension is of course the value given here plus half the relevant bin size.WL_BINSZ
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, this keyword sets the fixed bin size for the energy or reaction coordinate histogram (→ WL_MODE and WL_RC). At the beginning, 100 bins are created. Depending on the choice for WL_EXTEND, the histogram and its lower and upper limits may be extended throughout the simulation. However, the bin size will remain fixed. CAMPARI accepts two separate entries for any 2D histogram. Note that the histogram bin size and the initial number of bins may be overwritten if a dedicated input file is used to set an initial guess for the target histogram (→ WL_GINITFILE).WL_EXTEND
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, this keyword controls whether the energy or geometric reaction coordinate histogram (→ WL_MODE) is allowed to grow in range during the simulation. Choices are as follows:- The histogram is fixed. Note that any Wang-Landau simulation performed over a restricted interval bares the danger of generating incorrect results even after reweighting. For common interaction potentials and standard energy-based Wang-Landau sampling, this is particularly true for truncation of the energy histogram on the lower end.
- The histogram is allowed to grow only towards lower (more negative) values. This can be useful for energy histograms, where the initial energy range is not known.
- The histogram is allowed to grow in both directions. It is strongly recommended not to use this feature for energy histograms with a realistic interaction potential (since the energy is unbound on the positive side, and memory exceptions / segmentation faults are likely). This option is meant primarily for histograms defined purely on geometric reaction coordinates (→ WL_MODE).
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, this keyword allows the user to replace the default initial guess for the (logarithmic) target distribution with a user-supplied one. The default guess is flat. Supplying a nonflat guess can be useful in several scenarios: i) ongoing refinement of a WL run; ii) cases where a more useful "zero order guess" is available, e.g. an exponentially growing function for a condensed phase system with inverse power potentials; iii) convergence tests. The details regarding the format of this input file are provided elsewhere.WL_FREEZE
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, this keyword controls whether the range of bins in the energy or reaction coordinate histogram (→ WL_MODE) that is considered for proceeding to the next iteration stage (updating the value of the f-parameter) is fixed after the first such update or not. The update procedure is described for keywords WL_HUFREQ, WL_HVMODE, and WL_FLATCHECK.Any positive integer specified here will prescribe a minimum number of preliminary simulation steps beyond equilibration that must be exceeded before an update of the f-parameter is considered. After such an update, the range of bins considered for the histograms is the continuous one (and it must be continuous on account of the update rule) currently populated. If during further simulation steps additional bins were to be visited, those moves are instead considered as range exceptions and are rejected (the summary statistics provided in log-output for range exceptions can therefore contain results from two different contributions → WL_EXTEND). Any negative number provided will specify by its absolute value the aforementioned minimum number of preliminary steps in identical fashion. However, in this case, CAMPARI is instructed to allow further bins to be added for consideration during later stages of the algorithm. Note that this violates the refinement idea behind the Wang-Landau scheme, and can lead to severe convergence problems due to the numerical mismatch created by the extra bin "missing out" on f-increments during early stages of the algorithm. It is therefore strongly recommended to choose a relatively large and positive number for this keyword (to ensure that appropriate coverage of the accessible range has been reached).
Note that if the parallel Wang-Landau implementation is used, the step number provided refers to sampling amount for each individual node.
If a Wang-Landau acceptance criterion is used for a (partial) Monte Carlo run, this simple logical allows the user to request debugging information regarding the Wang-Landau iterative algorithm. If turned on, CAMPARI will report in log-output the progression through the various updating stages and may - depending on settings - also write temporary output files for the relevant histograms.Box Settings:
(back to top)
Every simulation has to occur within an explicitly or implicitly defined, finite volume. CAMPARI presently supports different ways of achieving such a finite volume listed below. For constant volume ensembles (→ ENSEMBLE), the (formal) volume remains exactly constant throughout the simulation. This does not imply that volume remains a meaningful paramerter under all circumstances, e.g., if phase separation occurs. For the type of boundary condition, there are currently three supported options and one quasi-obsolete mode:- Periodic boundary conditions (PBC):
This is the most commonly used boundary condition in molecular simulations. Here, the generally polyhedral simulation cell is assumed to be replicated as a - theoretically infinite - periodic system around the central one (which constitutes the actual, physical simulation container). Partial periodicity is also possible with other walls implemented as restraints. This is theoretically applicable to many different containers including polyhedra but only supported for periodic cylinders at the moment (SHAPE is 3 and BOUNDARY is 1). The implementation is such that all distance calculations along periodic dimensions are amended by determining the smallest distance amongst those between a particle and any of the replicated images of another particle. This so-called minimum image convention implies that for normal pairwise interaction potentials (for example SC_IPP) a particle only interacts with at most one "version" of another particle, never two or more. The idea of PBC is borrowed from crystals in which the assumption of periodicity is justified given that the simulation volume can be chosen such that it coincides with the crystal's unit cell (or exact multiples thereof).
Conversely, in liquids there is no persistent long-range order (homogeneous density, no pair correlations), and the approximation of a system of thermodynamic size by infinite replication of a nanoscopic system is at least questionable. Given typical cutoff schemes, however, the contribution of longer-range interaction is often exactly zero unless explicit techniques are used enumerating the periodic sum (→ Ewald summation, which is the only feature for which CAMPARI currently calculates interactions beyond the minimum image convention). This means that the actual impact of PBC is often just to mimic a continuous environment for particles close to the edge of the physical simulation volume. Note that no real-space interaction cutoff should exceed half the shortest linear dimension (face-to-face distance) realizable in the simulation volume since otherwise it becomes possible for multiple images of the same particle to be within interaction distance. In conjunction with the minimum image convention cited above, this invariably leads to artefactual results (reference). Note that in CAMPARI the convention of using the nearest image operates at the molecule level, i.e., the general rule is that intramolecular distances always refer to atoms in the same image of a molecule. CAMPARI will occasionally warn users about cases where an image interaction would be within the cutoff distance, but these warnings are not part of all routines (for efficiency reasons). Enabling box-consistent trajectory output may help in diagnosing such issues independently.
- Hard-wall boundary condition (HWBC):
This option is obsolete and cannot be selected. It may be reactivated in the future to enable simulations in containers with hard, particle momentum-conserving (i.e., reflective) walls. - Residue-based soft-wall boundary condition (RSWBC):
In simulations employing a continuum description of solvent, the resultant density is almost always low, in particular in the limit of simulating just a single macromolecule. In those cases, it may neither be meaningful nor beneficial to introduce additional replicas of the simulation cell. CAMPARI offers to define a system-volume via a soft-wall for such a scenario. Here, the simulated particles are prevented from leaving a simulation container (most often a spherical droplet) by an applied boundary potential modeled as follows.
Spherical case:
EBNDSphere = Σi kBND·H(ri-rD)·(ri-rD)2
Here, ri is the distance from a suitable reference point on residue i to the simulation sphere's origin, rD is the sphere's radius, kBND is the force constant and H(x) is the Heaviside step function.
Rectangular box case:
EBNDBox = Σi Σj=1..3 kBND·H(|di,j|-Lj/2)·(|di,j|-Lj/2)2
Here, di,j is the jth element of the distance vector of the reference point on residue i to the center point of the box (note that by convention the lower left corner serves as origin of the box), and the Lj are the side lengths.
Nonperiodic cylinder case:
EBNDCylinder = Σi kBND·[H(|di,z|-h)·(|di,z|-h)2 + H(ri,xy-rC)·(ri,xy-rC)2]
Here, di,z is the z-element of the distance vector of the reference point on residue i to the middle of the cylinder (cylinder axis always aligns with z-axis), ri,xy is the distance of the same point from the cylinder axis in the xy-plane, and h and rC are height and radius of the cylinder, respectively.
Partially periodic cylinder case:
EBNDPeriodic Cylinder = Σi kBND·[H(ri,xy-rC)·(ri,xy-rC)2]
The nomenclature is the same as for the nonperiodic cylinder. The partially periodic cylinder has a periodic boundary in the z-direction, and the corresponding term is thus missing from the boundary potential.
Triclinic case:
EBNDBox = Σi Σj=1..3 kBND·H(|fi,j-0.5|-0.5)·(vi,j)2
Here, fi,j is the jth fractional unit cell coordinate of the reference point of residue i, and vi,j is the vector connecting the position of this reference point with its intersection point along the normal direction of face j with the nearest face j. For a box vector parallel to a cardinal axis, this is the same as the solution for the rectangular box case. Acute angles between box vectors (see BOXVECTOR1 and related keywords) can easily create fairly narrow spaces that will lead to potentially severe artifacts. Generally speaking, the choice of such a container is difficult to justify if the interest is in the behavior and interactions of molecules. Note that, unlike distance-based operations and irrespective of the angles between box vectors, evaluating the boundary potential is not expensive even in triclinic cells since the computation of (origin-corrected) fractional coordinates and the vectors vi,j are just a few added multiplications/additions.
In general, hard-wall boundaries may be approximated by letting kBND → ∞. This will deteriorate integrator stability in gradient-based simulations, however. Choosing a RSWBC means that the boundary penalty is imposed on the reference atom of each residue (for peptide residues this is always Cα). This can lead to potential boundary artifacts with parts of large residues sticking out of the sphere and hence being deprived of interactions with smaller residues. Additionally, it must be pointed out that soft-wall boundary conditions lead to somewhat ill-defined system volumes since the code assumes the fixed volume inside the boundary to be the system volume whereas realistically it should be slightly extended depending on temperature and stiffness. The latter is not easily computed, however, since 1) the purely kinetic (entropic) pressure may be altered by the presence of non-rigid molecules, and 2) the virial pressure is generally unaccounted for. Hence, an exact volume is only recovered in the limit of an infinitely stiff boundary (HWBC). - Atom-based soft-wall boundary condition (ASWBC):
This option is analogous to the previous (RSWBC) option only that the boundary term is computed for each atom in the system rather than for the reference point on each residue in the system (formulas are not repeated). This will minimize artifacts of the aforementioned type, but it is also the most expensive droplet BC to compute. Because multiple atoms will contribute to the boundary penalty for each residue, it is generally recommended to use smaller force constants than for the RSWBC. This boundary condition is also underlying the compartmentalization potential, which is a set of additional inner, planar boundaries.
This keyword lets the user specify the shape of the simulation container the system is enclosed in. The available choices for SHAPE depend on the boundary condition selected (→ BOUNDARY). At the moment, choices are as follows:- Rectangular cuboid (= rectangular parallelepiped): This container is supported with both periodic boundary conditions (PBC) and soft walls. The dimensions are specified through keyword SIZE.
- Sphere: This container is only available with soft walls. The radius is specified through keyword SIZE.
- Cylinder: This container is available with soft walls or with partially PBC along the cylinder axis, which always aligns with the z-dimension, and atom-based soft walls elsewhere (BOUNDARY is 1). The radius and length are specified through keyword SIZE.
- Triclinic (= general parallelepiped): This container is supported with both periodic boundary conditions (PBC) and soft walls although not all features might be fully supported, especially in the periodic case. The dimensions are specified through the three keywords BOXVECTOR1, BOXVECTOR2, and BOXVECTOR3. Note that the term "triclinic" is used throughout to describe any type of parallelepiped, i.e., it here covers explicitly cases that in crystallography would not be called triclinic (like monoclinic).
In contrast to PBC, simulations in soft wall boundary conditions can use all available containers. Note that the geometry of a finite (nonperiodic) cuboid or cylinder is fundamentally mismatched with the radial (centrosymmetric) nature of most nonbonded interactions. Partial periodic boundary conditions are at the moment only supported for the cylinder as described above.
This keyword lets the user set the origin of the simulation system as a vector of three elements (x, y, and z). The reference point depends on the container's shape and is its origin for a sphere, its lower left corner for any parallelepiped, and the center of its central circular cross section for a cylinder. Note that for simulations started from "scratch" (no structural input), this keyword is mostly irrelevant. It is also theoretically meaningless for full 3D periodic boundary conditions (unless there are external fields or effectively external fields present) but might still be useful for convenience regions even in that case (e.g. to simplify display of trajectory output). There are two potentially serious implications to consider, though:- Structural output may be compromised if values are used that are far away from zero. This is because binary trajectory files and in particular the strictly formatted PDB-files have finite representation widths and fixed units (Å or nm) such that output may be severely compromised (for PDB files, format adjustments to nonstandard formats are available, see PDB_OUTPUTSTRING and PDB_INPUTSTRING). It is therefore recommended to adjust this keyword such that the minimum and maximum values for Cartesian coordinates (largest dimension) are either symmetric around the origin of the coordinate system or strictly positive but with minimal values.
- If structural input it used, it is strongly recommended to match the settings for ORIGIN to that implied in whatever structural input is provided. In soft-wall boundary conditions (see BOUNDARY), it may otherwise occur that parts of the system overlap with the ill-placed boundary and that their internal arrangement is destroyed or that the simulation explodes during the first few steps of simulation. Similarly, potentials like the compartmentalization potential, position restraints, or spatial density restraints effectively describe potentials in absolute space, and in these case it is always good practice to consciously define the origin and match it to expectation/input.
This keyword allows the user to define the size of the simulation container unless the container is triclinic (SHAPE is 4). In this case, this keyword is superseded by the three keywords BOXVECTOR1, BOXVECTOR2, and BOXVECTOR3. For the remaining container shapes, SIZE takes on alternative meanings. If the system volume is a rectangular cuboid (SHAPE is 1), a vector of three floating-point numbers is read in that specifies the three side lengths of the box in the x, y, and z-directions, respectively. If numbers are missing, the first is assumed to be the size in the x-direction, and this value will be reused for all missing specifications. If the system volume is spherical (SHAPE is 2), just one real number is needed that specifies the sphere's radius. Finally, if the system volume is cylindrical (SHAPE is 3), two floating point values are read and assumed to be the radius and height of the cylinder, respectively. No missing values are tolerated in this case. Note that highly asymmetric boxes and very short, partially periodic cylinders can place very stringent settings on cutoffs since it is generally the shortest dimension (face-to-face-across distance) that matters. All values are to be provided in Å.BOXVECTOR1
If a triclinic simulation container has been chosen (SHAPE is 4), this keyword specifies, in Å, the first box vector (x y z as three blank-separated floating-point values). The designation as "first" is generally irrelevant, and there is no enforced relationship with the cardinal axes of the system. Note that it is a common convention, however, to have the first box vector be parallel to the x-axis, the second box vector to be in the xy-plane, and the third box vector oriented such that a right-handed coordinate system is formed. There is a convenience keyword (→ BOXROTATE) that enforces this arrangement for trajectory output irrespective of how exactly the box vectors are defined.If d is meant to be the distance between opposite faces, and this distance is the same for all three sets of faces, then there are some particular geometries to consider: as mentioned above, the "minimum image cell" can have different shapes, and truncated octahedra and rhombic dodecahedra are of interest. Set the following to achieve these in a manner that creates a maximally symmetric (sphere-like) minimum image cell, which is consistent with the aforementioned convention:
- All: d 0 0
If a triclinic simulation container has been chosen (SHAPE is 4), this keyword specifies, in Å, the second box vector (x y z as three blank-separated floating-point values). The designation as "second" is generally irrelevant, and there is no enforced relationship with the cardinal axes of the system. Note that it is a common convention, however, to have the first box vector be parallel to the x-axis, the second box vector to be in the xy-plane, and the third box vector oriented such that a right-handed coordinate system is formed.To achieve specific, efficient geometries use these (where d is the (uniform) distance between opposite faces):
- Rhombic dodecahedron: 0 d 0
- Truncated octahedron: d/3 2·21/2d/3 0
- Hexagonal prism: -d/2 31/2d/2 0
- Cube: 0 d 0
If a triclinic simulation container has been chosen (SHAPE is 4), this keyword specifies, in Å, the third box vector (x y z as three blank-separated floating-point values). The designation as "third" is generally irrelevant, and there is no enforced relationship with the cardinal axes of the system. Note that it is a common convention, however, to have the first box vector be parallel to the x-axis, the second box vector to be in the xy-plane, and the third box vector oriented such that a right-handed coordinate system is formed.To achieve specific, efficient geometries, use these (where d is the (uniform) distance between opposite faces):
- Rhombic dodecahedron: d/2 d/2 21/2d/2
- Truncated octahedron: -d/3 21/2d/3 61/2d/3
- Hexagonal prism: 0 0 d
- Cube: 0 0 d
This keyword sets the harmonic force constant both for the residue-based and the atom-based SWBCs (see BOUNDARY) and for the compartmentalization potential. It is to be provided in units of kcal·mol-1Å-2 and corresponds to parameter kBND in the equations above. It is currently not possible to disable the evaluation of this potential by setting SOFTWALL to zero. Both very small and very big values can be detrimental to a simulation by producing an ill-defined volume (small values) and creating large forces (big values), respectively.As mentioned, SOFTWALL also serves as the (completely analogous) force constant for inner system boundaries added by means of the compartmentalization potential. Note that in simulations or trajectory analysis runs assuming constant pressure (fluctuating volume) conditions, all boundaries controlled by SOFTWALL will effectively move.
Integrator Controls (MD/BD/LD/Minimization):
(back to top)
If any dynamics-based (including hybrid methods of course) method is used, this keyword lets the user set the integration time step for the integrator in units of ps.CARTINT
This keyword determines - at a very fundamental level - the choice of degrees of freedom that CAMPARI shall sample. The "native" CAMPARI degrees of freedom are the rigid-body coordinates of all molecules and a subset of internal coordinates (almost exclusively freely rotatable dihedral angles). This option is the default and specified by choosing 1 for this keyword. Alternatively, the Cartesian positions of all atoms in the system may serve as the underlying degrees of freedom as is commonly the case in molecular dynamics calculations (option 2). There are several very important limitations and considerations that are mentioned throughout the documentation and reiterated here.- CAMPARI does not support the direct sampling of Cartesian degrees of freedom in Monte Carlo simulations. This applies to the MC portion of hybrid simulations as well. While it is trivial to design and implement simple move sets doing precisely that, their efficiency is negligible due to the large amount of motional correlation present between an atom and its immediate molecular environment.
- Internal space simulations do not require the full amount of bonded interaction parameters that are typically part of molecular mechanics force fields, specifically no bond length terms, and typically no or very few improper dihedral and bond angle terms (→ PARAMETERS).
- For freely rotatable dihedral angles, there is a distinction between those deemed important vs. those deemed unimportant. Details are listed in the documentation for providing sequence input. These choices generally pertain to methyl groups and/or to bonds describing electronically hindered rotations with identical groups. The resultant sets of degrees of freedom are not always entirely consistent (.e.g., between polypeptide sidechains and their respective small molecule model compounds). Related keywords are OTHERFREQ (MC) and TMD_UNKMODE (dynamics).
- While unsupported residues pose no problems in the setup of Cartesian coordinates, internal coordinate space simulations need to infer which dihedral angles are rotatable from the input topology. This happens automatically and is described elsewhere. For eligible dihedral angles not identified with standard polypeptide or polynucleotide backbone angles, relevant keywords are again OTHERFREQ (MC) and TMD_UNKMODE (dynamics).
- The choice of degrees of freedom in internal coordinate space simulations can be customized rather flexibly by introducing additional constraints (see corresponding input file). For MC simulations, the preferential sampling utility offers an additional level of control.
- Conversely, algorithms to enforce holonomic constraints in Cartesian space simulations are often limited to weakly coupled constraints (see SHAKEMETHOD for details). This means that it is not (yet) possible to mimic torsional space constraints in a Cartesian space run but that it is possible to follow a typical MD protocol by simulating a flexible macromolecule with some bond length constraints in a bath of rigid water molecules.
- The existence of virtual sites (effectively atoms with no mass) poses stringent requirements to Cartesian dynamics, in that those sites have to be constrained exactly relative to real atoms. At each integration time step, the forces acting on these sites are transferred to the surrounding atoms, and their positions are rebuilt post facto (see elsewhere for more details). Virtual sites in internal coordinate space simulations can only cause issues if a degree of freedom's effective mass depends solely on such sites. Then, CAMPARI will automatically freeze the corresponding degree of freedom.
This keyword lets the user choose the thermostat to be used to generate an NVT (or NVT-like) ensemble in dynamics simulations using a Newtonian formalism (option 2 or 5 in DYNAMICS). Currently, three options are fully supported:- Berendsen weak-coupling scheme (reference):
This is a deterministic and global velocity rescaling scheme which creates an exponential relaxation toward the target temperature. The velocity rescaling factor is computed for each coupling group (see TSTAT_FILE) according to:
fv,i2 = 1.0 + (δt/τT)·[ (Ttarget/Ti) - 1.0 ]
As is apparent, whenever the instantaneous group temperature (Ti) matches the ensemble target (Ttarget), velocities are not rescaled (fv,i is unity). Any deviations from Ttarget will lead to a systematic rescaling for all velocities that are part of the coupling group toward the target with a relative decay rate of τT (→ TSTAT_TAU). If τT approaches the discrete time step (δt), the relaxation becomes instantaneous. Note that the coupling of subparts of the system to essentially different thermostats is a largely obsolete method used in early days of simulations to prevent obscure freezing events sometimes encountered when the system is effectively partitioned into subsystems with very different levels of integrator stability, noise, and inherent relaxation. Then such an approach may circumvent the most dramatic pitfalls resulting from the inherent incorrectness of the weak-coupling scheme (and masking said incorrectness in the process). It is crucially important to realize that the Berendsen thermostat does not generate a well-defined ensemble and that the method only relaxes "safely" to the microcanonical one for τT approaching infinity. The quenched fluctuations observed in the Berendsen scheme may severely distort results on fluctuation-sensitive computations such as free energy growth calculations (see GHOST). Since the Berendsen scheme is a global coupling scheme, it is compatible with holonomic constraints but prone to equipartition artifacts (see option 4 below) like the freezing of subparts of the system mentioned above. Global thermostats are generally good at preserving velocity cross-correlations, which is an important property for tightly coupled systems, e.g. dense, polar liquids. They are also good at absorbing integrator (discretization) error. - Andersen scheme (reference):
The Andersen thermostat is a stochastic thermostat which introduces "collisions" re-randomizing the velocity associated with a given degree of freedom to one coming from the ensemble at that given temperature. This method is shown to sample the canonical ensemble and one of the recommended options for any calculation sensitive to the details of ensemble fluctuations. Implementation-wise, it works by re-assigning the velocity for each degree of freedom at each time step with a probability equivalent to δt/τT. This effectively gives rise to a "bath"-induced relaxation over a timescale τT. Note that implementations in other software packages may synchronize the application of these velocity resets. This is not the case in CAMPARI where each degree of freedom is treated independently. In Cartesian space, this means that each dimension for every atom is coupled individually, i.e., the velocity resets are uncoupled. As a consequence, the Andersen thermostat as implemented currently is incompatible with holonomic constraints (while the constraints are maintained, their imposition systematically bleeds kinetic energy from the resets leading to an artificial cooling). Without constraints and much like in Langevin dynamics, a remaining concern here can be the artificial loss of velocity correlations between multiple particles. Unlike in Langevin dynamics, however, the stochastic process is uncoupled and instantaneous, which has an additional downside. If there are noticeable discretization errors (there usually is), one cannot rely on the Andersen thermostat to absorb them in the same way that a globally coupled thermostat does. This is (most likely) because errors accumulating locally must also be dissipated locally where in a global scheme they are dissipated globally. - Extended ensemble methods:
Methods such as those by Nosé-Hoover, Martyna-Tobias-Klein, or Stern are currently not supported, but may be in the future. They often show poor relaxation behavior due to coupled oscillations in particular in the NPT ensemble, which they are most useful for. - Bussi et al. scheme (reference):
This thermostat can be thought of as a hybrid approach of the Nosé-Hoover and Berendsen thermostats. It preserves the exponential relaxation kinetics of the weak-coupling scheme if the ensemble target is far away but introduces fluctuations to the kinetic energy such that at equilibrium the global re-scaling does not quench fluctuations. This thermostat is currently the one offering the most general support for different classes of system and different types of degrees of freedom and thus the recommended option in most applications. The implementation is that of evolving the kinetic energy via an auxiliary stochastic dynamics much like the Langevin piston for pressure coupling does. Here:
fv,i2 = e-δt/τT + fT,i(1 - e-δt/τT) (R12 + RΓ,Nf,i-1) + 2e-0.5δt/τT·R1[ fT,i(1 - e-δt/τT) ]0.5
fT,i = Nf,i-1·(Ttarget/Ti)
Here, Nf is the number of degrees of freedom in the respective coupling group, R1 is a normal random number with mean of zero and unity variance, and RΓ,Nf,i-1 is a random number drawn from the gamma distribution with outside scale factor of 2.0 and shape of (Nf,i-1)/2. Like any thermostat acting globally, there is a higher risk for equipartition artifacts than for thermostats/methods that couple degrees of freedom individually. Equipartition artifacts are generally more likely to occur in weakly coupled and inhomogeneous systems, e.g., a peptide in a bath of ions in an implicit solvent model.
If the simulations is performed in the NVT -ensemble and if Newtonian dynamics are used, this keyword allows the user to set the key parameter of the employed thermostat, i.e. its coupling (decay) time, τT, in units of ps (the default is 1.0ps). Note that it is really the ratio of the time step δt (see TIMESTEP) and this number that matters, hence TSTAT_TAU cannot be less than the integration time step.TSTAT_FILE
If the simulations is performed in the NVT -ensemble and if Newtonian dynamics are used, this keyword sets name and location of an optional input file for defining thermostat coupling groups. These are meaningful only if the Berendsen weak-coupling or the Bussi et al. scheme is used (options 1 or 4 for TSTAT). For details, the user is referred to the description of the input file itself.SYSFRZ
This keyword controls the removal of net drift artifacts in dynamics runs (which are primarily relevant for fully ballistic MD). Predominantly in periodic boundary conditions (see BOUNDARY), it can happen that all kinetic energy is transferred into global translations or rotations of the system. This collective "degree of freedom" is typically friction-free and therefore represents a stable trap for the system's kinetic energy to accumulate in. Such behavior will give rise to grossly misleading results (the effective ensemble sampled has a much lower temperature). This can be avoided by periodically removing such global motions. For translational displacements, this is easy, but for rotational motion problems arise if subensembles have access to modes that are quasi friction-free themselves. This is often the case in mixed rigid-body/torsional dynamics and at the moment not dealt with properly.Choices are as follows:
- No removal of global motions is performed (the safest setting for most applications).
- CAMPARI will attempt to only remove translational motion of the system.
- CAMPARI will attempt to remove both global translation and global rotation (this option should be used with caution and is also automatically disabled if certain types of constraints are in use).
If a simulation is performed in mixed torsional/rigid-body space that contains a Newtonian dynamics portion, then this keyword allows the user to choose between (currently) two basic integrator variants. All integrators are derived from the following discrete scheme that relies on the aforementioned assumptions, i.e., a diagonal mass matrix (equations of motion formally decoupled) and the accuracy/correctness of the total kinetic energy expressed in terms of this diagonal mass matrix. Then, we can define pseudo-symplectic conditions as shown below for a (rotational) degree of freedom with index k:Ik(t2)ωk(t2)2 - Ik(t1)ωk(t1)2 - δt [ωk(t1) + ωk(t2) ] Fk(t1.5) = 0
Here, δt is the integration time step, Ik denotes the diagonal element of the mass matrix for the kth degree of freedom (function of time), ωk is the associated angular velocity, and Fk denotes the deterministic force projected onto this degree of freedom (torque). The projection yielding the torques and the mass matrix elements are computed with recursive schemes, i.e., they operate in linear time with the number of atoms in the molecule (more or less irrespective of how many rotatable bonds there are). More information on this recursive scheme can be obtained indirectly with the help of keyword TMDREPORT. The above scheme defines a quadratic equation that has a maximum of two solutions for ωk(t2) (formula omitted). The correct one must be picked (which may be difficult), and an alternative must be defined if no solutions are possible. For both purposes, we use a well-defined approximation to the full solution that yields:
ωk(t2) ≈ [Ik(t1)/Ik(t2)]1/2 ωk(t1) + δt Fk(t1.5)/Ik(t2)
This solution is always available and can be used to pick the correct solution amongst two alternatives for the full quadratic equation (simply as the closer one). The setting for TMD_INTEGRATOR determines whether the correct solution to the quadratic equation should be used whenever possible (option 1), or whether the approximation is used exclusively (option 2, which is the default for historical reasons). As written, the equations still contain the problem that they require knowledge of Ik(t2), whereas only the half-step mass matrix elements (which are structural quantities) are available in a typical leapfrog scheme. If the Ik are slowly varying functions in time, a simple approximation solving this problem is to allow a lag of half a time step:
ωk(t2) ≈ [Ik(t0.5)/Ik(t1.5)]1/2 ωk(t1) + δt Fk(t1.5)/Ik(t1.5)
This is again written for the approximative version (setting 2). The resultant leapfrog integrator is extremely simple and efficient, and it is obtained by setting the related keyword TMD_INT2UP to 0. However, at each integration time step, we can also take a half-step guess using a similar approximation to obtain a value for all the Ik(t2). This is done by explicitly perturbing the coordinates and recomputing just the mass matrix elements (little additional cost for all but tiny or trivial systems). With the values obtained, we can integrate the second equation above as written (this is obtained by setting TMD_INT2UP to 1). While theoretically more accurate, this variant can be noisy due to the extrapolation of the masses. In practice, for systems with very small and quickly varying Ik (such as rigid water molecules), performance is similar for all four pairings (TMD_INTEGRATOR 1 or 2, TMD_INT2UP 0 or 1), and reveals that additional corrections are recommended if the rate of change of the Ik is high (see below). Conversely, if the rate of change is negligible, all possible settings obtainable by combinations of the two keywords mentioned here relax to the exact same integrator (standard leap-frog in rotational space). This covers the special case of linear (translational) degrees of freedom, which have constant mass.
Note that this keyword is currently irrelevant for stochastic dynamics (always uses a derivation analogous to the last equation above), but that it is relevant for the stochastic minimizer. Another crucial keyword relevant to TMD integrators in general is ALIGN. Specifically for the Newtonian case, coupling parameters become relevant as well (TSTAT and TSTAT_TAU in particular).
If a simulation is performed in mixed torsional/rigid-body space that contains a Newtonian dynamics portion, then this keyword allows the user to control the number of incremental velocity update steps used to improve integrator stability for cases with quickly varying elements of the mass matrix (see above). The cases of 0 and 1 have already been covered in the documentation on TMD_INTEGRATOR. The remaining options assume that values for the diagonal elements of the mass matrix at times t1, t1.5, and t2 are available explicitly (as in: computed directly from coordinates) when trying to compute the updated angular velocity for a degree of freedom at time t2. Rather than solving the velocity update step in one step, the interval from t1 to t2 is instead divided into TMD_INT2UP subintervals, and the velocity is updated incrementally for each subinterval. If TMD_INT2UP is larger than 2, additional values are obtained by linearly interpolating between explicit values at the three times. This is why it is recommended to set this keyword to multiples of 2, and this is also why the added benefit becomes successively smaller. A recommended value is 4. Note that this only matters for velocity updates, and that the torque is assumed constant over the entire interval (Fk(t1.5) above). As a result, this option does not notably alter speed for a system of appreciable size and is not at all equivalent to a change in integration time step.FUDGE_DYN
If a simulation is performed in mixed torsional/rigid-body space that contains a dynamics portion, this keyword, when set to "1", enables an automatic adjustment of inertial masses. Because the equations shown above do not involve an explicit coupling of inertial masses to each other (the mass matrix is assumed diagonal), they can be scaled independently. As long as the scale factor is constant, this effectively increases the rotating mass for an individual degree of freedom without altering that of others. This effect cannot be achieved by atomic mass patches because atoms contribute to the inertial mass of more than one degree of freedom, including rigid motion of the molecule in question.If FUDGE_DYN is 1, CAMPARI will, based on the initial structure it has of the system, scan all rotational degrees of freedom and identify those with inertial masses of less than 6.0gÅ2/mol and calculate and store a constant scale factor that brings the inertial mass up to 6.0gÅ2/mol. This factor is never adjusted again and preserved on restarts, which is critical. For normal molecular systems, the correction affects terminal moieties involving just hydrogen atoms (e.g., torsions in methyl or hydroxyl groups) and rigid rotations of molecules composed of a single heavy atom plus hydrogen atoms (like methane and water). Dihedral angles across nearly colinear arrangements may also be picked up (like in alkynes) although they may also be frozen (based on a numerical threshold). Finally, molecules with tiny overall masses are excluded from this procedure as this signifies to CAMPARI that the parameters, units, and/or systems are meant to be nonstandard.
If a simulation is performed that contains a dynamics portion, this keyword allows the user to implement an artificial safety net to protect against a numerically unstable simulation. It works by defining an upper threshold temperature. As soon as the net system temperature exceeds this threshold, CAMPARI will rescale all velocities such that the exact target temperature is set instantaneously. There are several caveats to be weary of:- Since the size of temperature fluctuations scales inversely with the number of degrees of freedom, the threshold should be system-specific.
- For systems with many degrees of freedom, a local instability will lead to the undesired behavior of cooling down large parts of the system well below the target temperature.
- If there is a settings-related reason for the instability (e.g., time step issues), it will reoccur and ultimately lead to an uninterpretable result or a crash.
- An ensemble rescaled once or more times using this technique has no obvious thermodynamic interpretation. One should keep in mind, however, that all discrete integration schemes implemented here yield ensembles that are, rigorously speaking, not the correct thermodynamic ones (see this reference for further information).
- Generally speaking, as the conformation is not altered, the use of this keyword offers no guarantees that an initially unstable simulation will survive - it simply increases the chances.
If a simulation is performed that contains a dynamics portion, this keyword allows the user to define a criterion for when such a gradient-based propagation is deemed numerically unstable. The keyword is interpreted as a fractional (relative) step size. As soon as any propagation makes a step larger than the resultant effective maximum increment, a premature termination condition is triggered. For rotational motion (only relevant in internal coordinate spaces, the effective maximum increment is THRESHOLD_INCR·360.0° whereas for translational motion it is THRESHOLD_INCR·V1/3 where V is the volume of the simulation container.In production simulations, this keyword is useful primarily to avoid cryptic terminations due to integrator instability. The main actual use of the keyword is in small molecule screens: here, an uncontrolled floating point exception would lead to the termination of the entire process, and relying on THRESHOLD_INCR instead allows CAMPARI to continue with the run. The default value is 1.0. For the keyword to work as intended, this value should generally be smaller (0.1 or less).
If a simulation is performed in mixed torsional/rigid-body space with a gradient-based sampler (including minimization), then this keyword controls default constraints operating on certain rotatable dihedral angles. A second function of this keyword occurs in structural clustering using a distance function based on dihedral angles (see end of description). As described for sequence input, there is a selection of "native" CAMPARI torsional degrees of freedom that does not include every rotatable dihedral angle in natively supported residues, and for obvious reasons does not include any degrees of freedom within unsupported residues. This keyword therefore controls how to deal with these two categories of additional degrees of freedom. Options are as follows:- Only native CAMPARI degrees of freedom are sampled. This will leave any unsupported residues and molecules completely rigid.
- In addition to native CAMPARI degrees of freedom, all identified degrees of freedom in unsupported residues and molecules will be sampled.
- In addition to native CAMPARI degrees of freedom, all torsional degrees of freedom in natively supported residues, which are frozen by default, are sampled. This will leave any unsupported residues and molecules completely rigid.
- All aforementioned classes of degrees of freedom are sampled.
In the context of structural clustering, this keyword co-controls which dihedral angles are eligible as dimensions for a distance function. This applies to cases where a custom request is made through the appropriate input file and to cases where the full dimensionality is ment to be used. Further information is provided elsewhere.
This simple logical keyword enables the printing out of information regarding internal degrees of freedom (rigid body, torsional). This file is particularly useful for constructing input for a specific input mode for custom constraints. For every molecule it lists the index of the first atom in that molecule ("Ref."), the total number of atoms ("Atoms"), the total mass ("Mass") after applying all patches, and, if a gradient-based sampler in torsional space is in use, information on whether the (up to) 6 rigid body degrees of freedom are frozen or not ("Frozen"). The order is translation in x, y, and z followed by rotation around x, y, and z axes.The output on the dihedral angles in the molecule provides the following information. For each atom that, in the Z matrix, corresponds to the definition of a relevant (rotatable) dihedral angle, the structure of the rotation list setup is provided. Specifically, the number of rotating (swiveling) atoms ("Rotat.") is printed out along with their total mass ("Mass"). It is specified how many of the rotating atoms are unique ("Unique") with respect to that atom's rotation list a level above in the hierarchy that contains the rotation list for the present atom entirely ("Parent"). The hierarchy is understood by considering the polymer as a branched chain with a number of tips and a base of motion. This base of motion is defined by keyword ALIGN. Degrees of freedom at the tip have minimal rank (starting at 0) whereas those near the base have maximal rank in the hierarchy ("Rank"). The hierarchy necessitates a particular sequence of processing the individual degrees of freedom ("Order"). The report also provides information on the chemical elements ("Ele.") of the 4 constituent atoms for the dihedral angle (the atom defining the dihedral angle comes last) and on whether the degree of freedom is frozen in torsional space molecular dynamics ("Frozen"). The last bit of information is available only if a gradient-based sampler in torsional space is in use. The report is available irrespective of the type of calculation being performed. Note that keyword ALIGN, while conceptually controlling the same thing, is implemented differently in Monte Carlo moves. This means that most of the columns are representative for the Monte Carlo part only if ALIGN is set to 1. In hybrid samplers, the dynamics portion takes precedence.
This keyword specifies name and location (full or relative path) of the input file for the selection of molecules or residues for which selected degrees of freedom are to be excluded from sampling by explicit removal from Monte Carlo sampling lists and/or by not integrating equations of motion for them. This means that only such degrees of freedom can be constrained that are in fact explicit degrees of freedom of the sampling scheme in use (see DYNAMICS and CARTINT). If this keyword is not present, no constraints are going to be used beyond the system-imposed ones, which may be sampler-dependent. Note that restricting the Monte Carlo move set defines effective and group-wise constraints not covered here. The same is true for modifying the default picking probabilities for individual degrees of freedom as part of a Monte Carlo move. In Cartesian space, explicit constraints to the x, y, and z coordinates of selected atoms are possible. However, indirect geometric constraints are also supported (differently and independently via SHAKESET).The input for explicit constraints is described in detail elsewhere. Hard constraints may be necessary for specialized applications, for example when one attempts to just re-equilibrate the sidechains in a folded protein while leaving the fold intact. In general, it will be possible to use restraints (see for example SC_TOR or SC_DREST) as alternatives. Those allow the selected degrees of freedom to respond and fluctuate around a stable equilibrium position. Another example of a highly specialized application would be the simulation of a two-dimensional system.
Note that constraint requests are not entirely arbitrary, and that the level of control being offered depends on the sampling engine and the chosen input mode. In general, it may prove challenging to match custom constraints exactly when a hybrid sampling approach is used. This means that the sampled sets of degrees of freedom between Monte Carlo and dynamics segments may differ slightly, which is sometimes desired. An example is the sampling of five-membered rings, which are frozen in internal coordinate dynamics but can be sampled by a dedicated Monte Carlo move (→ PKRFREQ and SUGARFREQ). Furthermore, introducing constraints may prohibit certain MC samplers from being applied not just to the residues carrying the constraints but surrounding ones as well (such as concerted rotation methods → CRFREQ) due to underlying and conflicting assumptions. Lastly, CAMPARI will exit with an error if user-selected constraints deplete the sampling list for a given MC move type entirely. Here, it is requested of the user to explicitly adjust the move set, since otherwise these moves would have to be converted to another type that is not necessarily desirable (note that this still happens if moves are requested that the chosen system simply does not support).
If explicit coordinate constraints are used (→ FRZFILE), this keyword acts as a simple logical whether or not to write out a summary of the constraints in the system to log-output.SKIPFRZ
If constraints are used (→ FRZFILE) in torsional space simulations, this keyword gives the user some control over the calculation of effectively frozen interactions due to constraints. In Monte Carlo simulations (see DYNAMICS), incremental energies are computed by only considering the parts of the system that move relative to one another. This automatically addresses constraints. Conversely, in dynamics the total system energy and forces are calculated at each step. If this keyword is set, interactions between parts that have no chance of moving relative to one another (relative orientation completely constrained) will no longer be considered. Note that the potentials rigorously have to be (at most) pairwise decomposable for this option to be available (e.g., the polar term in the ABSINTH implicit solvation model is not strictly pairwise decomposable; → SC_IMPSOLV and SCRMODEL). Usage of this keyword can significantly accelerate dynamics runs or minimization runs in heavily constrained systems (such as ligand optimizations within a rigid protein binding site). Note that any reported energies do not contain the frozen contributions either if this option is chosen.MOL2FOCUS
If a small molecule screen is performed and any sampler in internal coordinate space is used, this keyword controls which degrees of freedom are available for sampling in the remainder of the system (excluding the screened molecule). Specifically, the use of this keyword allows the addition of extra constraints, e.g., those specified through keywords TMD_UNKMODE and FRZFILE. It can never be used to remove existing constraints. This keyword is ignored in scoring mode.Two options are currently available.
- If the value given is positive, this is interpreted as a threshold distance in Å. Any torsional or rigid-body (IMD) degree of freedom moving any atom that is within this distance from any atom of the reference ligand, i.e., the first one in the MOL2 input file, will remain free if it is free to begin with. All the other ones will be frozen unless they are already frozen. Because of the definition, small values will be enough to keep the immediate vicinity of the ligand flexible. This will include main chain degrees of freedom if those are not constrained by other means.
- If the value is given as negative, this is interpreted as follows. Any IMD degree of freedom to remain free must only move atoms that are all within the (absolute value of) the value given in Å of all atoms of the reference ligand, i.e., the first one in the MOL2 input file. This is by definition fairly restrictive and generally requires larger values to prevent everything from being frozen.
With additional coordinates flexible, it may be essential for the CAMPARI output files to document these additional coordinates. This is achievable either with the help of keyword MOL2PDBMODE or in the mol2 output file itself (by means of keyword MOL2AUXINDEX). The former solution is often impractical since the entire systems is written and the output is to individual files. The latter solution is particularly convenient if the special option "automatic" is chosen for MOL2AUXINDEX, which ensures that only a near-minimal set of extra coordinates is written exactly in response to the present constraints. Note that any flexibility on the receptor side means that poses missing this information can no longer be analyzed meaningfully, e.g., in a (re)scoring run.
If a small molecule screen is performed and a sampler in internal coordinate space is used, this keyword offers some control over how torsional degrees of freedom are sampled in the screened molecules. For runs in Cartesian space, there is currently no specific customization available.Since the screened molecules are always single-residue molecules with the same index for the first atom, rigid-body motion can be controlled with maximum flexibility using a custom constraint input file (mode 'A'), and it is thus not covered by this keyword. For both Monte Carlo and gradient-based runs, basic settings are governed by the move set controls and the preferential sampling utility. The major difference between Monte Carlo and gradient-based runs is that the MC move set's prior state may prevent some degrees of freedom to be added. For example, if single dihedral pivot moves on unsupported residues are not enabled to begin with, options 1-3 below will not change this. Note that MOL2FRZMODE is processed but only diagnostically relevant in scoring mode.
The options are as follows:
- Constraints are left as is, i.e., they are determined by the setting for TMD_UNKMODE, custom constraints, and, for Monte Carlo, by the move set choices.
- Only those dihedral angles rotating only hydrogen atoms are left free (the detection of a hydrogen atom is based on mass, which is ultimately dependent on the type matching algorithm). All of the eligible dihedral angles receive the same sampling weight in Monte Carlo.
- All dihedral angles are left free. In Monte Carlo, the same sampling weights are proportional to the number of atoms being rotated except if there are more than 3 (below, the weights are flat).
- All dihedral angles are frozen. This is rarely useful and may lead to crashes.
It is standard practice in molecular dynamics simulations in Cartesian space to employ holonomic constraints such that the system evolves according to Gauss's principle of least constraint. The reader is referred to the literature as to what exactly constitutes a time-reversible, symplectic integrator if holonomic constraints are enforced. In general, it will possible to formulate an algorithm which at least is drift-free, has some target precision for the constraints, and is approximately symplectic when the microcanonical ensemble is in use.The idea behind holonomic constraints in molecular dynamics is to eliminate fast vibrational modes in the system to allow for a larger integration time step to be used. This keyword allows the user different choices for which holonomic constraints to employ as follows:
- No holonomic constraints are used.
- All "native" bonds to terminal atoms with a mass of less than 3.5 a.m.u. are constrained in length. A terminal atom is defined as any atom bound to exactly one other atom. "Native" means that only bonds consistent with the assumed molecular topology (code-internal) are considered. This selection will usually constrain all bonds to hydrogen atoms.
- All "native" bonds of any type are constrained in length. This does include bonds formed by virtue of chemical crosslinks.
- All "native" bonds of any type are constrained in length as in mode 3. In addition, several bond angles are constrained explicitly. For a molecule free of rings of size 6 or less all bond angles are constrained (this also constrains improper dihedral angles at trigonal centers). For molecules with rings of size 6 or less, ring-internal bond angles are generally omitted. Note that more bond angles can be formulated at a tetrahedral site than constraints are needed, and that - system-dependent - redundant constraints may be created (which may be harmful). This option is only supported for the standard SHAKE constraint algorithm at the moment.
- This is nearly identical to option 4. However, bond angles are constrained by additional distance constraints rather than explicitly. This means this option is theoretically available for constraint algorithms other than SHAKE.
- An input file is read and used to derive the list of constraints. Note that it is possible to derive intra- and intermolecular long-distance constraints that way (geometric information will be taken from starting structure), but that those will very easily cause CAMPARI to crash.
The cost, accuracy, and applicability of constraint algorithms all scale poorly with the level of coupling. Options 4 and 5 from the list above will therefore be usable only in special cases (→ SHAKEMETHOD) such as systems without any rings or planar, trigonal centers. For specific applications using angle constraints, we strongly recommend defining a minimum set of distance-based constraints via option 6 above. This has the best chance to succeed.
If SHAKESET is set to 6, this keyword specifies the name and location of the file defining user-selected holonomic constraints to be enforced during the simulation. Its format and requirements are documented elsewhere.SETTLEH2O
This keyword allows the user to append/modify the constraint set selected via SHAKESET to replace all preexisting constraints acting on three-, four-, or five-site water molecules (SPC, TIP3P, TIP4P, TIP4P-Ew, or TIP5P) with constraints that completely rigidify each water molecule. It acts as a simple logical and is turned on by default since CAMPARI as of now does not support explicitly any inherently flexible water models. This means that a setting of 2 or 3 SHAKESET for a calculation in explicit water will still constrain waters to be rigid, and therefore correspond to a standard (and - for the supported water models - correct) simulation setup. Specifying this keyword and setting it to anything else but 1 will disable this override. Note that for water models possessing virtual sites (all four- and five-site models), it is assumed that the extra sites have no mass (see below). If this is not the case, the use of the analytical SETTLE algorithm for water is no longer possible, and the more complex set of constraints may no longer be solved efficiently (or may no longer be solved at all).Note that the geometries are hard-coded, but a small set of alternative options is available. This is distinct from native support for different water models and is only effective if the residue names "T3P", "T4P", and "T5P" are in use (but not "T4E" or "SPC"), see keywords WATER3S_GEOM, WATER4S_GEOM, and WATER5S_GEOM.
This keyword allows the user to control how the actual values for the set of holonomic constraints are determined at the beginning of simulations in Cartesian space. There are currently the following options:- Irrespective of structural input, all distance constraints of atoms bound covalently to one another are taken directly from the (hard-coded) CAMPARI default geometries that use, wherever possible, databases of high-resolution crystallographic structures of biomolecules (see for example the reference by Engh and Huber). For indirect angle constraints, i.e., constrained distances of atoms separated by two covalent bonds, the bond angle and two bond lengths in question are used to compute the effective length in similar fashion. For explicit angle constraints the reference value can be used directly. Simulations of unsupported residues require the structural input to be used directly (see comments on option 3 below for implied caveats). For many small molecules the logic is slightly different: they either inherent their geometric parameters from the parent values in polymers or they are set to be in compliance with specific computational models. The latter is particularly important for water molecules. Some control is offered through keywords WATER3S_GEOM, WATER4S_GEOM, and WATER5S_GEOM. This option is currently the default.
- Irrespective of structural input, CAMPARI will try to reconstruct the required constraint lengths from the minimum positions of bonded potentials (see SC_BONDED_B and SC_BONDED_A) that are provided by the force field in use. As for option 1, this extends to indirect and explicit angle constraints. If terms are missing in the force field (such as for rigid water models), covalent distances are taken from the default CAMPARI geometry as for option 1. This option is the recommended one for "standard" molecular dynamics calculations. Note that patches to bonded parameters are recognized and respected in this context.
- CAMPARI takes all reference values for constrained degrees of freedom directly form the structure the simulation is started in. This requires no adjustments but comes with caveats. Since input in pdb format is of limited precision, the various bond lengths and angles can only be extracted to the same precision. This means that constraints that are chemically identical will be set to slightly different values (e.g., the C-H bonds in a methyl group), which can cause small artifacts. For bond lengths involving hydrogen, rebuilding is an alternative to circumvent this problem (see PDB_HMODE). A second problem arises due to the lack of reproducibility caused by the limited precision. Specifically, simulations started from two different and minimized conformations of the system will end up using different values for the constraints. A more extreme version of this problem is encountered when starting simulations from snapshots of other simulations, in which the constrained degrees of freedom had been left free to move (this exceeds the mere precision effect). Due to these caveats, it is not recommended to use this option. Note that CAMPARI will have to use values defined by structural input for cases where no other information is available.
When restarting simulations, this keyword should generally be left unchanged. In case of option 3 being in use, it is recommended to either never supply an input pdb file or to always use the same one supplied as a template. For a simulation started normally, options 1 and 2 above entail the possibility that constrained degrees of freedom are adjusted before the simulation begins. This adjustment is reflected in the reference structure files written at the beginning of each run ({basename}_START.pdb, {basename}, and {basename}_START.mol2).
This keyword allows the user to choose which of the currently implemented algorithms CAMPARI should use to enforce the chosen set of holonomic constraints during a molecular dynamics simulation in Cartesian space. Options are as follows:- The standard, iterative SHAKE procedure is used. Coupled constraints are solved iteratively by assuming independence and linearity (Newton's method). SHAKE may converge in very few steps to good accuracy if the coupling is weak (coupling matrix is sparse). This is the only method that currently supports explicit constraints on bond angles (see SHAKESET and SHAKEATOL). It is also the only method that allows parallelization of an individual constraint group across multiple threads if the shared memory (OpenMP) parallelization of CAMPARI is in use, and if the constraint group is otherwise expected to become a bottleneck. Due to the use of Newton's method, SHAKE is not guaranteed to converge if the underlying "landscape" is non-linear due to the introduction of coupling between constraints. Then convergence is only guaranteed within a small enough environment around the actual solution. Therefore, SHAKE places an upper limit on the time step that can be used even though it is meant to allow increases of precisely that time step. Nonetheless, in canonical applications (bond length constraints only), SHAKE will be a reasonably efficient solution, i.e., the desired tolerance can usually be reached within few steps. The main weakness of SHAKE and related algorithms is their inherent inability to enforce planarity at a given site. This is because at a planar site all bond vectors which form the basis set for the application of iterative corrections are part of the same plane, i.e., it is impossible to correct an out-of-plane motion using those vectors. Depending on the exact set of constraints used, SHAKE will require many steps, not converge at all or converge with limited accuracy, and occasionally crash if bond length and angle constraints at a site deem it to be perfectly planar.
- A mix of the SHAKE and P-SHAKE (see below) algorithms is used in which P-SHAKE is applied only to those constraint groups which are internally entirely rigid. Because P-SHAKE, at least as implemented, fails as a general purpose constraint algorithm, this option is practically obsolete. It can be enabled for testing purposes with the help of keyword UNSAFE.
- The so-called P-SHAKE (preconditioned SHAKE) procedure is used. In P-SHAKE, SHAKE is augmented by introducing a pre-conditioning step which changes the convergence rate from linear to quadratic. The preconditioning step is a matrix multiplication essentially forming linear combinations from the bond vectors in the constraint vectors. Corrections employed along those new directions minimize the linear error by decoupling the constraints (within the bounds of a linear theory → hence the quadratic and not instantaneous convergence). Unfortunately, this method currently is implemented either inefficiently or incorrectly and does not usually offer a discernible improvement. However, it is also fundamentally flawed as large constraint groups are handled inefficiently due to the requirement of a full matrix multiplication that is needed to increment the coordinates at each iteration step. This operation in P-SHAKE has a cost of 3·np·nc and of only 6·nc in standard SHAKE. In addition, the matrix used to precondition the procedure, has to be recalculated frequently if a molecule undergoes significant conformational changes (currently hard-coded to 100 integration steps). P-SHAKE is therefore suitable only for enforcing holonomic constraints in small rigid or quasi-rigid molecules that can be solved by SHAKE as well. Just like SHAKE, it fails badly for planar sites (see above). As a consequence of the above, P-SHAKE can be enabled for testing purposes only with the help of keyword UNSAFE (otherwise CAMPARI terminates). When using P-SHAKE, CAMPARI may crash without any indicative messages due to failures in the LAPACK routines used by the algorithm (see installation).
- The LINCS method is used. LINCS is a linear constraint solver that uses a projection approach. In the end, a matrix equation needs to be solved which requires the inversion of a matrix related to the coupling matrix of the constraints in the group. This is the critical step and grossly ineffective as a general procedure. For sparse matrices, however, the inversion can be performed approximately by a series expansion. It is the order of this expansion and its applicability that will determine the success and accuracy of LINCS. LINCS is generally inapplicable to anything involving bond angle constraints, in particular in all-atom representation. It will work well for loosely coupled groups of constraints. Since the accuracy depends on the unknown convergence properties of an infinite sum, the accuracy of LINCS cannot be tuned directly to yield a specific tolerance for satisfying the constraints. Instead, a combination of the expansion order and a number of corrective iterations controls the achieved discrepancy. To make LINCS more comparable to SHAKE in results, CAMPARI will dynamically adjust the former to achieve the target tolerance. Note that LINCS currently cannot split the workload for an individual constraint group across multiple threads if the shared memory (OpenMP) parallelization of CAMPARI is in use. This can be a performance limitation.
There is an additional issue that arises when virtual sites (technically atoms with no mass) are used, for example in rigid water models like TIP4P. Such sites have to be circumvented by the integration scheme (displacement is dependent on inverse mass), and therefore they have to be exactly constrained with respect to the positions of atoms with finite mass. These constraints cannot be solved within the standard framework (also dependent on inverse mass). Instead, the least constraint solution is obtained by simply rebuilding the positions of these sites with fixed internal geometry. For this to yield a correct integrator, however, the forces acting on the sites need to be remapped to the atoms they are connected to. This is done by decomposing the Cartesian force acting on the site into internal forces, for which compensating terms are added to all the atoms comprising the respective internal degree of freedom. This cancels exactly the net force on the site, and makes integration symplectic. Virtual sites cannot occur in constraint groups that are handled by a method other than standard SHAKE or SETTLE.
Note that if the shared memory (OpenMP) parallelization of CAMPARI is in use, there are most two levels of parallelization, across constraint groups and within a given constraint group. The second level is only available with standard SHAKE at the moment. The first level is balanced dynamically (see THREADS_DLB_FREQ and related keywords for general information on load balancing in CAMPARI).
If SHAKE or P-SHAKE are in use (→ SHAKEMETHOD), this keyword allows the user to set the target tolerance for satisfying distance constraints. The tolerance is relative to the target value of the constraint. As soon as the maximum deviation is less than this value, the iteration stops unless it is terminated earlier for other reasons (→ SHAKEMAXITER).If LINCS is in use, this keyword still has meaning even though the tolerance cannot be set explicitly. Should CAMPARI find that LINCS with the given settings satisfies the constraints significantly worse than defined by this keyword, it will adjust one of the open parameters of the method (→ LINCSORDER) in an attempt to remedy this situation. Similarly, should the opposite occur (LINCS satisfies constraints significantly more accurately than the desired tolerance), the parameter will be adjusted in the opposite direction. All this happens within sane bounds (2-16).
If SHAKE (→ SHAKEMETHOD) is in use with explicit bond angle constraints (→ SHAKESET), this keyword allows the user to set the target tolerance for satisfying angular constraints. The tolerance is absolute and applies to the unitless cosine of the respective angle. As soon as both maximum deviations drop below the threshold tolerances (see also SHAKETOL) the iteration stops unless it is terminated earlier for other reasons (→ SHAKEMAXITER).SHAKEMAXITER
If SHAKE or P-SHAKE are in use (→ SHAKEMETHOD), this keyword allows the user to alter the maximum number of iterations permissible to the algorithm. Since poor convergence properties are generally indicative of a more fundamental problem, increasing the value for SHAKEMAXITER will rarely be useful. After exceeding this many steps, the algorithm will simply continue with its current solution meaning that - for a good case - constraints will be violated slightly more than specified by SHAKETOL and eventually SHAKEATOL. Note that CAMPARI will then adjust the constraint targets in an attempt to rescue a simulation otherwise doomed. This may not always work and also lead to unwanted drift. Appropriate warnings are provided.LINCSORDER
If LINCS is in use (→ SHAKEMETHOD), this keyword allows the user to define the initial expansion order for the approximate matrix inversion technique. As mentioned above, the convergence properties of this approximation are not really known and prevent LINCS from satisfying an exact tolerance explicitly. In particular, for a fixed number of corrective iterations, convergence does not improve strongly for comparatively large changes in expansion order. Thus, CAMPARI adjusts the expansion order dynamically if it find that constraints are satisfied significantly better or worse than the desired tolerance provided through SHAKETOL. The allowed range is from 2 to 16, and relative tolerances below 10-4 will generally require a setting of 2 or larger for keyword LINCSITER. Very small tolerances are feasible and meaningful in CAMPARI since the representation is entirely in 64-bit floating point precision. Warnings are produced if the tolerance is missed, and the average expansion order across all LINCS constraint groups is reported at the end. If this value is large (in particular, if it is close to 16), it is strongly recommended to increase LINCSITER.LINCSITER
If LINCS is in use (→ SHAKEMETHOD), this keyword allows the user to define the number of iterations for correcting rotational lengthening. Typically, LINCS assumes only one such correction, but the matrix expansion cannot become arbitrarily precise with a single iteration. Thus, this keyword, which defaults to 2 in CAMPARI, can be used to alter the minimum tolerance achievable by LINCS. Since CAMPARI operates entirely in 64-bit floating point precision, meaningful tolerances can be chosen that are comfortably solvable by SHAKE in few iterations yet for which LINCSITER being 1 is insufficient.CAMPARI will not vary the number of iterations throughout the run, but it will vary the expansion order per constraint group to achieve the desired tolerance. It thus may be computationally less efficient to set LINCSORDER to 1 (depending on tolerance, system, integration time step, etc.) as it requires a larger expansion order.
If a minimization run is performed, this keyword lets the user select the method of choice. CAMPARI currently supports three canonical and one nonstandard minimizer. All minimizers can operate either in mixed rigid-body/torsional space, i.e., the "native" CAMPARI degrees of freedom or in Cartesian space; → CARTINT. However, there are algorithmic restrictions that the canonical minimizers (options 1-3 below) only support trivial constraints (see FMCSC_FRZFILE), which is an issue in Cartesian space (rigid water models, etc).Let us define γ as a vector of base increment sizes suitable for each of the degrees of freedom (partitioned into three classes: rigid-body translation, rigid-body rotation, and dihedral angles; keywords MINI_XYZ_STEPSIZE, MINI_ROT_STEPSIZE, and MINI_INT_STEPSIZE are used to specify each element γi). Also, let fm be an outside scaling factor in units of mol/kcal set by keyword MINI_STEPSIZE. Lastly, we introduce a unitless dynamic step length factor λ. If we now denote the heterogeneous vector of phase space coordinates as x, and the Hamiltonian is written as U(x), then we can write how the system is evolved through either one of four different protocols as follows:
- Steepest-descent:
xi+1 = xi - λ·fmγ•∇U(xi)
Here, "•" denotes the Hadamard (Schur) product, i.e., simply the element-by-element multiplication. Should the new conformation have overstepped in the direction of steepest descent, λ is iteratively reduced by a constant factor until a valid step is found (lower energy). In case of successful steps, λ is iteratively increased to improve the efficiency of the procedure if the underlying landscape is relatively smooth and flat. Successful steps are used as well to construct an appropriate guess for the initial step size should a complete reset be necessary. This mimics a line search. - Conjugate-gradient:
xi+1 = xi - λ·fm [ γ•∇U(xi) + fCG,idi-1 ]
fCG,i = [ ∇U(xi)·∇U(xi) ] / [ ∇U(xi-1)·∇U(xi-1) ]
di-1 = γ•∇U(xi-1) + fCG,i-1di-2
This conjugate-gradient method follows the Polak-Ribiere scheme and augments the steepest-descent prediction by an additional term that is estimated according to the suggestion by Fletcher and Reeves. Much like in steepest-descent, should the new conformation have overstepped, λ is iteratively reduced by a constant factor until a valid step is found (lower energy). In case of successful steps, λ is iteratively increased analogously to what is described above. - Memory-efficient Broyden-Fletcher-Goldfarb-Shanno method (L-BFGS)
according to Nocedal (reference):
xi+1 = xi - λ· [ H-1·(γ•∇U(xi)) ]
This quasi-Newton approach technically employs the inverse of the Hessian which is typically unknown. However, the L-BFGS method constructs a numerical estimate directly for the matrix product H-1·(γ•∇U(xi)) from the recent history of the minimization process. This widely used recursive two-loop scheme has the advantage of i) only requiring very few floating point operations, and ii) not requiring a running guess for the complete Hessian (inverse or not) due to the recursive formulation. Note that the inverse Hessian in our implementation is constructed from γ•∇U(xi), i.e. has units of mol/kcal throughout, irrespective of which degree of freedom is considered. This means that the factor fm does not show up in the L-BFGS equation except for the first step (initially or after a reset) when the steepest-descent approximation is used (see mode 1). The usage of (estimated) second derivative information should generally help inform the minimizer of more useful directions to pursue but step size limitations and inadequate guesses of the Hessian may render this potential benefit ineffectual. The reader is referred to the literature for further details.
- Thermal noise
quasi-stochastic (akin to simulated, thermal annealing):
This minimizer couples the system to a variable temperature bath. By changing the coupling parameters, the degrees of freedom are successively brought to a state consistent with a very low temperature ensemble. A similar quench in conditions is used in simulated annealing, a general solution strategy for optimization problems.
Initially, the system uses a heat bath as defined by the settings for TSTAT and TEMP. The system is then evolved using NVT molecular dynamics in either mixed rigid-body/torsional space or Cartesian space. Depending on initial conditions, this may heat up the system to a variable extent, and the maximum temperature is recorded. After a prescribed fraction of the total simulation steps, the target temperature is successively lowered to the value specified by keyword MINI_SC_TBATH. This interpolation uses a Gaussian function on the normalized time axis such that all interpolation curves can be rescaled in temperature to exactly coincide. Simultaneously, the algorithm measures the rate in change of temperature from the recorded maximum toward MINI_SC_TBATH. If the actual rate appears too slow or too fast, the time constant, τT, of the thermostat in use (→ TSTAT_TAU) is successively altered so as to achieve a cooling down of the system to a negligible temperature within the remaining number of available iterations. These alterations happen within bounds of 10 times the integration time step on the low end and the original setting for TSTAT_TAU on the high end.
This minimization approach employs two convergence criteria as soon as the number of steps specified via MINI_SC_HEAT has passed. During the cooling schedule, the procedure will stop either because the RMS-gradient fell below the threshold (→ MINI_GRMS) or because the target temperature (MINI_SC_TBATH) was reached which - per se - does not provide information on the local gradient. Of course, it may be possible to minimize such a structure further using a canonical approach. Both temperature and RMS-gradient are written to log-output to allow for easy inspection whether the parameters are set reasonably well. As an additional note it must be pointed out that - much like in standard molecular dynamics - runs starting from very unfavorable structures will cause large accelerations which may lead to a catastrophic blow-up of the system. This behavior can be avoided by performing a number of steepest descent minimization moves upfront. This number is set by keyword MINI_SC_SDSTEPS.
In general a minimization run will terminate after either the maximum number of iterations has passed (see NRSTEPS) or after convergence is achieved (see MINI_GRMS). Note that bad combinations of the various step sizes and the convergence criterion can easily lead to non-terminating runs even if convergence is achieved de facto.
In general, minimizations are unlikely to be interesting for on-the-fly analysis. This is because the conformations encountered do not correspond to a meaningful ensemble: neither in terms of coverage nor in terms of relative weights. Nevertheless, all analysis routines are supported and will work assuming that a single step corresponds to a single successful perturbation doing minimization (due to overstepping, the number of energy/gradient evaluations in minimization is usually larger than the actual number of steps: keyword NRSTEPS sets the maximum for the former).
If a canonical minimization run is performed, this keyword acts as a scale factor applied to all conformational increments applied during minimization. It therefore sets the global step size and corresponds to factor fm in the equations above. It - for technical reasons - has units of mol/kcal to eliminate the energy units of the normalized gradients. There are no canonical rules one can formulate but values significantly less than unity will typically be most appropriate to avoid that the algorithm frequently oversteps in a subset of the degrees of freedom and then has to iteratively reduce the step size. However, step size management is dynamic (consult factor λ introduced in the equations for minimization modes 1-2(3) above). This means that the impact this keyword has may be less than what one would generally expect.MINI_GRMS
If a minimization run is performed, this keyword allows the user to set the convergence criterion in units of kcal/mol. Since minimization runs can occur in torsional and rigid-body space, the "raw" gradient over all degrees of freedom is unsuitable. CAMPARI utilizes a simple workaround by normalizing all gradients by a basic step size for the respective types of degrees of freedom (see keywords MINI_XYZ_STEPSIZE, MINI_ROT_STEPSIZE, and MINI_INT_STEPSIZE). The resultant, normalized gradient is used to obtain its root mean square (→ GRMS) which is compared to the convergence criterion provided here. Since the normalized gradients assume a default step size, this parameter becomes dependent on them. For unit values for all three base step sizes, values around 10-2 are recommended. Conversely, in Cartesian space, only MINI_XYZ_STEPSIZE is relevant for the gradient criterion.MINI_XYZ_STEPSIZE
If a minimization run is performed, this keyword determines a basic step size to be considered for all rigid-body translations of molecules and for all Cartesian displacements of atoms. This value is to be provided in units of Å. Note that this keyword determines the effective initial translation step size in conjunction with MINI_STEPSIZE and that it is mostly needed to be able to handle the different units occurring when minimizing in mixed rigid-body and torsional space. All translational gradients are normalized by this number such that numerical estimates of the Hessian (→ BFGS) or even a meaningful root mean square can be written (→ MINI_GRMS). Note that for simulations in (effective) Cartesian space, it would be possible to combine this parameter with MINI_STEPSIZE to a single step size parameter.MINI_ROT_STEPSIZE
If a minimization run in mixed rigid-body and torsional space is performed, this keyword determines a basic step size to be considered for all rigid-body rotations. This value is to be provided in units of degrees (compare MINI_XYZ_STEPSIZE).MINI_INT_STEPSIZE
If a minimization run in mixed rigid-body and torsional space is performed, this keyword determines a basic step size to be considered for all dihedral angles. This value is to be provided in units of degrees (compare MINI_XYZ_STEPSIZE).MINI_UPTOL
If a minimization run is performed, and if the BFGS method is used, this keyword lets the user choose a tolerance criterion in kcal/mol for accepting uphill steps. At most ten or MINI_MEMORY (whichever one is smaller) such steps will be tolerated until a reset of the estimate of the Hessian occurs. This reset will reorient the (multidimensional) direction back onto a steepest descent path and the procedure can start anew. This feature is included since the curvature-based estimate of the direction in the BFGS method does not always guarantee a downhill direction (i.e., the energy resultant upon a perturbation in such a direction is larger than the current one for all steps within a finite interval (including arbitrarily small ones → this is a different problem from "overstepping" for which step size reductions are employed).MINI_MEMORY
If a minimization run is performed, and if the BFGS method is used, this keyword lets the user choose the memory length for the running estimate of the Hessian. Since the system will evolve throughout the minimization, the estimate of the Hessian is of course a moving target and it will only be useful to include points from the immediate vicinity in its numerical, gradient-based estimate. This keyword simply gives the (integer) number of immediately preceding steps to consider. Note that very long values will typically be irrelevant since the BFGS procedure will - in rough landscapes - frequently propose an ill-fated (uphill) direction (see MINI_UPTOL for comparison). Such moves will eventually lead to a reset of the estimate of the Hessian which includes "forgetting" all the memory. Hence, the effective usable memory length will be limited by the system as well. Note that the resets are necessary for the BFGS method to find any minima.MINI_SC_SDSTEPS
If a stochastic minimization run is performed, this keyword allows the user to request the program to first run the specified number of steps as canonical steepest-descent (SD) minimization. These SD moves will follow the same parameter settings as described above and are completely independent of the stochastic steps. Note that these steps are always skipped if the settings request the use of holonomic constraints when minimizing in Cartesian space.MINI_SC_HEAT
If a stochastic minimization run is performed, this keyword specifies the fraction of the total number of steps (NRSTEPS) that are going to be used to perform NVT dynamics at the user-supplied initial temperature and thermostat settings. Generally, for an efficient annealing protocol, it is probably advisable to combine a large value for this keyword with a high enough temperature and/or a comparatively large value for the thermostat's time constant, τT, such that NVE dynamics are mimicked over short periods of time (this will lead to heating in itself). Conversely, for straight minimization, it will be more appropriate to supply small values in conjunction with tight thermostat settings and low initial temperature.MINI_SC_TBATH
If a stochastic minimization run is performed, this keyword lets the user specify the target temperature of the bath the system will be coupled to at the very end of the run. From the simulation step defined by MINI_SC_HEAT onward, the target temperature is interpolated between TEMP and MINI_SC_HEAT using a Gaussian function operating on a normalized time axis. For the protocol to work as intended, it will not be useful to specify anything but values close to (but not exactly) zero here.Move Set Controls (MC):
(back to top)
Preamble (this is not a keyword)
A Monte Carlo simulation is a series of biased or unbiased random perturbation attempts to the system, in which some moves will be accepted (the Markov chain transitions to a new microstate) and the others rejected (the Markov chain remains in place) dependent on some criterion. This acceptance criterion is designed to sample a specific distribution, and the most common example is the Metropolis criterion designed to produce Boltzmann-distributed ensembles.The type of random perturbation attempts possible constitute the move set, and the resultant microstate transitions are usually very different from those observed in molecular dynamics (MD). In dynamics, all unconstrained degrees of freedom evolve simultaneously (high correlation), but in small increments (low effective step size). In Monte Carlo, one or few degrees of freedom evolve at a given time, but in step sizes of varying amplitudes. It is not required that individual degrees of freedom are all sampled with equal weight (nor would it be clear how to establish this). The effective sampling weight is determined by three components:
- The overall picking frequencies for move types (e.g., OTHERFREQ) are implemented by CAMPARI through a binary decision tree invoked at each step of the MC simulation. This means that the decisions taken at the root will influence the actual number of attempted moves of types chosen further up the tree, and that it may be complicated to calculate the expected numbers of attempts for those moves. This is why formulas are provided. Some totals (attempted and accepted moves) are reported in the log output at the end.
- The organizational unit for a move is often a residue, but not all residue may possess equal numbers of degrees of freedom. For instance, sidechain moves have a variable number of degrees of freedom they sample (→ NRCHI), but the actual numbers per degrees of freedom will not be uniformly distributed since different residues may have different numbers of χ-angles.
- Sampling weights can be adjusted explicitly with the help of the preferential sampling utility.
Because elementary Monte Carlo only change a few degrees of freedom at a time, the algorithms should be (and usually are) smart enough to only consider an incremental energy change associated with a move. The energy complexity of moves differs by type (see reference for details). The technical complexity with regards to applying the random perturbation also differs by type and sometimes behaves antagonistically to energy complexity. Taken together, these characteristics mean that it is challenging to parallelize a Monte Carlo sampler efficiently. CAMPARI generally tries to parallelize both coordinate operations and incremental energy calculations and strives to achieve load balance by explicitly estimating and then splitting the overall load for each significant task. This involves specialized routines for special move types. Because of the heterogeneity of MC move sets (and the controls offered over them), it is recommended to always perform a quick scaling check when using CAMPARI's shared memory (OpenMP) parallelization.
This keyword is relevant only when ENSEMBLE is set to either 5 or 6, i.e., those ensembles which allow numbers of particles to fluctuate. In this case, the keyword defines the fraction of all moves that attempt to sample the particle number dimension of the thermodynamic state of the system. For the semi-grand ensemble, this corresponds to attempting to transmute one particle type into another while preserving the position of the target particle. For the grand ensemble, it will with 50% probability try to insert a particle of permissible type in a random location in the simulation container and with 50% probability attempt to delete a permissible particle. These moves are applied at the molecule level and most closely related to rigid-body moves in terms of complexity (→ RIGIDFREQ).Technically, the GC ensemble is supported in CAMPARI by maintaining a set of ghost particles for each fluctuating type which work as "stand-ins". This framework entails certain limitations which are detailed elsewhere.
Expected numbers of such moves overall are calculated trivially as:
Note that the default picking probabilities are such that every molecule type allowed to fluctuate in numbers receives equal weight. In case of particle permutation moves, which are implemented as joint insertion/deletion, there is no way to adjust these. This is because the implementation mandates the molecule types to be different, which would require additional corrections in the acceptance probability, which would cancel out the preferential sampling weights. For independent insertion and deletion available in the grand ensemble, the preferential sampling utility allows the user to at least adjust the picking probabilities on a per-type basis. This can be relevant for example in electrolyte mixtures with disparate target concentrations (and correspondingly disparate bath particle numbers), for which it would make sense to preferentially insert and delete those particle types with overall larger numbers. Such an adjustment would also bring the sampling weights in line with the default picking probabilities for rigid-body moves, which are flat on a per-molecule basis.
This keyword specifies what fraction of all remaining moves (i.e., 1.0 - PARTICLEFLUCFREQ) is to perturb rigid-body degrees of freedom. This encompasses translations and rotations of individual molecules as well as of groups of molecules. The default picking probabilities are even for all molecules regardless of type, size, or other properties. They can be adjusted via the preferential sampling utility, and this may be relevant in dense or semi-dilute systems with different molecule types of vastly different size (e.g., proteins and inorganic ions). In such a case, the acceptance rates for the macromolecules will be noticeably smaller, and this could be compensated for by sampling them preferentially.While translations are generally simpler than rotations, they encounter difficulties when systems with nonperiodic boundaries are used in conjunction with large amplitudes of displacement. Relevant auxiliary keywords are RIGIDRDFREQ, TRANSSTEPSZ, and RIGIDRDBUF. Rotations are particularly tricky when more than molecule is considered. The relevant keywords for rigid rotation moves are RIGIDRDFREQ, ROTSTEPSZ, CLURBROTFREQ, and CLURBSPINFREQ. Finally, single molecule rigid-body moves can be coupled (→ COUPLERIGID). In case the moves are not coupled, keyword ROTREQ sets the fraction of rotation attempts. Some more implementation details are also given there.
This keyword is a simple logical deciding whether or not to couple translational and rotational rigid-body moves for single molecules (see CLURBFREQ for multi-molecule moves). Like any type of move coupling, this means that up to six independent perturbations of individual degrees of freedom are employed (translation in x, y, and z, rotation around three axes) before energies and the acceptance criterion are evaluated. Note that molecules with no rotational degrees of freedom, while still enjoying the full sampling weight in terms of picking probabilities, will have their moves downgraded and counted as pure translation moves in the log-output.ROTFREQ
This keyword can be used to set the sub-frequency for purely rotational moves if uncoupled moves are used (→ COUPLERIGID is false). It will then determine the fraction of those rotational moves. Total number:NRSTEPS · (1.0-PARTICLEFLUCFREQ) · RIGIDFREQ · (1.0-CLURBFREQ) · ROTFREQ.
And the total number of purely translational moves will be:
Note that the above formulas do not account for the choice between randomizing and stepwise perturbations (→ RIGIDRDFREQ), which would introduce an additional factor into the above product (this is always a terminal decision).
Rotation moves for single molecules are implemented as follows. For a free molecule, three rotation axes are identified as the eigenvectors of the molecule's gyration tensor. For each axis, a random angle is picked with uniform probability, either globally or from a finite interval centered at 0.0 (see RIGIDRDFREQ). The composite rotation is constructed from the individual values and applied to the molecule. The composite rotation corresponds to a particular order of applying these rotations individually (note that rotations in 3D do not permute), which is arbitrary but fixed. The composite rotation axis passes through the molecule's geometric center (if the calculation is a pure MC run) or through its center of mass (if the calculation is a hybrid MC/dynamics run. For a molecule with custom constraints on individual rotational axes (a, b, c), the implementation is adjusted so that it uses the cardinal axes as rotation axes. If this were not the case, the constraints would be largely meaningless.
This keyword sets a terminal choice in the selection tree that is common to many of the moves in CAMPARI (see similar keywords PIVOTRDFREQ, NUCRDFREQ, and so on). Amongst the available rigid-body moves (it applies to three separate branches: coupled single-molecule moves, decoupled multiple-molecule moves, and decoupled single-molecule moves), this keyword chooses the fraction to completely randomize the underlying degrees of freedom. For example, the complete randomization of translational degrees of freedom would displace the molecule's reference center to an arbitrary point in the simulation container (with some tolerance for aperiodic boundaries, see below). The remaining fraction will correspond to stepwise perturbations in which a usually small random increment is added to the degrees of freedom in question. For example, such a move would displace a molecules reference center by a random vector small in absolute magnitude.As an example consider single-molecule translation moves. The total number of expected randomizing translation moves would be (assuming COUPLERIGID is false):
And the number of stepwise translation moves would be:
The same modifications apply to any other branch of rigid-body move as explained above. As an additional complication, the decision about randomization vs. stepwise perturbations is decoupled itself in coupled rigid-body moves. Also note that the log output does not distinguish between the stepwise and randomizing varieties for any move type.
Randomizing rigid-body translations have one peculiarity. Unless a spatial dimension is periodic (see BOUNDARY), the absolute coordinates in this dimension have no strict bounds, which means that a fully random prior would have to extend to infinity. If the restraining force of the present interacting boundaries is not infinitely strong, it therefore gives rise to a consistent bias if particles are regularly placed randomly in a volume confined exactly to the formal definition of system size (the regions extending beyond this formal size become undersampled). This bias decreases with increasing restraining force. It is also less noticeable whenever the boundary potential does not act on just a single atom in the molecule. It acts on just a single atom for monoatomic "molecules" with an atom-based boundary condition and for single-residue molecules with a residue-based boundary condition (see BOUNDARY). In other cases, the bias is less apparent because the translation is applied to the molecule's geometric center, which in general is more likely to reside in the formal volume than the outermost atoms/residues. The same reason causes cluster rigid-body moves to be affected less as well, To avoid this type of bias, keyword RIGIDRDBUF can be used.
For any stepwise perturbation of rotational rigid-body degrees of freedom, this keyword sets the maximum step size in degrees. It is implemented such that the actual step size is drawn with uniform probability from an interval from -ROTSTEPSZ° to ROTSTEPSZ° for every individual axis that enters the computation of the net rotation axis and increment.TRANSSTEPSZ
For any stepwise perturbation of translational rigid-body degrees of freedom, this keyword sets the maximum step size in Å. Analogous to ROTSTEPSZ, it is implemented such that the actual step size is drawn with uniform probability from an interval from -TRANSSTEPSZ to TRANSSTEPSZ Å.RIGIDRDBUF
For any full randomization attempt in a rigid translation move in the presence of an explicit boundary potential acting on at least one of the coordinates, this keyword sets a scale ratio of sampling dimension and the size of the simulation container. Specifically, this means that the point to be placed, which is the geometric center of a molecule or a group of molecules, samples from a uniform distribution inside a container with dimensions that are RIGIDRDBUF times larger than the formal size specified. For a rectangular box these are the side lengths, for a sphere it is a radius, and for a cylinder it is the height and the radius. In all cases, the scale factor applies only to those dimensions, which are not periodic, i.e., which are "closed" by an explicit boundary potential of controllable strength. The factor is applied uniformly to all of these eligible dimensions. The default value is 1.0, and only values greater than or equal to 1.0 are allowed. The computational efficiency of these move types decreases with increasing RIGIDRDBUF because an increasing number of moves will be rejected on account of the boundary potential. However, a value of 1.0 introduces systematic biases. Essentially, the full randomization move is nonergodic and will lead to a systematic underestimation of occupancy probabilities for positions with finite values of the boundary potential. This effect is strongest for the displacement of single-atom molecules by single-molecule rigid translation moves. Appropriate values for RIGIDRDBUF are those that effectively mask the bias. These appropriate values depend on boundary force, temperature, formal size, and system composition (see above for details regarding the last point).CLURBFREQ
This keywords sets the fraction of all available rigid-body moves to simultaneously perturb the rigid-body degrees of freedom of more than one molecule in concerted fashion. In other words, these moves allow the concerted translation (by the same vector) and two types of rotations (either around the "cluster" center-of-mass or collectively around individual axes) of several molecules in one shot.The expected total number of multi-molecule moves would be:
And that of all single molecule rigid-body moves would be:
Currently, the picking of the molecules in a "cluster" is completely random. This is generally inefficient when there are many molecules. In particular, these moves offer absolutely no advantage for dense systems composed of many molecules because the "cluster" will almost always be a set of molecules whose degrees of freedom are not strongly correlated (and allowing correlation is the only real motivation for attempting multi-degree of freedom moves in MC).
As mentioned, multi-molecule moves are split into a translational variant and two rotational variants, with the latter explained below (see CLURBROTFREQ and CLURBSPINFREQ). Cluster translation moves proceed by picking a common displacement vector for more than one molecule (→ CLURBMAX). The expected total number of multi-molecule translation moves is:
Since these molecules have no controlled spatial relationship, picking random displacements for them can easily get very inefficient. For example, in droplet boundary conditions, translations of clusters formed by distal molecules will frequently incur significant boundary penalties. Like all other rigid-body moves, cluster moves can be stepwise or completely randomizing (still in concerted fashion). This is all regulated by the previously introduced keywords RIGIDRDFREQ, RIGIDRDBUF, ROTSTEPSZ and TRANSSTEPSZ.
With the preferential sampling utility, it is possible to alter the picking weights on a per-molecule basis (the default weights are the same for every molecule irrespective of size, shape, etc). Note that this should yield either zero or reasonably large weights for all molecules, because the weights combine in a product sense during the picking process. This also means that it is tedious to compute the expected sampling probabilities for all possible "clusters" of molecules of sizes 2 to the maximum value. These moves are likely to be replaced by a more efficient variant in the future that uses spatial information to make the correlation component more meaningful.
Custom constraints on individual translational directions (x, y, z) are observed by these moves in a restrictive manner. This means that if just one of the molecules attempted to be moved has constraints in a given Cartesian direction, this direction is eliminated for the entire cluster. In the unlikely case that different molecules have different constrained directions, it is therefore possible that the move is depleted entirely. This is inefficient as it constitutes a type of "empty" MC move with absolutely no sampling benefits. The correct handling of individual constraint directions is most useful for simulations in reduced dimensions (2 or 1).
If cluster rigid-body moves are in use, this keyword selects the fraction of these moves to attempt rotations. The "normal" way of rotating is around a common cluster centroid and thus perturbs translational degrees of freedom of individual molecules. There is an alternative type of rotation move enabled by keyword CLURBSPINFREQ, in which the same rotation is applied to several molecules. These rotations are not around inertial axes of individual molecules but always around the cardinal axes of the laboratory reference frame.Note that cluster rotation moves of the first type can become somewhat counterintuitive. In periodic boundary conditions, the nearest image and hence the internal structure of the cluster may actually change upon rotation of a cluster. Similarly, in droplet boundary conditions, rotations (and translations) of clusters formed by distal molecules may incur significant boundary penalties and suffer from low acceptance rates. Like all other rigid-body moves, cluster rotation moves can be stepwise or completely randomizing (still in concerted fashion). This is all regulated by the previously introduced keywords RIGIDRDFREQ and ROTSTEPSZ.
The expected number of multi-molecule rotation moves around a common pivot (the "clusters" centroid) would be:
Note that molecules whose translational degrees of freedom have been frozen (by custom constraints) are ineligible for rotations around a common pivot (while they are eligible for the spinning moves explained below). The rotational axes considered for the "cluster" of molecules are always the cardinal axes of the laboratory frame passing through the geometric center of the clusters.
Custom constraints on individual translational directions (x, y, z) are observed by these moves in a restrictive manner. This means that if just one of the molecules attempted to be moved has constraints in a given Cartesian direction, no rotations are allowed that would change this direction. This means that already a single constrained direction eliminates two of the three axes and more than one constrained direction prevents the entire move. The latter case is inefficient as it constitutes a type of "empty" MC move with absolutely no sampling benefits. The correct handling of individual constraint directions is most useful for simulations in 2D where a single rotation axis remains.
If cluster rigid-body moves are in use, and cluster rotation moves are enabled, this keyword selects the fraction of these molecules to be of the spinning variety. "Spinning" here refers to the selection of a "cluster" of molecules and applying the same rotation around parallel axes where each axis passes through each molecule's geometric center. As a result, these moves do not change translational coordinates of molecules.The expected number of multi-molecule rotation moves of the spinning variety would be:
The three individual rotational axes considered for every molecule in a "cluster" are always the cardinal axes of the laboratory frame passing through the respective geometric centers (if the calculation is a pure MC run) or the centers of mass (if the calculation is a hybrid run). This differs from the single molecule rigid rotation moves (and the coupled ones), which use axes derived from the gyration tensor whenever all three dimensions are free to move.
Custom constraints on individual rotational axes per molecule (x, y, z) are observed by these moves in a restrictive manner. This means that if just one of the molecules attempted to be moved has constraints for rotating around that cardinal axis, this axis is eliminated entirely when constructing the net rotation axis. This means that the unlikely case of different rotation axes having been eliminated for different molecules, the constraints prevent the entire move. This is inefficient as it constitutes a type of "empty" MC move with absolutely no sampling benefits. The correct handling of individual constraint directions is clearly most useful for simulations in 2D where a single rotation axis remains.
This keyword sets the maximum "cluster" size for concerted multi-molecule rigid-body moves (see CLURBFREQ). The assignment is completely random at any given step such that detailed balance is maintained. Note that the number of possible "clusters" grows as binomial coefficients with increasing size of the cluster until CLURBMAX reaches half the number of molecules in the system. It is important to point out that picking values close to the number of molecules can cause search problems that CAMPARI actively avoids. Specifically, if the total sampling weight of available molecules remaining is less than 10%, a new molecule has not been found to add to the "cluster" in 100 tries, and the current size is at least 2, then the value picked initially for CLURBMAX is decreased to the current size. This is to avoid the code spending an excessive amount of time in an inefficient search procedure. The control on total sampling weight is particularly relevant for cases where the picking weights have been altered on account of the preferential sampling utility.ALIGN
This keyword is an integer indicating how to handle the fact that lever arm effects can be asymmetric in multimolecular simulations. A brief explanation is in order. Consider a macromolecule with multiple dihedral angles along the backbone. Then, a perturbation of an individual of those dihedral angles may be implemented in two basic implementations corresponding to two building directions of the (unbranched) main chain. Either one of the ends will swivel around (lever-arm) while the other remains fixed in place. In a simulation with just a single molecule, the new conformations for either type will be identical except for an implied rotation of the reference frame. In a simulation with multiple molecules, however, the two conformations will be explicitly different since the other molecules define the now static reference frame. In general, moves with longer lever-arms will have lower acceptance rates and are slower to evaluate and should generally be avoided. For MC, this affects polypeptide pivot moves (coupled and uncoupled (see COUPLE)), ω-moves (see OMEGAFREQ), Favrin et al. inexact CR moves (see CRMODE), pivot-type nucleic acid moves (see NRNUC), sugar pucker moves (see SUGARFREQ), and polypeptide cyclic residue pucker moves (see PKRFREQ). It affects single torsion pivot moves (see OTHERFREQ) in a slightly different manner, and this is described there. It is also relevant for torsional dynamics for which it in similar vein determines the assumed building direction and resultant base of motion for the chains. Options are as follows:- Always leave N-terminus unperturbed (C-terminus swings around).
- Always leave C-terminus unperturbed (N-terminus swings around). This is only recommended in special applications since the C-terminal alignment requires the whole molecule to be rotated around, which makes this mode more expensive but analogously asymmetric when compared to mode 1.
- Always leave the longer end unperturbed (shorter lever-arm is chosen). This is the default (and a good) choice as it should be the most efficient one for simulations with multiple chains of significant length. It is also the recommended setting for torsional dynamics in which the kinetics at one of the termini will otherwise be artificially slowed (note that the criterion determining lever arm length uses number of atoms rotated rather than number of residues in dynamics).
- A stochastic modification of mode 3 only available in MC: The
probability with
which the longer end swivels around is equal to:
plt = (Lst + 1) / (Lst + Llt + 2)
And conversely:
pst = (Llt + 1) / (Lst + Llt + 2)
Here, Lst is the smaller number of residues beyond the pivot point towards the nearer terminus and Llt is the larger number of residues beyond the pivot point towards the more distant terminus such that Lst+Llt+1 yields the total number of residues in the molecule. For example, a molecule with six residues would yield probabilities for doing C-terminal alignment (the N-terminus swings around) of 6/7 for residue 1, 5/7 for residue for residue 2, and so on down to 1/7 for residue 6.
This choice represents the most flexible move set and should normally be preferred in MC when sampling problems are encountered.
Finally, it should be mentioned that the N-terminal alignment corresponds exactly to the way molecules are built (hierarchically) in CAMPARI from their Z-matrix entries, and that this is a purely technical choice that should not have an impact on the thermodynamics of a system.
If this keyword is set to 1 (logical true), all polypeptide pivot moves are coupled to sidechain moves on the same residue (→ PIVOTMODE). This means that new conformations for the φ- and ψ-angles as well as for some of the sidechain χ-angles (if any) are proposed before the energy and acceptance criterion are evaluated. Like any other unbiased move perturbing multiple degrees of freedom, this procedure drastically increases the chance of generating an unacceptable conformation (assuming a typical excluded-volume interaction potential is used). Consequently, acceptance rates will be very low and it is generally not recommended to use this option. Note that it is still possible to use independent sidechain moves but that it is impossible to do independent pivot moves for residues with sidechains. In other words, all frequency settings are used as normal but all standard polypeptide pivot moves (the default move type of the decision tree) are coupled to a mandatory sidechain move (of a (sub)set of sidechain angles in that residue). Keywords PIVOTRDFREQ, PIVOTSTEPSZ, CHIRDFREQ, CHISTEPSZ, and NRCHI and all observed in the respective parts of the underlying base move types.The expected number of those coupled moves would be:
Note that the same formula applies to uncoupled polypeptide pivot moves.
Polypeptide pivot moves are historically the oldest move type in CAMPARI. Therefore, they are placed at the outermost branch of the move selection tree and possess no frequency selection keyword. In general, pivot moves simultaneously sample the φ- and ψ-angles of a single polypeptide residue unless the residue is ring-constrained (such as proline or hydroxyproline) in which case only the unconstrained degree of freedom (ψ for proline) is sampled. See PKRFREQ for "pivot" moves which sample the φ-angles of proline and analogous residues. The default picking probabilities for polypeptide moves are even for all residues with peptide φ/ψ-angles. They can be adjusted with the help of the preferential sampling utility. An example where this can be useful is in reducing the picking weight of proline and similar residues, for which the number of degrees of freedom is smaller.Mostly for historical reasons, this keyword allows the selection of different modes for pivot moves as follows:
- Blind backbone sampling, i.e., all angles have equal likelihood (unbiased and the default)
- Using grids (requires GRIDDIR), i.e., angle pairs are sampled come from within an approximate envelope derived from the space available to the corresponding dipeptide if one assumes typical excluded volume interactions (biased).
Much like for other move types, CAMPARI allows the user to mix two types of polypeptide pivot moves: the first randomizing the φ- and ψ-angles of the residue in question (for proline only the ψ-angle, for coupled moves also the sidechain χ-angles → COUPLE), the second perturbing them by a small increment whose size is set by the auxiliary keyword PIVOTSTEPZ. Note that randomizing moves may be extremely ineffective for the sampling of dense phases (collapsed states of macromolecules) and that the only accepted moves will be those realizing small displacements by chance.To calculate the expected number of randomizing and stepwise polypeptide pivot moves, the user may employ the formula listed under COUPLE and multiply it with PIVOTRDFREQ and 1.0-PIVOTRDFREQ, respectively.
This keyword sets the step size in degrees for local perturbation attempts to the φ- and ψ-angles of polypeptide residues (see PIVOTRDFREQ). Note that this step size encompasses the entire symmetric interval around the original position, i.e. a value of 10° will attempt uniformly distributed random displacements within the interval of -5° to 5°.GRDWINDOW
This keyword sets a parameter determined by external input files which are used to assist conformation space sampling in biased fashion when PIVOTMODE is set to 2. Then, GRDWINDOW needs to specify half the bin size for the steric grids (see GRIDDIR). The files are supplied in the data-directory and the default value to be used here would be 5.0°. Note that grid-assisted sampling is not a fully supported option in CAMPARI and may be removed entirely in the future.OMEGAFREQ
In polypeptides, the dihedral angle along the actual peptide bond (ω) is different from the φ- and ψ-bonds since the carbon and nitrogen atoms have partial sp2-character. This inhibits free rotation around the bond due to electronic effects and means that only a very narrow range of conformations is typically available to the ω-angle. The two dominant states are the planar cis- and trans-conformations with the latter being almost exclusively seen for non-proline residues and both contributing for proline. In molecular mechanics force fields, these effects are typically represented via strong torsional potentials (see SC_BONDED_T and SC_EXTRA). From a sampling point of view, this means that it would be unwise to couple the sampling of such a stiff degree of freedom to any other degree of freedom. ω-moves therefore perturb nothing but the ω-angle of an individual polypeptide residue. They technically are equivalent to pivot moves in that the "free" end will swivel around lowering the acceptance rates additionally if the perturbations are large (→ ALIGN).To calculate the number of expected ω-moves use:
Note that the moves are additionally split up into those attempting to completely randomize the ω-angle and those that attempt stepwise perturbations (→ OMEGARDFREQ). It should be emphasized that the randomizing move will typically be the only way of converting between cis- and trans-conformations due to the height of the barrier separating the two. The default picking probabilities are identical for all residues with ω-type bonds. They can be adjusted with the help of the preferential sampling utility, and such adjustment could be useful in mixed systems with small molecule amides and polypeptides, where it may be beneficial to preferentially sample the polypeptide ω-bonds.
This keyword is completely analogous to PIVOTRDFREQ but applies to ω-moves instead of φ/ψ-moves.OMEGASTEPSZ
This keyword is completely analogous to PIVOTSTEPSZ but applies to ω-moves instead of φ/ψ-moves.PKRFREQ
Of the fraction of all pivot-type polypeptide backbone moves, what is the fraction of backbone moves to selectively alter the dihedral angles around the N-Cα bond in proline or similar residues? These rotations are hindered by the presence of the ring and hence they cannot be sampled independently. Moves of this type therefore alter the pucker state of the amino acid sidechain belonging to the chosen residue and the backbone conformation of the polypeptide (pivot-type move) simultaneously. These moves are analogous to sugar pucker moves for polynucleotides (see SUGARFREQ).The expected number of polypeptide pucker moves would be:
Note that these moves are split up into two variants - a non-ergodic one which inverts the pucker state, and one which introduces new degrees of freedom, bond angles, but allows sampling of most of the relevant phase space (bond length changes remain quenched). This is determined by PKRRDFREQ. When analyzing high-resolution structural databases, it can be seen that proline residues occupy two dominant pucker states separated by a barrier. The non-ergodic move can jump across this barrier but is unable to explore the basin around its current position. The latter requires bond angle changes as otherwise the problem is overconstrained. This introduction of new degrees of freedom is generally undesirable (see discussion under ANGCRFREQ) but in this particular case of small impact since none of the bond angles along the main chain are allowed to change. This keeps the effects of bond angle changes local while allowing exploration of the continuous manifold of conformations of the five-membered ring.
The exact set of degrees of freedom used to sample the ergodic move type is explained in detail elsewhere, and an implementation reference is given in the literature. The default picking probabilities for this move type are flat for all polypeptide residues possessing ring pucker degrees of freedom. The probabilities can be adjusted by the preferential sampling utility, and this could be used to fine-tune sampling weights in polymers. For example, puckering equilibria for central residues in polyproline are expected to be both more relevant and more difficult to sample than those for terminal residues and may benefit from being sampled preferentially. Finally, like sugar pucker moves, these moves are using no parallelization to the closure problem for the ring when CAMPARI's shared memory (OpenMP) parallelization is in use, which is a limitation.
As pointed out above, finding arbitrary conformations of a five-membered ring while keeping all bond lengths and angles constant is an overconstrained problem (→ PKRFREQ). Therefore, CAMPARI releases the constraint on bond angle rigidity for those systems which include proline and similar polypeptide residues. This necessitates the use of bond angle potentials (see SC_BONDED_A) to keep local geometries reasonable. To sample different ring conformers effectively, CAMPARI uses a strategy of combining a non-ergodic reflection of the pucker step (non-local) with stepwise but unbiased excursions away from the current state. This keywords regulates the fraction of pucker moves to be of the former type (reflection). The formulas listed under PKRFREQ multiplied with PKRRDFREQ and (1.0-PKRRDFREQ), respectively, would give the expected numbers for either type. Note that it typically is not a good idea to set this to either zero or unity. A value of unity would create an effective two-state model (with fixed bond angles), while a value of zero would make it very difficult for the gross pucker state to switch due to the barrier separating the two (this last statement assumes typical interaction potentials).PUCKERSTEP_DI
This keyword applies to the second type of pucker sampling (see PKRRDFREQ) and controls the maximum step size for dihedral angles in degrees for the random stepwise excursions from the current state. It simultaneously applies to the problem of sugar pucker sampling (→ SUGARFREQ). In both cases, four of the seven freely sampled degrees of freedom are dihedral angles.PUCKERSTEP_AN
This keyword applies to the second (stepwise) type of pucker sampling (see PKRRDFREQ) and controls the maximum step size for bond angles in degrees for the random stepwise excursions from the current state. Much like PUCKERSTEP_DI, this keyword simultaneously applies to the problem of sugar pucker sampling (→ SUGARFREQ). In both cases, two of the seven freely sampled degrees of freedom are bond angles and one bond angle is derived to correctly close the loop.NUCFREQ
This keyword controls the frequency of all types of polynucleotide moves excepting those sampling just sidechain degrees of freedom. This set includes algorithms to sample stretches of polynucleotides with end-constraints (concerted rotation → NUCCRFREQ), dedicated algorithms to sample the constrained dihedral angles around the sugar bond (→ SUGARFREQ), and simple polynucleotide backbone pivot moves. The description below applies only to the latter type which does not possess a dedicated keyword but is the default fall-through choice for this branch of the decision tree.Non-terminal polynucleotides have six backbone degrees of freedom one of which is not sampled by this type of move. Much like for proline, the rotation around the sugar bond is hindered and a dedicated algorithm is needed to sample this dihedral angle (→ SUGARFREQ). An overview of the backbone degrees of freedom for terminal and non-terminal nucleotides can be gleaned from the description of sequence input. Nucleotide pivot moves are physically analogous to polypeptide φ/ψ-moves in that they sample the backbone of a single nucleotide residue. The new conformation will imply the rotation of a lever arm which will render large-scale perturbations very unlikely to be accepted (→ ALIGN). Technically, these moves are implemented slightly differently in that the number of sampled degrees of freedom may vary (→ NRNUC). This is to make it possible to fine-tune sampling efficiency. As with any move coupling the sampling of independent degrees of freedom blindly, efficiency will typically be unacceptably low for more than two backbone dihedral angles given a realistic interaction potential and the complicated topology of polynucleotides. In the future, these moves are sought to cover any type of non-polypeptide polymer and the flexible setup was implemented partially with that in mind.
Expected numbers for all polynucleotide pivot moves may be calculated as follows:
Remember that NUCFREQ does not control the fraction of polynucleotide pivot moves directly but only sets the expected number for all polynucleotide moves. Note that the moves are additionally split up into those attempting to completely randomize the nucleotide backbone angles and those that attempt stepwise perturbations (→ NUCRDFREQ). The default picking probabilities for these pivot moves are flat on a per-residue basis. They can be adjusted by the preferential sampling utility, and this could become routinely relevant in future applications, for which other polymer types are subjected to pivot moves through this facility. In such a case, it would almost certainly be desirable to make the picking frequencies (at the very least) proportional to the number of backbone degrees of freedom in each residue, which may not necessarily be homogeneous.
This keyword allows the user to set the maximum number of nucleic acid backbone angles to be sampled within a pivot polynucleotide move. The dihedral angles will always come from the same residue. The implementation has the following features:- Whenever NRNUC is equal to or larger than the number of backbone angles on a certain residue, all backbone angles on that residue will be sampled simultaneously.
- Whenever NRNUC is smaller than the number of backbone angles on a certain residue, on average NRNUC of the available angles should be sampled simultaneously. However, the actual average will be larger since always at least one angle has to be sampled (in other words, there is a stochasticity to the number of angles chosen, and the asymmetry is introduced by the constraint to always have at least one angle in the set).
This keyword is completely analogous to PIVOTRDFREQ but applies to polynucleotide backbone pivot moves instead of φ/ψ-moves.NUCSTEPSZ
This keyword is completely analogous to PIVOTSTEPSZ but applies to polynucleotide backbone pivot moves instead of φ/ψ-moves.NUCCRFREQ
This keyword sets the fraction of exact nucleic acid concerted rotation (CR) moves amongst all nucleotide moves. Concerted rotation algorithms are provided both for polypeptides and polynucleotide and function generally analogously although there are important implementation differences. Important general information for this type of move is provided elsewhere, along with parameters that apply to all variants of exact CR moves (such as UJCRBIAS, UJCRSTEPSZ, and UJCRWIDTH). In particular, the reader is referred to both the literature and the documentation on CR moves for polypeptides (→ CRFREQ and TORCRFREQ) in particular with regards to the interpretation of auxiliary keywords (NUCCRMIN and NUCCRMAX) and the handling of picking probabilities and their alteration by user-level constraints and preferential sampling weights.The general idea of a concerted rotation move is to sample a stretch of polymer without changing the absolute positions and relative orientation of the termini. Six degrees of freedom are required to solve this constrained problem. Note that for nucleic acid CR moves the rotation around the sugar bond (C4*-C3*) is always excluded from the algorithm (treated as a rigid segment). The order of angles is as follows:
- Any number of consecutive and permissible backbone dihedral angles immediately preceding nuc_bb_4 on residue i
- O5P-C5*-C4*-C3* (nuc_bb_4 on residue i)
- C4*-C3*-O3P-P (nuc_bb_5 on residue i)
- C3*-O3P-P-O5P (nuc_bb_1 on residue i+1)
- O3P-P-O5P-C5* (nuc_bb_2 on residue i+1)
- P-O5P-C5*-C4* (nuc_bb_3 on residue i+1)
- O5P-C5*-C4*-C3* (nuc_bb_4 residue i+1)
The expected number of nucleic acid concerted rotation moves is obtained as follows:
Expected numbers for polynucleotide CR moves may be calculated as follows:
The user is reminded again that some of the parameters required for this move type apply universally to all exact CR methods while some apply specifically to the nucleic acid variant. Finally, these and all other exact torsional concerted rotation moves are (currently) unavailable when using CAMPARI's shared memory (OpenMP) parallelization.
This keyword sets the fraction of polynucleotide backbone moves to selectively alter the dihedral angles around the sugar bond (C4*-C3*) amongst all polynucleotide moves not of the CR variety. Exactly analogous to the case for proline and similar cyclic residues in polypeptides (→ PKRFREQ), these rotations are hindered by the presence of the ring and cannot be sampled blindly. Moves of this type will therefore alter the pucker state of the sugar belonging to the chosen nucleotide and the backbone conformation of the polynucleotide (including lever arm) simultaneously.The expected number may be calculated as follows:
The approach chosen to sample sugars is identical to the one for proline. There are two basic move types, one which inverts the pucker state by flipping the sign of two dihedral-angles, and a second one which perturbs the bond angles and dihedral angles defining the 5-remembered ring by small random increments while maintaining bond lengths exactly (→ SUGARRDFREQ). The default picking probabilities for this move type are even for all eligible, sugar-containing residues. They can be adjusted by the preferential sampling utility. An example application could be to preferentially sample sugars close to the binding interface of a well-defined protein-DNA complex rather than those in the rigid portion of the DNA. Finally, like polypeptide pucker moves, these moves are using no parallelization to the closure problem for the ring when CAMPARI's shared memory (OpenMP) parallelization is in use, which is a limitation.
This keyword is exactly analogous to PKRRDFREQ but applies to sugar pucker moves in polynucleotides instead of to polypeptide pucker moves.CHIFREQ
Most biologically relevant polymers possess at least minor branches off the main chain. These sidechains are typically short and usually encode the alphabet underlying for instance polypeptides and polynucleotides. From a technical point of view, such short branches are much easier to sample than the backbone of a polymer since the impact of a change in conformation of the branch only affects the branch (lever arm effects are minimal and the assumed direction is always from the main chain outward towards the end of the branch). Since the perturbation is local, energy evaluations are much less costly and acceptance rates generally higher. There is no need for advanced algorithms and simple pivot-style moves re-setting or perturbing the dihedral angles angles in such a sidechain branch are sufficient to explore phase space. This keyword sets the fraction of all sidechain moves including a specialized move type used for analysis only (→ PHFREQ).Expected numbers for actual sampling moves (denoted as χ-moves) are:
And for moves trying to determine the pK-values of ionizable polypeptide sidechains:
Note that the former are decomposed further into those randomizing the contributing degrees of freedom and those applying stepwise perturbations (→ CHIRDFREQ). The default picking probabilities for this move type give equal weight to all residues with at least one χ-angle independent of the number of χ-angles. This can be adjusted by the preferential sampling utility, which as an example would allow making all residue picking probabilities directly proportional to the number of χ-angles for each residue.
This keyword is completely analogous to PIVOTRDFREQ but applies to χ-moves instead of φ/ψ-moves.CHISTEPSZ
This keyword is completely analogous to PIVOTSTEPSZ but applies to χ-moves instead of φ/ψ-moves.NRCHI
Many sidechains have different numbers of χ-angles and the complexity of a move would depend on the number of such angles sampled concurrently. Therefore, this keyword allows the user to set the maximum number of χ-angles to be sampled within a sidechain move. The dihedral angles will always come from the same sidechain on the same residue. Analogously to NRNUC, the implementation has the following features:- Whenever NRCHI is equal to or larger than the number of χ-angles on a certain residue, all χ-angles on that residue will be sampled simultaneously.
- Whenever NRCHI is smaller than the number of sidechain angles on a certain residue, on average NRCHI of the available angles should be sampled simultaneously. However, the actual average will be larger since always at least one angle has to be sampled (in other words, there is a stochasticity to the number of angles chosen, and the asymmetry is introduced by the constraint to always have at least one angle in the set).
MC move sets are highly specialized tools that have to reflect the choice of the system's degrees of freedom, its density, etc. Some of the choices enforced by the "standard" CAMPARI move sets and mandated by the default parameterization of the ABSINTH implicit solvent model are somewhat arbitrary. This is primarily an issue for degrees of freedom describing rotations around electronically hindered bonds and for rotations around terminal bonds between heavy-atoms (methyl and ammonium spins). For example, the amide bond in secondary amides is allowed to vary with dedicated moves, but these are not available for primary amides (the reasoning behind it is connected to the vanishing relevance of cis/trans isomerization in the latter case). However, these choices may not always be desirable. Second, when attempting to simulate entities that CAMPARI does not support natively, the majority of "standard" move types may not be available (exceptions apply if the entities are recognized as conforming to a supported biopolymer type). This would limit simulations containing such entities to pure rigid-body sampling.To address both issues, CAMPARI offers a separate class of dihedral angle pivot moves that can be applied to any freely rotatable torsion angle in any of the system's components. All candidate dihedral angles in residues supported natively by CAMPARI that are frozen by default fall into this category, (e.g., the C-N bond in the lysine sidechain, all C-N bonds in primary amides, the CA-CB bond in alanine, and so on). For unsupported residues, the Z-matrix is inferred from the input structure, and it may require some reordering of atoms to achieve the desired results (see a tutorial relevant in this context). In addition, these moves can also sample torsional degrees of freedom supported by other move sets as long as they fulfill the criterion that they correspond to a unique Z-matrix line describing the rotation around a topologically unhindered bond. Prior to version 5 of CAMPARI, this excluded the polypeptide φ/ψ-angles, which are supported by the widest range of specialized move sets. As a consequence, Monte Carlo runs with a fixed random seed produce immediately divergent results between across this version step (4 to 5) if polypeptide φ/ψ-angles are present and single dihedral angle moves are enabled for native degrees of freedom (see OTHERNATFREQ).
In terms of parameters, some care has to be taken that torsional potentials describing electronic effects (e.g., in primary amides) are included. Technically, moves of this type are unique in that they always sample only a single degree of freedom. Chain alignment works slightly differently for these moves. Specifically, for options 3 and 4, the number of atoms (rather than the number of residues) moving is critical in determining alignment. Also, all degrees of freedom are eligible for an inverted alignment including sidechain degrees of freedom. Even for option 3, this may consequently lead to the absence of a "base of motion" that would stay rigorously in place in the absence of rigid body moves. Because these moves will allow a main chain to swivel around a bond in a sidechain (branch), degree of freedom-based custom constraints would be seriously violated if ALIGN is set to 4. Thus, CAMPARI will defer option 4 to option 3 when there are active constraints set through the input file. Note that this does not apply to the Monte Carlo move set in general; it is only effective for the moves controlled by OTHERFREQ. For option 2, CAMPARI attempts to preserve a well-defined base of motion at the C-terminus, but this may not work as expected, in particular for polynucleotides and/or very short chains.
To calculate the number of all expected moves of type of OTHER, use:
Note that these moves are additionally split up into three basic types (see OTHERUNKFREQ and OTHERNATFREQ for choosing different subsets of degrees of freedom), each of which is again split into two variants, i.e., those completely randomizing the dihedral angle and those that attempt stepwise perturbations (→ OTHERRDFREQ). The default picking probabilities for OTHER moves are different from other move types in CAMPARI, since they are identical for all eligible degrees of freedom (and not identical for all residues containing at least one eligible degree of freedom). For each subcategory of degrees of freedom, sampling weights can be adjusted individually with the preferential sampling utility. Details and examples are given for the individual subcategories.
If single dihedral angle pivot (OTHER) moves are in use, and if the simulation utilizes entities (residues, molecules) that are not natively supported by CAMPARI, this keyword allows the user to choose the bulk sampling weight for degrees of freedom in those unsupported residues. The use of unsupported residues in simulations is explained in a dedicated tutorial.To calculate the number of expected moves acting on single dihedral angles in unsupported residues, use:
As mentioned above, these moves are additionally split up into two subtypes i.e., those completely randomizing the dihedral angle and those that attempt stepwise perturbations (→ OTHERRDFREQ). The default picking probabilities for OTHER moves are different from other move types in CAMPARI, since they are identical for all eligible degrees of freedom (and not identical for all residues containing at least one eligible degree of freedom). They can be adjusted at the level of individual degrees of freedom by the preferential sampling utility. As an example, this can be useful when sampling an unsupported polymer (e.g., a polyester) and greater sampling emphasis should be placed on backbone degrees of freedom.
If single dihedral angle pivot (OTHER) moves are in use, and if not all OTHER moves are consumed on unsupported residues (→ OTHERUNKFREQ), this keyword allows the user to choose the bulk sampling weight amongst remaining OTHER moves for degrees of freedom that are supported natively by CAMPARI.To calculate the number of expected moves acting on single dihedral angles natively supported, use:
This keyword also controls the fraction of moves acting on dihedral angles frozen by default, but located in residues supported natively by CAMPARI. Compute expected number as:
Both subclasses are additionally split up into two subtypes i.e., those completely randomizing the dihedral angle and those that attempt stepwise perturbations (→ OTHERRDFREQ). The default picking probabilities for OTHER moves are different from other move types in CAMPARI, since they are identical for all eligible degrees of freedom (and not identical for all residues containing at least one eligible degree of freedom). They can be adjusted at the level of individual degrees of freedom by the preferential sampling utility. For the natively supported degrees of freedom, this could be useful in order to aid sampling of backbone degrees of freedom, whereas for the natively frozen degrees of freedom it could be used to selectively enable a few of those degrees of freedom (e.g., enable flexibility of arginine sidechains, but keep suppressing the methyl spins in hydrophobic residues).
This keyword is completely analogous to PIVOTRDFREQ but applies to all moves of type OTHER instead of polypeptide backbone pivot moves.OTHERSTEPSZ
This keyword is completely analogous to PIVOTSTEPSZ but applies to all moves of type OTHER instead of polypeptide backbone pivot moves.CRFREQ
This keyword is a global frequency setting which controls and entire branch of Monte Carlo moves all sharing the feature that they are of the concerted rotation (CR) type and apply to polypeptides. The general idea of a CR move is to sample a stretch of polymer without changing the absolute positions and relative orientation of the termini. Six degrees of freedom are required to solve this constrained problem exactly but simpler methods exist to use more degrees of freedom to solve it approximately (→ CRMODE). The reader is referred to NUCCRFREQ for CR moves on polynucleotides.There are four different types of CR moves for polypeptides provided in CAMPARI:
- Exact CR moves utilizing both bond angles and dihedral angles
along the polypeptide backbone to solve the closure problem exactly
given fixed end constraints: these
moves are based on the work of Ulmschneider and Jorgensen (→ ANGCRFREQ). (reference)
- Exact CR moves utilizing φ-, ψ-, and ω-angles along the
polypeptide backbone to solve the closure problem exactly given fixed
end constraints: these
moves are primarily based on the work of Dinner (→ TORCRFREQ and TORCROFREQ). (reference)
- Exact CR moves utilizing just φ- and ψ-angles along the polypeptide backbone to solve the closure problem exactly given fixed end constraints: these moves are also based on the work of Dinner (→ TORCRFREQ and TORCROFREQ).
- Inexact CR moves utilizing just φ- and ψ-angles along the
polypeptide backbone to approximate a solution to the closure problem
by linear response: these
moves are based on the Favrin, Irbäck, and Sjunnesson (default
fall-through for this branch). (references)
The general appeal of exact CR methods partially lies in the reduced complexity of energy evaluations since the move only perturbs conformation locally and large parts of the polymer (assuming sufficient length) will remain static with respect to each other. This is never true for pivot-type moves applied to residues at the center of the chain. The other aspect which makes CR moves appealing is that they introduce correlation into the MC move set (the reader is referred to Vitalis and Pappu for further reading).
To compute expected numbers, use (same numbering as above):
This keyword selects the (sub-)fraction of Ulmschneider-Jorgensen (UJ) CR moves (see J. Chem. Phys. 118 (9), pp4261-4271 (2003)) according to the formulas shown above. Like any other exact CR move implemented in CAMPARI, UJ-CR moves combine two strategies for efficient conformational sampling: the approach of Favrin et al. (→ CRMODE) is used to obtain a variable length pre-rotation which biases the end of the pre-rotation segment to a position with high chance of having at least one real solution when attempting to close it. The closure problem is solved exactly using a numerical root search for an algebraically transformed equation for the following six degrees of freedom:- Dihedral angle Ci-2, Ni-1, Cα,i-1, Ci-1 (φi-1)
- Bond angle Ni-1, Cα,i-1, Ci-1
- Dihedral angle Ni-1, Cα,i-1, Ci-1, Ni (ψi-1)
- Bond angle Cα,i-1, Ci-1, Ni
- Bond angle Ci-1, Ni, Cα,i
- Dihedral angle Ci-1, Ni, Cα,i, Ci (φi)
- The chain closure algorithm relies on a search process to locate roots for a complicated equation, which makes repeated matrix operations necessary which generate a considerable computational overhead for a single UJ-CR move. This is true for all exact CR methods, even much more so for exact torsional variants than for UJ-CR moves (→ TORCRFREQ).
- The inclusion of bond angles into the pre-rotation stretch is not
a particularly useful extension but
required for reasons of ergodicity. Additional parameters are needed to
manage this aspect properly
(→ UJCRSCANG). The inclusion of
bond angles in the closure segments simplifies the root search
procedure by eliminating branches for solution space and generally
reducing the number of possible solutions. This makes the algorithm
faster than comparable methods using dihedral
angles only. However, varying bond angles cause two crucial issues:
- Allowing bond angles to change violates CAMPARI's typical paradigm of fixed geometry in MC calculations and therefore might invalidate some of the force field calibration done under this assumption. In general, it is very important to match the degrees of freedom chosen for the calibration phase of a force field with that for the application phase. The commonly held belief that the introduction of constraints does not alter the positions and relative weights of basins but merely influences barriers in the free energy landscape is not correct.
- CAMPARI currently has no way of independently sampling bond angles in Monte Carlo simulations. This means that effectively a subset of all bond angles are introduced as new degrees of freedom, for which is there is no a priori justification whatsoever (in other words: selectively sampling a few bond angles makes unjustified assumptions about the remaining bond angles). It is therefore recommended to use this feature with the utmost caution until a more sound implementation surrounding it is added. Presently, it may be most suitable as part of the MC move set in hybrid runs (see DYNAMICS) employing Cartesian sampling in the dynamics portions (see CARTINT) although this approach has its own caveats.
Aside from the UJ-CR moves which employ bond angles (see ANGCRFREQ), analogous methods have been formulated to instead employ exclusively dihedral angles in both the closure and pre-rotation stretches. This keyword sets the frequency with which both subtypes of those moves occur during the simulation according to the formulas listed above. The preceding discussion has outlines the appeal of exact CR methods and it is not repeated here. Much like Ulmschneider and Jorgensen, CAMPARI employs a hybrid scheme of biased pre-rotations according to Favrin et al. (see CRMODE) and of exact closures according to Dinner. The latter half of the algorithm is the cost-intensive one. The algebraically transformed equation requires a numerical root search, for which we use a modified Newton scheme outlined below. Typically, multiple solutions need to be found, and a careful weighting and bias-removal strategy has to be employed to choose solutions with the proper probabilities (→ TORCRMODE). Those comments apply equally to exact polynucleotide CR moves (see NUCCRRFREQ). For polypeptides, there are two variants available which differ in which peptide torsions are used to close the chain (described below).Note that proline (or any other cyclic residue with constrained flexibility around any of the backbone dihedral angles) causes additional problems. In theory, one could formulate algebraic solutions which skip the proline φ-torsion. Since the number and positions of proline residues in the closure stretch are not known a priori, this appears impractical. We therefore provide a coupling to (weakly biased and simplified) pucker moves (see PKRFREQ) which will simultaneously determine and propose a new pucker state while solving the chain closure problem. This means that:
- Sampling of the φ-angle becomes coupled to the proline sidechain conformation (as it should be).
- The acceptance rate for CR moves will be significantly lower due to the extra degrees of freedom included.
- The sampling of the sidechain conformation will be weakly biased towards proper pucker states. In detail, some of the proposed closures will yield φ-angle values incompatible with sidechain closure and those will be discarded. For those which yield a sane φ-angle, a corresponding χ1-value is proposed with bias toward closable states. One of two free bond angles is perturbed slightly in random fashion and the last one is given by the closure as usual.
- Due to the above, it will be advantageous to not rely overly on CR-sampling for proline-rich systems - both for reasons of efficiency and accuracy. Conversely, it should be difficult to find a statistically significant impact of the sampler on global chain properties for polypeptides with low proline content.
This keyword lets the user set the fraction amongst exact, torsional polypeptide CR moves to include ω-angles in the formulation of the closure problem? Conversely, the remaining moves will use only φ/ψ-angles to close the chain. Expected numbers for either type are listed above. In detail the ω-variant uses the following six degrees of freedom:- Dihedral angle Cα,i-2, Ci-2, Ni-1, Cα,i-1 (ωi-1)
- Dihedral angle Ci-2, Ni-1, Cα,i-1, Ci-1 (φi-1)
- Dihedral angle Ni-1, Cα,i-1, Ci-1, Ni (ψi-1)
- Dihedral angle Cα,i-1, Ci-1, Ni, Cα,i (ωi)
- Dihedral angle Ci-1, Ni, Cα,i, Ci (φi)
- Dihedral angle Ni, Cα,i, Ci, Ni+1 (ψi)
Conversely, for the non-ω-variant we have:
- Dihedral angle Ci-3, Ni-2, Cα,i-2, Ci-2 (φi-2)
- Dihedral angle Ni-2, Cα,i-2, Ci-2, Ni-1 (ψi-2)
- Dihedral angle Ci-2, Ni-1, Cα,i-1, Ci-1 (φi-1)
- Dihedral angle Ni-1, Cα,i-1, Ci-1, Ni (ψi-1)
- Dihedral angle Ci-1, Ni, Cα,i, Ci (φi)
- Dihedral angle Ni, Cα,i, Ci, Ni+1 (ψi)
The need for different implementations is that the problems differ algebraically (for once) and that the stiffness of the ω-bond may make those moves using the ω-bonds in the closure particularly ineffective. This is not the only reason, however, to favor the non-ω-variant which is also better-behaved in terms of finding solutions to the closure reliably. Note that several diagnostics of the performance of exact CR methods are reported during the simulation and after its completion in the log-file.
This defines the mode to use for concerted rotation moves roughly according to the Favrin et al. reference: J. Chem. Phys. 114 (18), 8154-8158 (2001). In general, this type of move attempts to introduce correlation into a MC move by coupling several consecutive backbone angles (only φ/ψ are considered) together to minimize a cost function which in this case is the difference of the position of the last atom in the stretch compared to its original position. Larger biases lead to smaller moves and higher acceptance. More often than not, this algorithm suffers from its computational inefficiency. Because the loop is only approximately closed, energy evaluations of high complexity (even more expensive than a pivot move) are necessary. It is not recommended to use moves of this type extensively.There are two modes available:
- A matrix relating changes in the degrees of freedom to changes in the cost function (dr/dφ) is computed by considering effective lever arms. In this implementation six effective restraints are imposed through the three reference atoms (N, Cα, C) on the residue following the last one of those whose torsions are sampled (note, though, that algorithmically all nine Cartesian positions are used). Note that this mode therefore requires an additional buffer residue at the C-terminus. Specifically, sampling is possible only within an interval from the third residue (in addition to the ineligible terminal residues, there is a symmetry-creating N-terminal buffer residue as well) to the third last residue in each polypeptide chain. In that sense, these moves are trivially non-ergodic since they fail to sample a subset of the chosen degrees of freedom (i.e., those within terminal residues).
- The dr/dφ matrix is computed by nested rotation matrices (propagating changes via matrix multiplication). This directly accounts for peptide geometry within the reference atoms and yields six actual restraints. Here, the reference atoms are Cα, C, and O on the last residue of which torsions are to be sampled. The implementation with nested rotation matrices is costlier and this mode is only marginally supported, i.e., offers very limited adaptability through the keywords below.
If inexact concerted rotation moves for polypeptides are in use (→ CRMODE), this keyword allows the user to provide the exact number of torsions to use each time such a move is performed. The default value is eight but a different number may be chosen as long as the chain is long enough to accommodate these moves. A minimum of seven degrees of freedom applies since the linear equations are otherwise overdetermined and only trivial solutions are (asymptotically) found. Note that this keyword is only supported if CRMODE is set to 1. Extensions of this to support mode 2 or to allow random, variable lengths during the simulations are currently not anticipated. This is due to the overall inefficiency of the Favrin et al. approach (see discussion here).CRWIDTH
This keyword gives the standard deviation in radian of the random normal distribution underlying inexact concerted rotation moves for polypeptides (→ CRMODE), from which the (unbiased) displacement vectors are implicitly drawn. This corresponds to parameter "a" in the reference but is specified here as its inverse (a = 1/CRWIDTH). Note that the actual resultant distribution width is only set by this keyword if the bias toward minimizing the cost function is zero. If the latter is non-zero the resultant distribution width will be co-controlled by the setting for CRBIAS. Note that only values up to π/2 may be specified to avoid wrap-around artifacts which may upset the procedure of removing the bias from these moves.CRBIAS
This keyword specifies the strength of the bias for inexact concerted rotation moves for polypeptides (→ CRMODE) and corresponds to parameter "b" in the reference. It essentially controls how close the end of the rotated segment will end up to its original position (satisfying the restraints). Unfortunately, this also co-regulates the step size, hence there is a need for parameter optimization (i.e., the variance of the resultant biased distribution cannot be controlled easily). Intuitively, the reason is that - in a linear response-type theory - tiny step sizes always represent one way of satisfying the restraint. Note that with a choice of zero for this keyword, these inexact CR moves relax to random pivot moves of multiple residues in a row (→ CRDOF) with a sampling width controlled by CRWIDTH. Conversely, when choosing very large numbers for this keyword, it should be kept in mind that the evaluation of the acceptance criterion requires inclusion of an exponential factor, exp[- (ΔφT A Δφ) + (Δφ'T A' Δφ') ]. Here, the primed quantities are for the reverse move. Matrix A is diagonal if this keyword is set to zero which implies A = A', and the bias correction is unity. For large values of CRBIAS, the two elements within the exponential become disparate in magnitude very quickly and the exponential may exceed numerical limits even for double precision variables. This may cause some compilers to throw exceptions. Note that the complete bias correction formula includes the determinant of matrix A as well.UJCRBIAS
Despite its name, this keyword regulates the biasing strength for the pre-rotation steps in all exact CR methods, i.e., nucleic acid CR moves, UJ-CR moves and both types of exact polypeptide CR moves (→ ANGCRFREQ, TORCRFREQ, and NUCCRFREQ). The strength of the bias controls how close the end of the pre-rotation segment remains to its original position hence improving the chances for successful closure. This parameter is strongly co-dependent "with" the default distribution width in the absence of any bias (→ UJCRWIDTH). This keyword parameter is analogous to CRBIAS in the Favrin et al. scheme and is called "c2" in the UJ reference. It should be stressed that all caveats outlined above apply here as well.UJCRWIDTH
Despite its name, this keyword regulates the general (in the absence of bias) width of the distribution (in degrees) sampled in the pre-rotation segment for all exact CR methods (→ ANGCRFREQ, TORCRFREQ, and NUCCRFREQ). As in the Favrin et al. scheme (which is practically embedded in all exact CR methods implemented in CAMPARI), the resultant width is co-dependent on the bias factor (see UJCRBIAS and for comparison: CRBIAS and CRWIDTH). It corresponds to "1/c1" in the UJ reference and therefore larger values give wider distributions.UJCRSTEPSZ
The chain closure algorithm works in most exact CR implementations by reducing a multi-dimensional variable search to a 1D root-search, which is then solved by some form of step-through protocol and subsequent bisection. This keyword allows the user to choose the step-size for that root search in degrees for all exact CR methods. Currently, the UJ-CR method (→ ANGCRFREQ) uses a simple, non-adaptive stepping protocol (see also UJCRINTERVAL). Larger step-sizes there increase the speed of the algorithm significantly, but also increase the fraction of attempts in which no solution is found at all (a quantity reported at the end of the log-file). The recommended value by the authors is 0.05°. Conversely, the exact torsional CR methods for both polypeptides and polynucleotides (→ TORCRFREQ and NUCCRFREQ) employ a modified Newton scheme to map out the complete solution space in three hierarchical steps. In those cases, this keywords merely defines the largest step size to ever be used (i.e., if target function and derivative indicate that no root is near, the step size is not adjusted to very large values but instead to the value given by this keyword). For these methods, a setting of around 1.0 appears much more appropriate. In the future, the implementation of the UJ-CR method may be adjusted to use the same protocol as the torsional methods. For clarity, it shall be repeated that this keyword applies to all exact CR methods (but is inapplicable to inexact CR moves: → CRMODE). It is very important to understand that the numerical root search will invariably be unreliable, i.e., that there are conformations for which the function may be approaching zero asymptotically while also approaching imaginary solution space. This implies that with such a technique, it will be nearly impossible to eliminate all biases rigorously although it will be possible to reduce their amplitude below that of statistical noise, even when the settings are such that satisfactory computational efficiency is provided (which of course is a crucial element to consider for expensive algorithms such as exact CR methods).UJCRMIN
Specifically for the bond angle-based Ulmschneider-Jorgensen algorithm (→ ANGCRFREQ), this specifies the minimum requested length (in terms of number of residues) for the pre-rotation segment in the implementation. Note that if no molecule in the system is at least UJCRMIN+4 residues long (two for closure, two terminal buffer residues that can be caps), CR moves will be disabled entirely. Due to the problems outlined above, this suboptimal implementation has not yet been improved. Note that UJCRMIN and UJCRMAX are analogous to keywords TORCRMIN_DO and TORCRMAX_DO, but use residue numbers instead of numbers of degrees of freedom. Another restriction is that - unlike for TORCRMIN_DO and analogous keywords - UJCRMIN is enforced strictly, i.e., candidate residues are only those that provide the correct padding on either side (for the exact, torsional variants, the specified minimum padding is generally adjusted to the absolute minimum for stretches that would otherwise be too short). Therefore, the implementation of the angular UJ-CR moves generally offers less flexibility.UJCRMAX
Specifically for the bond angle-based Ulmschneider-Jorgensen algorithm (→ ANGCRFREQ), this keyword specifies the maximum requested length (in numbers of residues) for the pre-rotation segment in those moves. Note that this parameter is automatically reduced if a move is attempted for a molecule which is too short to allow the full range of segment lengths (but long enough to satisfy UJCRMIN of course). This will make it difficult to predict the resultant distribution of pre-rotation segment lengths (compare TORCRMIN_DO).UJCRINTERVAL
Specifically for the bond angle-based Ulmschneider-Jorgensen algorithm (→ ANGCRFREQ), this keyword lets the user choose the size of the search interval for the one-dimensional root-search (see UJCRSTEPSZ). The algebraically isolated degree of freedom is scanned over the interval [φ-d;φ+d] where φ is the original value and d is the (half-)interval size specified by this keyword. The recommended value is 20.0°. Note that this implementation is unique to the bond angle UJ-CR method and offers much reduced overhead cost per CR move compared to the exhaustive search performed by exact torsional methods. The efficiency and justifiability of the method both rely on the crucial assumption that - given a typical pre-rotation - approximately one solution will be found in the scanned interval. If the number of solutions is often zero or larger than one, the algorithm violates detailed balance and the resultant distributions will be strongly biased. It is generally recommended to analyze the performance of the algorithm beforehand by checking for proper Boltzmann weights in the distributions of both torsional and angular degrees of freedom. This is most easily and meaningfully done employing only bond angle potentials (→ SC_BONDED_A) but no other terms in the Hamiltonian. Then, the distributions of the dihedral angles must be flat and those for the angular degrees of freedom must be such that to -kbT·ln(p(α)) equals the acting bond angle potential on α.UJCRSCANG
This keyword applies exclusively to the bond angle-based Ulmschneider-Jorgensen CR algorithm for polypeptides (→ ANGCRFREQ). It lets the user set a scaling factor to reduce the magnitude of pre-rotation perturbations of bond angle degrees of freedom (in the absence of pre-rotation bias, resultant width will be proportional to UJCRWIDTH·UJSCRANG → values less than unity are desirable). Large perturbations on those bond angles would reduce the efficacy of the method considerably due to the stiff potentials typically used to keep bond angles in the valid regimes. Note that the UJ-CR method never considers ω-angles for conformational sampling and that they are consequently excluded from pre-rotation sampling in their entirety. This is a bit of an arbitrary choice - in particular when considering the problems introduced by the bond angle sampling in the first place (discussion here) - and remedied in exact but purely torsional CR methods (→ TORCRFREQ). The parameter specified here corresponds to "1/c3" in the UJ reference.TORCRMODE
Unlike standard MC moves (such as φ/ψ-pivot moves), exact CR methods do not constitute an ergodic move set beyond the subspace satisfying the constraint (which is of course invariant toward sampling on that manifold). This necessitates mixing exact CR moves with other types of moves to achieve sampling of the entire phase space. Moreover, they solve an analytical problem numerically with finite error rate, i.e., not all solutions are always found. If these errors are dependent on the "position" of the constraint, i.e., on polymer conformation, the resultant sampling is biased even though Jacobian corrections are applied. This small bias is nearly impossible to remove entirely. CAMPARI supports two implementations for exact, torsional CR methods:- When set to 1, at each step, a superset of solutions is created containing the original solution, a set of alternative closures given the original pre-rotation state, and a set of new conformations with a given, altered pre-rotation state and a set of closures for that altered state. For each solution, the Jacobian determinants with respect to the closure constraint and the pre-rotation constraint are evaluated, multiplied, and a solution is picked using the net Jacobian as a weight factor. The chosen move is then evaluated via the acceptance criterion given the additional bias correction of evaluating the randomness of the pre-rotation move forward and backward as in the Favrin et al. scheme. In the absence of any pre-rotation bias, this algorithm is conceptually rejection-free. It also (in theory) satisfies detailed balance on account of the construction of the solution superset.
- When set to 2, at each step, a finite number of trials (see UJMAXTRIES) of pre-rotations according to the Favrin et al. scheme is performed. Closure is attempted and in case solutions are found, the possible closures along with the sampled pre-rotation constitute the set of possible moves. A random one is chosen (uniform probability) and the new conformation is evaluated via Metropolis with the Jacobian corrections for the proposed vs. the current state (with respect to both types of constraints) and the randomness correction for the pre-rotation step. Because solutions only need to be found given the pre-rotation, this algorithm is usually twice as fast as the one above given sane pre-rotation settings. This implementation does not satisfy detailed balance even in theory but attempts to remain globally balanced.
This specifies the minimum requested number of degrees of freedom for the pre-rotation segment for exact CR moves for polypeptides utilizing ω-angles during closure (→ TORCRFREQ). Note that this minimum number is not rigorously enforced but will be ignored if closure residues too close to the N-terminus are used. This is done in the interest of generality and to prevent the code from disabling these types of moves frequently. It is therefore not as straightforward as one may think to compute the expected distribution of pre-rotation segment lengths (and which residues are part of them with what probability) for each polypeptide. Note that here numbers of degrees of freedom are specified whereas for the bond angle UJ method, numbers of residues are specified (→ UJCRMIN).TORCRMAX_DO
This specifies the maximum requested number of degrees of freedom for the pre-rotation segment for exact CR moves for polypeptides utilizing ω-angles during closure (→ TORCRFREQ). Note that this maximum number is in fact a rigorous upper limit and never exceeded but that the length of some polypeptides in the system may be such that it is never realizable. In the latter case, there will be an additional complication in predicting the resultant distribution of pre-rotation segment lengths (see TORCRMIN_DO as well).TORCRMIN_DJ
This keyword is exactly analogous to TORCRMIN_DO but applies to exact CR moves for polypeptides without using ω-angles in the closure.TORCRMAX_DJ
This keyword is exactly analogous to TORCRMAX_DO but applies to exact CR moves for polypeptides without using ω-angles in the closure.TORCRSCOME
This parameter is analogous to UJCRSCANG and scales down the magnitude of the step-size for ω-bonds in the pre-rotation segment of exact torsional CR methods for polypeptides. Since stiff torsional potentials usually act on ω-bonds (→ OMEGAFREQ), the likelihood of obtaining rejected moves mostly on account of excursions of the ω-angle is high. This unwanted behavior may be alleviated by employing small values for TORCRSCOME. Remember, however, that the pre-rotation step size will often be relatively small in general.UJMAXTRIES
Despite its name, this keyword regulates the maximum number of pre-rotation sampling events to consider in exact, torsional CR methods with TORCRMODE set to 2. If no solution is found within UJMAXTRIES, the move is counted as rejected. Naturally, detailed balance is maintained only if there is always at least one solution found given the new pre-rotation (i.e., this keyword is rendered obsolete). As alluded to above, this is never the case for the entirety of a simulation. It is difficult to predict what setting in those cases would best preserve global balance. The main utility of this keyword, however, lies in different sampling applications, e.g., in the efficient and exhaustive sampling of different loop conformations given a fixed constraint.NUCCRMIN
This keyword is analogous to TORCRMIN_DO but applies to exact CR moves for polynucleotides. Note that the sugar bond (C3*-C4*) is always excluded from pre-rotation sampling.NUCCRMAX
This keyword is analogous to TORCRMAX_DO but applies to exact CR moves for polynucleotides. Note that the sugar bond (C3*-C4*) is always excluded from pre-rotation sampling.PHFREQ
This is the frequency out of all sidechain moves (see CHIFREQ) whether to perform a (de)ionization MC move. These moves will be turned off automatically in case there are no titratable residues in the system (currently only polypeptide residues D, E, R, K, and H (use neutral form) are supported). Note that these are pseudo-MC moves, i.e., they do not interface intuitively with the rest of the MC code. This means that the guidance criterion for accepting / rejecting titration moves is based on a distinct and simplified energy evaluation which has no impact on the actual Markov chain. These moves are therefore analyzing (on-the-fly) an independently generated Markov chain (using whatever Hamiltonian was specified) but do not perturb the conformational ensemble generated by said Markov chain in any way. This essentially corresponds to the assumption that the generated ensemble is independent of titration states - an assumption which is always wrong but may - in certain circumstances such as extreme denaturing conditions - nonetheless be justified. These moves rely on environmental settings (PH and IONICSTR) and are required for obtaining output in PHTIT.dat. The default picking probabilities for ionizable residues are flat and cannot be altered.PSWFILE
This keyword specifies name and location (full or relative path) of an optional input file parsed to alter the default picking probabilities for all types of moves in CAMPARI at most down to the residue level (but not further). In general, the idea of preferential sampling rests on the realization that any ergodic and unbiased move set is theoretically capable of producing a Markov chain yielding the correct phase space distribution. This means that the sampling weights given to degrees of freedom of the system need not be equivalent, but rather can be chosen arbitrarily (as long as a choice of zero somewhere does not eliminate ergodicity). Of course, the convergence properties of a Monte Carlo simulation are an exceptionally complicated function of the move set, and therefore deviation from default choices should be properly justified. Examples have been listed above, e.g. in the discussion of sidechain sampling.CAMPARI generally allows the preferential sampling facility to overlap with user-level constraints. Constraints are applied first, and then picking probabilities are altered. In the process, it is possible to effectively introduce additional constraints on account of setting selected sampling weights to zero. This is tolerated as long as it does not deplete the pool for a class of moves entirely. In such a case, the program terminates with an error. There is a notable difference in zero sampling weights and constraint requests for concerted rotation moves of polymers (described elsewhere). Note that it is not possible to control frequencies that would lead to incorrect sampling. In particular, it is impossible to control picking probabilities for particle permutation moves, and particle insertion and deletion moves can only be controlled down to the molecule type level. Rigid-body moves are generally limited to the scope of molecules, not residues. The format of the input file is described elsewhere.
If the default picking probabilities are altered (→ PSWFILE) in torsional space Monte Carlo simulations, this keyword acts as a simple logical whether or not to write out a summary of the resultant picking frequencies for every move type that is active and has been modified (to the log-file).Files and Directories:
(back to top)
Preamble (this is not a keyword)
In general, files and directories should be provided using absolute paths. This is often advantageous in deployment-based computing where relative directory structures and/or shortcuts may change or not exist. However, CAMPARI may fail in reading strings longer than 200 characters leading to truncation and subsequent failure. This should be kept in mind. Also, this section is merely a list of the auxiliary files potentially required by CAMPARI. The functionalities itself (including the files) are usually explained elsewhere (links are provided).BASENAME
This keyword allows the user to pick a name for the simulation/system that is going to be used in the names of all structural output files. However, all other output files produced by CAMPARI use generic names and will be overwritten if simulations are continuously run in the same directory.SEQFILE
This is the most important input file as it instructs CAMPARI which system to simulate. Its format and possible entries are explained in detail elsewhere.SEQREPORT
This keyword is a simple logical (specifying 1 means "true", everything else means "false") that controls whether CAMPARI writes out a summary of some of the system's features initially. In detail, it will provide an overview of the identified molecule types, viz., the numbers of each molecule type present, the first instance, their formal concentration, their molecular mass, and their high-level suitability for performing CAMPARI-internal analyses. The latter would for example report that urea molecules are not suitable for peptide-centric analysis such as secondary structure analyses. In addition, the parsing of these molecule types into analysis groups is written to log-output.PDBFILE
Among other functions, this is the main input file for providing a starting conformation for a simulation. See below for details.DCDFILE
See below.XTCFILE
See below.FRZFILE
See above.PSWFILE
This keyword is relevant only when ENSEMBLE is set to either 5 or 6 (ensembles with fluctuating particle numbers). It provides the location of the file that specifies the particle types that are allowed to fluctuate, the numbers of particles of those types to initially include in the system, and the chemical potentials of each fluctuating particle type (see here).REFILE
See below.FEGFILE
This keyword lets the user specify name and location of the input file from which CAMPARI extracts which residues and/or molecules to subject to scaled interaction potentials with the rest of the system in free energy growth (ghosting) calculations.DRESTFILE
See below.TORFILE
See below.POLYFILE
See below.MOL2FILE
This keyword lets the user choose an input file containing a map annotating φ/ψ-space for polypeptides with canonical secondary structure regions. This mapping is used to perform segment-based analyses of polypeptide secondary structure. CAMPARI provides two such files already (in the data/ subdirectory). These and the files' format are explained in detail elsewhere.GRIDDIR
This keyword sets the directory CAMPARI browses to find input files for grid-assisted sampling (see above). CAMPARI provides by default sample input files in $CAMPARI_HOME/data/grids/. The code assumes filenames to follow a systematic naming convention "xyz_grid.dat", where xyz is the lower-case, three-letter code of the standard 20 amino acids.This functionality is de facto obsolete and should not be used. It may be removed entirely in the future.
This keyword sets the location and name of the file CAMPARI expects to read for the tabulation of Fourier-Bessel (Hankel) transforms. This is required for diffraction analysis and is normally contained in the CAMPARI data directory. Details on the input are found elsewhere.PCCODEFILE
See below.CFILE
See below.CLUFILE
See below.NBLFILE
See below.NCDM_CFILE
See below.Structure Input and Manipulation:
(back to top)
This keyword determines the randomization aspects of initial structure generation. CAMPARI can generate default structures, completely random structures or, alternatively, use some or all available information from a structural input file. It contains a hard-coded database of required bond lengths, bond angles, and dihedral angles, which enables CAMPARI to construct molecules that it knows from scratch. The database is derived from high-resolution crystallographic structures of biomolecules (see for example the reference by Engh and Huber) and reference values of computational models (see, e.g., WATER3S_GEOM). The outcome of the initial structure generation is determined both by the choices below and by the choice for keyword PDB_READMODE. In general, the program employs a hierarchical procedure whereby stretches of the input sequence are randomized residue-by-residue or molecule-by-molecule. If no excluded volume term has been enabled (SC_IPP or SC_WCA), the randomization will almost certainly produce structures with steric clashes (the majority of energy terms are ignored; for example, it is not possible to implement excluded volume only by means of tabulated potentials and to rely on randomization to produce a clash-free initial structure). The only other, possibly relevant terms during randomization are boundary potentials, bonded potentials (like torsional potentials), and bias potentials acting on individual degrees of freedom (compartmentalization potentials, distance/position restraints, and torsional restraints). It is a general limitation that other stiff potentials (including bias potentials acting globally such as spatial density restraints) as well as suboptimal clash resolution can easily generate very large energies and forces for the initial structures produced by randomization. While not generally a problem in Monte Carlo runs, the large forces will make any gradient-based simulation immediately unstable. In these cases, the cleanest workaround is to set up a hybrid calculation (see DYNAMICS) that first runs a number of Monte Carlo steps large enough to resolve all large forces (see CYCLE_MC_FIRST) followed by a single, very long dynamics segment (keywords CYCLE_DYN_MIN and CYCLE_DYN_MIN should both be set to NRSTEPS-1). Alternatively, two separate calculations can be run with the Monte Carlo one being restarted as a gradient-based one with the help of keywords RESTART and RST_MC2MD. Because of the large forces, CAMPARI will exit during a randomization encountering an unresolvable clash for a pure gradient-based calculation (including minimization) unless UNSAFE is set to 1. The definition of "clash" is provided primarily by keyword RANDOMTHRESH.Any possible randomizations proceed according to the following three-step hierarchy, but not all steps are performed depending on the choice for RANDOMIZE.
- If a structural input file was read and parts but not all coordinates of one or
more complex molecules were read, CAMPARI may generate random conformations for sidechains (i.e. short branches
off a main chain confined entirely to a single residue, which excludes crosslinks). This happens if one or more relevant
heavy atoms in the sidechain are missing from structural input. The step is desirable because sidechains constructed
using default dihedral angles are likely to create local but significant clashes.
The interactions the sidechain in question is subjected to are evaluated with respect to all residues read at least in
part from structural input and include an excluded volume bias (if either
SC_IPP or SC_WCA is turned on),
all enabled bonded potentials (SC_BONDED_B, etc.), any
possible boundary potentials, torsional bias potentials
Sidechain resampling treats every sidechain independently in the order they appear in the sequence by a Monte Carlo
minimization procedure. Importantly, some natively frozen degrees in supported residues (such as the out-of-plane torsion χ5
in arginine), can be moved during this procedure. CAMPARI will print, in the summary of the calculation, some information
as to how many residues the procedure was applied to (if nonzero).
At the end of the first stage, if performed (RANDOMIZE is 1 or 2), structural input has been augmented by missing sidechains, and the resultant conformations, aside from missing parts, should be free of major clashes that are not already present in the input pdb file. If this stage was skipped (RANDOMIZE is 0 or 3), missing sidechains are in default conformations, and clashes are extremely likely. - If a structural input file was read and parts but not all coordinates of one or more complex molecules were read, CAMPARI may generate random conformations for any missing tails. This is fully supported only in conjunction with option 3 for PDB_READMODE because otherwise at most a single C-terminal tail in the last processed molecule is treated this way. Missing chain-internal residues are a separate problem and not dealt with. The tails are built in a systematic and hierarchical Monte Carlo minimization procedure starting from the residue closest to the part that was read in and proceeding towards the respective terminus. Different tails are processed in the order that they appear in in the sequence. During randomization, every tail interacts at most with those atoms read in from file and those having already been placed as part of tails occurring before in the sequence. The interactions are an excluded volume bias (if either SC_IPP or SC_WCA is turned on), all enabled bonded potentials (SC_BONDED_B, etc.), torsional bias potentials (SC_TOR), any possible boundary potentials, and any distance and position restraints (SC_DREST) that are meaningfully modified at this stage by the procedure. "Meaningfully modified" is a complicated criterion at this stage for distance restraints. For example, considering a distance restraint between an N-terminal and a C-terminal tail in the same molecule really requires optimizing them simultaneously which is not possible in the aforementioned hierarchical framework. Thus, the results are likely similary suboptimal as for the case of chemical crosslinks, which is what is described next. If the tails contain residues participating in intramolecular crosslinks, these crosslinks will at best be satisfied approximately. During the randomization procedure, they are whenever possible implemented in the same way as during the final simulation, i.e., by means of bonded potentials (which are thus required for obtaining a meaningful result). This is the case if a crosslink exists entirely within the same tail or if it links a tail to part of the coordinates having been read in. The situation is more complicated if a crosslink links tails in different molecules or two different tails in the same molecule. In these cases, results may be entirely unsatisfactory because all the burden of achieving a "closable" conformation is deferred to the tail occurring later in the sequence whereas the tail occurring earlier ignores crosslink constraints entirely. At the end of the second stage, if performed (RANDOMIZE is 1 or 2), all molecules read in at least partially from structural input should be in a state that is defined by the input or built randomly in a way that is approximately free of intramolecular clashes. If RANDOMIZE is 1, and there is more than one such molecule, they should additionally be free of intermolecular clashes. If the second stage is not performed but eligible tails exist, they are simply constructed in default polymer conformations and disregarding any clashes or crosslinks. This is unlikely to be useful and will almost certainly cause an error unless the simulation starts with a sufficient number of or uses only Monte Carlo steps.
- The third stage loops over all molecules and deals, for each molecule in this order, with internal
followed by external degrees of freedom. The internal conformation of molecules for which no structural input
was provided is constructed randomly unless RANDOMIZE is 0 or 3 or unless the molecule has no relevant
internal degrees of freedom. This random conformation is constructed as follows. For each residue, CAMPARI reserves
RANDOMATTS total attempts per residue and applies a threshold
penalty of RANDOMTHRESH kcal/mol. This penalty corresponds to
the required mean interaction energy per relevant (i.e., included by the short-range cutoff)
residue pair. The relevant energy terms are a possible excluded volume bias and any
torsional potentials but do not include boundary potentials
or other bonded potentials. Energies are evaluated for residue pairs involving the current
residue and all residues further toward the N-terminus of the stretch (already processed) and the single
residue immediately following in the stretch (not yet processed). If the sum of these energies plus the difference
in any torsional potential energies from the initial state is less than
the threshold, the algorithm proceeds to the next residue.
Thus, the excluded volume contribution is evaluated as its absolute value, which means that the threshold will have to depend
on the particular choice of Lennard-Jones parameters (→ PARAMETERS).
If the threshold criterion is passed, the calculation proceeds to the next residue.
This randomization occurs in three hierarchical phases (1/3 each of the total attempts per residue). In the first, only freely rotatable backbone angles (excluding all pucker and ω-angles) are considered, e.g., the φ/ψ-angles of polypeptides, or any backbone-like angles in unsupported residues. In the second stage, rotatable sidechain angles (excluding those in native CAMPARI residues that are frozen by default) of the current residue are added to the set as well. In the third stage, all aforementioned degrees of freedom for the residue immediately prior in the sequence are added. It is obvious that even with all 3 stages triggered, a stretch may be "stuck", e.g. fold back onto itself, thus requiring a completely new solution. Resolving such situations is not supported as this would lead to an uncontrollable runtime. Instead, the energetically most favorable conformation of the sampled ones is picked and a warning or error (depending on keyword DYNAMICS) is produced.
For molecules free of internal crosslinks, the stretch considered is the entire molecule. If there are internal crosslinks, the molecule is divided hierarchically into stretches. Stretches under no constraint are processed sequentially (from N- to C-terminus). Crosslink-constrained stretches are parsed, and CAMPARI tries to find a hierarchical order starting with the "innermost" stretches in the hope of arriving at a solution that is both clash-free and satisfies all intramolecular crosslinks exactly. Priority is given to crosslinks over clashes because it is very easy to arrive at structures that are more or less clash-free and have a dramatically perturbed crosslink geometry that cannot relax properly without a complete reorganization of the molecule. This procedure can be very slow because RANDOMATTS" attempts per residue are used to construct a potentially closable stretch conformation (using an empirical bias in addition to the aforementioned potentials) followed by an energetic evaluation of all identified loop closures. If the solutions have too many clashes or none are found, the entire cycle is repeated up to RANDOMATTS times. In the case of coupled crosslinks (for example, nested or staggered), it remains likely that the hierarchical procedure encounters a dead end and exits with a warning or error as before.
If successful, at the end of this step, the molecule processed should be in one of four possible states:- In a newly generated, random, and clash-free conformation that satisfies any intramolecular crosslinks exactly (RANDOMIZE is 1 or 2 and no structural input was provided).
- In a conformation partially or fully supplied by structural input with any tails randomized (RANDOMIZE is 1 or 2 and structural input was provided, see above).
- In a conformation partially or fully supplied by structural input with any tails in default conformations (RANDOMIZE is 0 or 3 and structural input was provided, see above).
- In its default conformation, which is generally not clash-free, with all (if any) intramolecular crosslinks broken (RANDOMIZE is 0 or 3 and no structural input was provided).
If the molecule is not connected to another molecule occurring earlier in the sequence by an intermolecular crosslink, the procedure is simple. There is only a single phase with the same number of total attempts (now per molecule). Energies are evaluated in pairwise fashion for all molecules occurring prior in sequence input vs. the current molecule. As before, the computed energy is taken as the mean interaction energy per relevant (i.e., included by the short-range cutoff) residue pair. The relevant energy terms are a possible excluded volume bias, all enabled bonded and torsional bias potentials (although they matter only for intermolecular crosslinks), any boundary potentials, and all relevant position and intermolecular distance restraints all of which are taken as absolute values. The step ends as soon as the computed mean interaction energy is below the specified threshold. The only additional complexity is provided by keyword RANDOMBURIAL, which only affects this stage and case. If the specified value is smaller than 1.0, the randomized molecule is a single-residue molecule (most often an ion or water), and partial structural input is present, then an additional effective energy is added that tries to forcefully avoid the randomization to place the molecule on the "inside" of structurally resolved parts. The most common scenario is with solvating macromolecules supplied in a well-defined starting structure with explicit water molecules. In this case, it can happen that, due to the density, positions in small pockets inside the macromolecules appear as favorable sterically as the remaining void spaces in the box, and RANDOMBURIAL is meant to protect from that.
Conversely, a molecule bound by an intermolecular crosslink to a molecule earlier in the sequence may not be placed freely. CAMPARI analyzes intermolecular crosslinks to determine a hierarchy that prioritizes molecules with more crosslinks (these should ideally be placed as early as possible in the sequence). While the higher-priority molecule is placed freely, a lower priority molecules instead satisfies the crosslink exactly, and the molecule is placed with the only randomization coming from the central 3 dihedral angles of the actual crosslink. This may require deferring this step until the higher priority molecule has been placed. Depending on the conformations of the involved molecules determined previously, this can easily lead to an unresolvable clash, which, as before, will be reported as a warning or error. Notably, there is no mechanism in place to displace molecules that have already been placed in previous iterations of the loop. This means that it is advantageous to place molecules joined by an intermolecular crosslink directly adjacent in the sequence file. If a molecule is crosslinked to more than one molecule occurring earlier in the sequence, a warning is produced and at most one of these crosslinks is respected in the generated conformation. Intermolecular crosslinks will at best be satisfied approximately in such a case (the same is true if only the position of a lower priority molecule in a crosslinked pair is determined by structural input).
At the end of the second step of the third stage, the molecule should be placed randomly in the simulation container without clashing with any molecules occurring earlier in the sequence. It should also satisfy intermolecular crosslinks to at most one molecule occurring earlier in the sequence. Note that even if PDB_READMODE is 3, it is not possible for a molecule requiring random placement to precede, in sequence input, a molecule placed based on structural input. The only molecules not placed randomly at the end of this stage are those read from structural input if RANDOMIZE is 0 or 3. To continue, the second stage proceeds to the next molecule until there are no unprocessed molecules left. The placement of molecules is very unlikely to be clash-free if the density is high (for example, liquid water).
- Minimal randomization is performed. It is the same as option 1 below if a structural input file is given that provides coordinates for all parts of the system. It is the same as option 3 if no structural input file whatsoever is provided. With this option, any missing tails in molecules with partial input are built in default conformations. No intramolecular crosslinks missing from structural input are satisfied initially. All molecules missing from structural input are built in default conformations and placed randomly in the box. Intermolecular crosslinks are satisfied as possible (see above).
- Supplementary randomization is performed, which is the default. This option is the same as option 0 if a structural input file is given that provides coordinates for all parts of the system. It is the same as option 2 if no structural input file whatsoever is provided. With this option, any missing tails in molecules with partial input are built in random, clash-free conformations that satisfy any crosslinks at best approximately. All molecules missing entirely from structural input are built in random conformations that satisfy intramolecular crosslinks exactly, and are placed randomly in the box. Intermolecular crosslinks are satisfied as possible (see above).
- This is the same as option 1 above only that the rigid-body coordinates are randomized even for those molecules read fully or at least partially from the structural input file. This option is the same as option 3 if a structural input file is given that provides coordinates for all parts of the system. This option can be useful for generating random starting structures for studies of the assembly of a protein complex from rigid components.
- This is the same as option 0 above only that the rigid-body coordinates are randomized even for those molecules read fully or at least partially from the structural input file. This option is the same as option 2 if a structural input file is given that provides coordinates for all parts of the system.
It is a very important restriction that initial structure randomization does not observe user-level constraints. In order to have a degree of freedom, which is accessible to randomization, start out in a well-defined state, randomization of the corresponding class of degrees of freedom must be disabled entirely (in which case the initial state comes either from the CAMPARI default or - more likely - from structural input). This is somewhat similar to the limitation regarding restraint potentials before. Stiff potentials acting globally such as polymeric biases, density restraints, and secondary structure biases are ignored by the initial structure generation. This means that a (partly) random structure can be clash-free, can satisfy all crosslinks, yet can still be subjected to very large forces initially. Performance-wise, the initial structure randomization procedure (like all initial setup) is unaware of the shared memory (OpenMP) parallelization of CAMPARI. Computationally, it may thus be more efficient to use an explicit Monte Carlo simulation to generate initial structures in comparable fashion (although the overall cost is often negligible regardless).
In general, the importance of initial structure randomization lies in avoiding initial structure biases that may be difficult to detect. Alternative procedures found in the literature often use simple reference states (fully extended chains) or results from high-temperature runs of an experimentally determined structure. With these approaches, it is quite difficult or even impossible to rigorously assert that the final results are not subtly influenced by the choice of starting conformation(s). Conversely, the random structures generated by CAMPARI are usually so independent from each other that the convergence of results is a good indicator of statistical error at the level of the chosen analysis. This is not to say that they are not biased by the simplified Hamiltonian used to construct them, which they of course are. Intramolecular crosslinks in particular generate constraints that can restrict the available space down to just a few "clusters" of solutions, and CAMPARI's hierarchical procedure may well pick with a strong bias from this set. Disregarding crosslinks, the excluded volume-centric Hamiltonian used in randomization will differ from the one used in the actual production simulations. This in itself is a bias of course. In most cases, the production Hamiltonian differs dramatically, in particular since it will generally contain net attractive potentials. This means that the beginning of a simulation corresponds to a quench/relaxation scenario, similar to what is seen experimentally in temperature-jump experiments or computationally in methods like simulated annealing. This instantaneous quench, broadly speaking, leads to the sampler taking the system to configurations, which are (more) compliant with the production Hamiltonian and most easily accessible from the starting configuration. Here, "most easily accessible" depends critically on the chosen sampler. Thus, memory can be retained and errors can be masked if the starting structures are not completely independent. It should be kept in mind that it is, outside of trivial cases, never possible to know a priori the subspace of configurations relevant to the system under the chosen conditions as this is usually the question one tries to answer by means of simulations. Errors pertaining to a lack of exhaustiveness can therefore not be diagnosed or understood based on randomized starting structures. For this, a robustness across different samplers and simulation lengths is advisable.
Note that in replica exchange or MPI averaging runs, all replicas will start from different conditions unless RANDOMSEED is given explicitly by the user.
If any type of initial structure randomization is requested, this keyword sets the general number of maximum attempts in randomizing the permissible degrees of freedom for a single residue or molecule. Large numbers (> 10000) may produce unacceptably slow performance when trying to randomize a long, complex polymer and/or a dense fluid. Large numbers in conjunction with too small a threshold can also be counterproductive. This is because this scenario corresponds to a hierarchical minimization and thus may limit the search space for dependent elements of the hierarchy. This problem can be exacerbated for intramolecular constraints, in particular coupled ones.RANDOMTHRESH
If any type of initial structure randomization is requested, this keyword sets the universal energy threshold to be applied with respect to energetic penalties for excluded volume, boundary potentials (rigid-body only), and bonded terms (e.g., torsional potential energies). Roughly speaking, for every residue or molecule being processed, there will be a given number of interacting residues (depending on the short-range cutoff). While the total energy is used to pick the best current solution, the threshold is evaluated against a mean interaction energy per residue pair, and this is the value specified by RANDOMTHRESH (in kcal/mol). Different terms contribute for different stages of randomization as described above. All these terms are pure penalty terms and cannot yield negative energies. Specifying small values for the threshold will generally yield lower starting energies because they make the procedure more minimization-like. There is a caveat in that parts of the randomization procedure are hierarchical, i.e., the solvability of a subproblem may depend on the solution of previous problem. Since the algorithm has very limited capability to "go back," a well-minimized result for a particular task may actually prevent subsequent problems from being solved satisfactorily. It is thus recommended to keep the threshold large enough and the number of attempts small enough that solutions remain diverse.RANDOMBURIAL
If a combination of initial structure randomization, structural input, and sequence input is used that leaves one or more small (single-residue) molecules/atoms with undefined rigid-body coordinates, then this keyword might become relevant. It sets a threshold in the degree of (normalized) buriedness, estimated with the help of accessible nearby volumes, for such a small molecule to tolerate (1.0 means fully buried, 0.0 means free of neighbors). The burial, and this is the key aspect, is only evaluated with respect to parts of the system that have been read in from structural input.The normal use of this keyword, which defaults to 0.4, is to prevent the randomizer from placing water molecules on the inside of macromolecules. Whenever the density is high, steric reasons fail to provide a clear priority for placing them in the bulk, so frequently water molecules or monoatomic ions end up in very small spaces that they cannot escape from. While not all of these cavities are necessarily erroneous, an erroneously trapped small molecule can have dramatic consequences for the stability of the macromolecule in subsequent simulations. A complementary and/or alternative approach is to retain so-called "crystal waters," i.e., water molecules that are structurally resolved, in the input structure (see elsewhere).
Technically, the keyword adds an artificial and very large penalty term to any randomization attempt that leads to a violation of this burial rule. Supplying a value of 1.0 means that all burial levels are accepted, so the function is disabled (the default prior to version 4.1 of CAMPARI). Supplying values smaller than 0.1 is prohibited to avoid erroneous burial signals from creating misleading results. The method depends on keyword SAVPROBE for the burial calculation: larger values increase the range of what is considered for burial and lead to more spatial averaging.
If a small molecule screen is performed, the system consists of a fixed and a variable part. While keyword RANDOMIZE describes the initial randomization of the fixed part (and, in virtually all circumstances, this should really be fixed in a small molecule screen, so RANDOMIZE should be "0"), MOL2RANDOMIZE controls how the input coordinates found in the main input file of the screen are treated:- The coordinates of the small molecules are not randomized. This is a good option to use if alignment to a tether group defined in a reference molecule occurs but can, depending on the sampler, lead to biases if this is not the case. Such bias might stem from absolute positions in the main input file that are systematic (which is frequently the case).
- The rigid-body coordinates of the small molecules are randomized if and only if no alignment to a tether group defined in a reference molecule occurs. This is the most common and default option.
- The rigid-body coordinates of the small molecules are randomized. This happens after any possible alignment to a tether group defined in a reference molecule (otherwise it would be ineffectual). The decisions made by the matching algorithm (see MOL2RMSDTHRESH) are not affected by this. Randomization does not occur if MOL2DRESTMODE is 1 because this might imply different Hamiltonians for different tetherings. This combination of options should be avoided, i.e., if MOL2DRESTMODE is 1, MOL2RANDOMIZE should be 0 or 1.
- The internal conformations (conformers) of the small molecules are randomized. This should only be used in conjunction with proper bonded potentials acting on dihedral angles. Most curated libraries of small molecules contain only sterically feasible conformers, and conformer randomization should not be necessary. This option should only be used in very particular circumstances.
- Both the internal conformations (conformers) and rigid-body coordinates of the small molecules are randomized. This should only be used in conjunction with proper bonded potentials acting on dihedral angles.
To help with efficiency, the rigid-body randomization of the small molecules will generally try to refer to a subvolume of the chosen simulation container. If position and or distance restraints are found, this subvolume is defined entirely by these restraints plus a buffer parameter that is at least 10Å and at most twice the maximum extent of the cuboid defined by the input conformer. Distance restraints with large target distances should be avoided for this to work efficiently. If no such restraints are found, the fixed part of the system is measured, and the buffered volume is derived from this extent instead. If neither is present, randomization defaults to the normal container defined by SHAPE and BOUNDARY. Using periodic boundary conditions or using restraints and/or a fixed system that favors positions outside of the container might lead to unwanted consequences in a small molecule screen when used in conjunction with options 2, 4, and possibly 1 above. These result from the randomization volume no longer being a proper subvolume.
The logic behind the idea of focusing the randomization of small molecule positions is to enable users to define a spatially large simulation container without relying on excessively large numbers for RANDOMATTS to achieve compliance with restraints and/or "find" the receptor. Note that RANDOMTHRESH should be set to a relatively small value for restraints to matter sufficiently during randomization.
This keyword provides the (base)name and location of a structural input file in pdb convention. This can either be a pdb trajectory (for analysis) or, more commonly, the intended (partial) starting conformation of the system. The two interpretation modes are switched based on keyword PDBANALYZE and have different requirements. General and specific formatting information for pdb files (which also apply to keyword PDB_TEMPLATE) are given in the corresponding input file documentation. The parsing of pdb files depends on a number of auxiliary keywords, specifically PDB_READMODE, PDB_HMODE, PDB_INPUTSTRING, PDB_TOLERANCE_A, PDB_TOLERANCE_B, PDB_R_CONV, and PDB_MPIMANY.If trajectory analysis mode is enabled, CAMPARI will interpret the input to this keyword either as a pdb trajectory file (using the MODEL/ENDMDL records) or as the first in a series of systematically numbered files (→ PDB_FORMAT). For the former, the MODEL / ENDMDL syntax is checked and has to be interpretable (the actual numbering on the MODEL line is ignored, however). For the latter, a systematic numbering scheme is inferred from the provided file name (based on plain numbers, or numbers with leading zeros). In this scenario, the first of such files should be provided; CAMPARI will then try to extract the numbering scheme and open NRSTEPS-1 consecutive snapshots. Note that in this mode the filename must not contain any additional numeric characters (i.e., foo_001.pdb is permissible while ala7_001.pdb is not). To choose between single-file and multiple-file formats, keyword PDB_FORMAT is used. If the set of numbered files or the trajectory do not provide enough snapshots to satisfy the selected value for NRSTEPS, CAMPARI will either terminate with an error (if any MPI parallel execution mode is used) or dynamically adjust the run length (if serial or OpenMP-only code is used). The latter can be confusing and may produce nonsensical output from built-in analysis routines (if the run is shortened enough to effectively disable an analysis that would have been enabled given NRSTEPS, it is not guaranteed that all output from this analysis is correctly suppressed).
Since the point here is to analyze a given trajectory, CAMPARI expects the specified sequence input to match exactly what is present in the pdb input file. This does not necessarily require that all atoms in every residue are read successfully, but it does require that all residues are found. The use of pdb files with atoms that were not read successfully or missing is of course very confusing depending on the types of analyses to be performed. This is because the positions of these atoms will be reconstructed, i.e., some of the coordinates entering analysis may be derived, arbitrary, or, in the worst case scenario, numerically ill-defined. While possibly tolerated by CAMPARI, it supplying a pdb trajectory (or set of files) that do not all contain the exact same set of atoms with the exact same names should be avoided. The total number of rebuilt atoms will be reported at the end of the run to log output. Mismatches between CAMPARI's representation of a residue and what is present in the pdb file may be circumvented with keyword UAMODEL and can always be masked by renaming them to be recognized as unsupported residues as demonstrated in Tutorial 10. Lastly, if the pdb files (or trajectory) encode fluctuating/changing volumes, this can be accounted for through keyword ENSEMBLE by choosing to assume NPT conditions. CAMPARI will then read and recompute box parameters directly from the file.
The more common use of this keyword is for CAMPARI to attempt to read an external file to construct an initial nonrandom conformation for the system. Depending on the setting for RANDOMIZE, only some of the information may be used. Naturally, the system (sequence) in the pdb file has to be at least partially consistent with the choices made via SEQFILE. Note that parallel runs can use multiple input structures (→ PDB_MPIMANY). In particular, CAMPARI will not reorder atoms or residue blocks in the pdb except for very specific exceptions. In a box with a protein, solvent, and ions, it is therefore necessary that the order of the components in sequence input is the same as in pdb input. If not, most of the system information will be discarded. The input file can be processed with varying degrees of leeway and two different paradigms (both depend on the choice for PDB_READMODE).
Note that it is not possible to directly start a simulation from a structure provided in a binary trajectory file format. In this case, however, CAMPARI can be used to extract a suitable pdb file from the trajectory with the help of keywords PDBANALYZE, XYZPDB, XYZOUT, XYZMODE, and - for example - DCDFILE.
Lastly it is important to mention that PDBFILE provides some functionality that is overlapping with that provided by PDB_TEMPLATE. Specifically, runs containing residues not natively supported by CAMPARI require the topology of those moieties to be inferred from file. If an analysis run operates on a single pdb file, a trajectory file in pdb format or a series of pdb files, or if a simulation run is supposed to start from a specific structure supplied via PDBFILE, then PDBFILE can (but need not) serve the function of topology inference as described for PDB_TEMPLATE. Conversely, PDBFILE never replaces the function of PDB_TEMPLATE in the other contexts it is relevant in, viz., to provide a map from binary trajectory input file to CAMPARI, and to serve as a reference structure for alignment.
This is only relevant if PDBANALYZE is true: It then specifies name and location of the trajectory (xtc format) to analyze. Like all other xtc-related options, this is only available if the code was in fact compiled and linked with XDR support (→ installation instructions). Keywords PDB_TEMPLATE and PDB_ATOMMAP offer instructions how to convert binary trajectory files with non-CAMPARI atom order. If the analysis run is parallel (→ REMC), an example is given elsewhere. Because binary trajectory files are not annotated, many of the above formatting options apply, at most, to the template. Specifically, keywords PDB_READMODE, PDB_HMODE, PDB_TOLERANCE_A, PDB_TOLERANCE_B, PDB_R_CONV, and PDB_NUCMODE are all irrelevant for the processing of the actual information in the xtc file whereas XYZ_FORCEBOX is respected. Should the data in the trajectory file be corrupted or exhausted before NRSTEPS snapshots have been read successfully, CAMPARI will either terminate with an error (if any MPI parallel execution mode is used) or dynamically adjust the run length (if serial or OpenMP-only code is used). Binary xtc files have a header section for each snapshot that specifies box coordinates, the number of atoms, and additional information. The number of atoms is always checked against the current system, and there is no tolerance mechanism for mismatches. If the box vectors (stored as a 3x3 matrix) encode fluctuating/changing volumes, they are important but will only be read if keyword ENSEMBLE is set to 3 (NPT conditions). CAMPARI will then read the box parameters directly from the file. This will only work safely for cuboids and triclinic boxes.DCDFILE
Analogous to XTCFILE, this keyword is only relevant if PDBANALYZE is true: It then specifies name and location of the trajectory (dcd format, CHARMM-style) to analyze. Keywords PDB_TEMPLATE and PDB_ATOMMAP offer instructions how to convert binary trajectory files with non-CAMPARI atom order. Binary dcd files have a single header section at the beginning of the file that specifies several control parameters including the number of atoms. The number of atoms is always checked against the current system, and there is no tolerance mechanism for mismatches. If the box vectors (stored as 6 floating point numbers) encode fluctuating/changing volumes, they are important but CAMPARI will attempt to read them only if keyword ENSEMBLE is set to 3 (NPT conditions). This can at best work safely for cuboids and triclinic boxes. Note, however, that dcd files have a poorly (read, no) defined standard for how exactly this information is stored, and the assumption that the 6 numbers are 3 vector lengths (abc) and 3 angles (bc ac ab) might not always hold. Using only 6 values implies not only that an absolute origin is undefined (a cosmetic issue in three-dimensional PBC but a problem otherwise) but also that the box vectors follow a specific convention. This convention normally states that the first two vectors lie in the xy-plane and that all three form a right-handed coordinate system. For triclinic boxes, this is not automatically true in CAMPARI. To achieve output that is compliant with this convention irrespective of choices for BOXVECTOR1, BOXVECTOR2, and BOXVECTOR3, there is an auxiliary keyword BOXROTATE.NETCDFFILE
Analogous to XTCFILE, this keyword is only relevant if PDBANALYZE is true: It then specifies name and location of the trajectory (NetCDF format) to analyze. Like all other NetCDF-related options, this is only available if the code was in fact compiled and linked with NetCDF-support (→ installation instructions). Keywords PDB_TEMPLATE and PDB_ATOMMAP offer instructions how to convert binary trajectory files with non-CAMPARI atom order. Unlike xtc or dcd files, NetCDF files do not need to be parsed sequentially and are partially annotated. CAMPARI thus determines immediately whether NRSTEPS snapshots are present in the file. If not, CAMPARI will adjust this number to the available one. If any MPI parallel execution mode is used, this is the minimum across all MPI processes (which read different files). Binary NetCDF files encode a well-defined standard. For trajectory data of particle systems, CAMPARI resorts to the standard developed for the AMBER program suite described elsewhere both in writing and in reading. Because NetCDF files do not need to be processed sequentially, they offer an additional benefit of analyzing snapshots in a specific order that is not the same as the original trajectory (→ FRAMESFILE for details).As for dcd files, the box vectors are stored as 6 floating point numbers that can encode fluctuating/changing volumes. CAMPARI will attempt to read them only if keyword ENSEMBLE is set to 3 (NPT conditions). This can at best work safely for cuboids and triclinic boxes. The 6 numbers are 3 vector lengths (abc) and 3 angles (bc ac ab) stored in separated fields. Using only 6 values implies not only that an absolute origin is undefined (a cosmetic issue in three-dimensional PBC but a problem otherwise) but also that the box vectors follow a specific convention. This convention normally states that the first two vectors lie in the xy-plane and that all three form a right-handed coordinate system. For triclinic boxes, this is not automatically true in CAMPARI. To achieve output that is compliant with this convention irrespective of choices for BOXVECTOR1, BOXVECTOR2, and BOXVECTOR3, there is an auxiliary keyword BOXROTATE.
This keyword is only relevant if PDBANALYZE is true. It then specifies the name and location of a simple input file that is meant to provide a map of atom indeces from CAMPARI's internal order to that found in the (usually) binary trajectory file (→ NETCDFFILE, etc.). The normal way to address this problem is through a pdb template. In rare circumstances, however, the binary trajectory might be constructed in a way that is incompatible with normal pdb files, e.g., by not having atoms belonging to the same residue be consecutive or by having residues arranged in a way that does not match the normal sequence direction. This is where this keyword can be used. Its format is a single column of integer values covering all values from 1 to n where n is the number of atoms in the system. Any violation in this regard will lead to CAMPARI terminating with an error. The input format is the same as that for other atom index sets. As an example, if the first atom in CAMPARI order is actually the 7th atom in the binary trajectory, the first line should read "7" in this file.FRAMESFILE
If PDBANALYZE is true, it is possible for CAMPARI to analyze a specific set of frames from the trajectory file (see PDB_FORMAT) rather than the entire trajectory. It is also possible to give every analyzed snapshot a sampling weight, and both functionalities are implemented by this keyword. Example applications are the extraction of structural clusters from a trajectory or the reweighting of biased simulations.Most input trajectories currently need to be processed sequentially (this applies to xtc, dcd, and pdb trajectory files, i.e., PDB_FORMAT is 1, 3, or 4). For these, the list of requested frames is sorted first, and duplicates are removed. This means that any newly written trajectory files (→ XYZOUT) will have exactly the same order of snapshots as the input. Conversely, the snapshots encoded in individual pdb files and NetCDF trajectory files (PDB_FORMAT is 2 or 5) can be accessed in arbitrary order. For these two settings, the frames file is processed "as is" unless there are floating point weights per snapshot or unless this is a parallel trajectory analysis run. Frames files processed "as is" have the advantage that they can arbitrarily reorder and duplicate individual simulation snapshots, which is relevant, for example, in the construction of synthetic trajectories.
It is important to note that the settings for NRSTEPS and EQUIL and all related frequency settings for analysis routines (see corresponding section) lose their straightforward interpretations if not all snapshots in the original trajectory are processed exactly once and in sequence. For the case of a processed frames file (sorted and free of duplicates), the analysis frequencies will still refer to the original, full trajectory file. This means that CAMPARI will read all frames sequentially and increment step counters accordingly. However, all the frames that are not part of the list are simply skipped. This implies that it is possible for a selection of 20 frames from a larger trajectory to fail to produce any output for polymeric quantities if POLCALC is set to 10, 5, or even 2 (simply on account of chance). It will therefore generally be easier to set such frequency flags to 1 if processed frame lists are used (this is the only setting that guarantees that the number of analyzed snapshots will be exactly proportional to the size of the list). Conversely, for a frames file used "as is," the unused frames are never read and no step counters are incremented. This means that the effective step becomes the processing of the frames file itself. Returning to the above example, a selection of 20 (possibly duplicated) frames from a larger trajectory will in this case always produce output for polymeric quantities if POLCALC is set to any value of 20 or less.
As mentioned above, the frames file allows the user to alter the type of averaging that is normally assumed for CAMPARI analysis functions. By default, each data point (trajectory snapshot) contributes the same weight to computed averages or histograms (distribution functions). This implied that the input trajectory conforms (was sampled from) the distribution and ensemble of interest. If, however, the input trajectory does not correspond a well-defined ensemble (or to a different one), it is common and possible to apply snapshot-reweighting techniques based on analyses of system energies or coupled parameters using weighted histogram methods. The result is a set of weights for each snapshot, which allows simulation averages and distribution functions to conform to that target distribution and ensemble. As an example, one may combine all data from a replica-exchange run (that no longer conform to a canonical ensemble at a given temperature), use a technique such as T-WHAM to derive a set of snapshot weights for a target temperature that was not part of the replica-exchange set, and then use this input file containing the weights to compute proper simulation averages at the target temperature.
The input file for this functionality is very simple and explained elsewhere. There are three important points of caution. First, floating-point weights imply that floating-point precision errors may occur. The implied summation of weights of very different sizes may then become inaccurate. CAMPARI provides a warning if it expects such errors to be large (based purely on the weights themselves). Second, snapshot weights do not influence the values reported for instantaneous output such as POLYMER.dat or for analyses that do not imply averaging (such as structural clustering). Third, reweighting techniques have associated errors that are difficult to predict. Simultaneous assessment of statistical errors via block averaging or similar techniques is therefore strongly recommended.
This simple keyword lets the user select the file format for a trajectory analysis run. The name PDB_FORMAT is a historical misnomer since the keyword cover all kinds of trajectory input as follows:- CAMPARI expects a single trajectory file in pdb-format using the MODEL /ENDMDL syntax to denote the individual snapshots.
- CAMPARI expects to find multiple pdb files with one snapshot each that are systematically numbered starting from the file provided via PDBFILE.
- CAMPARI expects to find a single trajectory in binary xtc-format (GROMACS style).
- CAMPARI expects to find a single trajectory in binary dcd-format (CHARMM/NAMD style).
- CAMPARI expects to find a single trajectory in binary NetCDF-format (AMBER style). (reference)
- CAMPARI expects to find data in binary representation in a PostgreSQL
database of name PSQL_R_DATABASE (an original format defined by CAMPARI)
and stored under a specific set of reference identifiers (see PSQL_R_TRAJKEY
and, possibly, PSQL_R_SIMUOFF and
This keyword (integer) controls how the information in a supplied pdb file is meant to be used. (see keyword PDBFILE and input file documentation). A maximum of three options is available with the first one offering restricted support depending on the type of calculation:- CAMPARI attempts to read in the Cartesian coordinates of heavy atoms from the pdb file, proceeds to extract the values for CAMPARI's "native" degrees of freedom (i.e., the unconstrained dihedral angles and the rigid body degrees of freedom in Monte Carlo or torsional molecular dynamics runs → CARTINT), and lastly rebuilds the entire structure using the determined values as well as internal geometry parameters for the constrained internal degrees of freedom (extracted from high resolution crystallographic databases). This hybrid approach will often lead to a propagation of error along the backbone of longer polymers and is therefore unsuitable for reading larger proteins or particularly for macromolecular complexes. While it is never a useful choice for structural input that contains complex molecules but does not exactly encode the same covalent geometry as what CAMPARI uses by default, it is of limited usefulness even when these conditions are met. Specifically, it should be used in conjunction with high precision PDB input (see PDB_INPUTSTRING and PDB_OUTPUTSTRING) for the remaining cases (essentially, CAMPARI runs in rigid-body/dihedral angle space not relying on any structural input). This input mode does not support the processing of unsupported residues (see PDB_TEMPLATE) and, upon discovery of unsupported residues in sequence input, will be changed automatically to option 3 (the default) below. There are further limitations to this mode. For example, it requires strictly that the first 3 atoms (in CAMPARI convention) are present for each molecule (unless there are less atoms in the molecule or it is a water, ammonium, or methane molecule), it does not recover (read) values for degrees of freedom in supported residues that are considered nonnative (e.g., hydrogen positions in methyl groups irrespective of values for TMD_UNKMODE and OTHERFREQ), and the read in is stopped as soon as there is any mismatch in sequence and structure inputs at the residue level (the remaining degrees of freedom missing from the pdb file are treated according to keyword RANDOMIZE). Overall, this option should be considered largely obsolete.
- CAMPARI attempts to read in the Cartesian coordinates of all
atoms from the pdb file and uses those explicitly
(i.e., it implicitly adopts the encoded geometry even for
degrees of freedom that are normally constrained within CAMPARI).
This will produce warnings if very unusual bond lengths or angles are
encountered (see PDB_TOLERANCE_A
and PDB_TOLERANCE_B), which
are most often indicative of missing atoms in the pdb-file (in
particular termini and hydrogens). Some of these problems will be dealt
with automatically, but it is always recommended
to check the file {basename}_START.pdb
to make sure that no drastic deviations occur. Such drastic deviations are almost inevitable
if backbone atoms are missing from polymer chains, and in these cases preprocessing of the
pdb file may be necessary. Conversely, if the input geometries are merely distorted
(experimental structures do not have arbitrary resolution or correctness),
the automatic rebuilding CAMPARI may perform should probably be
circumvented by increasing the thresholds
PDB_TOLERANCE_A. Of course, atoms might
be missing in the input PDB file, and their positions
will be reconstructed based on standard covalent geometries and on default values for
dihedral angles. This applies to missing hydrogen atoms or missing sidechains in crystal structures.
The rebuilt coordinates do not consider an energy function, and use of
TMD_RELAX, which offers localized relaxation in a
hierarchical and focused manner, is recommended. It is worth acknowledging that the
algorithms dealing with missing atoms in structural input have been refined and adjusted repeatedly
throughout CAMPARI's development history (much more so than other algorithms), which means that
even if you use a fixed random seed, you cannot expect the
exact same reconstructed starting structure across different versions, especially if tails
or heavy atoms were missing (obviously, the latest version is normally to be preferred).
Note that simulations with constraints cannot preserve arbitrarily precise values for the constrained degrees of freedom upon restarts of simulations from standard pdb files. If the sampler is in Cartesian space and constraints are used, keyword SHAKEFROM is a potential remedy. Conversely, simulations in rigid body or torsional space have no way of relaxing input geometries to the built-in (or any user-desired) values for bond lengths, angles, and rigid dihedral angles. In these cases, it may be useful (like for mode 1 above) to modify the assumed pdb format to improve precision (see PDB_INPUTSTRING and PDB_OUTPUTSTRING) and/or to rely on restart files whenever possible. The limitations of mode 2 are that atom names must be understood (automatic translation routines are in-place but not completely exhaustive), that residue names must be matched unambiguously (which means that automatically translated names such as "HSE" instead of "HIE" are OK, but ambiguous names like "HIS" are not), and that the read-in stops as soon as there is any mismatch in sequence and structure inputs at the residue level (the remaining degrees of freedom missing from the pdb file are treated according to keyword RANDOMIZE). - This mode is identical to mode 2 above with the exception of how mismatches between pdb and sequence input are addressed. Here, CAMPARI will assume that all of the structural input is potentially relevant but that some parts of polymer chains may be missing, which is a common issue with experimental structures. It will thus try to match maximally long sequence stretches from individual molecules (in the order of appearance in the sequence) with the sequence in the pdb file. Read-in stops as soon as structural input is exhausted or whenever an unresolvable mismatch occurs. An unresolvable mismatch occurs when, for a molecule present in sequence input, no information whatsoever can be found in the pdb, or when the pdb file contains any residue that cannot be mapped to the current or next molecule in the sequence. Unlike in the previous mode, CAMPARI will apply some soft-matching rules to be able to work with ambiguous labeling of different histidine protomers as "HIS" and to distinguish ambiguous assignments between DNA and RNA or between 5'-nucleoside and full 5'-nucleotide residue types. This mode enables the generation of initial structures with multiple C- and N-terminal tails to polymers being rebuilt. The rebuilding is under the control of keyword RANDOMIZE. This option is the default option. Since longer tails can easily get stuck in the randomization procedure, the initial reconstruction might have to be repeated a few times. In addition, the use of TMD_RELAX is recommended to try to resolve conflicts at the sidechain level not only in general but specifically for the newly built tail(s). It is currently not (yet) possible to reconstruct closed loops. In order to use structures with missing loops, it is recommended, as it is for all PDB files (see Tutorial 16 for an example), to use the sequence file tool shipped with CAMPARI. It will automatically break such loops into two capped tails.
As a more general consideration, the partial read-in of structural input might mean that CAMPARI is used to (also) generate solvent coordinates ("solvate the system") as is done, for example, in Tutorial 6. This not a problem with absolutely standardized solutions and is generally addressed heuristically, for example by overlaying the system with pre-equilibrated solvent boxes and generously removing solvent molecules that overlap. CAMPARI is more general in this regard and any missing coordinates will be generated randomly, supporting the creation of boxes containing random solvent mixtures, electrolyte solutions, etc. In this context, it becomes a choice whether and how many of the auxiliary coordinates in the input file to retain, most prominently for water molecules in crystal structures. Retaining them has the advantage that it makes it less likely that solvent molecules end up "inside" the macromolecules, a problem also addressed by keyword RANDOMBURIAL. On the flip side, water coordinates in crystal structures are not always well-defined and do not always correspond to water in reality.
This keyword allows changing the assumed PDB formatting string (Fortran) for PDB files. This is required to make CAMPARI be able to read in altered PDB files produced by the analogous keyword PDB_OUTPUTSTRING or by other software or scripts. The default format for the controllable section of the standard layout (up to column 54) is "a6,a5,1x,a4,a1,a3,1x,a1,i4,a1,3x,3f8.3" (if you use this keyword, the format must be supplied with the quotes). Because Fortran in general deals poorly with string-based I/O, any improper use of this keyword can easily lead to abnormal program termination. In the format string, the letters (a, i, f) give the type of variable, which must not change. The numbers give the fields lengths, and these can be customized for variables of type integer ("i") or real ("f"). It is also possible to modify the field widths of string variables ("a") but is not possible for extra content to be read, i.e., the resultant behavior is undefined. The only exception to this is the second variable (atom number), which is of the "wrong" type here because these values are ignored on input. This particular field width can be increased without harm. It is of course intended and required that the corresponding output string format uses an integer field here, by default "i5" instead of "a5".Common problems with standard PDB files, which can be addressed at least in part by the format string, are that the integer number for atom index overflows, that the chain indicator becomes fused to neighboring columns (because of overlong residue names or large residue numbers), that the residue number column overflows, that the coordinate entries get fused or overflow (if absolute coordinates are not centered at small (in absolute magnitude) values), or that the coordinate precision is insufficient for recovering exact covalent geometries based on this information alone. These limitations, in particular the system size restrictions, have led to the proposed phasing out of the format by rcsb itself. Unfortunately, the PDBx/mmCIF format, which replaces it, is not particularly convenient for simulation software, but eventually all software will have to provide conversion tools or adapt directly (planned for Version 6).
If structural input from a pdb file is requested in modes 2 or 3 (see PDB_READMODE and PDBFILE) or if a trajectory analysis run) is being performed, this keyword offers two choices for dealing with hydrogen atoms (which will often be missing from the pdb input file):- CAMPARI will attempt to read in the Cartesian coordinates of all hydrogen atoms directly and only rebuild those hydrogen (and other) atoms which cause a geometry violation defined by keywords PDB_TOLERANCE_B and PDB_TOLERANCE_A.
- CAMPARI will rebuild all hydrogen atoms according to its underlying default models for local geometry in chemical building blocks. This is most useful if hydrogen atoms are missing entirely from the input file.
For processing structural input, keyword PDB_NUCMODE explained below is ignored. It is listed here nonetheless to explain what CAMPARI actually does when reading in a pdb file supplied via PDBFILE or via PDB_TEMPLATE:If the input file is in CAMPARI convention, i.e., the O3* oxygen atom is part of the same residue as the phosphate it belongs to, the read in is consistent with internal convention. If, however, the input file is in pdb convention (also used by almost all other simulation software), i.e., the O3* oxygen atom is always part of the same residue as the sugar it belongs to, a heuristic is used to avoid an incorrect assignment. This heuristic relies on the geometry of the input structure being sane as it checks the bond distance to the appropriate phosphorous atom. For the heuristic to be successful, it is essential that the 4-letter atom name for the phosphorous atoms is always " P ". In terminal residues, it is possible that two oxygen atoms appear, and in this case it is important that they have different names (" O3*" and "2O3*" in standard CAMPARI convention).
As long as atom names can be parsed (see also below), the user should therefore not have to worry about the placement convention used in pdb input files. This implies that it is possible to supply a binary trajectory file (for example via DCDFILE) written in the non-CAMPARI convention of assigning the O3*-atom to the residue carrying the sugar it is attached to by the use of an appropriate template.
CAMPARI can in general process different conventions for the formatting of PDB files. A large fraction of simple atom naming convention multiplicities is handled automatically without the use of any keywords. PDB_R_CONV allows the user to select the format a read-in pdb-file is assumed to be in to be able to deal with more severe discrepancies. Possible choices currently consist of:- CAMPARI format (of course suitable for reading back in any CAMPARI-generated output even if PDB_NUCMODE was used → see above).
- GROMOS format (nucleotide naming). This option offers very little unique functionality since most of the supported conversions are handled automatically regardless of the setting for this keyword. It is primarily used to handle the GROMOS residue names for nucleotides (ADE, DADE, and so on).
- CHARMM format (in particular atom naming, cap and nucleotide residue names and numbering (patches), ...). Note that there are two exceptions pertaining to C-terminal cap residues (NME and NH2) which arise due to non-unique naming in CHARMM: 1) NH2 atoms need to be called NT2 (instead of NT) and HT21, HT22 (instead of HT1, HT2), and 2), NME methyl hydrogens need to be called HAT1, HAT2, HAT3 (instead of HT1, HT2, HT3). For nucleotides, there is an additional exception to 5'-residues carrying a 5'-terminal phosphate (the hydrogen in the terminal POH unit needs to be called " HOP" instead of " H5T"). This is again due to nonunique naming conventions within CHARMM.
- AMBER format (atom and residue naming in particular for nucleotides). Note that this option is the least tested one. Please let the developers know of any additional issues you may encounter.
This keyword is a combined input/output keyword and explained below. It can be used to process structural input with molecules that are broken up for periodic systems.PDB_TOLERANCE_A
This setting allows the user to override CAMPARI's built-in defaults for the tolerances it applies on a read-in structure (usually xyz from pdb). Since it is not always easy to distinguish distorted structures from missed input, the code applies a tolerance when comparing read-in bond angles to the internal reference value (which is derived from crystallographic databases). The default is an interval to either side of 20.0° and this relatively stringent setting can be expanded or contracted using this keyword. If a violation is found, the code usually overrides the faulty value with the default since it assumes that atomic positions were missing. If this happens along the main chain of a processed polymer from a pdb input file, this usually leads to unwanted effects, which can be avoided by setting this to a large number. The main benefit of keeping a lower tolerance is that it will relatively clearly flag possible errors or unusual features in the structural input file (which are often difficult to recognize through visual inspection).PDB_TOLERANCE_B
This is analogous to PDB_TOLERANCE_A, but allows the user to change the interval for considering bond length exceptions. The difference here is that two numbers are required: a lower fractional (down to 0.0) and an upper fractional number (preferably larger than 1.0 of course). This is because bond lengths ranges are inherently not normalized and in addition nonlinear (exceptions with too long bond lengths are much more frequent). The default is an interval between 80% and 125% of the crystallographic reference value (settings 0.8 and 1.25).PSQL_R_TRAJKEY
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a trajectory analysis run, this keyword sets the unique identifier string of the system that is meant to be analyzed. If the identifier is not present in the database specified by keyword PSQL_R_DATABASE, an error is produced. If the identifier has no associated simulations and tables with snapshot data in them (see description of standard elsewhere), an error is produced as well. If data are found but errors are produced due to mismatches between the specified system and the one deposited in the database, both a pdb file to be used as input for keyword PDB_TEMPLATE and a sequence file to be used as input for keyword SEQFILE are available to diagnose and/or rectify the issue. Keyword PDB_ATOMMAP offers an alternative route for solving the problems with non-CAMPARI atom order.If, on the other hand, the key exists, has associated data, and the deposited system matches the one CAMPARI assumes based on the supplied sequence file, the following modes of operation are possible:
- Keywords PSQL_R_SIMUOFF and PSQL_R_RANK are 0 and -1, respectively. If the analysis run does not use multiple copies (per MPI), this will simply concatenate all tables, up to a limit set by PSQL_R_LIMIT, and thereby snapshots associated with the requested key in order of increasing "sim_id" and, in second instance, "sim_rank" (if keyword PSQL_R_PREFSORT is 1) or in order of increasing "sim_rank" and, in second instance, "sim_id" (if keyword PSQL_R_PREFSORT is 2, which is the default). A possible file with custom input frames will have to refer to the implicit numbering defined by this concatenation. In a parallel analysis run, instead all replicas will try to find "their" table(s) by matching the "sim_rank" attribute to their MPI rank. The data will automatically be truncated to the largest common set. Errors will be produced as soon as any replica cannot find a matching table with a sufficient amount of deposited data.
- Keyword PSQL_R_SIMUOFF is a positive number and keyword PSQL_R_RANK is -1. This produces the same as the previous option except that only simulations matching the "sim_parent" attribute specified by PSQL_R_SIMUOFF are considered. This would be the analog of a parallel trajectory analysis run operating on separate files per replica. This is also the only option that is straightforwardly compatible with manipulations dependent on supplying a replica exchange or PIGS trace file or, alternatively, utilizing the connectivity information stored directly in the database (→ PSQL_R_TRACKBACK).
- Keyword PSQL_R_SIMUOFF is a positive number and keyword PSQL_R_RANK is 0 or larger. In this case, the table to be analyzed is completely fixed to match a "sim_parent" and "sim_rank". There should only be one such table in a given database for a given key.
- Keyword PSQL_R_SIMUOFF is 0 and keyword PSQL_R_RANK is 0 or larger. In this case, the table(s) to be analyzed are all those that match the key supplied with this keyword and the "sim_rank" defined by PSQL_R_RANK. As for the previous option, in a parallel analysis run, all replicas will analyze the same data.
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a trajectory analysis run, this keyword sets the target value of the attribute "sim_parent" as long as it is a positive number when identifying the snapshot tables associated with the database key given by keyword PSQL_R_TRAJKEY. The database will contain different tables, and those closely associated with one another (e.g., the different replicas in a deposited replica-exchange calculation) will be found as simulation entries that share the same value of "sim_parent". For example, a two-copy simulation might produce two tables "snapshots_14_0" and "snapshots_15_1". Here, 14 and 15 would be the associate "sim_id", and in the table "simulations" the corresponding rows would each have "sim_parent" as 14 but "sim_rank" as 0 and 1, respectively.The special option of putting 0, which is also the default, means that CAMPARI will look for all snapshot tables associated with the database key that also match the choice for keyword PSQL_R_RANK. More details are found in the description of keyword PSQL_R_TRAJKEY. The analagous keyword for write access is PSQL_W_SIMUOFF.
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a serial or parallel trajectory analysis run, this keyword defines the specific MPI rank to use to find the correct snapshot table(s) associated with the database key given by keyword PSQL_R_TRAJKEY. In the CAMPARI standard, snapshot tables are named "snapshots_i_j" where j refers to the rank ("sim_rank"), see the description elsewhere. By choosing -1 for this keyword, CAMPARI will assume that all ranks present in the analysis run find a corresponding table of shared parent "sim_parent" and their own respective ranks encoded as j. This means that, in a parallel analysis run, different replicas will analyze data from different tables. If the specified value instead is 0 or larger, the value is interpreted as the fixed rank, and one or more tables with fixed j are analyzed by the single run or by all of the replicas in case of a parallel run. The match is made safe because each simulation entry in the database has a "sim_parent" field, which is what is uniquely associated to the value specified for PSQL_R_SIMUOFF. More details are found in the description of keyword PSQL_R_TRAJKEY.PSQL_R_TRACKBACK
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a serial or parallel trajectory analysis run, this keyword (0/1) asks CAMPARI to trace back the continuous path(s) of snapshots stored in the database. This information is part of the standard (see the description elsewhere) and normally entered during deposition (fields "previous_id", "previous_rank", and "previous_snap" in the snapshot tables). It is of interest in trajectory ensembles where the individual copies exchange information. Two types are supported: swaps as in replica-exchange runs and terminations/duplications (one-to-many) as in PIGS or related adaptive sampling methods but not mixtures of the two. The former is normally for copies evolving in different conditions (temperature, Hamiltonian, etc.) while the latter is normally for copies evolving in the same condition. This is why swaps (or other cyclical operations) are, generally speaking, redundant in adaptive sampling.In practice, the option requires supplying an input file with a single input frame for keyword FRAMESFILE. From this starting point, the backward map stored in the trajectory database is used to find the prior stored snapshot, from which the current one derives. This is continued until either an end point is reached (no more prior snapshot available) or the number of snapshots requested through for NRSTEPS has been reached. The user should take care to ensure it is the latter. In tracking back a (standard CAMPARI) replica-exchange trajectory in this way, a geometrically continuous path is recovered that jumps through conditions. In tracking back a PIGS trajectory in this way, a geometrically continuous path is recovered that recapitulates the origin of the structure and will encapsulate the features that contributed to its survival. Because the reseeding of PIGS generates a tree-like structure, backward paths are unequivocal while forward ones are not.
In a parallel trajectory analysis run that uses the replica-exchange setup, where PSQL_R_RANK is -1, and where a specific parallel trajectory ensemble has been selected, this feature will be used independently by all replicas. Suppose there are 32 trajectories with 1000 snapshots each, a frames file with "1000" in it will select the last snapshot separately for all ranks, thus leading to the simultaneous analysis of all backward-continuous paths. In a replica-exchange calculation, this is conceptually the same as what would be accomplished by reading and processing the trace file (but only one of the two must be used) except that the snapshots are obtained in reverse order. In a PIGS calculation, an increasing number of replicas will eventually converge onto the same surviving branch as the backtracking proceeds leading to redundancy (duplicated snapshots). This does not replace the step of reading the trace file for graph-related analyses when a parallel analysis run uses the MPI averaging setup. Instead, it is an explicitly different procedure that, as mentioned, not only duplicates snapshots but also removes dead branches. To see branches that did not survive to the end, the snapshot in the frames file needs to be adjusted accordingly.
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a serial or parallel trajectory analysis run, this keyword sets the maximum number of SQL snapshot tables to consider per replica. More than a single table per replica might be considered because of PSQL_R_SIMUOFF being 0 and/or PSQL_R_RANK being -1.PSQL_R_PREFSORT
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a serial or parallel trajectory analysis run, this keyword defines the preference for how to order tables in SQL in case the settings for PSQL_R_SIMUOFF and PSQL_R_RANK request a dynamic scope (multiple tables to be processed). Choosing 1 prioritizes the simulation ID ("sim_id") over its rank ("sim_rank"), choosing 2 (which is the default) does the opposite.PSQL_R_DATABASE
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a serial or parallel trajectory analysis run, this keyword lets the user specify the name of the database to read trajectory information from. Naturally, using this functionality requires that CAMPARI has been compiled and linked with PostgreSQL (PSQL) and libpqxx support (see installation instructions for further information). If the database does not exist or is not accessible by the user running CAMPARI, an error is produced. Connection errors can be down to user permissions (ask your system/SQL administrator to correct this if you are trying to access a database you do not administer yourself) but also down to incorrect settings for keywords PSQL_R_DBPORT and PSQL_R_DBHOST. Lastly, the analogous keyword for PSQL_R_DATABASE for the writing of structural information is PSQL_W_DATABASE.PSQL_R_DBHOST
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a serial or parallel trajectory analysis run, this keyword lets the user specify the (possibly remote) host where the database specified by keyword PSQL_R_DATABASE can be found. This access is handled through the libpqxx library, which supports remote connections. The default is "" (without the double quotes), which on most Linux systems maps to the local machine. The communication channel uses a specific port, which can be set by keyword PSQL_R_DBPORT. If a remote connection fails, it might be because the database server is not configured to accept noninteractive connections without additional authentication (such as a user password). Lastly, the analogous keyword for PSQL_R_DBHOST for the writing of structural information is PSQL_W_DBHOST.PSQL_R_DBPORT
If structural input from a PostgreSQL database is requested (→ PDB_FORMAT) in a serial or parallel trajectory analysis run, this keyword lets the user specify the port number to use when accessing the database specified by keyword PSQL_R_DATABASE hosted on the server specified by PSQL_R_DBHOST. This access is handled through the libpqxx library, which supports remote connections. Allowed values are legitimate port numbers, and the default is 5432. The only common reason to change this is if, possibly for security reasons or to avoid clashes, the SQL administrator has proactively moved the port number to a different value. The default value of 5432 is the officially registered port for the PostgreSQL database. Lastly, the analogous keyword for PSQL_R_DBPORT for the writing of structural information is PSQL_W_DBPORT.PDB_TEMPLATE
This keyword allows the user to provide name and location of a pdb file that serves in possibly several auxiliary functions. A template pdb file is relevant in the following circumstances:- In a trajectory analysis run, it can serve as a map to correct a mismatch in atom ordering between a binary trajectory file (dcd, xtc, NetCDF, PostgreSQL) and CAMPARI's intrinsic convention (this function can alternatively be fulfilled by supplying a file input for keyword PDB_ATOMMAP). Typically, a pdb file provided by the program having generated the binary file will serve this purpose. In order for the map to work, it is crucial to ensure that every single atom to be read in has a proper match (by atom name) in the pdb file, i.e., it is not tolerable to provide a pdb template with missing atoms or with atom names that CAMPARI cannot parse. In general, CAMPARI's pdb parser is relatively flexible and allows additional control via PDB_R_CONV. It is typically not possible, however, to correct mismatches in the grouping of atoms into residues (with the exception of the processing coordinates for nucleotides, see PDB_NUCMODE). This is because CAMPARI treats a change in residue number on consecutive coordinate records as the signal for delineating entries by residue. Conversely, the absolute numbers are irrelevant.
- The template pdb file can simultaneously serve as a reference structure if alignment is requested in trajectory analysis runs (→ ALIGNCALC). This has the same requirements as the previous function meaning that it is not trivially possible to align trajectories using an incomplete or different reference structure. However, alignment to a reference structure is a functionality offered by almost every molecular visualization program.
- In all types of runs, the template pdb file can be used to infer the topology of
residues not natively supported by CAMPARI. This is crucial for handling these systems.
Importantly, using the template for this purpose decouples the topology determination
from structural input for simulation runs, which allows
initial randomization (possibly partial) of systems containing such unsupported residues.
The content of the template should contain each unique unsupported residue in the same order
that they appear in in the sequence file. For unsupported residues that are
part of a polymer chain, each occurrence in the sequence must have its own entry also containing the
immediate N- and C-terminal sequence context in the polymer chain. This is unless consecutive unsupported
residues are part of the same polymer chain (sequence context is mutually shared). If the unsupported residue is a small molecule
(single residue), the template should contain a single instance.
Beyond this, there are precise requirements
for both input files. They are listed in the corresponding documentation for both the
sequence file and the pdb file.
Assuming both files to be properly formatted, CAMPARI then does the following:
- From the sequence file, the number of unknown residues and their intended linkages are extracted.
- The template is read and the atomic indices delimiting all unknown residues are extracted. Single-residue molecules that are unsupported and occur repeatedly in the sequence can (but need not necessarily) reuse the same indices as the first instance of this type (matched by residue name in the pdb file). The procedure will try to match sequence input with the data in the template while allowing for gaps in both: i) supported residue present in sequence input and not required as context for unsupported polymer residues are skipped; ii) supported residues present in the template are ignored; iii) unsupported single-residue molecules occurring a second or more times in sequence input are skipped if (and only if) a different unsupported residue is found next in the template. The last condition means that the second occurrence of an unsupported single-residue molecule may or may not be used (depending on sequence input). After this, basic parameters such as the effective residue radius and the reference atom are inferred. It is therefore important that the conformation of the residue in the pdb file is somewhat representative.
- The remainder of the system topology is constructed. The internal order of atoms for unsupported residues always reflects the order in the input pdb file exactly. The pdb template file is parsed again to ensure that the required sequence context is present. This applies only to those unsupported residues that are part of a chain (polymer).
- From the pdb atom names, the chemical element is guessed (C, O, N, H, P, S, halogens, various metals, and metalloids) and the mass is set to that of an appropriate atom type in the parameter file (identification by attempts to match mass and valence). The assignment will be poor if the parameter file does not support the chemical element in question. Further details are found elsewhere. This can later be overridden by a biotype patch and/or a combination of other patches.
- A new biotype is created for every new atom type encountered. This biotype is initialized to be empty with the exception of keeping the atom name and the (already) assigned atom type. The numbering of these new biotypes continues from what the highest number in the parameter file is. It is therefore not possible to use the parameter file for these assigned biotypes directly. Instead, it is recommended to use a biotype patch or specialized patches. The assignment of an atom type is sufficient to provide basic support, so for certain applications no patches may be required. For residues duplicated in sequence relative to the template, this and all subsequent information is simply copied from the first (and usually only) instance. For valid unsupported residues with identical names, which occur multiple times in the sequence, only the type (but not the geometry) information is copied from the reference instance.
- The covalent bond information is used to infer the molecular topology (including a detection of rings). This defines the Z-matrix entries (internal coordinate representation) for unsupported residues. Similarly, the linkage to covalently bound residues that are either supported or also unsupported is inferred. In the process, rotatable dihedral angles are detected automatically. This procedure, which explicitly tests for bond angle or length variations upon rotation, is critical to most subsequent assignments.
- Given a set of pdb names, atom types, valences, and a topology, CAMPARI attempts to conclude by analogy whether the residue conforms to the backbone of one of the supported polymer types (currently, polypeptides and polynucleotides). If it does, as many internal pointers as possible are set to identify the residue accordingly (this does not work for single-residue molecules).
- If a residue is recognized as being part of a supported polymer type, the topology itself is corrected (the goal is that it should make no difference to CAMPARI whether a residue is supported or whether it is masked (by changing the name) as an unsupported one and all the information has to be inferred from the input structure). Further corrections pertain to the setup of interactions, etc. Note that the match cannot always be perfect, e.g., fudge factors that are not zero or unity in conjunction with MODE_14 being 2 and INTERMODEL being 1 may lead to energetic inconsistencies. The interaction setup relies on determining local rigidity via its knowledge of which dihedral angles are rotatable. Due to code-specific reasons (scanning for short-range exceptions, exclusions, etc), it is highly recommended to parse the chain into residues such that any pair of atoms in residues i and i+2 is separated by a least four rotatable bonds.
- All flexible dihedral angles may be made part of basic sampling routines if the simulation is in internal coordinate space. These are the torsional dynamics sampler (→ TMD_UNKMODE for details) and the basic Monte Carlo moves for degrees of freedom of this type (→ OTHERUNKFREQ). Furthermore, access will be granted to the specialized samplers if the residue is detected as eligible. This, however, may sometimes lead to an altered interpretation of the absolute values of certain dihedral angles or even alter details of the sampler slightly, e.g., the pucker sampling of proline-like, unsupported residues may end up perturbing different sets of auxiliary bond angles.
- If analyses are requested, these routines will respond to the unsupported residue according to the values set in the previous steps. Basically, the better the match to natively supported entities is, the more analysis functionalities will be available. Straightforward cases depend only on Cartesian coordinates (e.g., RHCALC or CONTACTCALC), whereas polymer type-specific analyses (e.g., DSSPCALC) require an unsupported residue to be recognized as the corresponding polymer type. Care must be taken in mixed polymers or other exotic cases, and it may occasionally be necessary to disable certain analysis routines.
For MPI parallel (multi-copy) runs, this logical keyword (1 means "on") allows the user to provide different starting structures via pdb files for different replicas (copies). The keyword is irrelevant in parallel trajectory analysis mode where this is the required and automatic behavior.CAMPARI used to restrict the use of this keyword to certain classes of calculations, but this is no longer the case. There are some risks associated with PDB_MPIMANY as follows: In internal coordinate space, the default accuracy of pdb files is too low to ensure that the covalent geometries across multiple input structures are sufficiently similar even when they were exactly the same in the underlying full precision coordinates. The distortions in covalent geometries mean that the simulated systems are no longer exactly the same, which is undesirable in cases where this is implied because the copies are coupled, e.g., in replica exchange or PIGS calculations. The magnitude of this effect can be diagnosed, for instance, by analyzing these geometries (→ INTCALC directly or by comparing short-range energy terms, in particular bonded ones (such as bond angle potentials). The effect can be circumvented with the help of keyword PDB_INPUTSTRING, which allows redefining the pdb format to be high-precision. This is the most obvious solution but only available if pdb files can be generated from higher-precision coordinates to begin with. In Cartesian space simulations, geometries usually relax to their force field minima unless holonomic constraints are in use (in which case keyword SHAKEFROM can be used to circumvent precision issues with PDB_MPIMANY).
If this option is active, CAMPARI expects to find systematically named pdb files with the base name given via keyword PDBFILE. The naming is analogous to the convention CAMPARI uses for outputs of parallel runs and also identical to what parallel trajectory analysis runs require. It is explained elsewhere. A list of keywords specific to running CAMPARI in parallel is found found below.
If a small molecule screen is performed, and the specified files for MOL2FILE and (optionally) MOL2_REFMOL contain not only atoms for the small molecules to be screened but also coordinates belonging to the fixed part of the system, this keyword is a required input file to tell CAMPARI so. The file should contain the indices (in CAMPARI numbering) of the atoms additionally specified in the input file. It is a requirement that they have to have smaller indices than the atoms of the screened molecules (that is, they must occur first). The format is shared with other atom index input files (see elsewhere).MOL2_PDB_RELAXED
If a small molecule screen is performed in scoring mode, then this keyword allows the user to supply an additional PDB file containing (only) the fixed part of the system. This PDB file is processed according to the same rules and conventions outlined elsewhere. The structure read in this way is used as the reference conformation for computing the "Protein" energies in the finite difference that is reported in the scores. Thus, it can be used to correct scores for what is often referred to as "strain" in the complex (caused by the presence of the small molecules). For a completely fixed base part, this simply creates an offset whereas for a partially dynamic base part, it can also change the rankings (because without using this keyword, the scores would all reference different "Protein" energies.Energy Terms:
(back to top)
Preamble (this is not a keyword)
There are various classes of energy terms. They include the core nonbonded energy terms (SC_IPP, SC_ATTLJ, SC_POLAR, SC_IMPSOLV, SC_TABUL, SC_WCA, and ghosted variants thereof), which typically use a truncation scheme involving neighbor lists. When CAMPARI's shared memory (OpenMP) parallelization is in use, these interactions are calculated in parallel using a detailed scheme operating at the neighbor lists level to achieve load balance. The second class of energy terms are those that are not pairwise, require no cutoff scheme, and are truly "local". Some of these, such as bonded potentials, e.g., bonded terms are generally split by residue across threads without difficulty, i.e., there are sums of independent terms. The final class of energy terms are various restraint (bias) terms requiring a synthesis of information across many residues, e.g., spatial density restraints. For these, the documentation below lists explicitly the parallelization support in each case.HSSCALE
This keyword controls a generic scaling factor for size parameters (Lennard Jones σii and σij) that were read in from the parameter file. This fundamentally alters the excluded volume properties of the system and also affects derived properties. Motivation for using this keyword (which naturally defaults to 1.0) may arise during parameter development or in specialized calculations.The directly affected potentials are SC_IPP, SC_ATTLJ, and SC_WCA, but the ABSINTH solvation model will also generally respond to it (this depends on the use of "radius" overrides in the parameter file. The analogous modification of scaling universally the ε parameters is achievable by means of the scale factors (SC_IPP, SC_ATTLJ, and SC_WCA).
This keyword allows the user to specify the linear scaling factor controlling the strength of the inverse power potential (IPP) defined as:EIPP = cIPP·4.0ΣΣi,jεijf1-4,ij·(σij/rij)t
Here, the σij and εij are the size and interaction parameters for atom pair i,j, f1-4,ij are potential 1-4 fudge factors (see FUDGE_ST_14) that generally will be unity, rij is the interatomic distance, t is the exponent, and the (double) sum runs over all interacting pairs of atoms. Lastly, cIPP is the linear scaling factor controlled by this keyword which - unlike most other scaling factors for energy terms - defaults to 1.0. In most applications, the inverse power potential will be the repulsive arm of the Lennard-Jones potential (t = 12 → 12th power, see IPPEXP). The interpretation and application of the provided parameters (see documentation and keyword PARAMETERS) can be controlled through keywords SIGRULE, EPSRULE, INTERMODEL, FUDGE_ST_14, and MODE_14. Note that the use of the Weeks-Chandler-Andersen (WCA) potential (→ SC_WCA) is mutually exclusive with inverse power potentials.
This keyword allows the user to adjust the exponent (an even integer that defaults to 12) for the inverse power potential. An important restriction is that many of the optimized loops in dynamics calculations do not support any other choice except 12. Note that very large numbers will of course - possibly in compiler-dependent fashion - slow down code execution due to the increasing complexity of expensive operations in innermost loops. By (formally) setting this to a value greater than 100, CAMPARI is instructed to replace the IPP potential with a hard-sphere (HS) potential, which is only available in pure Monte Carlo runs. In this case the scaling factor is ignored, the "infinity"-value (penalty for nuclear fusion) is determined by the setting for BARRIER, and the use of a size reduction factor (HSSCALE) is strongly recommended. In hard-sphere potentials, any energy readout for the IPP term should now be in multiples of BARRIER, and all persisting non-zero values would indicate a frustrated (non-relaxable) system. The actual value specified for IPPEXP is then irrelevant.SIGRULE
This keyword defines the combination rule for combining the size parameters of Lennard-Jones (and WCA) potentials, i.e., how to construct σij from σii and σjj if σij is not provided as a specific override in the parameter file (for details see PARAMETERS).The choices are:
1) σij = 0.5·(σii + σjj) (arithmetic mean)
2) σij = (σii · σjj)0.5 (geometric mean)
3) σij = 2.0·(σii-1 + σjj-1)-1 (harmonic mean)
Analogous to SIGRULE, this keyword defines the combination rule for interaction parameters of Lennard-Jones potentials. The same options are available and the same caveats apply with respect to overrides in the parameter file.SC_ATTLJ
This keyword allows the user to specify the linear scaling factor controlling the strength of the dispersive (van der Waals) interactions defined as:EATTLJ = -cATTLJ·4.0ΣΣi,jεijf1-4,ij·(σij/rij)6
Here, the σij and εij are the size and interaction parameters for atom pair i,j, f1-4,ij are potential 1-4 fudge factors (see FUDGE_ST_14) that generally will be unity, rij is the interatomic distance, and the (double) sum runs over all interacting pairs of atoms. Together with an inverse power potential with scaling factor 1.0 and exponent 12 (see SC_IPP), the canonical Lennard-Jones potential is constructed if the scaling factor, cATTLJ, is set to unity.
This very important keyword controls the exclusion rules for short-range interactions of the excluded volume and dispersion types (see SC_IPP, SC_ATTLJ, and SC_WCA). For Monte Carlo or torsional dynamics simulations assuming rigid geometries, the computation of spurious (constant) LJ interactions is inefficient. Conversely, in Cartesian sampling, bonded interactions are almost always parametrized with all 1-4, and certainly with all 1-5-interactions in place. The latter refer to intramolecular atom pairs separate by either exactly three (1-4) or four (1-5) bonds. The ABSINTH implicit solvation model, which is one of the core features of CAMPARI, was parametrized with a reduced interaction model. Hence, this keyword allows two choices:- Consider only interactions which are not rigorously or effectively frozen when using internal coordinate space sampling. This setting for example excludes all interactions within aromatic rings. As for determining 1-4-interactions, the rules outlined under MODE_14 apply.
- Consider all interactions separated by at least three bonds to be valid. This is the default setting for molecular mechanics force fields. Note, however, that many of these interactions are quasi-rigid and that their computation is somewhat nonsensical even in a full Cartesian description. Also note that due to the inherent assumption that every bond is rotatable the setting for MODE_14 does not matter if INTERMODEL is set to 2. All atoms separated by exactly three bonds will be considered 1-4. It is important to point out that the setting chosen for INTERMODEL affects the setting for ELECMODEL as well (see ELECMODEL).
- The GROMOS force field uses a very specific set of non-bonded exclusions which is supported by choosing this option for INTERMODEL. It is essentially a weakened version of the first (sane) option. Note that to reproduce the GROMOS force field exactly, ELECMODEL (which remains an independent setting) has to be set to 2 and INTERMODEL to 3.
This keyword can be used to provide the location and name of an input file that allows reassigning the size exclusion and dispersion parameters used in describing generic short-range potentials of the Lennard-Jones (see SC_ATTLJ and SC_IPP) or WCA types. The parameter file that CAMPARI parses will contain atom entries that specify general atom types. These types have associated with them entries of the contact and epsilon types specifying the Lennard-Jones σij and εij parameters (see equations provided with scale factor keywords). Within the list of biotypes, each biotype is assigned an atom type, and the patch functionality described here allows the user to change this to a different atom type for a specific instance of a biotype. Note that the reassignment is restricted to the Lennard-Jones parameters, but excludes other atomic parameters specified by atom types such as mass, proton number, description, or valence. Conversely, parameters derived from Lennard-Jones parameters are altered. This is particularly important for the derived atomic radii and volumes used in the continuum solvation model and analysis. If those parameters are meant to be left unchanged or set to yet another set of values, either the radius facility of the parameter file must be employed (if it is not already in use for the original atom type in question), or a patch of atomic radii must be applied in addition. Because size exclusion and dispersion parameters rely on combination rules and/or many overrides for special cases, it can be tedious to patch them. This is because a patch will often require the user to define a new atom type, which, for example, for the GROMOS force fields can be a lot of work. Some more details are given elsewhere.SYBYLLJMAP
This keyword can be used to provide the location and name of an input file that allows the user to override the automatically determined map of Sybyl (mol2) atom types to the atom type definitions in the chosen parameter file. This is currently relevant only for output files used for validation/debugging (see elsewhere) and in the context of a small molecule screen where a reliance on patch files is not practical. The mol2 input file must contain atom types according to the Sybyl standard for any inference to be working properly. These types contain information about element and hybridization but do not contain Lennard-Jones parameters. This is why a map is needed. Normally, CAMPARI will scan the existing atom types using valence and nucleus number information in particular to find appropriate matches. This may not work well with all parameter files. In particular, the automatic detection uses three hierarchical searches with increasingly relaxed matching criteria. However, the correct nucleus number is always required and element mismatches are thereby impossible. To offer both more control and more flexibility, this keyword exists. The details regarding atom types and the file format are provided elsewhere.SC_EXTRA
This (largely obsolete) keyword specifies a linear scaling factor for certain structural correction potentials. Assuming the typical set of torsional space constraints (see CARTINT), these are applied to rotatable bonds with electronic effects which cannot be captured by atomic pairwise contributions. In all applications, this keyword should be left at 0.0 and instead SC_BONDED_T should be set to 1.0.These covered terms consist of:
- Secondary amides: The rotation around the C-N bond is hindered due to the partial double-bond character present in amides. Corrections are therefore applied to residues which have an ω-angle (all non-N-terminal peptide residues excluding NH2 as well as the secondary amides NMF and NMA → sequence input). These keep the peptide bond roughly planar while allowing for cis/trans-isomerization and increased overall flexibility. The potentials are directly ported from OPLS-AA.
- Phenolic polar hydrogens: The rotation around the C-O bond in phenols is hindered due to its partial double bond character and in-plane arrangements of the attached hydrogen are favored. Corrections are applied to tyrosine (TYR) and p-cresol (PCR). These keep the polar hydrogen in their favored position. The potential is not overly stiff so that out-of-plane arrangements will be populated as well. The parameters are again ported directly from OPLS-AA.
This keyword gives the linear scaling factor for all bond length potentials. Their usage is permissible in all simulations but not meaningful unless bond lengths are actually allowed to vary, i.e., typically unless sampling happens fully in Cartesian degrees of freedom (see CARTINT). It is important to remember, however, that even in rigid-body / torsional space simulations, specific move types and systems will require setting this to unity (so we recommend it throughout). For bond length potentials, the only such exceptions are crosslinked molecules (see CRLK_MODE). Note that the parameter file has to provide support to be able to use this energy term (see PARAMETERS for details), and that simulations relying on those terms will otherwise fail, crash, or produce nonsensical results. Use GUESS_BONDED to circumvent those issues for incomplete parameter files, for runs with unsupported residues, and for small molecule screens.SC_BONDED_A
Similar to SC_BONDED_B for all bond angle potentials. Bond angle potentials (see PARAMETERS for details) matter for sampling in Cartesian space (see CARTINT), for crosslinked molecules (see CRLK_MODE), and for the sampling of five-membered, flexible rings (see PKRFREQ and SUGARFREQ). The coordinate derivatives for bond angles diverge at the extreme values of both 0° and 180°. This means that care must be taken in setting up the Z-matrix such that no terms are included, which would explicitly demand these values. In other software, this is sometimes overcome by the use of dummy atoms. In CAMPARI, this is unlikely to be problematic in Monte Carlo simulations. In dynamics, forces are buffered to avoid program crashes due to floating point errors, but the actual values are no longer meaningful. This issue is primarily relevant when modifying the code or when simulating unsupported residues, for which the Z-matrix is inferred from input (see elsewhere for details).SC_BONDED_I
Similar to SC_BONDED_A for all improper dihedral angle potentials.SC_BONDED_T
Similar to SC_BONDED_B for all torsional potentials. Note that these do in fact encompass degrees of freedom sampled in all types of simulations supported within CAMPARI and hence are always relevant. As alluded to above, torsional potentials can be easily set up to cover the same correction terms as the ones applied within SC_EXTRA. If that is the case, we therefore recommend not using SC_EXTRA (otherwise energy terms will in fact be applied twice, which is effectively scaling up those torsions; in such a case, CAMPARI produces an appropriate warning as well).SC_BONDED_M
Similar to SC_BONDED_B for all CMAP potentials. These grid-based correction potentials are part of the CHARMM force field and explained in PARAMETERS. This keyword specifies the "outside" scaling factor. Note that CMAP corrections can theoretically be relevant for all possible simulations of biopolymers within CAMPARI since they act on consecutive dihedral angles. The default CMAP corrections from CHARMM only apply to polypeptides, however, and are only contained within the reference CHARMM parameter file.IMPROPER_CONV
If improper dihedral potentials are in use (→ SC_BONDED_I), this very specialized keyword can be used to force a reinterpretation of the input sequence for the assignment of improper dihedral angle potentials to bonded types (see elsewhere). When set to 2, this keyword forces CAMPARI to switch the meaning of the first and third specified bonded type when it comes to energy and force evaluations. This allows a more or less exact match to the convention used in the AMBER set of force fields (and by extension: in OPLS-AA). For any other value specified, CAMPARI will use the CAMPARI-native convention (that is the same as in the CHARMM and GROMOS force fields).CMAPORDER
If CMAP corrections are used (→ SC_BONDED_M), this keyword sets the interpolation order for cardinal splines (assuming those are chosen through parameter input → PARAMETERS). A higher spline order will yield a smoother surface. Since the splines are non-interpolating, however, rapidly varying or coarsely tabulated functions may not be well approximated in such cases. The only interpolating cardinal B spline is the linear one which requires a choice of 2 for this keyword. This keyword is irrelevant should bicubic splines be chosen.CMAPDIR
If CMAP corrections are used (→ SC_BONDED_M), this keyword lets the user specify the absolute path of the directory in which the CMAP files are to be found (by default they are in the data/-subdirectory of the main distribution tree).BPATCHFILE
This keyword can be used to provide the location and name of an input file that allows reassigning or adding bonded potential terms (see bond length potentials, bond angle potentials, improper dihedral angle potentials, torsional potentials, and CMAP potentials). At the level of the parameter file that CAMPARI parses to generate default assignments based on biotypes (see elsewhere), there are limitations to how finely the system can be parsed. For instance, it is technically not possible to have different bond length potentials acting on the N→Cα bonds of two non-terminal glycine residues (because biotypes are identical). Of course, even providing bonded parameter assignments exactly at biotype resolution would generally be inordinately complicated, which is the reason for grouping biotypes into so-called bonded types in the parameter file. In cases where specific alterations to a given a system are desired, the patch functionality provided by this input file will generally be the most convenient (and often the only) route to take. For stiff terms, CAMPARI can also guess values based on initial geometries. Applied patches to bonded interactions are always printed to log-output In order to diagnose their correctness more easily, it is recommended to use the report functionality for bonded potential terms. Note that the most critical limitation is that extra or alternative bonded potentials can only be applied to such internal coordinates that are eligible for default assignments themselves, e.g., it is not possible to apply a bond angle potential to atoms a-b-c if a is not covalently bound to b or if b is not covalently bound to c.GUESS_BONDED
This keyword lets the user CAMPARI instruct to construct a set of bonded parameters from the most basic information available, which are the default molecular geometries ( usually from high-resolution crystal structures) for residues natively supported by CAMPARI or structural input for unsupported residues (see documentation on sequence input for details). Options are as follows:- No potentials are guessed. This means that the only bonded potentials available are those defined in the parameter file and mapped by matching the entries on bonded types to the available potentials, e.g. bond length potentials, as well as those potentials defined in a corresponding patch file, which also rely on the parameter file. Missing bonded interactions can make it impossible or meaningless to run simulations in Cartesian space.
- CAMPARI will guess missing harmonic potentials (type 1) for all bond lengths and angles defined by molecular topology. The equilibrium positions are defined as mentioned above. The force constants are flat, which is obviously a crude approximation, and evaluate to 300kcal·mol-1Å-2 for bond lengths and 80kcal·mol-1rad-2 for bond angles, respectively. It is not possible to add potentials to doublets or triplets of atoms that are not topologically derived. This is important for unsupported residues where a suboptimal Z matrix may be constructed because of bad atom input order.
- This option implies the previous option. In addition, eligible improper dihedral angles (see documentation on bonded types) are given a harmonic potential (type 2), and the strength is set to 40kcal·mol-1rad-2. Note that the functional form here includes a factor of 1/2 not present for bond angles.
- In addition to what is done for options 1 and 2, CAMPARI will also guess the chemistry of the underlying systems purely based on atom types, connectivity, and geometry information. This procedure is not aware of the true chemistry of any supported systems being simulated as it is largely meant for dealing with unsupported residues. It will assign harmonic or twofold dihedral angle potentials to nonterminal bonds that are expected to display electronic limitations to rotation around them. This includes aromatic ring bonds, carbon-carbon double bonds, amides, esters, and many other double-bonded or conjugated systems. The planarity of sp2-hybridized atoms participating in such groups is adjudicated by improper dihedral angles and a tolerance setting (see keyword PLANAR_TOLS). The recognition routine provides information about all relevant bonds, which allows an assessment of the quality of the parsing. This assignment of torsional potentials, which does not assign any threefold potentials to bonds with only weak rotational barriers (e.g., carbon-carbon single bonds), is consistent with the ABSINTH force field paradigm, which implies using an internal coordinates space. All potentials assigned in this manner are reported if BONDREPORT is used.
While these potentials are obviously too crude to study problems requiring very high resolution at the local geometry level, they can be very useful too quickly enable Cartesian space simulations of unsupported systems where often calibration data are missing (or unreliable) to begin with. In addition, option 3 above is helpful when trying to simulate unsupported residues in internal coordinate space. The guessed potentials are written to log output, and parsing this with a script can help in creating templates for exhaustive patch files, which are tedious to create from scratch. Note that the source data do not come from structural input for supported residues, which means that initial structures deviating dramatically from the assumed local geometries can be subject to large forces. Conversely, since the source data for unsupported residues must come from structural input, it is critical that the input conformations allow the parser to make the right inference. There are many ambiguous cases in terms of electronic conjugation (e.g., in substituted biphenyls) or conformational preferences (Z vs. E). In addition, the presence of flexible rings makes it harder for the force field to be accurate because their conformational preferences are notoriously difficult to describe and determine experimentally. If these input structures are present in odd conformers, CAMPARI's inference will likely be odd as well and either try to perserve the status quo or not put any potential. This is because CAMPARI will no longer understand the bond in question and is unwilling to speculate.
If the general chemistry parser is used to automatically assign some dihedral angle potentials to a system, this keyword sets two tolerances (in degrees) for impropers measured at sp2-hybridized atoms. The first is for atoms within rings, and the second is for atoms not in rings. If the measured improper exceeds the relevant tolerance, the atom is considered as something other than sp2-hybridized, which usually will lead to a lack of classification.MOL2BONDMODE
If a small molecule screen is performed and bonded potentials are in use, this keyword functions exactly like keyword GUESS_BONDED in controlling which types of potentials to guess from input and/or considerations of general chemistry. This refers to the bonded terms for the screened molecules alone. Since the source data for the screened molecules must come from the input conformers, it is critical that these conformers allow the parser to make the right inference. For example, a bond between sp2-hybridized atoms is unlikely to receive a bonded potential favoring planarity if the input conformer has an angle between the two planes exceeding the relevant tolerance setting. This is especially tricky if multiple conformers of the same or very similar molecules are present in the input file (inconsistencies here could split what was meant to be the same molecule into two or more different ones). To be able to diagnose errors of this type, the output of the bond parser can be instructed to be very verbose. To circumvent problems of this type conceptually, keyword MOL2RESPECTBOND allows deferring the inference of the type of chemical bond to the input file as a custom (CAMPARI-specific) additional column in the @<TRIPOS>BOND section.BONDREPORT
This report flag allows the user to request a summary of the bonded potentials found and not found during processing of the parameter file. This is primarily useful as a sanity and debugging tool for creating parameter files. Note that missing but necessary parameters (necessary ones are all bond length and angle potentials if and only if CARTINT is 2) as well as guessed parameters (see GUESS_BONDED) are always reported upon.SC_WCA
Mutually exclusive to the use of the Lennard-Jones potential, CAMPARI allows using the extended Weeks-Chandler-Andersen (WCA) potential which is defined as :EWCA = 4.0·cWCAΣΣi,jεijf1-4,ij·[(σij/rij)12 - (σij/rij)6 + 0.25·(1.0 - c2)] if rij < σij·21/6
EWCA = c2·cWCAΣΣi,jεijf1-4,ij·[0.5·cos(c3·(rij/σij)2 + c4) - 0.5] if rij < σij·c1
EWCA = 0.0 else
c3 = π·(c12 - 21/3)-1
c4 = π - c3·21/3
Here, the size, interaction, and fudge parameters are used as defined before. c1 is the interaction cutoff (in units of σij) while c2 is the depth of the attractive well to be spliced in (in units of εij). c1 and c2 can be set by keywords ATT_WCA and CUT_WCA, respectively. The potential provides a continuous function mimicking a LJ potential in which the dispersive term can be spliced in without shifting the position of the minimum. cWCA denotes the linear scaling factor specified by this keyword.
This allows the user to specify the well depth (positive number) for the attractive part of the WCA potential in units of εij (parameter c2 under SC_WCA).CUT_WCA
This allows the user to specify the cutoff value for the extended WCA potential in units of σij (parameter c1 under SC_WCA). Note that the minimum allowed choice here is 1.5.VDWREPORT
This keyword is a simple logical deciding whether or not to print out a summary of the Lennard-Jones (size exclusion and dispersion) parameters, i.e., to report the base values (σii and εii), the combination rules, and in particular all "special" values which overwrite the default combination rule-derived result.INTERREPORT
Mostly for debugging purposes, this simple logical allows the user to demand a summary of short-range interactions. Naturally, this output can be very large and the keyword should only be used when absolutely needed, for example to understand the settings for MODE_14 and FUDGE_ST_14.SC_POLAR
CAMPARI only supports fixed-charge atom-based electrostatic interactions which work by defining partial charges for each atom and then writing the potential as:EPOLAR = cPOLAR·ΣΣi,j [ (f1-4,C,ij·qiqj) / (4.0πε0·rij)]
Here, the qi are the atomic partial charges, f1-4,C,ij are potential 1-4 fudge factors (see FUDGE_EL_14) that generally will be unity, ε0 is the vacuum permittivity, rij is the interatomic distance, and the (double) sum runs over all eligible atom pairs (see ELECMODEL). cPOLAR is the linear scaling factor for all polar interactions specified by this keyword. Since electrostatic interactions are characterized by the potential to yield long-range effects (distance scaling ranges from r-1 for monopole-monopole terms to r-6 for dipole-dipole terms between molecules tumbling freely), the Coulombic term can employ a different cutoff in MC calculations (see below) than short-range potential. The correct long-range treatment of electrostatic interactions is one of the most investigated areas in molecular simulations and the user is referred to current literature and keywords LREL_MC and LREL_MD for details. All required partial charges are read either through the parameter file or can be set by a dedicated patch.
Note that the functional form given above is only correct if no implicit solvation model is in use. In such a scenario, Coulomb interactions are usually modified by an extra term sij which can be complicated function of interatomic distance and/or the positions of all nearby atoms. The reader is referred to Vitalis and Pappu for the exact functional forms used in the ABSINTH implicit solvation model.
This important keyword is somewhat analogous to INTERMODEL and allows the user to set the interaction model for electrostatic interactions:- Depending on the setting for INTERMODEL, interactions are either screened for connectivity and frozen interactions are excluded (INTERMODEL is 1), or are purely considered based on the number of bonds separating two atoms (INTERMODEL is 2). In any case, partial charges interact without considerations of net neutrality (see below), which is problematic for short-range interactions. Consider for example the ω-bond in polypeptides and assume that CO and NH both form neutral groups supposed to indicate dipole moments. If INTERMODEL is 2 and ELECMODEL is 1, the interaction between O and H is considered (1-4) but none of the others as they are topologically too close. This leads to spurious (and very strong) Coulomb interactions between what essentially are fractional, net charges. This is an inherent weakness of the point charge model which is typically addressed by extensive co-parameterization of bonded parameters, 1-4-fudge factors, etc. (see FUDGE_EL_14).
- The partial charge set in the parameter file is
read and the assigned charges are screened for (generally) net neutral
charge groups. These charge groups are determined largely automatically and are
currently not patchable per se. The automatic charge group generation operates by trying
to group partial charges into groups of minimum size and spanning a minimum
number of covalent bonds satisfying a target net charge without isolating atoms that have
only one covalently bonded partner. The net charges serving as default targets
are derived from knowledge of every CAMPARI-supported residue and assumptions about
their titration states (if any). This means, for example, that a nonterminal lysine residue
will be processed by looking for a charge group with a net charge of +1 along with
as many net neutral groups as possible.
While CAMPARI does not allow grouping charges arbitrarily, there is a
dedicated patch that allows defining
a set of (arbitrary) target values for the net charges of charge groups in a given residue. This is
required to deal with charge sets that do not group at all, or to deal with
residues that contain multiple ionic moieties. For example, depending on the charge set
in use, one may want to partition free, zwitterionic alanine either as multiple
groups with +1, -1, and 0 charges, or simply as one or more net neutral groups.
The procedure is described in detail in the corresponding section in
Parameters and not only controllable
by the aforementioned patch but also by
keyword POLTOL, which sets a general upper limit for the
accuracy with which the targets can be matched.
Resultant groups that are not well-defined charge groups according to CAMPARI's standards will be reported in log output.
With the groups in place, only interactions between those groups, for which all possible atom-atom pairs are separated by at least one significant degree of freedom, are computed. Interactions within a group are always excluded. What constitutes a significant degree of freedom is predetermined by the choice for INTERMODEL, and the reader is encouraged to read up on this if necessary. Essentially, INTERMODEL will define the maximum set of short-range interaction pairs that can also be considered for polar interactions. As an example, for the 6 net neutral CH units in benzene, if INTERMODEL is 1, no intramolecular polar interactions can be considered (the maximum set is empty). Conversely, if INTERMODEL is 2, several group-group interactions are now permissible (C1H-C4H, C2H-C5H, C3H-C6H). The list of retained atom-atom interactions is printed if ELECREPORT is turned on.
Depending on the charge set and on the choice for INTERMODEL, setting ELECMODEL to 2 can lead to a massive depletion of short-range electrostatics. The problems motivating this paradigm are of course not unknown in traditional force field development as evidenced by fudge factors, force field-specific exclusion rules (see, for example, choice 3 for INTERMODEL), or topology-specifc nonbonded interaction parameters (see Parameters). There are cases where the exclusions incurred by this model are harmful. For example, if charges group poorly, the Coulombic repulsion between neighboring phosphate backbone units might be omitted, and this is what the next option (#3) is provided for. - This is the same as the previous model (#2) except that a correction is introduced for the
leading term of any omitted short-range electrostatic interactions, viz., monopole-monopole
interactions. This correction was introduced formally in the context of the small
molecule extension of ABSINTH, see keyword MOL2SCREEN but has since been
generalized to become part of the normal force field. It uses the same functional form as the monopole-monopole
corrections for choices 3 and 4 for LREL_MC and
LREL_MD, respectively. In words, it loops over the unique pairs of charge groups
within residues and over the pairs of charge groups in topologically adjacent residues, and finds those who carry a nonzero charge and
whose interatomic interactions are excluded. It then adds a single interaction mapping the full charge to the
two atoms closest to the respective centers of charge. If the ABSINTH model
is in use, all solvation considerations are dealt with at the atom level. In other words, the only group-based
consideration is the lumping of the formal net charge of the group onto a single atom.
The correction is conditional upon having such pairs of charge groups with nonzero charge whose interactions are excluded. With standard polypeptide residues and normal partial charges, the only case where this might become relevant is for terminal amino acids with charged sidechains, especially aspartate, and, most notably for free amino acids. It might be highly relevant along the polynucleotide backbone, as mentioned above, but also in small molecules as explained in the reference publication. In all cases, the relevance hinges on both the partial charges and their processing by the grouping algorithm (see Parameters). For example, for free amino acids, if the ammonium and carboxylate functions are not put into separate groups, the correction does nothing since it "sees" only a single net-neutral group. If a system is simulated where choices 2 and 3 lead to a different model, it is likely that setting ELECMODEL to 2 causes systematic errors that are at least qualitatively corrected/improved by picking model ELECMODEL as 3. Due to the short distances, the energy scales can be quite large, and it should be avoided to have the two atoms serving as centers of the monopole-monopole interaction be topologically very close (usually, they should be at least two bonds apart). This can be checked with the help of output file MONOPOLES.vmd. The added corrections are shown as atom-atom interactions when ELECREPORT is turned on.
- The charge groups are important for deciding how long-range electrostatic interactions between ionic groups are computed exactly (see options 1, 2, and 3 for LREL_MC and options 4 and 5 for LREL_MD).
- The charge groups are used as the basis for computing group-averaged screening factors for certain screening models in the ABSINTH framework (see options 1, 3, 5, and 7 for SCRMODEL).
One "flaw" in the biotype setup in CAMPARI (see PARAMETERS) is the fact that the two polar hydrogens on primary amides are treated as chemically equivalent which - on a typical simulation timescale - they are not. Instead of creating yet more biotypes, this keyword simply allows to add a small polarization term for partial charges on those hydrogens. The value specified will be added to the hydrogen cis to the oxygen (the electronegative atom nearby increase the partial positive charge) and subtracted from the trans-H to keep them both at the same total charge. For example, if both hydrogens have a charge of +0.36, a specification of 0.05 here will yield charges of 0.41 (cis-O) and 0.31 (trans-O). It will be useful to track these changes using ELECREPORT. It is very important to note, however, that fundamentally a sampling algorithm may isomerize the amide bond and hence render the correction incorrect and - moreover - that reading in a structure may flip the two hydrogens to start out with (because of inconsistent numbering between two software packages). Hence, this keyword should be used only when absolutely necessary (and its sign may have to be flipped to achieve the desired effect).This correction to primary amides is a specific example for the occasional need to overwrite partial charge parameters for atoms due to "biotype splitting". The more general approach provided CAMPARI for this explicit purpose is to "patch" the partial charge set by a dedicated input file.
If the polar potential is in use, this keyword can be used to provide the location and name of an input file that allows overriding some or all of the partial charge parameters CAMPARI obtains from the parameter file (see elsewhere). This can be required to match the exact standard given by a force field with a finer biotype parsing. Note that - by default - such corrections are error-prone and should only be used when absolutely needed. In any case, the user is recommended to use ELECREPORT for a detailed summary of final partial charges in the system.DIPREPORT
This simple logical will - when turned on by the user - produce two summary files (see DIPOLE_GROUPS.vmd and MONOPOLES.vmd), which allow to graphically assess the automatically determined charge groups. The former will visualize all charge groups in the system (not just the net neutral ones) by highlighting all atoms belonging to each group. The second will visualize the "center" atom of all groups carrying a net charge (the meaning of this is defined by the value for POLTOL). Note that - naturally - this option is not available if SC_POLAR is zero.NCPATCHFILE
If the polar potential is in use, CAMPARI automatically determines charge groups, i.e., groups of atoms within a residue that are topologically close and whose partial charges sum up to zero or to an integer net charge. If LREL_MD is 4 or 5 and/or LREL_MC is 1, 2, or 3, this information is used to flag residues as carrying ionic groups, which leads to the computation of additional interactions even if residues are not in each others' neighbor lists. A residue is flagged if it contains at least one charge group with a total, absolute charge greater than a tolerance that is zero by default (and increasing this tolerance affects the charge group partitioning in other ways, see also the corresponding section in Parameters).This keyword allows the user to specify location and name of an optional input file that can perform two important tasks:
- It allows removal of the net charge flag for specific residues, thereby altering the overall interaction model (if the corresponding options for LREL_MD and/or LREL_MC are selected).
- It allows the manual specification of a set of target values for the total charges of charge groups to be identified. This is currently the only way to manually alter the charge group partitioning, and can be crucial when simulating unsupported residues and/or when dealing with charge sets that do not group naturally (such as those for nucleic acids in most common force fields).
If the polar potential is in use, CAMPARI automatically determines charge groups, i.e., groups of atoms within a residue that are topologically close and whose partial charges sum up to zero or (normally) to an integer net charge. This keyword allows a simple way to rescale charges and associated reference free energies of solvation (if SC_IMPSOLV is positive) for those groups carrying a net charge (not zero). This patch can be also be achieved by a (tedious) combination of a charge patch and a solvation group patch.This patch is relevant irrespective of the long-range and short-range interaction models. It is applied after possible charge and solvation group patches but before the patch modifying the charge flag status of residues. As such, the value for POLTOL may matter. The scale factors in the corresponding input file are interpreted to scale the net energy accordingly, i.e., they use the square root of the values for partial charges, and the actual value for the reference free energies of solvation. This keyword should be used with utmost caution as the resultant systems effectively contain fractional charges and, often, an overall charge imbalance.
If the polar potential is in use, CAMPARI automatically determines charge groups, i.e., groups of atoms within a residue that are topologically close and whose partial charges sum up to zero or to an integer net charge. In this context, this keyword has two functions.First, as described above, these net charge values can be patched. This may, for example, be used to obtain a grouping into approximately neutral groups for partial charge sets that include complex polarization patterns. In order to avoid having the resultant groups cause CAMPARI to flag the corresponding residue as carrying a net charge (i.e., they are treated like ions), this keyword allows the user to defined an increased tolerance for what is considered "approximately neutral". This is relevant because treatment of residues as ions can have substantial implications for the interaction model in particular in terms of computational efficiency (see LREL_MC and LREL_MD). Note that this keyword operates at the charge group level, whereas patches via NCPATCHFILE can (also) disable the ionic flag status of residues. Therefore, both offer different levels of control. The numerical value specified here (in units of e) is compared to the total charge of a given charge group.
As an example, consider a terminal nucleotide residue carrying a 5'-phosphate with an integer negative charge. Suppose that the partial charges on the phosphate linker to the next residue are such that - in addition to the terminal phosphate - this leaves a charge group with a small, fractional charge. In this case, the residue-level patch could only remove the net charge flag for the entire residue (probably undesirable), whereas the tolerance setting described here could specifically eliminate the group with the fractional charge from the list of ionic groups. The default tolerance is set to be zero within reasonable numerical precision. This first function is relevant only if LREL_MD is 4 or 5 and/or LREL_MC is 1, 2, or 3.
The second use of this keyword is in the charge group parser, which might be particularly important in a small molecule screen and for natively supported species that have partial charges that group poorly (most notably, nucleic acids in almost all common force fields). The tolerance is applied directly to the charge group identification, which is described in the corresponding section in Parameters and, for the small molecule screen, briefly for keyword MOL2POLMODE. This procedure has changed from V3 to V4, and for some special cases, a prior poor assignment in V3 is now improved, which has the side effect that polar energies are no longer the same at least as long as ELECMODEL is not 1 and/or a group-consistent screening model is in use for ABSINTH.
If a small molecule screen is performed and the Coulomb potential is in use, this keyword instructs CAMPARI how to deal with input charges in the mol2 input file. Options are as follows:- Leave them exactly as is. This implies potentially setting UNSAFE to 1 to override issues with lacking exact integer charges.
- Correct them at the level of the entire molecule (the deviation from the nearest integer charge divided by the number of atoms with nonzero partial charge is substracted from all atoms with nonzero partial charge).
- Correct them at the level of charge groups. Charge groups are determined in largely the same way as that described in the corresponding section in Parameters, the most important differences being that no hard-coded charge group targets exist (instead, they are inferred from a rough chemical fingerprint and the total charge), that no fractional charge targets are supported, and that the merging step is currently absent. Bad assignments are of course possible, in particular for delocalized ring systems. Some partial charge sets found in small molecule database have no regularization whatsoever (not even at the level of the entire molecule). Note that charge groups matter if ELECMODEL is 2 and/or if group-consistent screening models are used within the ABSINTH model. In addition monopole group assignments matter if LREL_MD is 4 or 5 and/or LREL_MC is 1, 2, or 3.
This keyword provides a flat 1-4 scaling factor for interatomic, non-bonded interactions of specific types. 1-4 interactions are defined according to the choice for MODE_14 and depend on the setting for INTERMODEL as well. The value for FUDGE_ST_14 is applied to all steric and dispersion potentials, i.e., the potentials turned on by SC_IPP, SC_ATTLJ, and SC_WCA. The only other 1-4-scaled interaction potential is the electrostatic one for which a separate 1-4-scaling factor is in use (see FUDGE_EL_14). All other pairwise, non-bonded potentials are never subjected to 1-4-corrections (see for example SC_TABUL or SC_DREST). Note that the value for FUDGE_ST_14 is applied in addition to corrections applied at the parameter level by providing 1-4-specific σ- and ε-parameters in the parameter file (see PARAMETERS).FUDGE_EL_14
Similar to FUDGE_ST_14, this keyword specifies a scale factor for 1-4-interactions. Here, the provided value will be applied specifically to electrostatic interactions (see SC_POLAR) only. If ELECMODEL is set to 2, any charge group interaction will be scaled as a whole by this factor, as soon as any of the possible atom pairs fulfill the 1-4-criterion (see MODE_14).MODE_14
This keyword's relevance is limited to the case in which INTERMODEL is 1. Then, this essentially defines what a 1-4-interaction is, specifically whether anything separated by exactly three bonds or by exactly one relevant rotatable bond should be considered 1-4:- Only two interacting atoms separated by exactly three bonds are treated as 1-4.
- Two interacting atoms separated by exactly one relevant, freely rotatable bond are always treated as 1-4.
Take a phenylalanine residue and consider the CA-CB-CG-CD1 stretch (from Cα to one of the Cδ). This is exactly three bonds and the bond CB-CG is the only relevant rotatable one (CA-CB is also rotatable but irrelevant, since CA lies on the axis, while CG-CD1 is not rotatable). CA and CD1 are treated as 1-4 in both modes. Now consider the CA-CB-CG-CD1-CE1 stretch. These are four bonds and CA and CE1 are not considered 1-4 in mode 1. However, there is still only one relevant rotatable bond in between (CB-CG, since CG-CD1 is rigid), so CA and CE1 are in fact treated as 1-4 in mode 2.
Note that CAMPARI allows specific modifications of 1-4-interactions, either through the use of fudge factors (see FUDGE_ST_14 and FUDGE_EL_14) or through specific parameters provided in the parameter file. If neither of those indicates a deviation from normal interaction rules, then this keyword becomes irrelevant as well.
If the polar potential is in use, this simple logical allows the user to request a summary for the close-range electrostatic interactions in the system. Similarly to INTERREPORT, this keyword mostly serves debugging purposes and should only be needed/required to understand the details of the short-range interaction setup.SC_IMPSOLV
This keyword serves two functions. First, as a logical it enables the ABSINTH implicit solvent model, i.e., it will compute the direct mean-field interaction (DMFI) of each solute with the continuum and enable screening of polar interactions (if turned on → SC_POLAR). For the former (the DMFI) it simultaneously serves as the linear scaling factor. Note that the amount of screening of polar interactions is not dependent on this keyword and solely determined by other parameters (in particular IMPDIEL). The DMFI is defined as:EDMFI = cDMFI·Σk ΔGFES,k [Σi ζik·υik]
Here, υik is the solvation state of the ith atom in the kth solvation group and ζik is its weight factor. The solvation states are computed by CAMPARI and vary throughout the simulation whereas the weight factors are constant. The reference free energies of solvation for each solvation group (ΔGFES,k) are provided through the parameter file and are constant as well (for the latter see PARAMETERS). Note that the computation of the DMFI given the υik is a computation of negligible cost and that CAMPARI obtains the υik while computing short-range non-bonded interactions at a moderate additional cost. This implies that the ABSINTH implicit solvation model is speed-limited almost exclusively by the complications incurred by the screening of polar interactions. The user is referred to Vitalis and Pappu for further details (reference).
To employ the ABSINTH implicit solvent model as published use:
In simulations relying on gradient-based propagation, it is a disadvantage that the original published creates force discontinuities at 3 levels:
- Cutoffs (these are usually of the least concern, especially given that the effective cutoff distances in CAMPARI are often significantly larger than the chosen value due to the reliance on residue-based lists)
- Linear approximation to the calculation of sphere overlap volumes
- Truncation of mapping function from solvent-accessible volume fractions to solvation states at assumed theoretical limits
FMCSC_FOSFUNC 3 # rather than 1
FMCSC_SCRFUNC 3 # rather than 1
FMCSC_SAVMODE 2 # rather than 1
FMCSC_LREL_MC 3 # rather than 1
FMCSC_LREL_MD 4 # rather than 5
The changes to the cutoff treatments are necessary to keep simulations of large systems computationally feasible since an explicit enumeration of all monopole-dipole interactions at atomic resolution is prohibitively slow for, for example, larger proteins.
Note that the more rigorous screening model (option 1) has not only been used in a highly-cited work on arginine-rich peptides (Mao et al.) but also shown to be quantitatively superior in predicting the binding free energies between small molecules and proteins (see Marchand et al.). In addition, it is known, at least at the community level, that a conjunction of atom-based screening with cutoff treatments that neglect monopole-dipole interactions (see LREL_MC and LREL_MD for details), can lead to artifacts when simulating charged species, especially charged polymers with a blocky charge pattern. These arise because atom-based screening effectively creates tiny effective monopoles from dipole groups because the atoms in a net neutral charge group are not all screened by the same factor. If there are also strong monopoles around, especially distally but on the same polymer, these interactions become numerous enough that they have a measurable influence on the conformational properties of this polymers. Many other screening models are fully implemented but without any published data available (as of 04/2016). Similarly, and as mentioned above, it is possible to switch the functional forms for mapping from solvent-accessible volume fractions to solvation states using keywords FOSFUNC and SCRFUNC and to change the way overlap volumes are computed (→ SAVMODE). In both cases, conservative options are available that simply remove force discontinuities from the model. Finally, note that the DMFI can be made temperature-dependent by additions to the parameter file and use of keyword FOSMODE.
This keyword can be used to provide the location and name of an input file that allows overriding the default, topology-derived values for the maximum fractions of the solvent-accessible volume, ηi,max. Because values depend on hard-coded parameters (geometry) and user-level settings (choice of parameters and keyword FMCSC_SAVPROBE), CAMPARI (re)computes these values at the beginning of each run. This utilizes the default local geometries (not input structures) and works by decomposing the molecule into suitably small model compound units. The patch prints a summary of all successful changes, and results can also be assessed via column 4 in output file SAV_BY_ATOM.dat. Note that these values rely on other patchable quantities, most notably atomic radii. Patches follow a hierarchy, and a patched value for the ηi,max overrides values derived from radii that could be patched themselves (here, RPATCHFILE overrides indirect reassignment via LJPATCHFILE) without touching the atomic radii. This means that it possible for the patched values of ηi,max to be grossly inconsistent with the underlying set of radii.ASRPATCHFILE
This keyword can be used to provide the location and name of an input file that allows overriding the default, topology-derived values for the pairwise reduction factor for atomic volumes used in most computations using the atomic volume, most prominently the ABSINTH implicit solvation model. Reduction factors are needed, because the exclusion volumes of covalently bound atoms overlap. The reduction factors are computed in linear approximation, and - by default - the overlap volume is subtracted evenly from the remaining atomic volume of each partner. These values depend on various parameters (parameters and hard-coded geometry), and CAMPARI (re)computes them at the beginning of each run. The patch prints a summary of all successful changes, and results can also be assessed via column 7 in output file SAV_BY_ATOM.dat. See SAVPATCHFILE for remarks on the hierarchy of patches of atomic parameters.FOSPATCHFILE
Since there is no external way to control details of the solvation group assignments relevant to the computation of the DMFI (→ SC_IMPSOLV) through the parameter file, CAMPARI offers users to alter the default group partitioning and to control reference free energies of solvation on a per-moiety basis through a dedicated input file. This also supports alterations to transfer enthalpies and heat capacities at the patch level if a temperature-dependent DMFI is in use. This keyword is used to provide the location and name of this input file. There are some underlying restrictions to the freedom of choices, but in principle it is possible to completely redesign the underlying DMFI model using this facility. Restrictions and formatting are explained elsewhere. The applied patch implies that CAMPARI will keep the built-in default partitioning along with the default reference values from the parameter file (see elsewhere) for unpatched residues and molecules. As with other force field patches, these corrections are error-prone and CAMPARI output should always be double-checked against the intended input. For this purpose, keyword FOSREPORT and associated output file FOS_GROUPS.vmd will be of particular use.FOSMODE
Simulation temperature is used frequently in biomolecular sampling both to explicitly probe temperature-dependent behavior and to enhance sampling. For the former, the correctness of fixed force field parameters becomes questionable. If the DMFI of the ABSINTH implicit solvation model is in use, this keyword allows the user to make some of the parameters of the model temperature-dependent themselves. There are currently two options:- All values for ΔGFES in the equation above are fixed to the reference values specified in the parameter file independent of temperature or any other environmental parameters. This is the default.
- CAMPARI tries to extract values for temperature-independent enthalpies and heat capacities of the transfer
process of a given model compound from a fixed conformation in the gas phase into water from the
parameter file. By default, all CAMPARI
parameter files do not contain these parameters. The temperature-dependent values are computed as:
ΔGFES(T) = (ΔGFES(T0) - ΔHFES)T/T0 + ΔHFES + ΔCp,FES[T[1 - ln(T/T0)] - T0]
Here, ΔHFES and ΔCp,FES are the aforementioned enthalpies and heat capacities of transfer, whereas T denotes the simulation temperature and T0 denotes the reference temperature for the listed free energy value. T0 is set by keyword FOSREFT.
If the DMFI of the ABSINTH implicit solvation model is in use, and if a temperature-dependent model has been requested, this keyword sets the assumed reference temperature for transfer free energies of solvation listed in the corresponding section of the parameter file. It defaults to 298K.FOSREPORT
This simple logical allows the user to request CAMPARI to print a summary of the group-based reference free energies, enthalpies, and heat capacities of solvation read from the parameter file. The latter two terms are only relevant if a temperature-dependent model has been selected. In general, the reference free energies will correspond exactly to the terms ΔGFES,k above. Note, however, that this initial output is not a summary of the system but rather of the parameters, i.e., it is more like VDWREPORT and unlike ELECREPORT or INTERREPORT. If some solvation group assignments and parameters are changed via a corresponding patch file, this keyword will also ensure that the applied patch is documented in detail in CAMPARI's log output. The actual group partitioning for the system at hand (but not the associated numerical parameters) is available from output file FOS_GROUPS.vmd.SAVPROBE
This keyword is crucial for the ABSINTH implicit solvent model and specifies the size of the solvation shell around individual atoms. The input value is interpreted to be the radius in Å of a solvent sphere rolled around each atom and consequently twice the value of SAVPROBE will yield the thickness of the assumed first solvation layer. The resultant solvation shell volume is the starting point for determining solvent-accessible volume fractions (ηi) which are then mapped to yield atomic solvation states (υi) which are relevant for the DMFI and screened electrostatic interactions (→ SCRMODEL). In order to compute solvent-accessible volumes, overlap volumes of spheres need to be calculated or estimated, and how to do that is controlled by keyword SAVMODE. It is important to note that SAVPROBE is the only keyword other than SAVMODE directly controlling the ηi which are otherwise purely functions of atomic parameters (see PARAMETERS). Lastly, note that this keyword is still relevant for SAV analysis even though the implicit solvent model might not be used (→ SAVCALC).SAVMODE
This keyword controls how CAMPARI calculates solvent-accessible volumes. The size of the solvation shell is defined by atomic radius and the setting for SAVPROBE. There are currently two options:- Linear approximations are used to calculate pairwise overlap volumes. Individual atomic volumes are scaled by reduction factors given by molecular topology.
- Pairwise overlap volumes are calculated exactly (polynomial equation). Individual atomic volumes are scaled by reduction factors given by molecular topology.
If the DMFI of the ABSINTH implicit solvation model is in use, this keyword controls which functional form is used to map the solvent-accessible volume to the DMFI solvation state.- The published smoothed and stretched sigmoidal function is used, which relies on 2 parameters,
viz., χf and τf.
The functional form is:
υi,f ~ [ 1.0 + exp[ - (ηi-h(χf))/τf ] -1
Here, υi,f is the DMFI solvation state for atom i, ηi is the solvent-accessible volume fraction for atom i, and h(...) is a linear function shifting the mid-point parameter χf (set by FOSMID) such that symmetry between the two natural limits of ηi is obtained. The normalizer is not shown in the equation. The function is smooth over the interval over which it applies. - A stair-stepped, stretched sigmoidal function is used, which relies on 5 parameters,
viz., χf, τf,
gf, ζf, and
FOSSHIFT. This is a piecewise-defined function. The width in solvent-accessible
volume space is set directly by gf starting from the lower natural limit.
Within each piece, the fractional interval gfζf is flat with values set by
functional form 1 above (thus relying on χf and
τf). If we term two neighboring plateau values as υ1 and
υ2, then the functional form for the interval of width gf(1-ζf) is:
υi,f = 0.5(υ2 - υ1)·(1 - cos( (ηi - η1)·π(1-ζf)-1/Δη))
Here, η1 corresponds to the left boundary of the interval in question, and Δη is the equivalent of gf in solvent-accessible volume fraction space. The position of the interpolation interval within the total interval of width Δη is defined by keyword FOSSHIFT. Note that functional form 1 is theoretically recovered if gf approaches 0.0. Note also that FOSSHIFT becomes irrelevant as ζf approaches 0.0 and that the limit of ζf reaching 1 (true step function) is numerically forbidden. Note that these additional parameters are not independent and that the coarse shape of the default (smooth) function is preserved, especially if FOSGRANULE is small (left near the default). - The published smoothed and stretched sigmoidal function is used, using the same parameters, except for a small interval for solvent-accessible volume fractions close to the maximum defined by the initial or reference bond topology. The width of this interval is set by keyword FOSTAPER. If ηi,max - ηmin defines the width between minimally and maximally achievable solvent-accessible volume fractions (the maxima can be patched, see SAVPATCHFILE), then the relevant interval is from ηi,max - FOSTAPER(ηi,max - ηmin) to ηi,max. Within this interval, cubic Hermite interpolation is performed using the analytical gradient and value of the standard function at ηi,max - FOSTAPER(ηi,max - ηmin) as anchor points to gether with ηi,max and a desired derivative of 0.0 at that point. Thus, the resulting piecewise function has continuous gradients and avoids the force discontinuity at solvent-accessible volume ratios approaching or exceeding the maximum. While this is unlikely to happen in polymers, it can happen in small molecules, especially those corresponding to a single solvation group (model compound) that have intrinsic flexibility. The difference this choice causes relative to option 1 is generally expected to be small, in particular if τf is small.
The atomic solvent-accessible volumes, ηi, are mapped to solvation states by two different sets of parameters, the first being responsible for obtaining υi,f which are the solvation states describing the change in DMFI with changes in conformation (the second set is responsible for obtaining υi,s which describe the change in dielectric response with change in conformation). The details of the mapping function are complicated by the requirement to normalize the υi,f to the well-defined interval [0:1] but in essence it holds:υi,f ~ [ 1.0 + exp[ - (ηi-h(χf))/τf ] -1
Here, τf is the parameter determining the steepness of the sigmoidal interpolation and this is the parameter determined by this keyword. Large values will yield an approximately linear re-mapping between the natural limits of ηi which are derived from closest packing of spheres (lower limit) and model compound topology (upper limit). This case is not obvious from the above equation but is obtained via τf-dependent re-scaling to match the target interval. Conversely, very small values yield a step-function like interpolation. h(x) is a linear function shifting the mid-point parameter χf (set by FOSMID) such that symmetry between the two natural limits of ηi is obtained.
As explained for FOSTAU, the mapping from solvent-accessible volumes ηi to solvation states υi,f relies on a mid-point parameter, χf. In the functional form given above, the mid-point of the sigmoidal function (i.e., the point of maximal slope) can be shifted toward either one of the natural limits of ηi by varying this keyword between zero and unity. Since the sigmoidal nature of the interpolation disappears in the limit of large values chosen for FOSTAU, FOSMID is only relevant for sufficiently small values of FOSTAU and its impact deteriorates progressively with growing FOSTAU. Note that the default is 0.5 but that it is easily possible to generate fairly asymmetric interpolation functions in the process (i.e., at values close to zero atoms are considered solvated at almost all times while at values close to unity the opposite is true). There is a Matlab script in the tools-directory (sigmainterpol.m) that helps assess the effect FOSTAU and FOSMID have given values for the natural limits of ηi.FOSGRANULE
As explained above, this keyword applies only to the case of a stair-stepped interpolation function from solvent-accessible volume fraction to DMFI solvation state. It sets the volume increment to assume for each step (in Å3). The solvation shell volume of each atom is then discretized by this increment and a step-like function is applied to each resulting interval in η-space. The default is set to the volume available to a single water molecule when assuming liquid water with a density of 1g/cm3.FOSTIGHT
As explained above, this keyword applies only to the case of a stair-stepped interpolation function from solvent-accessible volume fraction to DMFI solvation state. It sets the narrowness of the cosine-based step interpolation within each interval defined by FOSGRANULE and FOSTIGHT. A value of 0.0 gives a smooth piecewise function without any plateau regions whereas smaller values intersperse plateau regions by making the transition narrower (function remains smooth, however). The theoretical limit of 1.0 gives a true step function, but this is numerically forbidden.FOSSHIFT
As explained above, this keyword applies only to the case of a stair-stepped interpolation function from solvent-accessible volume fraction to DMFI solvation state. It sets the position of the step interpolation within an interval defined by FOSGRANULE and FOSTIGHT, and values from 0.0 (left) to 1.0 (right) are possible. The keyword is irrelevant if FOSTIGHT (ζf above) is zero.FOSTAPER
As explained above, this keyword applies only to the case of a tapered interpolation function meant to remove the force discontinuity at large solvent-accessible volume fractions. If ηi,max is the maximum solvent-accessible volume fraction for atom i and ηmin is the minimal value defined by spherical close-packing, then the taper interval is from ηi,max - FOSTAPER(ηi,max - ηmin) to ηi,max. The allowed values are positive values up to a limit of 0.5. Values close to 0.0 should be avoided since this would introduce strong gradient curvature over a very small range of solvent-accessible volume fractions. The default is 0.05.MOL2FOSLIBFILE
This file provides a library of abstract solvation groups indexed by integers. It is, at the moment, only relevant in the context of a small molecule screen where it provides the database of solvation groups available to parameterize the ABSINTH DMFI for the small molecules.The groups are simply specified line-by-line with no names or chemical information beyond two specifications of numbers of atoms. No file of this type is provided with CAMPARI, and the process of preparing a mol2 input file so that it utilizes the information in the solvation group library has to be performed independently of CAMPARI. Formatting details are given elsewhere.
This keyword lets the user set the assumed continuum dielectric. Primarily, this is used in the ABSINTH solvation model to treat the screening of electrostatic interactions. The dielectric constant enters the equation for the modified Coulomb sum in different ways depending on the choice for SCRMODEL. In general, the solvent-accessible volumes, ηi will be mapped to yield solvation states υi,s for dielectric screening. The mapping process is equivalent to the one described for the DMFI but relies on a separate set of parameters (see SCRFUNC). In the published ABSINTH model, the screening factor for the polar interaction is given as:sij = [ 1 - aυi,s ]·[ 1 - aυj,s ]
a = (1 - εr-1/2)
Here, εr is the relative dielectric constant set by this keyword. The above equation corresponds rigorously only to using screening model 2. Note how the functional form ensures an interpolation between the vacuum (υi,j = 0.0 → εeff = 1.0) and the fully screened cases (υi,j = 1.0 → εeff = εr).
In a completely different context, this keyword also sets the assumed continuum dielectric outside the cutoff sphere when treating electrostatics interactions with reaction-field methods (→ LREL_MD). For this latter purpose, it may be advantageous to set a very large value.
This keyword has several options which allow the user to control how dielectric screening of charges is done, specifically what functional form is used for the pairwise screening factor sij for a pair of interacting atoms i and j. The electrostatic framework within ABSINTH aims specifically at ensuring that only moieties with well-defined net charges interact (this is discussed in a different context for ELECMODEL). This means that for every base functional form of sij there will be two variants, one in which the υi,s are used directly (atom-based) and one in which a charge group-based υk,s is pre-computed for each group k out of its constituent atoms' solvation states υi,sk. Only the latter ensures rigorously that two formally neutral charge groups interacting will not create effective charge imbalances by atom-specific screening. The downside of those models (and the reason we generally do not recommend using them) is the higher computational cost associated and the dependence on the local neutrality in the partial charge set (i.e., should the base parameters not yield any locally neutral subgroups within a residue, the relevant charge group may be as large as an entire polynucleotide residue and dielectric responses of fairly distant moieties may become coupled which suggests a length scale to the solvent response vastly inconsistent with the setting for SAVPROBE). In the latter case, it may be necessary to attempt to patch the charge groups so that an approximate grouping is obtained (for details on the charge group identification, please refer to the corresponding section in Parameters).- For every charge group, the solvation states for the
individual sites are averaged in charge-weighted fashion (group-based →
see above).
The resultant group solvation state υk,s is used to screen
all the charges belonging to this
sij = [ 1 - aυk,s ]·[ 1 - aυl,s ]
a = (1 - εr-1/2)
Here, we assume atom i is part of the kth charge group and atom j is part of the lth charge group. εr is provided by IMPDIEL. - This is the published atom-based model and explained above (→ IMPDIEL). The atom-specific screening via atomic solvation states υi,s will break the neutral paradigm somewhat but localizes and strengthens specific interactions.
- Since electrostatic interactions tend to be somewhat weak with
the aforementioned options, this model
extends the default model (1) by an important change. If the distance
of atoms i and j, rij approaches
the length-scale of the first solvation shell, the dielectric is
augmented by a distance-dependent contribution
intended to strengthen specific interactions. This yields a very
complicated (although computationally not much more
expensive) model:
sij = senv,ij if rij ≥ (r0,ij+dW) or senv,ij > [ εc·r0,ij ]-1
sij = [ 1 - fMIX·[1 - dw-1(rij-r0,ij)] ]·senv,ij + fMIX·[1 - dw-1(rij-r0,ij)]·[ εc·r0,ij ]-1 if rij < (r0,ij+dW) and rij > r0,ij
sij = (1 - fMIX)·senv,ij + fMIX·[ εc·r0,ij ]-1 if rij < r0,ij
senv,ij = [ 1 - aυk,s ]·[ 1 - aυl,s ]
a = (1 - εr-1/2)
Here, dW is the thickness of the solvation shell (2·SAVPROBE) and r0,ij is given by the sum of the atomic radii of atoms i and j. fMIX is the impact of the distance-dependent contribution and set by keyword SCRMIX. εc is set by CONTACTDIEL (compare model 4). Note that the distance-dependence is achieved by the interpolation performed in the distance regime r0,ij < rij < (r0,ij+dW) but that no explicit distance-dependence is introduced otherwise. Furthermore, the contact dielectric εc·r0,ij is generally overridden if the environmental dielectric senv,ij would lead to a stronger interaction (less screening). Importantly, model 3 operates on the group-consistent solvation states (as model 1 does). The atom-specific modification corresponds to model 9. It should be noted that these models are largely untested and were part of initial calibration studies with the ABSINTH implicit solvent model. They are fully supported by CAMPARI, however. - This model implements a (more or less) pure distance-dependent
sij = [ εc·rij ]-1 if rij > r0,ij
sij = [ εc·r0,ij ]-1 else
Here, εc is the strength of the distance increase of the dielectric constant and r0,ij is the contact distance below which no further distance dependence to sij is applied. The resultant effective dielectric constant is εc·r0,ij which should never be less than unity. εc is set by CONTACTDIEL and r0,ij is defined by the sum of the atomic radii of atoms i and j. This means that the derivative of the potential is discontinuous at the contact point. Note that distance-dependent dielectric models break for a variety of limiting cases, in particular for anything involving net charged species. They also rely on a cutoff criterion since they otherwise do not converge upon a meaningful limiting dielectric. In this way, distance-dependent dielectrics may be seen as somewhat analogous to reaction-field treatments (see LREL_MD). - This model is a group-based variant and therefore similar to
option 1). It attempts to take a different route
toward computing an effective dielectric. Whereas models 1, 2, 3, and 9
use an effective charge approach, this model
(just like models 6, 7, and 8) employs an effective dielectric
approach. The former implies that the solvation
state enters the potential energy for Coulombic interactions as υi,s·υj,s,
EPOLAR will scale with changes in the υi,s
differently than the DMFI.
Consequently, screening model 5 implies:
sij = M( [1 - a·υk,s], [1 - a·υl,s] )
a = (1 - εr-1)
Here, we assume atom i is part of the kth charge group and atom j is part of the lth charge group and M is a function corresponding to a generalized mean whose exact form is determined by the choice for ISQM. The latter will be able to give rise to fundamentally different scaling behavior of EPOLAR with the υi,s illustrated for example by taking the arithmetic mean. This can more closely approximate the behavior seen for the DMFI and may allow using much more similar parameter sets τs and χs compared to τf and χf than is the case with models 1 or 2. - This model is the atom-based variant of model 5:
sij = M( [ 1 - a·υi,s], [ 1 - a·υj,s] )
a = (1 - εr-1)
- This model is an equivalent modification to model 5 as model 3 is to model 1.
- This model is an equivalent modification to model 6 as model 3 is to model 1.
- This model is an equivalent modification to model 2 as model 3 is to model 1.
For certain screening models, (SCRMODEL = 3, 4, 7, 8, or 9) a value for the effective dielectric at an interatomic distance matching the sum of the two atomic radii exactly is postulated to have the limiting value of εc·r0,ij (see equations above). This keyword provides the value for the parameter εc.SCRFUNC
If the ABSINTH implicit solvation model is in use and Coulombic interactions are enabled, this keyword controls which functional form is used to map the solvent-accessible volume to the solvation state for charge screening (υi,s for atom i above). As explained before (see IMPDIEL), the ABSINTH implicit solvent model employs two sets of solvation states, υi,f and υi,s. The υi,s determine the effective dielectric acting between polar atoms (see equations above).The text below is basically the same as that for keyword FOSFUNC.
- The published smoothed and stretched sigmoidal function is used, which relies on 2 parameters,
viz., χs and τs.
The functional form is:
υi,s ~ [ 1.0 + exp[ - (ηi-h(χs))/τs ] -1
Here, υi,s is the charge screening solvation state for atom i, ηi is the solvent-accessible volume fraction for atom i, and h(...) is a linear function shifting the mid-point parameter χs (set by SCRMID) such that symmetry between the two natural limits of ηi is obtained. The normalizer is not shown in the equation. The function is smooth over the interval over which it applies. - A stair-stepped, stretched sigmoidal function is used, which relies on 5 parameters,
viz., χs, τs,
gs, ζs, and
SCRSHIFT. This is a piecewise-defined function. The width in solvent-accessible
volume space is set directly by gs starting from the lower natural limit.
Within each piece, the fractional interval gsζs is flat with values set by
functional form 1 above (thus relying on χs and
τs). If we term two neighboring plateau values as υ1 and
υ2, then the functional form for the interval of width gs(1-ζs) is:
υi,s = 0.5(υ2 - υ1)·(1 - cos( (ηi - η1)·π(1-ζs)-1/Δη))
Here, η1 corresponds to the left boundary of the interval in question, and Δη is the equivalent of gs in solvent-accessible volume fraction space. The position of the interpolation interval within the total interval of width Δη is defined by keyword SCRSHIFT. Note that functional form 1 is theoretically recovered if gs approaches 0.0. Note also that SCRSHIFT becomes irrelevant as ζs approaches 0.0 and that the limit of ζs reaching 1 (true step function) is numerically forbidden. Note that these additional parameters for the second functional form are not independent and that the coarse shape of the default function is preserved, especially if SCRGRANULE is small (left near the default). - The published smoothed and stretched sigmoidal function is used, using the same parameters, except for a small interval for solvent-accessible volume fractions close to the maximum defined by the initial or reference bond topology. The width of this interval is set by keyword SCRTAPER. If ηi,max - ηmin defines the width between minimally and maximally achievable solvent-accessible volume fractions (the maxima can be patched, see SAVPATCHFILE), then the relevant interval is from ηi,max - SCRTAPER(ηi,max - ηmin) to ηi,max. Within this interval, cubic Hermite interpolation is performed using the analytical gradient and value of the standard function at ηi,max - SCRTAPER(ηi,max - ηmin) as anchor points to gether with ηi,max and a desired derivative of 0.0 at that point. Thus, the resulting piecewise function has continuous gradients and avoids the force discontinuity at solvent-accessible volume ratios approaching or exceeding the maximum. While this is unlikely to happen in polymers, it can happen in small molecules, especially those corresponding to a single solvation group (model compound) that have intrinsic flexibility. The difference this choice causes relative to option 1 is generally expected to be small, in particular if τs is small.
This is the specification analogous to FOSTAU for the charge screening solvation state and provides τs rather than τf.SCRMID
This is the specification analogous to FOSMID for the charge screening solvation state and provides χs rather than χf.SCRGRANULE
This is the specification analogous to FOSGRANULE for the charge screening solvation state and provides gs rather than gf.SCRTIGHT
This is the specification analogous to FOSTIGHT for the charge screening solvation state and provides ζs rather than ζf.SCRSHIFT
This is the specification analogous to FOSSHIFT.SCRTAPER
This is the specification analogous to FOSTAPER.SCRMIX
Several of the screening models (choice of 3, 7, 8, or 9 for SCRMODEL) splice a distance-dependent term into the environmental charge-screening over a well-defined length scale. The impact of this contribution is set by this keyword which corresponds to the parameter fMIX in the equations above. If set to values close to zero, the model approaches its unmodified base model, e.g. model 3 essentially converges to model 1. Conversely, a value close to 1.0 would yield maximum impact and let - for example - model 3 approximate model 4 for distances close to the contact distance r0,ij. The choice here is naturally tightly coupled to that for CONTACTDIEL.ISQM
In those screening models postulating an effective dielectric rather than effective charges, the generalized mean function M(x,y) was introduced (see equations above). This can be an integer from -10 to 10, but large absolute values slow down the computation drastically and are not recommended. The specification here defines the order m for the generalized mean:M(x,y) = [0.5·( xm + ym ) ]1/m if m ≠ 0
With the limiting case of:
M(x,y) = (x·y)1/2 if m = 0
Common cases aside from the geometric (m=0) are the arithmetic (m=1) or the harmonic (m=-1) mean. Any m>1 will favor large values in an asymmetric pair, i.e., let both participating atoms appear desolvated leading to stronger interactions, while any m<1 will favor small values in an asymmetric pair, i.e., let both participating atoms appear solvated and weaken such interactions (it is the derived screening factors and not the solvation states that enter the mean). The former scenario (m>1) would rarely seem desirable as it means that - for instance in solutions of small, polar molecules - the cooperativity for converting between fully dissociated and fully associated states becomes overly pronounced on account of the positive coupling between adding more and more species to a growing cluster and the enthalpic benefit offered by that process.
This keyword specifies the linear scaling factor controlling the "outside" scaling of torsional bias terms, VTOR. Such a potential allows to either harmonically restrain virtually all freely rotatable dihedral angles to specific target values or to softly bias them toward such target values. The setup for these is handled through an input file (details of the format are described elsewhere). Note that a particularly useful application of ETOR is to apply torsional restraints according to structural input which is useful for equilibrating molecules meant to remain in a specific, internal arrangement. Notably, torsional bias terms are respected during almost all stages of any possible structural randomization happening during the setup stage of a simulation task. This means that they can be used to generate starting configurations of polymers that are in compliance with a particular secondary structure pattern.TORFILE
This keyword specifies the location and name (absolute paths preferable) of the input file for individual backbone torsional bias potentials, VTOR (see elsewhere for description).TORREPORT
This is a simple logical allowing the user to instruct CAMPARI to write out a complete summary of the torsional bias terms contributing to VTOR (naturally parsed by residue) in the system. In addition to the annotated log-output, this will also create the output file SAMPLE_TORFILE.dat, which is a rewriting of the current input specifications to a fully explicit and residue-based version. This is useful primarily in preserving input to the definition of torsional bias potentials that comes from structural input. It is recommended to utilize this option.SC_ZSEC
This keyword gives the linear scaling factor for a global secondary structure bias term. For values larger than zero, a harmonic bias is applied on two order parameters, fα and fβ which measure the secondary structure content of the chain. fα and fβ are calculated as the sequence-averaged (excluded termini) values of a mapping function defined for each residue:zα = e-τα·(dα-rα)2 if dα < rα
zα = 1.0 else
The radius of the (spherical) α-region, rα, is provided by ZS_RAD_A and its center φ/ψ-position by keyword ZS_POS_A. The distance dα is taken from the center of the circle and corrected for periodic wraparounds in φ/ψ-space. zβ is defined analogously. This function represents a smooth "top hat" function which is continuous and differentiable. By tuning the parameter τα through keywords ZS_STP_A and ZS_STP_B, the Gaussian decay beyond the limits of the spherical plateau region can be turned from very shallow to step function-like. The default definitions (all of which can be overridden) are:
Center: φ/ψ=(-60.0,-50.0)°; rα = 35.0°; 1.0/τα1/2 ≅22.36°
Center: φ/ψ=(-155.0,160.0)°; rβ = 35.0°; 1.0/τβ1/2 ≅ 22.36°
The global values (if there are multiple polypeptide chains in the system, the average is over all of them) are then restrained:
VZSEC = cZSEC·(kα·(fα - fα0)2 + kβ·(fβ - fβ0)2)
Here, cZSEC is the linear scaling factor specified by this keyword. The other parameters are explained below. Note that it may not be a good idea to use such a residue-based restraint potential for very short sequences. Here, the net content idea breaks down and (for typical choices of τα/β) the chain will have access only to values in the vicinity of those given by a discrete residue content. This may lead to a specific sampling of the ring regions around the plateaus to satisfy intermediate target values which runs counterintuitive to the intent of the potential.
When CAMPARI's shared memory (OpenMP) parallelization is in use, the calculation of VZSEC is currently executed by a single thread, possibly in concurrence with another thread addressing the complementary DSSP term but not with anything else. This is a scaling limitation, and a corresponding warning is produced.
This keyword specifies the target α-content for the global secondary structure bias (fα0) potential (values [0.0:1.0]).ZS_FR_B
This keyword specifies the target β-content for the global secondary structure bias (fβ0) potential (values [0.0:1.0]). Note that the sum of fβ0 and fα0 (see ZS_FR_B) should usually not exceed unity, especially in conjunction with stiff spring constants. Doing so would generate a frustrated system for which results will often be irrelevant.ZS_FR_KA
Through this keyword, (twice) the spring constant (in kcal/mol) operating on fα is provided (kα) if the global secondary structure bias potential is in use.ZS_FR_KB
Analogous to ZS_FR_KA, this keyword lets the user specify the spring constant (in kcal/mol) operating on fβ (kβ) if the global secondary structure bias potential is in use. If both parameters are meant to be restrained, it usually would not seem meaningful to choose very different values for the two spring constants. In doing so, one would essentially create a primary bias (stiffer term) and a secondary bias (softer term) operating "within" the primary restraint.ZS_POS_A
This is one of the few keywords that requires two floating point numbers as input. It allows the user to override the default location of the α-basin (see SC_ZSEC). The two numbers are interpreted to be the φ- and ψ-values (in degrees) for the center of the (spherical) basin. The setting is relevant for the corresponding restraint potential and the output in ZSEC_HIST.dat, ZAB_2DHIST.dat, and ZBETA_RG.dat.ZS_POS_B
See ZS_POS_A, only for the β-basin.ZS_RAD_A
This keyword requires one floating point number to be specified. It allows overriding the default radius of the α-basin (see SC_ZSEC) and is assumed to be given in degrees. The setting is relevant for the corresponding restraint potential and the output in ZSEC_HIST.dat, ZAB_2DHIST.dat, and ZBETA_RG.dat.ZS_RAD_B
See ZS_RAD_A, only for the β-basin.ZS_STP_A
This keyword requires one floating point number. It allows overriding the default steepness of the decay (τα) of the order parameter value beyond the spherical plateau region defining the α-basin (see SC_ZSEC). It is assumed to be provided in inverse degrees squared. The setting is relevant for the corresponding restraint potential and the output in ZSEC_HIST.dat, ZAB_2DHIST.dat, and ZBETA_RG.dat.ZS_STP_B
See ZS_STP_A, only for the β-basin.SC_DSSP
This keyword provides the outside scaling factor, cDSSP, on biasing potential acting on order parameters derived from the secondary structure annotation of polypeptides in the simulation system using the DSSP alogrithm. In essence, this allows to bias the system to populate more and stronger hydrogen bonds characteristic for either α-helices (H) or β-sheets - whether parallel, antiparallel, multi-pleated or hairpins (E). Since secondary structure annotation is essentially a discretized and on/off variable, it may seem surprising that a restraint potential can be applied in meaningful fashion.VDSSP = cDSSP·(kH·(fH - fH0)2 + kE·(fE - fE0)2)
Here, the kH and kE are (twice) the spring constants for the harmonic restraints applied to the secondary structure scores, fH and fE. The spring constants are set by keywords DSSP_HSC_K and DSSP_ESC_K for H-score and E-score, respectively. fH and fE are exactly identical to the H-score and E-score defined below and rely on the same base parameters (→ DSSP_MODE). Essentially, they correspond to a multiplicative function of the assignment and the quality of the hydrogen bonds giving rise to the assignment. They can - depending on system and DSSP settings - be continuous and approximately smooth order parameters over a large part of the accessible regime. The target values fH0 and fE0 are set via keywords DSSP_HSC and DSSP_ESC. There are a few noteworthy peculiarities which the user should keep in mind:
- DSSP E-assignments can rely both on intra- and intermolecular hydrogen bonds rendering the DSSP term a true system-wide potential. Currently, CAMPARI only allows restraining global E- and H-scores which may make calculations with multiple polypeptides more difficult to interpret.
- In the limit of no hydrogen bonds, the order parameters will always be discontinuous since the discrete assignment score has to be non-zero for the quality score to matter.
- Due to the potential discontinuities, dynamics calculations utilizing the DSSP biasing potential may suffer from substantial noise, in particular for stiff restraints and small systems.
- Again, due to the functional form, there is no direct driving force to form new hydrogen bonds of the right type. The potential relies on random encounters and the cooperativity of secondary structure elements.
- Lastly, in case some proper hydrogen bonds are formed, the resultant energy landscape is often very rugged and sampling may be severely hampered by the presence of the restraints. It is therefore advisable - at the very least - to perform multiple independent simulations when using DSSP restraints.
When CAMPARI's shared memory (OpenMP) parallelization is in use, the DSSP restraint potential is currently calculated by a single thread, possibly in concurrence with another thread addressing the complementary ZSEC term but not with anything else. This is a scaling limitation, and a corresponding warning is produced.
In case DSSP restraints are used (→ SC_DSSP), this keyword allows the user to set the target H-score (α-content, fH0 above). Its value is limited to the interval from zero to unity. A large value will steer the system toward forming many i→i+4 hydrogen bonds.DSSP_ESC
In case DSSP restraints are used (→ SC_DSSP), this keyword lets the user set the target E-score (β-content, fE0 above). Just like for DSSP_HSC, values are restricted to the interval [0.0:1.0]. A large value will bias the system toward forming characteristic β-hydrogen bonds but does not distinguish between parallel or anti-parallel arrangements. Note that the sum of DSSP_HSC and DSSP_ESC should probably never approach unity. Also note that the E-score can never be exactly unity for a monomeric, finite length polypeptide even when discarding termini (turn requirement).DSSP_HSC_K
If DSSP restraints are in use (→ SC_DSSP), this keyword sets (twice) the spring constant (in kcal/mol) operating on the DSSP H-score, i.e., it sets the value of kH above.DSSP_ESC_K
If DSSP restraints are in use (→ SC_DSSP), this keyword sets (twice) the spring constant (in kcal/mol) operating on the DSSP E-score, i.e., it sets the value of kE above.SC_POLY
In studies of generic polymers coarse descriptors like size and shape of the macromolecule may be more relevant than structural characteristics tailored specifically to polypeptides. CAMPARI supports restraint potentials on such coarse descriptors, specifically the parameters t and δ (see description of output file POLYAVG.dat) which measure size and shape asymmetry, respectively. Two-dimensional histograms of these quantities can be computed and written by CAMPARI (see output file RDHIST.dat). These molecule-based restraint potentials yield a bias term to the total potential energy, VPOLY, and this keyword provides its "outside" scaling factor cPOLY. Note that with the exception of the scaling factor, requests are generally handled through a dedicated input file (see elsewhere for details). As mentioned above, when CAMPARI's shared memory (OpenMP) parallelization is in use, all threads contribute to calculating VPOLY synchronously.POLYFILE
This keyword should point to the location of the input file for individual molecular polymeric biasing potentials (→ elsewhere for description).POLYREPORT
Like other report flags, this keyword is a simple logical which allows the user to obtain a complete summary of the polymeric bias terms (by molecule) in the system. It is only meaningful if polymeric biasing terms are in use (→ SC_POLY).SC_TABUL
CAMPARI has an extensive facility to supply tabulated non-bonded potentials which are then applied to the system. This keyword specifies the "outside" linear scaling factor cTABUL according to:ETABUL = cTABUL ·ΣΣi,j I(Vijk,Vijk+1,mijk,mijk+1,dij)
Here, the sum runs over all atom pairs i,j which have a tabulated potential specified for them, Vijk is the kth tabulated value of the acting potential and dij is the interatomic distance. dij is located uniquely within the interval given by the kth and k+1th tabulated value. I(...) is the interpolation function, and CAMPARI currently performs only cubic interpolation with cubic Hermite splines:
I(Vijk,Vijk+1,mijk,mijk+1,dij) = (2t3 - 3t2 + 1)·Vijk + (3t2 - 2t3)·Vijk+1 + (dk+1-dk)·[(t3 - 2t2 + t)·mijk + (t3 - t2)·mijk+1]
t = (dij - dk)/(dk+1-dk)
Here, t is the relative position in the interval from k to k+1 normalized to unit length. The mijk are the tangents to (slopes at) the control points (tabulated values) of the potentials. The spline is set up to recover both values and tangents at the control points. This means that the resultant function is continuously differentiable regardless of the values used for the tangents. Tangents are either read from file (without error checks → description of dedicated input file) or estimated numerically via finite differences from the potential input (see description of dedicated input file). In the latter case, some options are available to tune the spline (see TABIBIAS and TABITIGHT).
There are a few additional characteristics of the implementation of tabulated potentials in CAMPARI:
- Aside from Coulombic terms, these potentials are the only ones captured by the longer of the non-bonded cutoffs in MC runs (→ ELCUTOFF).
- When used concurrently with other non-bonded potentials, a lot of wasteful distance calculation may be performed. This is since tabulated potentials have to use their own data structure to be able to function efficiently both for cases with universal use and for very sparse use.
- Atom pairs that are in close proximity and are excluded from all other non-bonded potentials are not excluded from tabulated potentials.
This keyword provides the index input file which determines which tabulated potential to use for which atom pair (see elsewhere for format description). Naturally, this is only relevant if the tabulated potential is in use.TABPOTFILE
This keyword should give the name and location of the actual input file for the tabulated potentials (see elsewhere for format description). Naturally, this is only relevant if the tabulated potential is in use.TABTANGFILE
This keyword should give the name and location of the optional input file for providing derivatives of the tabulated potentials specified via another keyword. If this file is not provided, the derivatives are estimated numerically to generate the necessary tangents for the cubic interpolation scheme. If the file is provided, however, no checks are performed on the supplied values (see elsewhere for format description). Naturally, this is only relevant if the tabulated potential is in use.TABITIGHT
If tabulated potentials are in use, and if the input file providing derivatives of the potentials is either missing or incomplete, the cubic interpolation scheme applied to the discrete input data (using cubic Hermite splines) utilizes numerical estimates of the tangents (slopes) at the nodes (control points). The shape and nature of the resulting spline can be varied somewhat with two control parameters, the first controlling the "tightness", and the second (see below) controlling a left/right-sided bias with respect to the control points. The control parameters are used in the construction of the tangents as follows:mijk = [ (1-tt)·(1+tb)·(Vijk - Vijk-1) + (1-tt)·(1-tb)·(Vijk+1 - Vijk) ] / (dk+1 - dk-1)
This is essentially a simplified Kochanek-Bartels spline scheme skipping the discontinuity parameter and assuming identical distance spacings. The Vj are the potential values at the specified distances, dk, supplied via the required input file. tt is the tightness parameter controlled by this keyword, and tb is the bias parameter controlled by TABIBIAS. For both parameters being zero, the well-known Catmull-Rom spline is obtained. Regardless of the choices for tt and tb (allowed values span the interval from -1 to 1), the resultant interpolation scheme will yield a function that is continuous and smooth (i.e., continuously differentiable). However, unless the control points are very sparse with respect to the features of the potentials, any non-zero settings for tt and/or tb will most likely lead to undesirable effects, in particular at the level of derivatives.
If tabulated potentials are in use, and if the input file providing derivatives of the potentials is either missing or incomplete, the cubic interpolation scheme applied to the discrete input data (using cubic Hermite splines) utilizes numerical estimates of the tangents (slopes) at the nodes (control points). The shape of the resulting spline utilizes a bias parameter, tb, that is specified by this keyword. Its exact interpretation is explained above. Simply speaking, positive values lead to a lag (along the distance axis) in the interpolated, piecewise polynomial compared to the control points, whereas negative values do the opposite.TABREPORT
If tabulated potentials are in use (see SC_TABUL), this keyword lets the user instruct CAMPARI to print out a report of all the tabulated interactions in the system. This output can be quite large and is written to a separate output file (see TABULATED_POT.idx).SC_DREST
This keyword controls the "outside" scaling factor for quadratic potentials acting on either interatomic distances between any pair of selected atoms or on absolute coordinates of individual atoms. Such requests are handled and processed through a dedicated input file (see FMCSC_DRESTFILE). Details regarding functional forms and available choices and parameters are in the input file documentation. SC_DREST simply controls the "outside" scaling factor cDREST for the individual VDREST terms, i.e., EDREST = cDREST·ΣmVDREST(m). A summary of the requested terms can be obtained with keyword DRESTREPORT.One prominent role of using such restraints would be to allow a restrained relaxation of parts of a system (e.g., it is common to restrain protein atoms and relax water molecules in the presence of these restraints before starting a simulation of a protein in a crystallographic conformation in explicit solvent). A second important role lies in the fact that several experimental techniques (in particular NMR or FRET) can derive distance restraints on the relative position of two sites in a biomolecule. Hence, computational techniques are able to utilize this experimental information as restraints (prominent for example in the computational determination of protein structures via NMR).
CAMPARI offers the simple facility to harmonically restrain atoms which otherwise need not have any particular relationship. These restraints can be made one-sided, i.e. they can also restrain a distance to simply be within or beyond a certain threshold, which is usually a more appropriate treatment for incorporating experimental results. Nevertheless, both distance and position restraints are harmonic potentials acting on variables constrained only by the box size. Thus, a starting structure that is not compliant with a selected set of restraints will suffer from large energies/forces. Because of the stiffness of these terms, CAMPARI tries to respect distance and position restraints during structure randomization. This will frequently be imperfect, in particular since it is easily possible to set up sets of restraints that are mutually inconsistent or inconsistent with molecular topology and/or other interaction potentials.
As mentioned elsewhere, when CAMPARI's shared memory (OpenMP) parallelization is in use, all threads contribute to calculating EDREST synchronously. Note that there is no incremental treatment of this term in Monte Carlo calculations, which is a limitation.
This keyword should give the location and name of the input file containing specific requests for specific interatomic distance and/or specific atom position restraints (see elsewhere for format description). Naturally, this is only relevant if custom distance/position restraints are in use. In a small molecule screen, if CAMPARI was instructed to create some automatic position/distance restraint terms, this file can be missing. In all other cases, the potential will be disable if this file cannot be found or parsed.DRESTREPORT
If distance or absolute position restraint potentials are in use (see SC_DREST), this keyword allows the user to request a summary of the active restraint terms in the system. Note that this will not reflect terms added via MOL2DRESTMODE.MOL2DRESTMODE
In a small molecule screen, CAMPARI offers to create some position/distance restraint terms as follows.- No terms are added. This is the default.
- The data for the reference molecule are analyzed. This is either the first molecule in the mol2 input file or the molecule contained in the file supplied for keyword MOL2_REFMOL. The coordinates of specially indicated atoms are extracted and stored. They are used to set the minimum positions for position restraints in all three dimensions, These position restraints are then applied to all molecules using a matching scheme described elsewhere (see input documentation and keyword MOL2ASSIMILAR). This option makes sense only if the reference molecule is in a meaningful position relative to the fixed part of the system (in ligand binding terms, it is "properly docked"). Molecules for which the matching fails will have fewer or even no position restraints added.
- Similar to the previous option, the data for the reference molecule are analyzed. The coordinates of all heavy atoms atoms are extracted and stored. They are used to define a generous envelope used to set upper and lower bounds for position restraints in all three dimensions (flat-bottom potentials), These position restraints are then applied to the heavy atoms in all subsequent molecules. As for the previous option, this option makes sense only if the reference molecule is in a meaningful position relative to the fixed part of the system (in ligand binding terms, it is "properly docked"). The number of restraints added will be proportional to the number of heavy atoms of every screened molecule. Similar restraints applied to all atoms can also be defined manually via the normal input file.
Note that pose randomization can be used in conjunction with these restraints but that it is usually relatively meaningless to destroy an alignment, especially if mode 1 has been selected. Choosing mode 1 here will in fact prevent randomization since the Hamiltonians might differ between different alignments. Conversely, randomization might make some sense for mode 2 in case the tethering to the reference molecule is primarily meant to create a binding site volume rather than a specific chemotype match.
In a small molecule screen, if CAMPARI was instructed to create some automatic position/distance restraint terms, this keyword sets a buffer value in Å for position restraints. If these automatic restraints are atom-specific (mode 1 for MOL2DRESTMODE), the reference positions extracted from the tagged substructure in the reference molecule are now extended by a flat bottom of the width prescribed by this keyword. In other words, specifying a positive value for MOL2DRESTBUF will switch the position restraint from a simple harmonic potential to a potential consisting of two harmonic arms interspersed by a flat region of zero penalty that is MOL2DRESTBUF Å wide. If instead the restraints are on the envelope prescribed by the entire reference molecule (mode 2 for MOL2DRESTMODE), the potentials already have a flat bottom, and MOL2DRESTBUF can be used to extend it. This is useful if the screened molecules would not fit into the envelope of the reference molecule, which is a common problem. Note that this extension is always applied symmetrically to all heavy atoms and all coordinate axes.SC_OSMO
This keyword controls the outside scaling factor of a simple bias potential that allows the separation of the simulation container (irrespective of shape) into two or more compartments and to apply a compartment-specific restraint potential to specific residues in the system. In essence, this creates a set of soft, semi- or impermeable membranes. The use of the method in conjunction with umbrella sampling on position restraints to calculate transfer free energies is explained in the reference publication.Specifically, keyword OSMO_MODE selects the number of planar boundaries to be introduced. Unless the system container is a general, triclinic box, each boundary is parallel to one of the cardinal axes of the system and passes through the simulation container's geometric center. The first boundary to be added is that orthogonal to the z-axis (xy-plane) followed by the xz- and yz-planes. In the triclinic case, the boundaries are parallel to the container walls and divide up the system in the hierarchy of the plane orthogonal to BOXVECTOR3, followed by BOXVECTOR2 and BOXVECTOR1. Once an assignment of residues to compartments has been made (via a mandatory input file → OSMOFILE), the boundaries required to define the compartment exert a restoring force if atoms in a residue assigned to it tries to leave the compartment:
Here, cOSMO is the outside scaling factor defined by SC_OSMO, kBND is the rigidity of the compartmental membrane (controlled by SOFTWALL), and di,BND is a shorthand notation for the effective distance of atom i from the boundary, which is always 0.0 inside the compartment. This treatment is exactly the same as that for an atom-based soft-wall boundary condition. Note that in finite systems with external boundaries, the relative strengths of inner compartment and external boundaries can be made different by choosing a value for SC_OSMO different from 1.0. The compartmental boundaries that do not coincide with an external boundary are a little more tricky to understand in periodic systems. For example, in a 3D periodic, cubic box with a single inner boundary in the yz-plane passing through the origin, an additional boundary is required at the xy-border of the box to avoid connecting the compartments via periodic images. This boundary is felt differently by particles assigned to different compartments. Periodic dimensions that are not parallel to an added inner boundary are not (and must not be) corrected in the same way.
With the help of keyword OSMOREPORT, CAMPARI will report the expected bulk densities per compartment. Because the boundaries are soft they respond to the internal pressure of each compartment. Thus, choosing compartments with dramatically different internal pressures will tend to change the effective volumes and densities per partition. In general, it is of course difficult to assert how meaningful a compartmentalization of this type can be if the compartments contain non-mixing phases (e.g. an hydrophobic gas vs liquid water or two immiscible liquids). In many cases, results from such a simulation will not be straightforwardly interpretable.
If compartmentalization potentials are in use (see SC_OSMO), this keyword specifies the number and type of inner compartment boundaries to be added. If the system is periodic, additional boundaries might be added (see above). Currently, the available options are:- A single planar boundary is added in the xy-plane (general case) or in a plane orthogonal to BOXVECTOR3 (triclinic case) and passing through the origin/center of the system (split in z or along BOXVECTOR3, respectively). This creates exactly 2 compartments.
- Two planar boundaries are added: one in the xy-plane and one in the xz-plane, and both passing through the origin/center of the system (split in y and z) in the general case. In the triclinic case, the first boundary is in a plane orthogonal to BOXVECTOR3, the second in a plane orthogonal to BOXVECTOR2. This creates exactly 4 compartments.
- Three planar boundaries are added: in the general case, one in the xy-plane, one in the xz-plane, and one in the yz-plane, and all are passing through the origin/center of the system (split in x, y and z). In the triclinic case, the first boundary is in a plane orthogonal to BOXVECTOR3, the second in a plane orthogonal to BOXVECTOR2, and the third in a plane orthogonal to BOXVECTOR1. This creates exactly 8 compartments.
If compartmentalization potentials are in use (see SC_OSMO), this keyword should give the location and name of the input file containing the assignment of residues to individual compartments. Different input modes exist (by residue, by molecule, or by molecule type). This is a mandatory input file for this bias potential. The user is referred elsewhere for a detailed description of the input format. The details of the assignment can be printed to log-output by means of keyword OSMOREPORT.OSMOREPORT
If compartmentalization potentials are in use (see SC_OSMO), this keyword allows the user to request a summary of the assignment of all residues in the system to compartments. The report will also list the formal bulk densities for the individual compartments.SC_EMICRO
This keyword sets the global scaling factor for a spatial density restraint potential. The method was introduced recently (Vitalis and Caflisch), and the user is referred there for additional details. The potential relies on reading and quantitatively interpreting an input density map. The interpreted density for a given lattice cell with indices l, m, and n is denoted Ξlmn and is meant to correspond to some atomic property such as mass (→ EMPROPERTY). The potential itself is as follows:EEMICRO = fEMICRO Σijk (ρijk - Ξijk )2
The value of fEMICRO is set by this keyword. The potential is extensive with the number of grid cells. If it is the dominant contribution in terms of CPU time to energy evaluations, the use of Monte Carlo sampling is currently quite wasteful since the values for ΔEEMICRO are not actually incremental. The sum implied in the above equation is over all lattice cells of an evaluation grid reduced in resolution to exactly that of the input density map. Note that the dimensions of the evaluation grid are controlled by system size and shape, and that its formal resolution is either assumed to be that of the input map or set explicitly by keyword EMDELTAS (although the resultant lattice is required to have cell boundaries that align exactly with those of the input map). If the resolution of the evaluation grid is finer, the values for its cells are summed up to give the coarser resolution. Furthermore, the evaluation grid may extend beyond the input map, and in such a case the summation also includes (coarse) cells where the input is assumed to be exactly the background density. Taken together, these caveats mean that it is rarely useful not to match the input lattice exactly. That said, it is also fundamentally important, in aperiodic boundary conditions, to avoid solutes leave the evaluation grid. When this happens, the behavior is undefined (meaning that forces and energies can be wrong or misleading). The correct workaround is provided by keyword EMBUFFER, and its value must be set considering at least also temperature and the harmonic spring constant of the boundary.
Importantly, the spatial density restraint provides an absolute reference in space, which means that it is most likely incorrect to use drift removal techniques. The grids used by CAMPARI are always rectangular lattices, which precludes the use of input density maps that do not have only right angles (although no warning or error is produced, they are just interpreted as rectangular). By extension, this means that triclinic containers only work with this potential when full aperiodic boundary conditions are used. In periodic ones, the lattices would have to mimic the triclinic unit cell, which is currently unsupported. 1D-periodic cylinders are supported on the other hand since the periodic dimension aligns with the cardinal (z-)axis. The grids are fixed throughout, which means that in ensembles with volume fluctuations, aperiodic boundary conditions have to be assumed as well. Another unusual aspect about this potential is that it only applies to physically present molecules in simulations in ensembles with fluctuating particle numbers. This is despite it not being a pairwise interaction term, and distinguishes it from potentials affecting the bath particles as well (such as bonded potentials). Because the potential is strictly a penalty term, this creates an effective mismatch that must be lumped manually into the excess chemical potential. This is neither pretty nor clean meaning that concurrent use of these techniques should be accompanied by the appropriate skepticism.
Depending on the choice for EMMODE, EEMICRO can also be written using an average of the simulation density that is typically not equivalent to the canonical ensemble average:
EEMICRO = fEMICRO Σijk ( 〈 ρijk 〉 - Ξijk )2
Here, the angular brackets indicate an average that depends on keyword EMIWEIGHT and is explained there. Further details as to why the canonical average is not used are below. Note that the potential utilizing this average no longer corresponds to a unique Hamiltonian, i.e., every time the average is updated the energy landscape changes. This means that the ensembles generated are no longer straightforward to interpret. The obvious benefits of using an ensemble-averaged restraint are twofold. First, explicit heterogeneity can explain data that would be inconsistent with a unique structure. Second, sampling is aided by the fact that "stuck" conformations will tend to become unstable in terms of EEMICRO over time. As a final remark, users should keep in mind that the actual ensemble average generated may not agree with input given that this quantity was never actually restrained during the simulation.
As mentioned above, when CAMPARI's shared memory (OpenMP) parallelization is in use, all threads contribute to calculating EEMICRO synchronously. However, the parallel efficiency is generally poor if either the lattices are large (in number of grid cells) relative to the number of atoms or if the solutes in a dilute system change absolute positions rapidly.
If the density restraint potential is in use, this keyword allows the user to choose between two options. Setting this keyword to 1 computes the restraint term by comparing the instantaneous simulation density to the input density map, whereas a choice of 2 computes the restraint term by comparing an ensemble-averaged simulation density to the input density map.While the first option is straightforward, the second one requires some additional considerations as follows. Irrespective of whether a run is in parallel or not, the ensemble average is currently obtained over the previous sampling history (beyond equilibration) of the exact trajectory in question. Note that any average is created in terms of numbers of steps, which may cause inconsistencies in hybrid sampling runs due to the different average phase space increments. Choosing an appropriate type of average is not trivial (see, e.g., this reference), because the naive approach of including the entire sampling history leads to a continuously decreasing impact of the restraint term. There are currently two ways to address this. First, the accumulation frequency for the ensemble average can be reduced by keyword EMCALC. This slows down the reduction in impact and effectively gives the system more time to explore, because it results in concatenated runs of length EMCALC, during which the potential is in fact constant. Second, CAMPARI uses a fixed weight for the instantaneous component of the average while evaluating the potential. This fixed weight is set by keyword EMIWEIGHT and provides a way to utilize the entire history without degrading the impact of the restraint potential. A third route would be to use an appropriate kernel function in the time averaging, but this is inconvenient and potentially inefficient for spatial density analysis due to the large number of terms that would have to be stored and processed to recompute the kernel-based average.
A third option for this keyword may be added in the future that allows a lateral ensemble average to be restrained in MPI averaging calculations.
If the density restraint potential is in use, and if the potential acts on some ensemble-averaged simulation density, this keyword allows the user to set a fixed weight for the constructed average:〈 ρijk 〉 = (1-winst) Nsteps-1 Σi ρijk(i) ) + winst ρijk(current)
Here, the factor winst is set by this keyword and bound to the interval from 0 to 1. The ρijk(i) are the Nsteps values contributing to the running, canonical average of the density, and ρijk(current) is the density produced by the current conformation at that given lattice cell. The limiting case of winst being 1.0 recovers the instantaneous treatment (→ EMMODE). The limiting case of winst being 0.0 does not, however, produce a meaningful restraint (since it is independent of the current conformation). Both limiting cases are therefore forbidden. Note that it is currently not possible to recover the naive approach of a restraint that continuously decreases in relevance.
This keyword provides the location and name of the mandatory density input file when using the density restraint potential. The file format is described in detail elsewhere, and here it suffices to say that the external NetCDF library is needed, that that currently no other common density file formats (.ccp4, .mrc, ...) are read directly by CAMPARI, and that only lattice-based density on rectangular grids are properly supported. UCSF Chimera is able to convert between various density-based file formats, and does read and write NetCDF files.The most common application is likely that of a simulation with 3D periodic boundary conditions and a rectangular cuboid simulation volume. Here, the cells of the input lattice should align exactly with those of the analysis and evaluation lattice CAMPARI uses, and generally it will be easiest to match both origin and dimensions exactly. By default, CAMPARI will obtain the lattice cell dimensions from the input map. For nonperiodic boundaries (including simulation systems with curved boundaries), it will be required, however, to deviate from such an exact match. Here, keyword EMBUFFER can be used to define the buffer in size for the evaluation grid at any nonperiodic boundaries. Furthermore, keyword EMDELTAS can always be used to request the analysis and evaluation lattice to have cells of a smaller size, which, with the restraint potential in place, has to yield the exact input cell size by integer multiplication for all three dimensions. Lastly, keyword EMREDUCE can be used to average the input map to a lower resolution by re-binning.
Assuming no further transformations are applied (→ keywords EMREDUCE, EMTRUNCATE, EMFLATTEN), the interpreted density based on the input file is as follows:
Ξijk = ρsol + c(ωijk - ωbg)
Here,the final density for a given lattice cell, Ξijk, has units of physical density, c is a scale factor explained below, ωijk is the original input density for the same lattice cell, and ρsol and ωbg are the assumed physical and input background signals, respectively. ρsol is set by keyword EMBGDENSITY, and ωbg can be set by keyword EMBACKGROUND if the value determined automatically from the histogram of input densities is not appropriate. Factor c is given as follows:
c = [ MM - ρsol Σijk Vijk H(ωijk-ωt) ] · [Σijk (ωijk-ωbg)Vijk H(ωijk-ωt) ]-1
Here, the first term in square brackets is a hypothetical excess signal (mass) using the apparent macromolecular volume (the sum of the volume of all lattice cells with signals exceeding the threshold, ωt) and the assumed total mass. The Vijk are the volumes of individual lattice cells and currently have to be all equal, and H(x) denotes the Heaviside step function. The second term in square brackets is the actual excess signal (mass) derived from the input map obtained by analogous summation. Factor c has units that convert optical density (input) to physical density. It is important to note the crucial impact of keywords EMTHRESHOLD and EMTOTMASS on the quantitative interpretation of the map. In particular, many combinations of values will be rejected by CAMPARI, because they cannot produce an excess signal larger than the background. The resultant interpreted map is written to a dedicated output file at the beginning of each run. Note that this includes all optional transformations controlled by keywords EMREDUCE, EMTRUNCATE, and EMFLATTEN.
If the density restraint potential is in use, this keyword can be used to change the formal resolution of the input density map. This is accomplished by simple re-binning, i.e., the target and original lattices are aligned at the origin, and the original signal for each cell is distributed to the target cells by simple overlap. Because the input is assumed to be a density, volume renormalization is performed. Note that it is generally meaningless to create a finer grid this way, because no new information is available, and CAMPARI distributes signal assuming a flat distribution inside each original input cell. Similar to keyword EMDELTAS, this keyword requires the specification of three floating point numbers that set the target lattice cell sizes of the re-binned input map in Å for the x, y, and z dimensions, respectively. Note that the exact values will generally be slightly different because of the requirement to have the outer dimensions of both grids align exactly. Finally, users should keep in mind that physical resolution and formal resolution of the lattice used to represent the data are two distinct quantities.EMBACKGROUND
If the density restraint potential is in use, this optional keyword can be used to override the value determined to correspond to background in the input density map (ωbg in the equation above). This value is commonly set by binning the densities in all cells, and identifying a well-resolved peak in the histogram. If the map does not encode much background signal, the histogram-based determination may be inappropriate, and this is when this keyword is useful. Note that values refer to the original input density map. The meaning of the background level for positive definite atomic properties like mass is that of an approximate lower bound. The meaning for other properties (where values for individual atoms can offset each other like charge) is typically more that of a median or mean. If more than one custom atomic properties are used, this keyword should have the corresponding number of entries, one for each property. Users are advised not to confuse this keyword with EMBGDENSITY, which sets the expected background level in physical units (see the first equation shown for EMMAPFILE).EMTHRESHOLD
If the density restraint potential is in use, this important keyword controls the linear transform used to interpret the input density map in terms of a physical density of one or more atomic properties of interest. Specifically, it sets a threshold level in units and numbers of the (potentially re-binned) input that distinguishes signal from background. Since measurements often have low contrast, the threshold is not an obvious property of the input map. The threshold set here corresponds to parameter ωt in the equation above. It is primarily responsible for the overall scaling factor, i.e., larger threshold values will generally produce interpreted maps with a wider spectrum of physical density values. Using the apparent molecular volume and the total integral, the chosen threshold directly determines the apparent physical density (reported in log-output). This quantity poses constraints on the chosen value, because the integrated signal must be yielding a density larger than the assumed physical background density. For atomic properties, for which values for individual atoms can offset each other, the threshold is compared to the absolute values of the input density. This absolute value is, however, applied to the difference of the actual value from the automatically determined or specified background level. If more than one custom atomic properties are used, this keyword should have the corresponding number of entries, one for each property.EMTOTMASS
If the density restraint potential is in use, this keyword sets the net integral to be assumed for the selected atomic property or properties (units differ accordingly). This is basically the "total signal above noise" expected to be explained by the input density map, and it is consequently restricted to the lattice cells exceeding the chosen threshold. In general, for positive definite properties like atom or proton mass, this can be set to correspond exactly to the explicitly represented matter in the simulation (this is the default), but exceptions may desire an override, e.g., when simulating only a part of the system without wanting to distort the interpretation of the map. The parameter corresponds to MM in the equation above. For other properties like partial charges, the value should be the sum of the absolute values of these properties. This is required to arrive at a meaningful interpretation of the map that is usable in a quantitative way for the density restraint potential. If more than one custom atomic properties are used, this keyword should have the corresponding number of entries, one for each property.EMTRUNCATE
If the density restraint potential is in use, this keyword enables truncation of the input map below the chosen value as long as it is higher than the minimum and lower than the assumed threshold level (ωt in the equation above). Truncation implies that the spectrum of values for the interpreted density is completed depleted below the specified level. For a positive definite property like mass, all values below the threshold level are simply assigned the background level, ωbg. Conversely, for a property that can assume both positive and negative values, all values below the threshold level are simply assigned the threshold level itself. In the former scenario, this technique is commonly used to eliminate noise from the input that may hamper sampling (it flattens a noisy background). In the latter scenario, EMTRUNCATE is the exact complement to EMFLATTEN and can be used to remove spikes in the negative direction. In all cases the specified value or values refer to the density levels as found in the original input density map. If more than one custom atomic properties are used, this keyword should have the corresponding number of entries, one for each property.EMFLATTEN
Depending on how a density map is generated, the signal may cover a wide spectrum of values. This is particularly true if the contrast to the background is generally low, and the lack of contrast is compensated for by averaging over similar, but heterogeneous conformations. In such cases, the ratio of peak to barely detectable signals may be impossible to describe by physical densities of instantaneous conformations. If the density restraint potential is in use, this keyword therefore allows the user to flatten an input density map at a given level specified by this keyword. The requirement is that the value be larger than the assumed threshold level. This keyword is complementary to EMTRUNCATE: in an exact manner for atomic properties that can take both negative and positive values (EMFLATTEN works only on the positive end) and in a qualitative manner for positive definite properties (where negative "spikes" are not realistic). Both keywords used concurrently can produce an interpreted map that is purely an envelope of a more or less fixed and homogeneous density. Finally, if more than one custom atomic properties are used, this keyword should have the corresponding number of entries, one for each property.EMHEURISTIC
The evaluation of the density restraint potential involves the summation of contributions from all the grid cells. Each cell contributes a squared difference of the input density and the actual density for the current conformation of explicit matter in the system. If the formal resolution is high, the evaluation of the potential can be costly. Occasionally, it may be possible to save some CPU time by applying dedicated heuristics, and this is what is controlled by this keyword. Choices are as follows:- No heuristic is used. At each global evaluation of the density restraint potential, all grid cells are recomputed and summed up.
- When spreading the atomic masses in the system onto the analysis and evaluation grid, CAMPARI keeps track of whether any given xz-slice of the input map actually received a contribution from any atom. If not, the cells constituting this xz-slice are not recomputed, but instead a precomputed value for the entire slice is used. This is possible because the simulation densities in all the cells of the slice will be equivalent to the assumed background density. Efficacy of this heuristic obviously depends on the details of the system.
- This works identically to the previous option, except that x-lines are considered rather than xz-slices.
- This works identically to the previous options, except that local rectangular supercells are used rather than xz-slices or x-lines. Here, the algorithm will try to combine existing grid cells to yield approximately 1000 supercells. This option is probably the most successful in general, because it can match arbitrary arrangements of explicit matter best.
Not yet documented.GHOST
This keyword is a simple logical that determines whether or not to (partially) "ghost" the interactions of selected particles (see FEGFILE) with the rest of the system (and eventually amongst themselves → FEG_MODE). Such scaling of interactions creates artificial systems which can be used to interpolate between two well-defined end states. The most common need for such an application arises in cases where the two end states are significantly different and one is interested in the free energy difference. For example, to calculate the aqueous free energy of solvation of a small molecule in water, one could scale the interactions of the small molecule with water from zero to their full value. Such growth-based calculations are usually complicated to set up and perform since i) trajectories evolved at a given Hamiltonian have to be evaluated (on-the-fly usually) assuming different Hamiltonians, and ii) it is difficult to maintain an internally consistent system of interactions such that all changes induced by the ghosting can be mapped to atomic parameters of the ghosted species. In CAMPARI, FEG (free energy growth/ghosting) calculations are therefore supported in conjunction with limited Hamiltonians only: the only potentials allowed are IPP, ATTLJ, POLAR, and the bonded interactions. In other cases, it may be possible to extract the same or related quantities through other techniques realizable in CAMPARI. As an example, the free energy of solvation for a flexible (single) solute immersed in the ABSINTH continuum solvation model can be obtained by simultaneously scaling the dielectric from 1.0 to 78.0 and the DMFI from 0.0 to 1.0. The default settings for the auxiliary keywords to GHOST are such that the molecules or residues listed in FEGFILE will be completely ghosted (i.e., invisible to the system).FEG_MODE
In FEG calculations interactions (see GHOST) are always scaled between the ghosted species and the rest of the system. A natural question arises as to what happens to interactions between or within ghosted species (if any are present)? If they are not scaled but instead use the background Hamiltonian it will be impossible to map the effect of the scaling to a change in atomic parameters which is desirable from the viewpoint of rigor. As an example, consider polar interactions between a single ghosted butane molecule and a bath of non-ghosted water. A scaling of the atomic charges on the ghost butane by a factor f would give rise to interactions with the bath scaled by f and self-interactions scaled by f2. This type of scaling is enforced in CAMPARI if a method requires it such as treating electrostatics with the reaction-field method (see LREL_MD). In general, however, it is impossible to find a unique mapping while leaving the background Hamiltonian intact. It is therefore left to the user to determine which of two options to choose:- Interactions between/within ghosted species use the full background Hamiltonian.
- Interactions between/within ghosted species use the scaled Hamiltonian.
This keyword specifies the "outside" scaling factor for the ghosted inverse power potential. Note that depending on the choice for FEG_LJMODE this is not as simple as SC_IPP and that additional parameters may determine the impact this keyword has. The setting here corresponds to the parameter sgIPP below. Note as well that the inverse power potential supported in calculations with ghosted interactions always uses an exponent of 12 (i.e., setting IPPEXP to anything but the default of 12 will cause CAMPARI to abort). This keyword is only relevant if GHOST is true.FEG_ATTLJ
This keyword is analogous to FEG_IPP but controls the "outside" scaling of the attractive r-6 dispersive term. The setting here corresponds to the parameter sgattLJ below. Note that scaling this up while FEG_IPP is set to zero (or - depending on the mode - even set to something smaller) will potentially lead to numerical instabilities.FEG_LJMODE
The exact functional form of the scaled (ghosted) Lennard-Jones potential is as follows:EgLJ = 4.0·ΣΣi,j εijf1-4,ij·[ g(sgIPP)·[α·h(sgIPP) + (rij/σij)6]-2 - g(sgattLJ)·[α·h(sgattLJ) + (rij/σij)6]-1 ]
Here, the εij and σij are the standard pairwise Lennard-Jones parameters (see PARAMETERS), the f1-4,ij are potential 1-4 fudge factors (see FUDGE_ST_14) that generally will be unity, g(s) and h(s) are auxiliary functions whose functional form depends on the choice for this keyword, and α is the so-called soft-core radius (unitless). The two scaling factors sgIPP and sgattLJ are provided by keywords FEG_IPP and FEG_ATTLJ). There are three possible choices determining g(s) and h(s):
- g(s) = s
h(s) = 0 - g(s) = sf1
h(s) = 1.0 - sf2 - g(s) = (1.0 - e-sf1)/(1.0 - e-f1)
h(s) = (1.0 - s)f2
This keyword allows the user to specify the parameter α in the above equations (see FEG_LJMODE), i.e., the soft-core "radius" for the modified Lennard-Jones potential. It is generally of limited utility to set this to zero since in that case the scaled potential could as well be created by setting FEG_LJMODE to 1 in which case this parameters becomes meaningless. Conversely, for large soft-core radii, the potential is modified for large distances which generally represents unnecessary modification which may slow down convergence in free energy calculations relying on interpolation via ghosting. Generally speaking, values around 0.5 are recommended for either mode 2 or 3. This keyword is only relevant if GHOST is true.FEG_LJEXP
This keyword sets the parameter f1 in the above equations (see FEG_LJMODE). It represents a way to - in a simple way - alter the weight of change experienced by the system depending on the choices of FEG_IPP and FEG_ATTLJ. In that sense, it is very closely tied to the design of the interpolation schedule (i.e., both address the exact same issue). There are no gold standard rules for picking this and the user is referred to the literature for further details. In case of free energy calculations, it will be best to inspect the schedule empirically by metrics such as the statistical precision of the pairwise estimates or overlap metrics such as (theoretical) swap probabilities and to then refine either the schedule itself or the global settings accordingly. This keyword is only relevant if GHOST is true.FEG_LJSCEXP
This keyword sets the parameter f2 in the above equations (see FEG_LJMODE). Much of the same discussion applies here as already mentioned for keywords FEG_LJRAD and FEG_LJEXP. This keyword is only relevant if GHOST is true.FEG_POLAR
The only other non-bonded potential besides Lennard-Jones supported in FEG calculations is the polar potential (see SC_POLAR). This keyword provides a scaling factor (sgPOLAR) for the soft-core Coulomb potential. Much similar to the case for scaled LJ interactions (see above), this may involve three additional parameters (see FEG_CBMODE). Note that it would be most common to only scale this up while FEG_IPP is set to unity so as to avoid potential numerical instabilities.FEG_CBMODE
In analogy to FEG_LJMODE, this keyword determines what exact functional form CAMPARI uses for the scaled (ghosted) Coulomb potential with the "outside" scaling factor sgPOLAR set by\ FEG_POLAR):EgLJ = (4.0πε0)-1·ΣΣi,j g(sPOLAR)·qiqj·f1-4,C,ij·[αC·h(sgPOLAR) + rij]-1
Here, the atomic partial charges are represented as qi,j, ε0 is the vacuum permittivity, and rij is the interatomic distance. f1-4,C,ij denotes potential fudge factors acting on 1-4-separated atom pairs (see FUDGE_EL_14) but will generally assume a value of unity. g(s) and h(s) are the same auxiliary functions defined above for the Lennard-Jones potential (→ FEG_LJMODE) and αC is the soft-core radius (unitless) specific to the Coulomb potential (controlled by keyword FEG_CBRAD). For completeness the options are listed again in detail:
- g(s) = s
h(s) = 0 - g(s) = sfC,1
h(s) = 1.0 - sfC,2
This keyword is analogous to FEG_LJRAD and allows the user to choose the value for the soft-core radius specific to the Coulomb potential (αC in the equations under FEG_CBMODE). The specification is meaningless if FEG_CBMODE is set to 1.FEG_CBEXP
This keyword is analogous to FEG_LJEXP and allows the user to choose the value for the polynomial scaling exponent to the Coulomb potential (fC,1 in the equations under FEG_CBMODE). The specification is meaningless if FEG_CBMODE is set to 1.FEG_CBSCEXP
This keyword is analogous to FEG_LJSCEXP and allows the user to choose the value for the soft-core scaling exponent to the Coulomb potential (fC,2 in the equations under FEG_CBMODE). The specification is meaningless if FEG_CBMODE is set to 1.FEG_BONDED_B
Non-bonded interactions provide a straightforward interpretation for parsing the energetics of the system into solute-solvent, solute-solute, and solute-solvent contributions. This is used in a thermodynamic cycle argument when computing - for instance - the free energy of solvation of a solute in solvent via FEG methods. Sometimes (as alluded to under FEG_MODE), it may be desirable to scale intramolecular non-bonded interactions as well. But what about intramolecular bonded interactions? This keyword allows the FEG-like scaling of bonded terms associated with a ghosted species but not of those associated with non-ghosted particles. Beyond that this keyword operates just like SC_BONDED_B. Note that this almost certainly creates a pathological situation if bond length potentials are allowed to approach zero and naturally relies on bond lengths being allowed to vary (see CARTINT) to be meaningful. Note that for all bonded parameters the assignment of terms to individual residues in a multi-residue molecule is somewhat arbitrary if atoms from two different residues participate.FEG_BONDED_A
This is analogous to FEG_BONDED_B only for bond angle potentials. Note that this may lead to a pathological simulation if bond angle potentials are allowed to approach 0° or 180° and - again - relies on bond angles actually being varied throughout the simulation to be meaningful.FEG_BONDED_I
This is analogous to FEG_BONDED_B only for improper dihedral angle potentials. Note that this may lead to a pathological simulation if improper dihedral angle potentials are allowed to approach zero and - again - relies on these degrees of freedom actually being varied throughout the simulation to be meaningful.FEG_BONDED_T
This is analogous to FEG_BONDED_B only for proper dihedral angle potentials. Note that this relies on torsional angles actually being varied throughout the simulation to be meaningful (there may be subsets).FEGREPORT
This simple logical keyword lets the user instruct CAMPARI to write out a summary of the ghosted particles (residues or molecules) in free energy growth/ghosting calculations.SCULPT
The accelerated molecular dynamics method of Hamelberg et al. offers a general (parameter-dependent) way to modify the potential energy landscape or individual terms thereof (torsional potentials and 1-4 interactions have been used most often). The idea is that a controlled modification of the landscape that leads to reduced barrier heights is capable of massively accelerating the effective dynamics without reducing the ensemble overlap dramatically. CAMPARI offers a generalization of this approach as follows:EELS = Σi Ei + ΔEi,ELS
ΔEi,ELS = | 0 | if Vif < Ei < Vis |
(Vif - Ei)2/(Vif - Ei + αif) | if Vif > Ei | |
(Vis - Ei)2/(Vis - Ei - αis) | if Ei > Vis |
Here, the sum runs over all active terms of the Hamiltonian. These are generally the terms CAMPARI offers a global scaling factor for, e.g., the total DMFI of the ABSINTH model, EDMFI, the total sum of improper torsional potentials, EBONDED_I, etc. Limitations are discussed below. By default, the threshold energy parameters for every energy term, Vif and Vis, are initialized such that Ei,ELS is always zero, i.e., no sculpting occurs. They can be modified with the auxiliary keywords ELS_FILLS and ELS_SHAVES. Naturally, Vif must always be less than or equal to Vis. The parameters αif and αis must always be greater than or equal to zero. They serve as buffer parameters. The modified energy landscape for a given term has two possible modifications. First, its low energy states (local minima) can be filled up. Using αif as zero flattens all low energy states to the specified threshold, Vif. Larger values for αif preserve the unbiased shape of the landscape more and more, and the limit of αif reaching infinity recovers the unbiased potential exactly. Second, its high energy states (barrier regions) can be shaved off and the use of αis as zero flattens all barrier regions to the value of Vis exactly. The effect of larger values is exactly analogous. Note, however, that potentials allowing for large positive energy values must be treated with caution (notably inverse power potentials). The value of ΔEi,ELS for large negative values of (Vis - Ei) obviously approaches (Vis - Ei) itself, which means that the barriers are more or less completely eliminated. This can be dangerous in conjunction with attractive nonbonded interactions (numerically speaking) and also lead to poor behavior during reweighting (see below).
This keyword (SCULPT) allows the user to specify one or more terms to be sculpted (list of integers). The choices available correspond exactly to the columns of output file ENERGY.dat (click the link for a list). It includes the total energy (choice 2), which is mutually exclusive with any other term. There are further limitations as follows:
- In gradient-based simulations (including hybrid runs), nonbonded interactions can only be controlled as a single joint term (sum), viz., the sum of all active short-range steric interactions (see SC_IPP, SC_ATTLJ, and SC_WCA) as well as polar and tabulated interactions (see SC_POLAR and SC_TABUL). The correct code to use for this joint term is 15.
- The use of the (quasi-obsolete) correction potential is not supported when using any energy landscape sculpting.
If the energy landscape sculpting method is in use, this keyword supplies the parameters Vif described above. Values are to be provided in kcal/mol. For example, if the choices for SCULPT are "20 22", then a choice for ELS_FILLS of "5.0 5.0" would provide lower threshold energies of kcal/mol each to both proper dihedral angle potentials and to CMAP potentials. It is not possible to skip values, i.e., the length of the list supplied here should be identical to that for SCULPT. To disable the basin filling aspect of sculpting, it is generally safe to supply a very large negative energy here.ELS_SHAVES
If the energy landscape sculpting method is in use, this keyword supplies the parameters Vis described above. Values are to be provided in kcal/mol. The interpretation is identical to keyword ELS_FILLS above. To disable the barrier shaving aspect of sculpting, it is generally safe to supply a very large positive energy here.ELS_ALPHA_F
If the energy landscape sculpting method is in use, this keyword supplies the parameters αif described above. Values are to be provided in kcal/mol and must be zero or positive. Note that a choice of zero inevitably leads to force discontinuities. In addition, the absence of any force (flat surface) will lead to the natural shape of the landscape being completely forgotten, which can deteriorate the statistical significance of the reweighted results.ELS_ALPHA_S
If the energy landscape sculpting method is in use, this keyword supplies the parameters αis described above. Values are to be provided in kcal/mol and must be zero or positive. The keyword is interpreted identically to ELS_ALPHA_F above and applies to the barrier shaving aspect.ELS_PRINT_WEIGHTS
If the energy landscape sculpting method is in use, this keyword controls the output frequency for output file ELS_WFRAMES.dat, which contains the corresponding simulation step numbers (that will of course increase in steps of ELS_PRINT_WEIGHTS) and the associated weights. These weights are derived from knowledge of the applied net sculpting potential for each snapshot as wi = exp(β EELS). They can be used in a trajectory analysis run with user-supplied frame weights. Note that large positive values of the sculpting potential will make the reweighting susceptible to shot-like noise (due to few conformations receiving very large weights).ELS_THRESHOLD
If the energy landscape sculpting method is in use, and weights are requested, this largely undocumented keyword can be used to apply a minimum threshold to prevent values that are practically zero from being written.EWALD
CAMPARI supports using the Ewald decomposition technique to compute long-range electrostatic interactions in fully 3D periodic systems (see LREL_MD). This includes orthorhombic and triclinic simulation containers (see SHAPE). There are two supported approaches to computing the reciprocal space sums in the Ewald formalism:- Particle-Mesh Ewald (PME): This elegant and vastly popular method
introduced originally by Darden et al.
and Essmann et al.
uses discrete Fourier transforms (DFFTs) and
cardinal B-splines to simplify the computation of the reciprocal space
Due to the DFFTs, CAMPARI needs to be linked against the free open
FFTW for this option to be available.
Briefly, PME reciprocal space sums have different scaling components,
i) the
number of charges; ii) the number of grid-points; iii) the
interpolation order
for the cardinal B-splines. It depends strongly on the system which of
components is the speed-limiting factor, in particular since the
accuracy of the reciprocal sum depends on the simultaneous optimization
of the spline order (see
BSPLINE) and the grid-size (EWFSPAC) given
that the real-space part co-determines the Ewald parameter (EWPRM). Note,
however, that the fundamental scaling with the number of charges is
O(N). The performance of PME is only partially controlled
by CAMPARI as the library calls can be the bottleneck. Coarser grids, higher
spline orders, and higher number densities of partial charges decrease the relative workload of the DFFTs.
The general performance of the DFFTs can sometimes be improved by providing or computing a better "plan".
This is supported by keywords EWFFTPLANNER and
If the shared memory (OpenMP) parallelization of CAMPARI is in use,
this also triggers calling the threaded FFTW library. The performance of this call
is tricky to predict, however, because it is spawned from a multi-threaded execution region
to begin with (inside an OpenMP MASTER construct). This causes additional thread generation and destruction
operations that are a cost factor, both directly and indirectly (through the kernel having to
manage a temporarily oversubscribed machine). It implies that the performance
results become strongly dependent on the thread affinity model and respond to
environment variables such as OMP_PROC_BIND or OMP_PLACES. Keyword THREADS_TEST
allows a quick way to test parallel FFTW performance for the system at hand.
Irrespective of these complications, PME is the recommended (since fastest)
implementation of Ewald sums.
- Standard Ewald: A straightforward computation of the reciprocal part of the original decomposition introduced by Ewald is supported by CAMPARI as well. This method is very slow and scales poorly (K3) with the (linear) cutoff size in the reciprocal dimension. Much like PME, the reciprocal sum fundamentally scales as O(N) with the number of charges, however. Standard Ewald is unlikely to ever be a reasonably efficient alternative. For this, tight cutoffs in reciprocal space must be permissible (an example case might occur when PME is slowed down due to a dominant cost imposed by DFFTs such as in very dilute systems using very big boxes). If the shared memory (OpenMP) parallelization of CAMPARI is in use, the standard Ewald sum is expected to scale well as long as the number of residues is reasonably large.
If the Ewald method is used for treating long-range electrostatic interactions, this keyword can be used to set an accuracy tolerance for the tabulated computation of the (complementary) error function (and its derivative). This uses additional tricks to save operations (details omitted). Because the tabulated values can be a significant amount, the performance of this implementation is usually cache- rather than FLOP-limited. The tabulation can be disabled at the compilation stage by passing the variable "DISABLE_ERFTAB" (see installation instructions) as many modern compilers offer support for fast math libraries occasionally also with controllable precision.BSPLINE
When using the PME method (see LREL_MD and EWALD), this keyword determines the order of the cardinal B-splines to be used. The order can be increased at a moderate cost, such that it is sometimes advantageous to choose a higher interpolation order coupled to a relatively coarse mesh (see EWFSPAC) instead of a lower interpolation order coupled to a finer mesh. The default order is 6, and currently only even numbers are permitted (uneven numbers are adjusted to the default automatically). For various reasons, it is not recommended to use orders below 4. In any case, it can be useful to try different settings and study the predicted accuracy and initial energies that are reported at the beginning (summary of the calculation written to log-output) or to test convergence of energies manually on a single-point calculation (using, for example, the trajectory analysis framework).The B-splines are always treated as products of independent contributions in the three dimensions, irrespective of whether a triclinic or orthorhombic box is used. This is because the implementation works in fractional unit cell coordinates. The only corrections required are calculate these correctly and to account for the correct chain rule derivative when incrementing the Cartesian forces. These minor complications are not adding noticeable computational cost.
When using the PME method (see LREL_MD and EWALD), this keyword determines the grid spacing for the mesh in Å. A smaller value yields a finer mesh which in turn yields more accuracy. The cost associated with finer grids easily becomes substantial (K3-scaling), though, even when using the DFFTs provided by FFTW. The code will occasionally adjust too coarse a value since the interpolation order (BSPLINE) requires a certain minimum for the number of available mesh points in each dimension. When using the standard Ewald method, keyword EWFSPAC determines the reciprocal space cutoffs to either side directly as the ratio of half the box side length and itself. In any case, it can be useful to try different settings and study the predicted accuracy and initial energies that are reported at the beginning (summary of the calculation written to log-output).EWPRM
When using the Ewald method (see LREL_MD and EWALD), this can be used to overwrite the automatically determined value for the Ewald parameter. The Ewald parameter is given in units of Å-1 (but can just as well be defined as a dimensionless parameter). It determines the relative weight of the real-space and the reciprocal sum in determining the total electrostatic energy of the system. The larger EWPRM is the more weight shifts to the reciprocal sum. Note that the accuracy of the Ewald method is highly sensitive to this parameter in conjunction with the real-space and reciprocal space cutoffs and that a catastrophic lack of accuracy can easily be realized. Therefore, the code tries to determine a reasonable value for the Ewald parameter based on the (hard) settings for the real-space cutoff (NBCUTOFF) as well as EWFSPAC and - in the case of the PME method - BSPLINE. Unfortunately, the accuracy predictor formulas in use are currently somewhat flawed (they are based on the mean force error estimates presented by Petersen). They should be more accurate for the standard Ewald method than for PME since in the latter certain error contributions from the spline-based interpolation are missing. Hence, the automatically chosen parameter should by no means considered an optimal one, merely one which - given the cutoff settings - provides comparatively small errors in forces and energies. Should the procedure be deemed inadequate or should there be an independent estimate of the error this keyword comes into play. In any case, it can be useful to try different settings and study the predicted accuracy and initial energies that are reported at the beginning (summary of the calculation written to log-output).EWFFTPLANNER
This keyword is only relevant if the particle-mesh Ewald is used (see LREL_MD and EWALD). It can then be used to control how hard the linked FFTW library will try to compute an efficient plan for the involved DFFTs before the start of the simulation. The options are as follows:- A heuristic and practically cost-free estimate is used. For simple cases and geometries, this is often appropriate enough (corresponds to FFTW_ESTIMATE).
- Explicit measurements are performed to pick a reasonable plan but the algorithmic space explored is limited (this is the default and corresponds to FFTW_MEASURE).
- Explicit measurements are performed to pick a reasonable plan and the algorithmic space is widened relative to the previous option (corresponds to FFTW_PATIENT).
- Many explicit measurements are performed across a wide algorithmic space to determine the best plan (this is very expensive and corresponds to FFTW_EXHAUSTIVE).
- A previously determined plan is read in in the form of a "wisdom file" (see EWWISDOMFILE).
This keyword is only relevant if the particle-mesh Ewald is used (see LREL_MD and EWALD). It instructs CAMPARI to save a DFFT plan generated by the FFTW library in the form of a wisdom file. For this, the value for EWFFTPLANNER has to be 1-4. Note that wisdom files are not compatible between the threaded and serial FFTW libraries. The former is evoked automatically if the shared memory (OpenMP) parallelization of CAMPARI is in use. The name for the wisdom file is chosen by keyword EWWISDOMFILE.EWWISDOMFILE
This keyword is only relevant if the particle-mesh Ewald is used (see LREL_MD and EWALD). It provides the name for a FFTW wisdom file to be either read in (if EWFFTPLANNER is 5) or written (if EWFFTPLANNER is not 5). Note that the wisdom file is an autonomous output file of the FFTW library and not documented further either here or in the dedicated documentation. Users can find more information elsewhere (links out).RFMODE
When using the Reaction-Field method (see LREL_MD), this keyword determines whether the corrections include a continuum electrolyte assumption (generalized reaction field) or not:- The generalized reaction-field correction is used. By default, the code determines the concentration of net charges (including those which are part of macromolecules) and derives an effective ionic strength. The default can be overridden by supplying a positive value for keyword IONICSTR. This bulk electrolyte concentration is used to model the dielectric response outside of the cutoff sphere for an individual charge in a Poisson-Boltzmann sense. If an ensemble with fluctuating volumes is in use, this value should be the equilibrium bulk value, which means that either the initial volume would have to be close to the true value or that a suitable override be used.
- The standard reaction-field correction is used. Irrespective of the existence of free, net charges in the system, the dielectric response is simply an approximate solution to the Poisson equation.
Cutoff Settings:
(back to top)
If nonbonded interactions dependent on interatomic distances are in use (IPP, ATTLJ, WCA, IMPSOLV, TABUL, and POLAR), it is often necessary to truncate these interactions. Historically, there have been a large number of implementations to achieve this, both in how to to effectively determine the interactions to compute and how to deal with the force discontinuity and truncation. CAMPARI does not implement empirical switching functions. The WCA and IMPSOLV potentials have exact cutoffs by virtue of their functional forms. IPP, ATTLJ, and TABUL can only be truncated. Long-range electrostatics options are supported (LREL_MC and LREL_MD). Keyword CUTOFFMODE controls whether to apply truncation at all and how to search for nearby interaction partners. This is (currently) always done using residue-based, buffered neighbor lists. The neighbor lists are in general not post-processed to achieve exact truncation at the chosen distance, which means that the effective range of interactions is larger (dependent on the buffer size, which are computed from residue radii).All cutoff implementations are responsive to the shared memory (OpenMP) parallelization although some particular combinations of samplers, cutoffs, and Hamiltonians may not be supported. In general, the parallel efficiency of the neighbor lists calculations is acceptable. The major performance-limiting factor is that relatively complex and large data structures are used in relation to relatively low number of floating point operations, which can expose weaknesses in cache management.
The following modes are available:
- If - for whatever reason - cutoffs are undesirable, the code will assume that all residues are spatial neighbors and compute all interactions at every step. Note that not all combinations of samplers and Hamiltonians might support this option since optimized loops relying on neighbor lists are often employed (and/or the method may rely in its formulation on a cutoff). Limited support (or performance) may also exist if the shared memory (OpenMP) parallelization is in use. It implies that most other keywords in this section become meaningless (e.g., NBCUTOFF, LREL_MD, etc.).
- This option is obsolete.
- This option instructs CAMPARI to employ grid-based cutoffs. The
is governed at the residue level by the position of the residues'
reference atoms.
All grid-based methods (with a uniform mesh) are difficult/inefficient
for systems with very
asymmetric density (such as a single very long
extended chain in a large periodic box) since those systems would
either require too large grids
(inefficient and memory-consuming) or are so coarse
that no efficient pre-screening can occur. Grid-based cutoffs are a
good choice for
systems with homogeneous density and many small (few atoms) residues.
They are absolutely indispensable for simulations of large explicit
water systems as any
other cutoff mode supported by CAMPARI will critically slow down
in such scenarios. Like all cutoffs in CAMPARI, the grid is used with a
buffer size dependent on residue radii (in addition to the actual interaction
cutoffs, viz., NBCUTOFF and
ELCUTOFF). The parameters of the grid are controlled using a number of keywords:
GRIDMAXRSNB. If the shared memory (OpenMP) parallelization
is used, the performance of grid-based cutoffs is hampered if residues are reassigned frequently,
which happens in many Monte Carlo calculations (all trial moves matter, not just the accepted ones).
This is because the global copy of the grid association must be kept in sync. Note that some
limitations exist: for any aperiodic boundary, CAMPARI
uses a fixed buffering, which requires that SOFTWALL must not
be too small; in (assumed) NPT conditions, the set of grid neighbor
points is set only once at the beginning, and large volume fluctuations could lead to catastrophic errors (missed
interactions), in both periodic (grid rescales) and aperiodic (fixed buffer) dimensions.
- The last available option instructs CAMPARI to employ topology-assisted cutoffs, which is currently the default. Here, interatomic distances are simply pre-screened by a master value for the two reference atoms of residue pairs. This takes advantage of molecular topology to simplify the generation of spatial neighbor lists since only residues which pass the pre-screen are assumed to be spatial neighbors. Note that the program will compare the distance between the two reference atoms to the sum of the cutoff and the effective radii of the two residues in questions. These radii are currently hard-coded. This mode is the method of choice for systems with heterogeneous density and/or large (many atoms) but relatively few (<1000) residues. Note that in the presence of non-bonded interactions method 3 reduces the scaling of CPU time with system size from N2 to something considerably faster (where N is the number of atoms). Method 4 does not change the scaling behavior but reduces the constant factor for this cost dramatically (by experience, for a water box of ~1000 molecules, mode 4 is still slightly faster). This option should generally perform well in conjunction with the shared memory (OpenMP) parallelization of CAMPARI.
If cutoffs of nonbonded interactions have been requested, this keyword sets the interaction range for a part of the nonbonded interactions. It is interpreted differently dependent on the type of calculation:- For Monte Carlo calculations (see DYNAMICS), it simply sets the non-bonded (IPP, ATTLJ, WCA, and IMPSOLV) cutoff in Å. Neighbor lists are populated based on this value, and exact truncation is performed unless keyword MCCUTMODE is set differently from the default value of 1. All the potentials governed by NBCUTOFF should conceptually be short-range in nature. For WCA and the calculation of solvent-accessible volumes for the ABSINTH DMFI, users should choose values that are guaranteed to comply with the intrinsic cutoffs of these methods, which are parameter-dependent (see parameter file and keyword WCA_CUT). For ATTLJ, it must be noted that the properties of systems of heterogeneous density can depend significantly on the truncation of 6th power dispersive term. This affects, for example, simulations of polymers in implicit solvent. In these cases, both the cutoff value and the choice for MCCUTMODE are important parameters of the Hamiltonian.
- For gradient-based calculations, NBCUTOFF defines the short-range regime, within which all interactions and forces are guaranteed to be computed at every time step. This is complementary to the choice for ELCUTOFF. If these two cutoffs are not the same, they form a twin-range pair with some interactions being recomputed only every NBL_UP steps. It is important to realize that the neighbor lists underlying these types of cutoffs are residue-based. They are buffered in order to fulfil the aforementioned guarantee, and the buffer values depend on residue size. This means that the true interaction range can be considerably larger than the value specified (unless the simulation involves only monoatomic molecules). For certain Hamiltonians, CAMPARI supports having the same exact truncation scheme that is the default for Monte Carlo runs (see MCCUTMODE). In this framework, excluding simulations of single atoms, any usefulness of a twin-range approach to the cutoffs is largely eliminated, and this is because of the buffering.
If cutoffs of nonbonded interaction have been requested, this keyword sets the interaction range for the remainder of the nonbonded interactions (not covered by keyword NBCUTOFF). It is interpreted differently depending on the type of calculation:- For MC calculations (see DYNAMICS), it simply sets the second nonbonded (TABUL and POLAR) cutoff in Å. All the potentials governed by ELCUTOFF are potentially long-range in nature. It is up to the user to ensure that tabulated potentials can be truncated safely in the distance regime set by ELCUTOFF. Importantly, the terms supported in a Monte Carlo run and governed by this cutoff are never truncated exactly at the cutoff distance. Such an exact truncation leads to severe cutoff artifacts due to imbalances in evaluating dipole-dipole interactions between groups of partial charges as part of Coulomb terms. This is also the reason why CAMPARI will complain if the partial charges in a residue do not add up to an integer value. Note that even interactions beyond the buffered second cutoff may be computed: Coulomb terms involving moieties flagged as carrying a net charge may not be subjected to any cutoffs based on the chosen setting for LREL_MC.
- For gradient-based calculations, it defines the mid-range regime, within which all interactions and forces are computed accurately, but only every nth time step, i.e., at a lower frequency which is set by the neighbor list update frequency (see keyword NBL_UP). As for the Monte Carlo case above, exact truncation does not happen with one prominent exception, i.e., Coulomb terms in conjunction with a reaction-field treatment where the truncation is necessary. Other options for LREL_MD may, like for LREL_MC, lead to interactions being computed even when they are not in the neighbor lists based upon the value of ELCUTOFF. If the values for NBCUTOFF and ELCUTOFF are not the same, the twin-range terms are assumed to be approximately constant for the number of steps between neighbor list updates. Twin-range cutoffs are explicitly disallowed for the Ewald and reaction-field methods. If CAMPARI computes additional interactions, i.e., if LREL_MD is either 4 or 5, these interactions are subjected to the same assumption for forces and energies (residues pairs with distances beyond the buffered cutoff).
When using nonbonded interaction potentials in conjunction with cutoffs, Monte Carlo calculations typically truncate short-range interactions (IPP, ATTLJ, WCA, and solvent-accessible volume calculations for the ABSINTH DMFI) exactly at the cutoff distance. The latter two terms have intrinsic cutoffs, and the choice for NBCUTOFF should be consistent with these intrinsic limits. This is not true for IPP and ATTLJ. In particular for the latter, if the system has inhomogeneous density, the properties of the system can depend significantly on the truncation scheme. In gradient-based calculations, the cutoff on short-range terms is, by default, always used exclusively to populate the corresponding neighbor lists. This means that no exact truncation occurs. Clearly, these two settings are not identical except for pairs of single atom residues. Because of this, keyword MCCUTMODE offers the following two options:- Favor exact truncation. This is the default and will apply, without limitations to pure Monte Carlo calculations. For other choices for keyword DYNAMICS, only limited supported is available. A hybrid sampler with the Hamiltonian where exact truncation cannot be achieved (due to code-internal reasons), will terminate CAMPARI with an error. A pure gradient-based calculation may see the choice for MCCUTMODE be changed (a warning is printed).
- Avoid exact truncation. This is the more general setting and is supported for all Hamiltonians. By choosing a pure residue-level exclusion approach, many more interactions tend to be computed for the biopolymers supported in CAMPARI. This is because of the large buffer radii in use. This option can be used to safely achieve exactly the same Hamiltonian in hybrid MC/MD calculations or for gradient testing.
At present, there are three supported options as follows:- Every input conformer is scanned and the maximum distance of any atom to any other atom is collected per atom. The atom with the shortest such distance is chosen as the reference atom, and the residue radius is set to 1.5 times this distance. This is meant to account for subsequent variations in the conformation of the small molecule. The downsides of this approach are: i) atoms can be picked which in some conformers are central but in others are very poor choices; ii) different atoms and radii might be picked for different conformers, which causes problems with consistency across different conformers (for example when rescoring existing poses or when using conformer libraries as starting points). Due to these issues, it is not recommended to use this mode (maintained for legacy reasons).
- The molecular topology (bond matrix) is scanned and the minimum number of covalent bonds it takes to reach any other atom in the molecule is collected per heavy atom. The atom with the smallest such number and, in case of ties, occurring first in the input is chosen as the reference atom. The residue radius is set heuristically to 1.1·(BC+1) where BC is the respective minimum number of bonds. The downside of this approach is that typical bond lengths are effectively hard-coded into the heuristic. This is the default since CAMPARI version 4.1.
- The molecular topology (bond matrix) is scanned and the minimum number of covalent bonds it takes to reach any other atom in the molecule is collected per heavy atom. The atom with the smallest such number and, in case of ties, occurring first in the input is chosen as the reference atom. The residue radius is set to 1.25 times the largest distance from this atom to any other atom in the input conformer. This is a compromise that is more tolerant to unusual geometries than mode 2 and more stable than mode 1.
This keyword provides the update frequency for neighbor lists in gradient-based calculations. Every NBL_UPth step, it is recalculated which residues are within a distance of NBCUTOFF Å (short-range) and which ones are within a distance of ELCUTOFF Å (mid-range). Interactions with the former are computed at every time step explicitly and those with the latter are computed only every NBL_UPth step explicitly. For interactions outside of either cutoff, truncation occurs unless the electrostatic model chosen provides a long-range term (see LREL_MD). These latter interactions will then be recomputed at the same frequency as the mid-range ones (with the exception of the reciprocal space sum in Ewald methods which is always computed at every step). Note that this keyword is irrelevant if CUTOFFMODE is set to 1, a setting useful only for debugging purposes.The assumptions made by this keyword are rather aggressive, and it is therefore recommended to use it with caution. Specifically, the neighbor lists here should not be thought of as "buffered" in any way. The integrator noise accumulating by setting this to something large can be quite substantial, and should probably be offset by a large choice for the outer cutoff distance (→ ELCUTOFF). Conversely, the use of residue-level neighbor list with large effective radii tends to bloat the effective cutoff radius, which creates something akin to an effective buffer zone. This implementation may be changed in the future.
This keyword determines CAMPARI's method of handling long-range electrostatic interactions in MC calculations. There are currently several options for this with more being added in the future. A general problem is hidden in the fact that MC calculations have to be able to compute relative energies of drastically different configurations at every step such that similarity assumptions cannot be used to speed up the calculations as is the case in MD/LD/BD.- All monopole-dipole and monopole-monopole interactions are computed explicitly (at full atomic resolution). By default, the governing factor is the parser for the partial charge sets which determines the individual charge groups (see option 2 for ELECMODEL and output files DIPOLE_GROUPS.vmd and MONOPOLES.vmd). Those with a total charge exceeding a threshold (usually zero) are considered "net charges", and those without are considered "dipoles". The flagging is at the residue level, and can be overwritten by a dedicated patch facility. Interactions between dipole groups are skipped even if one or both of the participating residues are flagged. For large systems, the number of interactions can grow dramatically of course. Using this option also requires allocation of a potentially large matrix if grid-based cutoffs are in use, which can hamper parallel performance.
- All monopole-monopole interactions are computed explicitly (at full atomic resolution). As in the option above, the flagging is at the residue level, and here both residues are required to be flagged. Dipole-dipole and dipole-monopole interactions are skipped even if both of the participating residues are flagged. For plasmas or ionic liquids or concentrated ionic solutions, the number of interactions can become prohibitively large of course. It also requires allocation of a potentially large matrix if grid-based cutoffs is in use, which can hamper parallel performance.
- This is identical to the previous option except that monopole-monopole terms are computed at a reduced resolution, viz., polyatomic monopole groups are represented by collapsing the total charge onto a single atom, which is nearest to the true monopole center. This choice is currently the default. The same caveats as for option 2 apply.
- No additional interactions are computed (rigorous truncation).
Note that periodic boundary conditions are mutually inconsistent with any of the above treatments with the exception of truncation. This is because in PBC the largest effective cutoff value for nonbonded interactions must not exceed half of the smallest linear dimension of the box. In case of a hybrid sampler, the values for LREL_MC and LREL_MD should be matched to achieve a consistent Hamiltonian. Compatible values are 1/5 and 3/4, and 4/1 (LREL_MC/LREL_MD).
Much like LREL_MC, this keyword controls how CAMPARI handles long-range electrostatic interactions in gradient-based calculations calculations. There are currently several options for this which are generally different from those available for Monte Carlo runs since two core assumptions are true for dynamics calculations; i) only global energy/force evaluations are needed; and ii) the system remains self-similar through several integration steps. The options are as follows:- No additional interactions are computed, i.e., everything beyond the mid-range cutoff is discarded. This setting can be used along with LREL_MC set to 4 and ELCUTOFF being equal to NBCUTOFF to create an exact match between dynamics and MC Hamiltonians which may be relevant for hybrid calculations (→ DYNAMICS).
- Ewald summation is used, which relies on periodic boundary conditions,and (currently) cubic boxes (→ BOUNDARY and SHAPE). This technique relies on the decomposition of an infinite sum over all periodic images into two quickly convergent contributions, a real-space and a reciprocal space part. The real-space part involves a modified Coulomb interaction, which therefore requires separate loops. Hence, support for Ewald sums is currently limited to "gas-phase"-type calculations with nonbonded interactions corresponding to Lennard-Jones and polar interactions only. Even though possible in theory, there is currently no support for the ghosting of interactions, which is used in the context of free energy calculations. The reciprocal space part can be solved in a number of different ways (see EWALD and associated keywords). Note that the two cutoffs are collapsed into the shorter one (there is no mid-range regime) when using Ewald techniques. Both the real-space and the reciprocal sums are recomputed at every step. Ewald summation replaces the standard Coulomb term and is relevant for all polar interactions even in the absence of full charges. It always requires the error function and a tabulated approximation exists in case the built-in variant is too slow (see installation instructions → DISABLE_ERFTAB and keyword EWERFTOL).
- The (generalized) reaction-field correction is used. The mode is picked with keyword RFMODE. This involves a modified Coulomb sum and relies on the assumption that truncation can be dealt with by assuming that a low dielectric cutoff sphere is embedded in a high dielectric medium, which gives rise to a reaction-field correction, which lets the force on a charge vanish at the cutoff distance if the difference in dielectric constants is large. The high dielectric is set with keyword IMPDIEL, and the size of the cutoff sphere is given by ELCUTOFF. This method requires modified Coulomb interactions and support for the type of nonbonded interactions is limited similar to Ewald sums except that the ghosting of interactions is supported for net neutral solutes. Note that reaction-field corrections assume dielectric homogeneity, i.e., the underlying theory breaks down if the effective dielectric inside or outside the cutoff sphere might become inhomogeneous. The latter is always the case, if, for example, a large enough macromolecule is present or if the system is non-periodic. Note that algorithmically this is not a long-range correction and that (G)RF-corrected terms are computed with the same frequency as short- and mid-range terms are (see NBCUTOFF and ELCUTOFF). Due to stability issues, twin-range cutoffs are not allowed for reaction-field methods. Even then, the force discontinuity at the cutoff distance (vanishes only if the dielectric is assumed to be infinite) may cause more noise than a simple truncation scheme (option 1). The reaction-field solution replaces the standard Coulomb term, i.e., it is relevant for all polar interactions even in the absence of full charges.
- The same option as 3) in LREL_MC. The same rules and caveats apply. By matching the methods this way and setting the two cutoff criteria equal to one another, this allows a consistent choice of Hamiltonian in hybrid runs (→ DYNAMICS). This option is currently the default choice.
- The same option as 1) in LREL_MC. The same rules and caveats apply. By matching the methods this way and setting the two cutoff criteria equal to one another, this allows a consistent choice of Hamiltonian in hybrid runs (→ DYNAMICS).
If grid-based cutoffs are in use (→ CUTOFFMODE), this keyword allows the user to specify the three integers determining the x,y,z dimensions for the rectangular cutoff grid. The origin and the size of the grid are determined by the box parameters (see BOUNDARY and SHAPE). In a droplet boundary condition, the grid cannot be aligned with the simulation container exactly, and parts of it are wasteful. The extra buffer space is computed automatically, and this may lead to crashes of CAMPARI complaining that a part of the system is "off the grid". This most often occurs with an unstable (exploding) simulation but can also happen if a residue-based boundary condition is used in conjunction with bulky residues, if the restraining force is very small, or if large volume fluctuations occur.The total number of grid points should not be so large that operations scaling linearly with this number become a contribution of significant computational cost. Setting the size of the grid cells equal to the cutoff is typically not an effective strategy due to the requirement of having large margins. The latter are a result of the residue-based grid association CAMPARI uses which requires accounting for the effective residue radii in determining spatial neighbor relationships via the grid.
If grid-based cutoffs are in use (→ CUTOFFMODE), this keyword allows the user to specify an initial limit for the maximum number of residues associated with a single grid point. Arrays are dynamically re-sized during the simulation but if the initial setup fails already, an error is returned (see also GRIDMAXGPNB). This keyword is required mostly so CAMPARI has a realistic estimate of the required memory at the beginning.GRIDMAXGPNB
If grid-based cutoffs are in use (→ CUTOFFMODE), static grid-point neighbor lists are set up initially and used to simplify the generation of neighbor-lists using the grid. This keyword specifies the maximum number of grid-point neighbors each grid-point may possess. If the number is too small, the program will fail during the initial setup. This is again to avoid inadvertent memory emergencies (as for GRIDMAXRSNB).It can be annoying to find an acceptable value for this keyword as the distance range depends on the system and the grid. For a big system, it may be advisable to use a temporary sequence file with just the largest residue present to speed up the remainder of the initial setup. Once a proper value has been found for GRIDMAXGPNB, the real sequence can be restored and GRIDMAXRSNB can be calculated relatively easily.
If grid-based cutoffs are in use (→ CUTOFFMODE), this simple logical instructs CAMPARI to write out a summary of the initial grid occupation statistics.CHECKFREQ
This keyword is interpreted differently dependent on the type of calculation. In pure gradient-based simulations, CHECKFREQ simply sets the interval for how often to report global ensemble variables to log output. This can be useful to track simulation progress and make sure that no unexpected behavior (instability) occurs. This output overlaps with output file ENSEMBLE.dat. There is no significant cost incurred by this reporting as the relevant numbers have been computed anyway. In trajectory analysis runs, CHECKFREQ is ignored. For monitoring the progress of processing large input data sets, keyword FLUSHTIME can be used instead.CHECKFREQ takes on a more important role in Monte Carlo calculations or the MC stretches of hybrid sampling runs. Here, it specifies the interval (in elementary steps) how often to recompute the total energy globally. This number is compared to the incremental energy obtained from the energy updates for individual MC moves (which do not compute the global nonbonded energy). The global value supersedes the incremental one (i.e., it is a reset). The numerical drift error from the incremental calculations is usually very small. Thus, the reference energy can be chosen to be either the same as what propagates the Markov chain (affected by keyword CUTOFFMODE and all associated choices) or it can be chosen as the N2 sum assuming a lack of cutoffs. This is controlled by keyword N2LOOP. The choice of reference energy has no implications for the Markov chain but can (and usually does) affect absolute energy values. This may be relevant for certain free energy calculations, for comparisons of simulation results obtained with different cutoff lengths, etc. Whenever absolute energies need to be comparable, it is best that that N2LOOP is set to zero. If it is not zero and the cutoff-assisted and N2 energies differ, the cutoff-sensitive values reported to output file ENERGY.dat will begin to deviate within each interval of CHECKFREQ steps. In this case, consistent output to ENERGY.dat is achievable only if ENOUT is a multiple of CHECKFREQ. The drifting inconsistency in each interval is precisely what was the original motivation of the output, i.e., to understand the magnitude of cutoff effects and to be able to diagnose the correctness of incremental energy calculations. As an additional function, if cutoffs are turned on and N2LOOP has not been set to zero, a sanity check is performed as well, i.e. given the current structure, are the derived interactions in fact complete given the chosen maximum cutoff distance set by ELCUTOFF? If not, this would most likely mean that the parameters used for deriving the list of relevant interactions (specifically, the maximum residue radii) are inappropriate (this can happen for simulations of unsupported residues).
Because both the N2 energy evaluation and the cutoff check can be extremely slow for large systems, low frequencies are highly recommended for these cases especially if N2LOOP is not zero.
This keyword is a simple logical which allows the user control over whether or not to compute the full N2-loop of non-bonded interactions (on by default) as a reference. In pure gradient-based simulations, this number is reported initially only for information purposes but serves no other function. Setting N2LOOP to zero disables this initial calculation, which can be very slow for large systems, in particular as it does not benefit from the OpenMP parallelization. In restarted calculations of this type, N2LOOP never comes into play. In trajectory analysis runs, N2LOOP has no effect even if energies only are calculated (DYNAMICS is 1).The primary use of N2LOOP is to choose the reference energy in Monte Carlo calculations (see CHECKFREQ). When turned on (default), MC simulations will continuously reset the total energy to the cutoff-free value. When it is turned off (zero), they will reset the total energy to the user-selected cutoff scheme (which can be the same of course → CUTOFFMODE). This happens at regular intervals of CHECKFREQ steps. If N2LOOP is set to zero, it will additionally suppress the sanity check procedure for cutoffs. Note that the Markov chain of MC calculations is unaffected by this keyword (it corresponds to a shift of the arbitrary zero point). In particular in hybrid samplers, N2LOOP should probably be 0 to avoid confusion. It is important to keep in mind that, whatever the context, the N2 sum of nonbonded energies may not be a useful reference state, especially in periodic boundary conditions.
This logical keyword applies to all Monte Carlo elementary moves (except particle deletion moves). The normal sequence of events in CAMPARI is:- Perturb configuration.
- Compute short-range terms for moving parts for new conformation.
- Compute corresponding long-range terms.
- Restore original conformation.
- Compute short-range terms for moving parts for original conformation.
- Compute corresponding long-range terms.
- Evaluate Metropolis criterion.
- Process acceptance or rejection.
From the above, it is clear that at step 2 we do not yet have access to a difference in energies (which is only available after step 5). Consequently, this quantity is simply compared to the net value of the short-range energy terms (→ SC_IPP, SC_ATTLJ, SC_WCA, boundary interactions, SC_BONDED_B, SC_BONDED_A, SC_BONDED_I, SC_BONDED_T, SC_EXTRA), and certain bias terms (→ SC_ZSEC, SC_POLY, SC_DSSP, SC_EMICRO, SC_DREST). With the exception of SC_ATTLJ, SC_WCA, SC_BONDED_T, and SC_BONDED_I, these are all strictly penalty terms that can only yield positive contributions to the total energy. Because of the above, the screen is most useful if SC_IPP is used. Inverse power potentials diverge for small distance and can yield arbitrarily large values, which allow meaningful choices for the associated keyword BARRIER. If all aforementioned terms are either zero or negative, the screen will not have any effect. Harmonic potentials (as used in most of the bias terms) can also yield very large values, but the likelihood of this happening during simple MC moves is very small except for SC_DREST, SC_BONDED_B, and SC_BONDED_A (for the latter two terms, this only holds in the presence of soft crosslinks). Therefore, the difficult cases are those, for which the penalty terms are generally high, but do not necessarily vary quickly or strongly upon MC moves. It may then become impossible to use a simplification of this type, i.e., if the chosen screen height is too small, the Markov chain will be corrupted, and if it is made larger, the screen no longer has any effect. To buffer against incorrect use of the method, there is an additional criterion that the incremental energy must exceed twice the total system energy (for typical interaction potentials and an equilibrated system, the latter is often a negative number, and this condition becomes trivially fulfilled).
Note that this technique assumes that the Markov chain remains unperturbed even though the actual acceptance criterion is circumvented. Depending on the setting for BARRIER, this will often be rigorously true for a finite-length simulation. Because the same threshold is used for all types of moves, the efficacy of the screen is likely move type-dependent. Finally, simulations using the Wang-Landau acceptance criterion may not be able to use this technique (a warning is printed in any case).
This keyword is used in two different contexts. First, Monte Carlo moves can take advantage of a cutoff-like screen eliminating proposed conformations after only a partial evaluation of the relevant energy terms. (this is enabled with USESCREEN). Then, BARRIER sets the energy threshold (screen height, cutoff value, barrier) in kcal/mol.Second, the value of BARRIER in kcal/mol is used as the hard-sphere penetration penalty in the hard-sphere excluded-volume implementation (enabled by setting IPPEXP to a sufficiently large value).
Parallel Settings: MPI (Multi-Copy) Parallelism (Replica exchange (REX), PIGS, and MPI Averaging) and OpenMP Parallelism (Task Decomposition):
(back to top)
Preamble (this is not a keyword)
Most biomolecular simulation software packages allow a form of parallelization which one may refer to as domain decomposition. Here, the system is partitioned into a number of subsystems corresponding to the number of processor cores available to the parallel computation. Each core then - more or less - computes only interactions of its own subsystem. The main requirements for an efficient implementation are to keep the communication load as small as possible and the workload even. While for specific classes of systems (dense, truncated interactions, etc.) this method is undoubtedly superior, CAMPARI does not currently implement it. Instead, it offers a general-purpose shared memory parallelization relying on OpenMP. A shared memory parallelization has the advantage of replacing communication calls with conceptually simpler synchronization calls. The shared memory parallelization in CAMPARI is primarily a way to speed up simulation and analysis tasks within the confines of a single machine. Current (2016) compute nodes in supercomputers offer tens of CPU cores, and significant gains can be made for many practically relevant applications. While the OpenMP parallelization is the inner layer of parallelism of CAMPARI, there is also an outer layer that implements sparse communication algorithms such as replica exchange. Like most simulation software, CAMPARI uses the MPI standard for handling the communication requirements of this outer parallelization layer. The resultant hybrid OpenMP/MPI code is particularly well-suited to multi-copy (replica exchange, PIGS) simulations of medium-sized systems.NRTHREADS
This keyword controls the number of threads that the workload of the calculation is distributed across. The actual value is respected only if the pure OpenMP-enabled version of CAMPARI is used (campari_threads). For the hybrid MPI/OpenMP executable (campari_mpi_threads), the number of threads cannot be set this way in CAMPARI (because it must be known during creation of the MPI universe). Instead, the environment variable OMP_NUM_THREADS should be defined accordingly. The threads parallelization as described next is the same in both cases, so the documentation here is relevant in both cases.The OpenMP parallelization of CAMPARI is not a domain decomposition in the vein of almost all MPI-parallel decompositions found in molecular dynamics codes. CAMPARI is meant to cover a diverse set of simulations (including those using different samplers or those in implicit solvent). Spatial domain decompositions work well for systems with homogeneous density undergoing global and continuous evolution (which creates workloads that are all large enough and remain comparable across many simulation steps). Conversely, a single Monte Carlo move of a small molecule in a dilute solution of biopolymers and small molecules is not meaningfully addressable by spatial decomposition techniques. Task-based parallelizations are conceptually simpler but offer less scalability (communication/cache/memory issues). Here, the necessary calculations are simply divided at all cost-intensive stages across processes, which usually requires that all parallel processes “see” the entire system. This is suitable for a shared memory (OpenMP) decomposition (although cache and memory issues remain). As in any other parallel implementation, Amdahl's law holds. This is critical if the decomposable workload is very low to begin with as in those cases any nonparallel task will become a bottleneck (performance tapers off). Additionally, synchronization costs increase with increasing number of threads. Consequently, performance reversal will occur for systems with unfavorable size or properties. This is particularly tricky on modern many-core architectures where memory management (cache), task pinning, etc. are all nontrivial or sometimes impossible to analyze and control. The systems that parallelize best in general are those that have high floating point operations (FLOP) counts and can take advantage of blocky memory layouts.
The main “modes” or tasks covered by the shared memory parallelization in CAMPARI are as follows:
- Global force and energy evaluations. This is the most cost-intensive step in any gradient-based technique implemented in CAMPARI. In general, forces and energies are grouped into terms even though this leads to partial redundancy. For the nonbonded terms, which in general have costs that are conformation-dependent (cutoffs), and a specific group of bonded and other terms loads are balanced dynamically. The limits are usually subsets of atoms, residues, molecules, or residue-residue interactions. Some bias terms have an inner parallelization that is independent of these types of limits, specifically the spatial density potential, the polymeric biasing potential, and custom distance/position restraints.. This is in contrast to other bias and bonded terms, which are occasionally evaluated asynchronously by individual threads.
- Incremental energy evaluations. Incremental energy calculations account for the bulk of the cost of most Monte Carlo simulations and can differ dramatically in extent from step to step. This is why CAMPARI will compute an explicit distribution of work load onto threads every time. No dynamic balancing occurs. The workload can easily be so low that the scaling limit is reached. To avoid adverse effects for larger numbers of threads, the number of synchronization operations per step is kept small. The fact that no forces need to be computed simplifies both the required synchronization operations and the required data structures considerably. The second point can lessen the negative impact of cache management on performance.
- Determination of neighbor/interaction lists. Irrespective of the sampler, any scalable simulation on a system of appreciable size featuring nonbonded interactions will require truncation or transformation of these interactions (→ CUTOFFMODE), which in turn necessitates algorithms to identify nearby species efficiently. These algorithms are all parallelized with generally good efficiency.
- Managing complex constraints.If holonomic constraints are in use, CAMPARI identifies all groups that can be solved independently. This may offer a trivial parallelization across threads. If one or few groups are expected to dominate the cost, CAMPARI evaluates whether these groups are large enough. If so, each of these groups is solved in parallel by all threads, which requires a number of synchronizations proportional to the number of actual SHAKE iterations (only SHAKE is supported for this). Note that with this and most similar decompositions, it is currently not supported to use just a subset of the requested threads to define an optimal setting. Instead, either 1 or all threads have to work on such a constraint group. This means that this type of parallelization is sometimes inactivated upon increasing the number of threads further as it is no longer deemed efficient, which can be a limitation. If Cartesian or internal coordinate space is used in conjunction with a gradient-based sampler, atom-based forces must be redistributed to these degrees of freedom and their effective masses need to be computed. Both of these happen in recursive loops for complex molecules. CAMPARI will again detect whether the sizes of molecules are large enough to warrant "internal" parallelization. If so, each of the eligible molecules is solved in parallel by all threads, which requires synchronization operations proportional to the longest continuous branch in the molecule in question. Here, some threads become inactive (but blocked) based on the choice for keyword NRTHREADSONMOL. This is because the data dependency is generally so high that performance degradation occurs already for comparatively few threads. This can be diagnosed with the help of THREADS_TEST.
- Coordinate operations for large molecules.When sampling in Cartesian or internal coordinate space with methods that propagate many or all degrees of freedom simultaneously, it will be necessary to globally reconstruct the Cartesian coordinates of the system based on the altered values. This conversion (Z matrix to Cartesian coordinates) scales linearly with the number of atoms but requires a large number of trigonometric functions and has very high data dependency because it is strictly hierarchical. If required at every step, it can thus become a significant cost factor when the remainder of the rate-limiting computations are parallelized efficiently. Thus, for each molecule deemed large enough, CAMPARI analyzes automatically the hierarchy and creates a parallel procedure for solving this problem with a fixed number of threads that can be less than the total available number. This works through a TASK logic in OpenMP in conjunction with keyword NRTHREADSONMOL. The very high data dependency will usually require that this helper keyword is set to a relatively small number (2-8). This is particularly relevant if the total number of available threads can be much larger. Large numbers of threads assigned to this task will normally lead to a performance degradation, which can be diagnosed with the help of THREADS_TEST. Generally speaking, any advanced OpenMP construct will make performance harder to control (compare the comments below for FFTW). Conversely, other coordinate operations such as the simple propagation in Cartesian dynamics, or simple translations by a single vector are straightforwardly parallelized at the outer loop level.
- Threaded computation of fast Fourier transforms with FFTW.Unlike all other libraries linked by CAMPARI (e.g., NetCDF or HSL), the code looks for and uses explicitly the threaded implementation of the FFTW library. Because of the way the interface works, this involves creating a new team of threads from the parallel CAMPARI execution, i.e., it entails nested parallelism. Because the other threads of the original team are idle during FFTW calls (but not destroyed), performance is harder to control and predict, i.e., it depends on the way the kernel, compiler, and custom runtime environment end up distributing threads onto the available resources. This is why keyword THREADS_TEST also enables a performance test for threaded FFTW execution (if available and in use). This can be used to understand better the influence of environment variables like "OMP_WAIT_POLICY" or "OMP_PROC_BIND" on FFTW performance inside CAMPARI.
- A number of required utility operations required at every step of a calculation. Especially in gradient-based calculations, there are a number of simple tasks to complete that depend on atomic coordinates of the entire system. These include the actual coordinate propagation (whether Cartesian or internal), the correction of drift velocities, the calculation of total kinetic energies, or the management of polymeric descriptors and reference frames (center of mass, etc.). The overall cost of all of these is linear with the total number of atoms. Synchronization requirements occur if a molecular property like the center of mass needs to be computed by multiple threads (the necessity to do so is detected automatically) but they are generally low. In Monte Carlo calculations, quaternion-based parallelizations of pivot-type moves occur in many move types along with repeated copy operations on coordinate arrays. Most of these operations for any sampler have a low cost to begin with and are thus limited in scalability unless the systems get larger. They differ from the holonomic constraint handler or the coordinate operations in that they should simply plateau in parallel performance with increasing numbers of threads (rather than become slower again). This is because of the low (but nonzero) synchronization requirements.
- Simple analysis tasks. At every step of any run (exceptions caused by the use of FRAMESFILE aside), CAMPARI evokes a high-level routine that goes over all possible analysis tasks, evaluates whether they need to be performed at that step (depending on the output and calculation frequencies listed elsewhere), and executes the identified ones. The majority of tasks are not worthwhile to be handled by multiple threads at once, so CAMPARI uses a task parallelization using a large "SECTIONS" construct. This means that it is beneficial for the parallel efficiency of these analyses if the calculation/output frequencies are matched with each other. Conversely, it is inefficient to enter the function with multiple threads with only a single simple task to be performed.
- Specific analysis tasks. Some analysis tasks, which are inherently expensive, have been parallelized specifically to be handled by all threads at once. This includes the calculation of spatial densities, overlap metrics in hybrid MPI/OpenMP multi-copy simulations (as they rely on global energy evaluations), and a number of tasks that are part of the structural clustering utility evoked during post-processing. Specifically, the tree-based clustering, the approximate progress index, and iterative algorithms in graph processing (see CADDLINKMODE, CREWEIGHT, and CMSMCFEP) have been OpenMP-parallelized. Other tasks could, due to their unfavorable scaling with system size, benefit from this type of parallelization, and they may be implemented in the future, e.g., parallel contact, DSSP, or scattering analyses.
A few more comments are needed. First, the OpenMP parallelization does not extend to all parts of CAMPARI. In particular, the initial setup and final clean-up are completely serial with no thread awareness whatsoever (the only exception to this rule is the force-based relaxation procedure). This comment extends to the (highly repetitive) setup work required over and over again in a small molecule screen. Some of the procedures performed during setup, for example, initial structure randomization, can be time-consuming, and it is a limitation that they cannot be accelerated. Second, it is always recommended to get a quick estimate of parallel efficiency by timing the code (keyword FLUSHTIME can be used to force frequent production rate estimates). Note that some calculations have inherently heterogeneous production rates, e.g., hybrid dynamics/Monte Carlo calculations. Third, task decomposition by threads generally changes the order in which individual summands are combined to yield net properties like the total force acting on an atom. Because floating point math is not associative, this leads to a generally lower level of exact reproducibility compared to multiple executions of 100% serial code. For large sums, a multi-threaded calculation will generally be more precise. The loss in reproducibility is generally smaller than the loss of reproducibility across architectures and, especially, compilers (which often has fundamentally the same reasons).
If the shared memory (OpenMP) parallelization of CAMPARI is in use, this keywords provides a number of threads to deploy for certain operations suffering from poor parallel performance due to high data dependency. At the moment, these are coordinate generation and propagation of forces onto internal degrees of freedom for large molecules (relative to system size), as outlined above. The use of this keyword does not imply a generation of nested parallelism, it does not alter actual team sizes, and it does not free the idle threads: it simply caps the maximum effective team size. Setting this keyword to be the same or larger as NRTHREADS means that no adjustments are performed. Conversely, choosing a smaller value will lead to some threads idling during the aforementioned operations. Clearly, unless the algorithm were to be very poorly written, it will generally be harmless in terms of actual calculations performed to use too many threads for poorly parallelizable tasks. Due to the actual team size staying the same, there is also no real gain in thread synchronization costs. The reason for the existence of this keyword is another factor: memory/cache access. Many threads in a large team performing a very small workload on shared data can lead to an actual performance loss compared to a smaller subset of threads performing a comparatively larger workload while other threads are idle. Recommended values are 2-8 but it is recommended to test this explicitly with the help of keyword THREADS_TEST.THREADS_DLB_FREQ
If the shared memory (OpenMP) parallelization of CAMPARI is in use, this keyword determines after how many steps dynamic load balancing is periodically enabled. Dynamic load balancing conservatively shifts bounds on internal entities of representation (such as atoms, molecules, degrees of freedom) between adjacent threads to improve the balance of times spent per thread. It is important to realize that embedded synchronization requirements destroy the ability to meaningfully measure load balance, which means that dynamic balancing is only performed for synchronization-free subtasks (of which there are several, e.g., evaluation of nonbonded forces or neighbor list generation). It does so for at most THREADS_DLB_STOP elementary simulation steps after the start of each interval.In detail, this means that every THREADS_DLB_FREQ steps a new data collection and balancing interval is started and continued for up to THREADS_DLB_STOP elementary steps. The information used for balancing can be pre-averaged across multiple steps using keyword THREADS_DLB_EXT. For each block, if satisfactory balance is achieved, the balancing will stop until the next interval is encountered. For small systems in particular, some subtasks will not be able to achieve a satisfactory balance (insufficient granularity). If in these cases THREADS_DLB_STOP is equal to or larger than THREADS_DLB_FREQ, continuous load balancing is obtained. This is not recommended because the measurements themselves are not completely cost-free, and because a continuous adjustment of bounds is likely to yield inferior cache performance.
There are two notes of caution. First, the idea of dynamic load balancing implies that the load does not change dramatically from step to step. This may be violated in trajectory analysis runs where OpenMP is used to compute energies. Splitting the trajectories and running an MPI-parallel analysisis likely to be a superior strategy here. Second, dynamic load balancing expects threads to have the same "computing power" available to them at every step. This is not necessarily the case in oversubscribed systems (more threads than CPU cores) where system-induced waiting times occur. It can also happen in undersubscribed (less threads than CPU cores) cases on multi-CPU systems where threads are not pinned to specific cores. This is because the available cache differs depending on how many threads reside on a CPU (socket). These issues are (at least theoretically) controllable at the level of the operating system, for example using environment variables such as OMP_PROC_BIND. In practice, it is very hard to predict performance accurately, and some amount of trial-and-error (benchmarking) is usually needed. In particular, native hyper-threading for Intel chips should be tested as, possibly contrary to expectation, it has proven beneficial in many applications (most likely due to better cache use).
If the shared memory (OpenMP) parallelization of CAMPARI is in use, this keyword determines the maximum length of each periodic data collection interval for dynamic load balancing. The interval frequency is set by keyword THREADS_DLB_FREQ. Data collection and bounds adjustment is stopped as soon as the load imbalance is satisfactorily small or as soon as the number of elementary passed since the beginning of each interval is equivalent to THREADS_DLB_STOP. Note that generally it is recommended that the chosen value be small in relation to that chosen for THREADS_DLB_FREQ as repeated measurement cycles and bounds adjustments can themselves adversely affect performance.THREADS_DLB_EXT
If the shared memory (OpenMP) parallelization of CAMPARI is in use, this keyword determines the number of elementary steps over which the execution times per thread are averaged in dynamic load balancing. Choosing a value different from 1, which is the default, can violate the requirement that has to be balanced in every step for performance to be optimal. However, for small systems and certain execution blocks, the measured times can be so small (and noisy) that averaging may be required to balance them effectively. This is the role of this keyword. Note that choosing larger values also limits the rate of convergence for load balance.THREADS_VERBOSE
If the shared memory (OpenMP) parallelization of CAMPARI is in use, this keyword controls the level of diagnostic output written to a dedicated output file, viz., THREADS.log. Options are as follows:- No output is provided.
- Only timing information (performance and expected time to finish) is written at intervals controlled by keyword FLUSHTIME. This is the default and recommended option for normal usage.
- In addition to all output produced by the previous options, CAMPARI periodically reports updated bounds resulting from dynamic load balancing in case a reasonable balance is achieved. This is available for different categories. No output is written if the balancing approach fails to find a satisfactory solution after the requested of steps for the current interval. The bounds specify the chunk processed by each threads and can refer to different representation constructs (atoms, residues, dihedral angles, and so on).
- In addition to all output produced by the previous options, CAMPARI initially reports fixed bounds for operations with predictable cost. These bounds are used in various places, and the information will not be of much use for regular users.
- In addition to all output produced by the previous options, CAMPARI frequently reports load imbalance measures for any tasks that undergo dynamic load balancing. It is not recommended to use this option outside of specific debugging or optimization tasks as the amount of data written gets very large very quickly. Note also that the significant file I/O can interfere with external performance measurements. The output can highlight aspects of the calculation that fail to become balanced.
This keyword is primarily for developer use. It instructs CAMPARI to not perform the actual simulation or analysis but to instead test a subset of relevant threaded execution routines relative to their serial counterparts. These tests use the actual system specified by the key-file. The output of the tests is mostly self-explanatory but understanding all of the reported deviations may require some insight into algorithm structure. If the particle-mesh Ewald method is in use, the correctness tests are followed by a scaling test for the linked FFTW library in threaded execution mode. This is a point of concern because the library does not allow an existing parallel region to access it. More details are given elsewhere.REMC
This logical keyword - when set to 1 - instructs CAMPARI to perform a calculation employing and evolving a number of copies of the system. Unlike for the mutually exclusive keyword MPIAVG, here it is allowed that each replica is evolving under a different condition (e.g., temperature). This covers the replica exchange (RE) method for standard simulations, the dynamic splitting of an input library for a small molecule screen, and parallel analysis runs yielding as many results as there are trajectories. Like all multi-copy (or multi-replica) methods in CAMPARI, the communication between copies is handled by MPI, and it is therefore necessary to use an MPI-enabled executable. The shared memory (OpenMP) parallelization of CAMPARI can be used simultaneously as this inner parallelization layer does not deal with the exchange of information between copies. In hybrid MPI/OpenMP mode, the number of threads is no longer settable by NRTHREADS but has to use an environment variable (OMP_NUM_THREADS) at the system level instead.For a simulation task, REMC activates the replica exchange method employing REPLICAS separate conditions (processes). The conditions differ in one or more parameters (→ REDIM), and there is a dedicated input file FMCSC_REFILE to specify them. Note that the order of conditions may matter (→ RENBMODE). Irrespective of whether the underlying propagator is pure Monte Carlo (see DYNAMICS), a dynamics-based method, or any hybrid method, restrictions apply in that the sampled ensemble must be the canonical (NVT) one (see ENSEMBLE). This can either be achieved by running constant particle number MC, Newtonian dynamics with a proper thermostat (see TSTAT), or stochastic (Langevin) dynamics (which inherently tempers the ensemble). In the RE method, structures (or conditions) are exchanged periodically between replicas using a well-defined acceptance criterion. This is controlled by keyword REFREQ and includes the case of disabling these exchanges altogether. The exchanges are generally meant to improve sampling by allowing excursions into conditions or Hamiltonians in which (enthalpic) barriers are reduced. The evaluation of the acceptance probability implies that energies of current structures must be computed for different conditions. Independently of any exchanges, this functionality is useful in free energy calculations (perturbations) as the exponential average of the work required to change condition (energy difference) is directly related to the free energy difference between those conditions (→ REOLCALC). Parameters of the method are the exchange frequency (REFREQ) the scope for possible exchange partners (RENBMODE), the number of exchange attempts in a single exchange cycle (RESWAPS), and for dynamics propagators the way of dealing with velocities upon a successful exchange (RE_VELMODE).
In CAMPARI, each replica and its output will correspond to both instantaneous and averaged information from the associated condition, i.e., the underlying trajectory is no longer continuous in conformational terms. The typical assumption is that, depending on the settings for the parameters of the method and given a suitable arrangement of replicas in the RE input file, it can be achieved that the resultant ensemble averages and distributions are, for finite samples, indistinguishable within error from a correct reference simulation for the same condition that does not utilize exchange moves. This issue is not trivial, however, and the more general and precise approach to the analysis of replica exchange data is to reweight all samples to a given target condition that should either have been part of the original replica space or that can be obtained by interpolation. This reweighting is technically possible in CAMPARI (→ FRAMESFILE) for almost all analysis features in trajectory analysis mode, but the weights have to be determined externally (e.g.,, by the weighted histogram analysis method, WHAM).
If a RE run contains Monte Carlo moves, and is combined with the Wang-Landau acceptance criterion, there is a necessary note of caution. Specifically, if WL_MODE is set to 1, this may result in identical copies of Wang-Landau runs if the exchanged parameters do not alter the Hamiltonian (since environmental conditions are irrelevant to the Wang-Landau sampler in such a case). In any case, the Wang-Landau iterations will proceed independently for each replica. This implies that it may yield results that are difficult to interpret if replica-exchange swap moves are allowed (because those - currently - always follow a Boltzmann criterion).
In a small molecule screen, the difference in replicas is no longer a difference in Hamiltonian or temperature but a difference in the molecule being sampled. In this mode, exchanges are not allowed, the replica exchange input file is not required, and the above comments regarding ensembles, reweighting, etc. do not apply. Importantly, unless keyword MOL2ISSPLIT is turned on, the master process of this MPI setup does not perform any actual calculations; instead, it simply reads the mol2 input file and distributes the molecules in it to the remaining MPI processes. The only exception to this is the (nonsensical) case of there being only a single replica. There is some communication overhead to this procedure, but this is not expected to play a significant role for reasonable ratios of the number of molecules to the number of replicas. It is rare for the cost of the calculation for a single molecule to be so small that the read-in and the distribution of the information from the mol2 file become a significant cost factor. In those cases where it may be significant, CAMPARI also supports the alternative approach of having all MPI processes perform calculations on separate input files (by turning keyword MOL2ISSPLIT on). The downside of this is the loss of automatic load balancing.
If an analysis run is performed, the meaning of keyword REMC changes compared to the case of simulation tasks described above. For analysis, REMC instructs CAMPARI to perform trajectory analysis in parallel while keeping the data from all replicas separate (for parallel analysis runs combining data, see keyword MPIAVG). Example applications are to speed up expensive analyses of large data sets, to compute free energy differences between ensembles, or to obtain data suitable for error estimates via block averaging. Keywords REFILE, REPLICAS, and REDIM are required. All other RE simulation-related keywords are ignored. For the RE setup, analysis keywords REOLCALC, REOLINST, and REOLALL are respected. As alluded to, this can be useful in post-processing simulation data for free energy growth or related calculations requiring "foreign" energies. There is another complication with RE data and that is the question how to evaluate a possible sampling benefit. Users should always keep in mind that a RE trajectory with swaps inherently averages over data from several coupled trajectories. A simple consequence of this is that data tend to look smoother and better converged if the number of replicas is increased. An assessment of the actual purpose of the method, i.e., increased barrier crossing rates by excursions into conditions amenable to barrier crossing, is more feasibly obtained by unscrambling trajectories, i.e., by looking at trajectories continuous in conformation (and not in condition). This is why CAMPARI allows the user to supply an input file with the swap history of a set of trajectories with the goal of transcribing the set of trajectories to a new set that are all continuous in conformation. The input file needs to be similar in format to the analogous output file created by CAMPARI during RE simulations. If this option is enabled, auxiliary keywords RE_TRAJSKIP and RE_TRAJOUT may become relevant.
Technically, parallel trajectory analysis requires that the REPLICAS individual trajectories are systematically named and numbered in a fashion similar to how CAMPARI writes trajectories in RE simulations. This means that every file is prefixed with "N_XXX_", where XXX gives the replica number (started from "000"). Since there is only a single key-file, the input trajectory name specified should not include this prefix (it will be added automatically). An example is given elsewhere. Frame-specific analyses (and thereby frame weights) are not yet supported in parallel trajectory analysis runs.
For any multi-replica simulation that supports structure transfer between replicas, this keyword sets a fixed interval for attempting these structure transfers. It is an important parameter of both the replica exchange method and the PIGS method. Unlike frequencies supplied to define Monte Carlo move sets described above, this parameter is a deterministic interval, i.e., a setting of 104 will imply that possible exchanges are attempted exactly every 104 elementary steps. This is because, in general, the communication requirement will mandate that all replicas remain synchronized regardless. For replica exchange, a swap cycle counts as a single (Monte Carlo) step in the trajectory. For PIGS, the reseeding does not count as a step. Instead, it is performed exactly between the steps corresponding to multiples of REFREQ and the respective next ones (see elsewhere for details).All structure exchange is implemented in peer-to-peer mode. For generality reasons, the head node is always involved in decision making for structure exchange. This imposes an unfavorable (centralized) communication structure for some data (e.g., reassignment maps).
For replica exchange runs, structure transfer is realized as swaps between conditions. Viewed as a Monte Carlo move, such a swap attempt is defined in the context of a multicanonical ensemble. This means that any analysis should consider the entire set of simulation data and employ appropriate reweighting protocols to obtain canonical averages corresponding to the individual or even interpolated conditions. It is not immediately clear how justified it is to assume that the individual replicas in a replica exchange run can be analyzed as if they satisfied the canonical distribution for each condition individually. For a large fraction of published replica exchange simulations, swap attempts are restricted to the immediate neighbors along a one-dimensional temperature coordinate, and the data coming from replicas are treated independently. Keyword RENBMODE allows the user to choose between neighbor-only and global swap protocols. We emphasize again that CAMPARI does support the computation of reweighted averages and distributions by adding floating-point weights to a frames file.
It is difficult to provide guidelines for useful settings for this keyword. In replica exchange, very small values for this exchange attempt interval can lead to relaxation problems. With dynamics samplers, the treatment of velocities becomes an important consideration (see RE_VELMODE). There is a considerable body of literature on this subject (some of it is cited in this reference).
If the replica exchange method is in use, this keyword specifies the number of swaps within a swap cycle. Each time a step is encountered that is a multiple of REFREQ CAMPARI will collect the data from all replicas, construct the required energy matrix, and randomly pick pairs of eligible replicas (see RENBMODE) for which the swap move Boltzmann acceptance criterion is evaluated. This process is repeated RESWAPS times and the map matrix (structure to condition) is upated after every successful swap. This means that it is possible for no pairs of replicas to effectively swap structures despite the presence of accepted moves. This stochastic implementation differs from that seen in other software and requires a careful choice for this keyword. For exchanges between all replicas (see RENBMODE), this should probably be at least Nrep·(Nrep-1)/2, where Nrep is the number of replicas in the simulation. For neighbor swaps only, it should be Nrep-1. The reason for choosing a number proportional to or larger than the unique possible exchanges is that the computational cost of computing necessary cross-energies (in Hamiltonian replica exchange) and of communicating the information required for the aforementioned matrix is, in our implementation, independent of the final number of accepted swaps. This means that the cost of a swap cycle would be largely wasted by exchanging just a single pair chosen from a much larger number of replicas. For neighbor swaps, the set of possible swaps is limited because the required energy matrix is only a tridiagonal band matrix. This means that "secondary" swaps may be rejected due to lack of information rather than the Boltzmann criterion, which can introduce biases.Note that the acceptance rates become very small once there is hardly any overlap between different replicas (in turn, the acceptance is always strictly unity if the conditions are the same - regardless of the two structures). A large number of attempted swaps in conjunction with all-against-all exchange corresponds to an equilibration of current structures across conditions. In the limit of tiny acceptance rates, the impact of the replica exchange method is no longer felt, and it reduces to a set of independent canonical simulations at different conditions (the same limit is achieved explicitly by setting REFREQ to be very large). Because of this, a reasonable swap acceptance rate is often taken as the primary diagnostic toolfor the choice of conditions (see output file for swap probabilities).
This keyword defines location and name of the file containing the specifications for the replica exchange method (see elsewhere for details).RENBMODE
As alluded to above, the replica exchange method represents a rigorous sampling technique if one considers the multicanonical ensemble it defines. This can cause problems in the interpretation of data obtained for an individual condition. Moreover, the energetic overlap between distant conditions is often small leading to negligible swap likelihood for all but the replicas most similar in condition. This is the typical scenario for temperature replica exchange calculations in explicit solvent. Here, it is very common to restrict swap attempts to the (at most) two neighboring replicas for a series of conditions. In Hamiltonian replica exchange, the same idea might actually be more useful as it also restricts the computation of the energy matrix to neighboring conditions. Recomputing energy values for many different conditions can be costly. Therefore, the available options are:- Swaps are attempted with all available replicas
- Only the (at most two) neighboring replicas are eligible for swap moves, and neighbor relationships are determined by the sequence of conditions as they appear in the input file (this is the default).
Note that almost all exchange-related problems naturally disappear in the limit of few attempted swaps (→ REFREQ) or in the limit of poor overlap and consequently few accepted swaps. This limit is very easily reached for large, condensed-phase system with typical interaction potentials (fluctuations decrease with increasing size).
This keyword sets the number of subprocesses intended to be created by a multi-copy simulation. For replica exchange calculations, this has to rigorously correspond to the number of processes granted by the system. A large enough number of different conditions in the corresponding input file (→ FMCSC_REFILE) has to be present. For MPI averaging calculations, which includes PIGS runs and parallel Wang-Landau runs, this keyword will actually be overridden by the actual processor number granted by the system. Note that if the shared memory (OpenMP) parallelization of CAMPARI is also used (hybrid OpenMP/MPI), the management of access to hardware resources (compute cores) by both MPI processes and their spawned OpenMP threads is not trivial on modern many-core CPUs. For example, a machine with two sockets with a 12-core CPU each can host a calculation with 4 MPI processes using 6 OpenMP threads each in many different ways including nonsensical mappings like running everything on a single core. While this mapping can often be controlled by the user at both levels, it should really be managed automatically by the operating environment whenever possible.REDIM
If the replica exchange method is in use (→ REMC), this keyword sets the number of dimensions specifying the conditions to be expected in the dedicated input file (→ FMCSC_REFILE). Note that replica exchange calculations may rely on neighbor relations (see RENBMODE), and that those may be difficult to define if multiple dimensions are used to specify each condition.REMC_DOXYZ
For any multi-replica simulation that uses pure MC sampling and supports structure transfer between replicas, this simple logical keyword lets the user choose to use Cartesian rather than torsional/rigid-body coordinates to be transferred. The keyword is ignored if the propagator is fully or partially reliant on a dynamics method. This keyword can be useful if internal degrees of freedom not sampled by MC diverge in any node-specific input files (for example, through rare scenarios when trying to restart an MC run from (modified) restart files produced by MD).RE_VELMODE
This keyword selects how to deal with velocities for any multi-replica calculation allowing structural transfer between replicas. As such, it is relevant for replica-exchange molecular dynamics runs and for PIGS runs using a molecular dynamics propagator (see DYNAMICS). One of the complications of theses types of calculations arises in the necessity to pass on or re-assign velocities upon any successful structure change. The options for handling this difficulty are as follows:- All velocities are always randomly re-assigned upon receiving a new structure. This is equivalent to an instantaneous, global action of an Andersen-type thermostat (see TSTAT). It might be the safest option to use for pure Hamiltonian replica-exchange, especially if the Andersen thermostat is used in conjunction with Newtonian dynamics. It is also the required option for PIGS runs with propagators lacking a stochastic component.
- Velocities are re-scaled by a factor equivalent to (Ti/Tj)1/2 where Ti is the temperature of the current node, and Tj the temperature of the node the received structure originated from. Note that this does not scale the instantaneous temperature to a specific value, but rather by a specific factor. Unlike the first option, it preserves directions and relative magnitudes of all velocities. This mode relaxes to the third option if temperature is not one of the replica exchange dimensions, or if the run is using the PIGS protocol instead of replica exchange.
- Velocities are taken directly from the node the incoming structure originated from, i.e., always remain associated with "their" structure. This will almost certainly lead to small artifacts for replica exchange calculations with temperature as one of its dimensions. It is the preferred choice for PIGS runs with stochastic propagators.
This keyword is only relevant in MPI replica exchange or in MPI PIGS (or PIGS analysis) calculations with swaps or reseedings performed. It requests that an instantaneous integer trace is written that allows the user to recapitulate the complete history of structure transfer between replicas. For replica exchange, the trace indicates which condition is (after each swap or reseeding cycle) associated with which initial starting conformation (see N_000_REXTRACE.dat). For PIGS, the trace indicates every reseeding event in an even simpler form (see N_000_PIGSTRACE.dat). For replica exchange, these data can be used to reconstruct trajectories continuous in geometric variables rather than continuous in exchange condition (the latter being the CAMPARI default). This is useful to be able to estimate the sampling enhancement provided by replica exchange in terms of conformational decorrelation or similar metrics. For MPI PIGS calculations, the trace can be read in by CAMPARI to recover the structural connectivity of configurations in graph-related analyses.MPIAVG
This logical keyword - when set to 1 - instructs CAMPARI to perform a calculation employing and evolving a number of copies of the system. Unlike for the mutually exclusive keyword REMC, here it is assumed that each replica is evolving under exactly the same condition. Note that keyword REPLICAS is no longer used to set the number of copies (the value is taken directly from the MPI environment instead). Like all multi-copy (or multi-replica) methods in CAMPARI, the communication between copies is handled by MPI, and it is therefore necessary to use an MPI-enabled executable. The shared memory (OpenMP) parallelization of CAMPARI can be used simultaneously as this inner parallelization layer does not deal with the exchange of information between copies. In hybrid MPI/OpenMP mode, the number of threads is no longer settable by NRTHREADS but has to use an environment variable (OMP_NUM_THREADS) at the system level instead.Additional keywords can activate specific multi-copy methods that utilize a similar framework, viz., parallel Wang-Landau runs (via MC_ACCEPT) and the PIGS protocol (via MPI_PIGS). In detail, the possible tasks are:
MPI averaging
This is the mode achieved without any additional keywords. Here, The individual copies (replicas) are strictly independent (no communication requirement) until the very end when on-the-fly analysis data are automatically collected and processed by the head node (see documentation of output files for details). Some analysis functions or simulation algorithms may not be supported. This is primarily a mode to save time for the user since it essentially reproduces a common mode of running molecular simulations, i.e., running multiple trajectories in parallel and analyzing the results to together. Starting conditions (see RANDOMIZE and PDBFILE) and stochasticity of the propagator are important here to avoid multiple replicas diverging only on account of numerical drift.
Parallel Wang-Landau runs
If the simulation is a pure Monte Carlo simulation , and if the Wang-Landau acceptance criterion is used, the behavior changes. Wang-Landau runs are essentially iterative, and in such a case keyword MPIAVG will create a parallel version of the Wang-Landau scheme. At an interval set by WL_FLATCHECK, the histograms are recombined over the individual nodes. The combined histogram is then what determines the move acceptance and what is used to evaluate whether to update the convergence parameter or not. The value of the convergence parameter and all other relevant settings remain synchronized throughout. In between update steps, the individual replicas evolve according to the last global histogram that was since incremented locally. This means that the value chosen for WL_FLATCHECK is a delicate quantity since both too small and too large values may impede convergence. While the former may remove the bias for an individual replica to traverse phase space faster than a canonical simulation, the latter may result in several replicas exploring the same area of phase space, thereby amplifying a lack of global convergence. Note that the communication routines used in the parallel Wang-Landau implementation can be fine-tuned using keywords MPICOLLS and MPIGRANULESZ.
PIGS runs
PIGS runs are explained in detail below. Here, CAMPARI collects information from all replicas over a specified interval. Rather than biasing the potential energy surface, this information is used to make decisions on whether to truncate some of the trajectories and to restart them from more interesting points corresponding to the current states of other replicas. PIGS stands for Progress Index-Guided Sampling and utilizes information from a method described elsewhere (please refer also to the published articles: progress index; PIGS).
Parallel analysis runs
It is possible to let CAMPARI analyze several trajectories in parallel using either the MPI replica exchange setup or the MPI averaging setup. This is enabled by setting MPI_PIGS to 0 (the default) and PDBANALYZE to 1. For an analysis run in the MPI averaging setup, the behavior is exactly as in an MPI averaging calculation, i.e., data are collected and analyzed by each MPI process and pooled at the end. The results from such a calculation should be the same as the result from a serial analysis of a single long trajectory obtained by concatenating the individual trajectories. Technically, parallel trajectory analysis requires that the REPLICAS individual trajectories are systematically named and numbered in a fashion similar to how CAMPARI writes trajectories in RE simulations. This means that every file is prefixed with "N_XXX_", where XXX gives the replica number (started from "000"). Since there is only a single key-file, the input trajectory name specified should not include this prefix (it will be added automatically). An example is given elsewhere. Frame-specific analyses (and thereby frame weights) are not yet supported in any parallel trajectory analysis runs of this type. For parallel analysis runs using the replica exchange setup, please see above.
Parallel analysis runs emulating a PIGS stretch
This option is identical to the previous one with the single exception that structural clustering and related analyses are not available in their standard form. Instead, CAMPARI will emulate the behavior of the PIGS reseeding heuristic and print out a line of the PIGS trace file. This respects input for keywords MPI_GOODPIGS, RE_TRAJTOTAL, RE_TRAJOUT, and all the required keywords for computing the progress index (starting with CCOLLECT). For details, please consult the documentation on the PIGS simulation method. This mode is enabled by setting MPI_PIGS to 1 and PDBANALYZE to 1.
If a multi-replica simulation is requested via keyword MPIAVG, this keyword allows the user to enable the PIGS enhanced sampling scheme. We refer the user to the literature for a detailed overview. Technically, PIGS utilizes parts of the infrastructure of both replica-based parallel simulation protocols (keywords MPIAVG and REMC as described below). Briefly, PIGS works as follows:Each of the REPLICAS processes running in parallel propagates a copy of the same system. After an interval of REFREQ steps has elapsed, the algorithm evaluates a heuristic that is used to selectively terminate some of the trajectories and to reseed them from the current states of other replicas. To avoid bit-identical trajectories, the propagator must have a stochastic component to it, e.g., Langevin dynamics, Monte Carlo, or Newtonian molecular dynamics with suitable thermostats. Unlike in replica exchange, the conditions associated with each replica are identical, and swaps would be redundant. The termination and reseeding of trajectories implies a loss of information and is justified only if the reseeding point ultimately leads to better sampling. The notion of "better" is not general. For PIGS, it consists of the desire to diversify individual replicas, e.g., to prevent sampling of overlapping regions of phase space. The truncation and selective reseeding of simulations is used in many methods such as distributed computing or transition path sampling.
To evaluate the heuristic, PIGS collects data from every trajectory over every interval of size REFREQ. To remain scalable, it is a memory-free algorithm, i.e., the slice of data determining the reseeding is always of the same size. From the composite data slice, the so-called progress index is constructed (see option 4 to CMODE). The size of the data slice is therefore set by the combination of keywords REFREQ, CCOLLECT, and REPLICAS (or the actual number of replicas available). Construction of the progress index requires as essential input only the definition of a representation and distance between snapshots, which is provided by CDISTANCE and possibly CFILE. Again, for scalability reasons, the approximate progress index is constructed, and this entails additional parameters of minor importance (see keyword CPROGINDMODE for details).
With the complete progress index in hand, it is possible to locate the current snapshots for all replicas in the index. This requires that REFREQ be a multiple of CCOLLECT. The progress index is an ordered sequence of snapshots that arranges geometrically similar snapshots close to one another without using a reference state. Every snapshot is associated with a specific distance that corresponds to the length of an edge in an underlying spanning tree. From this information, a composite rank of three individual ranks is constructed. The latter are:
- Position in the progress index (larger is better as snapshots from low likelihood regions are more likely to appear there).
- Length of the associated edge (larger is better as distances tend to be larger in low likelihood regions).
- Distance from any other current snapshot in terms of progress index position (larger distances are better as they indicate more unique sampling domains).
p(X → Y) = [ ζ(X)-ζ(Y) ] / (Δζmax+1)
Here, ζ(X) and ζ(Y) are the composite (summed) ranks for replicas X and Y, respectively. Δζmax is the maximum realizable difference in composite rank. The result is compared to a uniformly distributed random number between 0 and 1. A reseeding is accepted putatively if this random number is smaller than the evaluated expression, which biases acceptance toward cases with large rank difference. It is only putatively accepted because every replica can be protected on account of a uniqueness criterion. This is evaluated by finding the first and third quartiles of the snapshots coming from the replica in question in terms of progress index position. If they are tightly clustered, the number is small, and it is inferred that this replica samples a relatively unique area of phase space. Conversely, if they are distributed, this indicates sampling overlap. The difference in the positions of the first and third quartiles is compared to the number REFREQ/CCOLLECT, which is twice the minimum value. If it is less than this number, any putative reseeding is rejected for the replica in question.
A positive reseeding decision incurs the same mechanism as that of accepting a new structure in replica exchange. Principally, all required settings and variables for the propagator are transferred. This excludes the seed of the random number generator, i.e., otherwise identical trajectory can diverge quickly on account of the different sequences of random numbers. This is how the stochastic component of the propagator as mentioned above is relevant. For molecular dynamics propagators, velocities can be kept or reassigned, i.e., both meaningful controls supplied by RE_VELMODE are supported. For pure MC propagators, keyword REMC_DOXYZ is also supported. The history of reseeding decisions can be recorded in a trace file. This file is similar to the same output file for the replica exchange method and can be obtained with keyword RETRACE. It is strongly recommended to always write this file for subsequent diagnosis and analysis.
With the exception of structural clustering, on-the-fly data analysis is supported by PIGS in the same way as it is by the default multi-replica (MPI averaging) setup. Data are gathered across replicas, combined, and total averages and distributions are provided. In general, however, PIGS leads to biased distributions, and it may therefore be more useful to adopt a standard protocol that stores trajectories individually for each replica (with MPIAVG_XYZ), disables most on-the-fly analyses, and performs all further analyses strictly in post-processing (with PDBANALYZE).
Structural clustering uses the same infrastructure that PIGS requires to evaluate the heuristic but the data are deleted after each interval of length REFREQ. Note that this implies that the memory requirement of the head node can be large if REFREQ/CCOLLECT and the number of replicas are both large. Scalability of the protocol with respect to the number of replicas requires parallelization of the progress index computation itself, and this is only implemented at the level of the master MPI process thus far. Conversely, the subordinate ranks do nothing but communicate their snapshot to the master instantly after each collection event. Once the data for the entire stretch has been received from all replicas, the shared memory (OpenMP) parallelization of the data mining algorithms comes into play (in a hybrid MPI/OpenMP run). The available OpenMP threads are only those of the master MPI process even if multiple MPI processes reside on the same shared-memory node. Because of these limitations, it is recommended to ensure through appropriate parameter settings that the cost added by the heuristic is kept manageable. Keep in mind that some aspects of the structural clustering facility are not available in the context of the PIGS heuristic. Obviously, CMODE is not selectable, nor are CPROGINDSTART or CPROGINDMODE controllable (they default to 4, -1, and 2, respectively). Keyword CPROGMSTFOLD has no effect (it use would be undesirable in the context of the first ranking criterion mentioned above). All keywords related to editing or utilizing the link structure of the network are irrelevant. Data preprocessing and the utilization of weights are both supported but the application of linear transforms is not (yet). The use of weights can entail a number of associated parameters that reflect or describe a (time) locality in the sequence of snapshots, e.g., a lag time. It is important to keep in mind that the PIGS algorithm simply concatenates the data from all replicas, which can lead to artificial periodicities or spikes in locally estimated fluctuations, which may or may not be desired. Lastly, the technical parameters controlling the tree-based clustering and the short spanning tree construction for the progress index are of course relevant in PIGS (→ CRADIUS, CMAXRAD, BIRCHHEIGHT, BIRCHMULTI, CREFINE, CLEADER, BIRCHCHUNKSZ, CMERGEDIAM, CPROGINDRMAX, CPROGRDEPTH, CPROGRDBTSZ).
This, along with REFREQ, CCOLLECT, and CDISTANCE, is one of the main parameters of the PIGS protocol (see link for details). It determines how many of the parallel replicas are protected from being reseeded and serve as database for the remaining replicas. If MPI_GOODPIGS equals the actual number of replicas (normally set by REPLICAS), the PIGS algorithm relaxes to the propagation of independent, identical copies of the system (basic functionality of MPIAVG). There is no consensus rule for good choices for this parameter, but a reasonable starting point is usually given by setting it to REPLICAS/2.MPIAVG_XYZ
If the MPI averaging technique is in use (→ MPIAVG), this simple logical keyword lets the user choose to obtain trajectory data for each of the independent, identical replicas separately (which is also the default). If this keyword is explicitly set to zero (logical false), only a single trajectory file will be written with entries cycling not only through the time or equivalent axis but also through replica space (see elsewhere for details). The choice here is mostly a matter of convenience for post-processing but note that with individual trajectories REPLICAS as much structural data are written as with a single file. Lastly note that very frequent write operations by different processes to a shared output file may occasionally cause race conditions and/or be inefficient due to long waiting times.MPICOLLS
This keyword acts as a simple logical (turned off by default) that allows the user to enable the usage of collective communication routines defined by the MPI standard for selected communication operations in CAMPARI (routines such as MPI_ALLREDUCE, MPI_BCAST, etc). These routines should at all times be functionally equivalent to what CAMPARI would use otherwise, i.e., collective primitives constructed exclusively from blocking send and receive operations (MPI_SEND and MPI_RECV).The reason for having such a keyword is twofold. First, buggy code in conjunction with these MPI-defined collective communication routines can be difficult to diagnose and debug, because the MPI standard requires an outcome, but not a specific implementation. Essentially, developers and users cannot make any assumptions about the underlying communication flow. In general, this is of course desired (especially from a performance point of view), since it leaves the optimization of said communication to the MPI library rather than forcing the calling program to address these issues. Second, there are enough reports on the web of potentially faulty implementations of these routines in common MPI libraries. In conjunction with additional concerns regarding thread safety, etc, it could prove advantageous to developers to have modifiable implementations in place.
If custom CAMPARI routines for collective communications are in use (→ MPICOLLS), and if a calculation is performed that relies on such collective communication operations, this keyword lets the user alter the communication flow structure CAMPARI sets up to handle these cases. The keyword specifies a number of processes, amongst which communication is presumed fast (most often the number of CPU cores on a single board). The communication flow is then set up in a way that minimizes the required communication between such blocks of processes (they are generally assumed to be in sequence and to all be of identical size). This keyword is therefore unlikely to be useful for heterogeneous allocations (different numbers of cores granted on different machines or processes distributed non-sequentially). Between blocks, communication attempts to minimize latency (tree topology), whereas within blocks communication is (currently) strictly hierarchical and sequential with a single head process for each block. This means that (currently) setting MPIGRANULESZ to the number of processes granted by MPI will generate a global hierarchical flow with a single master, whereas setting it to 1 will generate a global tree-like flow.Output and Analysis:
(back to top)
Preamble (this is not a keyword)
Unlike most other simulation software, CAMPARI offers to analyze certain quantities while the simulation is being performed ("on-the-fly"). This has the advantage that the frequency of dumping raw trajectory data to the disk does not have to control the frequency of analyses. This can save time and money by circumventing expensive write operations to disk. Of course, in a typical simulation setting, the user will still want to obtain trajectory data: for visualization, for non-yet-defined analyses, and so on. However, the built-in analyses can still prove beneficial by utilizing as much data as possible. This is generally controlled by several interval settings: analysis X should be performed or instantaneous data Y should be reported every N steps. Such keywords (see for example ANGCALC) are interpreted the same way unless otherwise noted. For example, if ANGCALC is 250 and NRSTEPS is 1000, the analysis would be performed at steps #250, 500, 750, and 1000. There is only one other keyword affecting this: the number of equilibration steps. If in the above example EQUIL is 400, the analysis would only be performed at steps, 500, 750, and 1000 (i.e., the count is always relative to the 0th step).Note that some analyses can be costly. Their scaling with system size will usually be stated. At the very end, the log-output will typically report the fraction of CPU time spent performing analysis routines. This may help assess whether some of the frequency settings should be reduced. Simple ways to disable built-in analyses are provided.
CAMPARI often groups statistics together. For example, for a melt of identical polymers, CAMPARI would by default compute only a single histogram of end-to-end distances. This grouping is at times undesired and is overcome by the concept of analysis groups. Unfortunately, the opposite task of grouping information from different species together is currently not supported for such analyses.
For long CAMPARI runs, the instantaneous analysis has the downside that (currently) no intermediate results are produced (everything remains in memory until the very end). In this case, it can be useful to utilize the restart functionality (→ RSTOUT) to produce simulation blocks of equivalent sizes each with complete analysis output files. This strategy also serves to preserve more information in case of unexpected crashes or job terminations. The alternative is to follow the classical route of shifting the entire analysis burden to post-processing by only saving instantaneous trajectory output. As mentioned above, this has the downside of dealing with larger amounts of data and, more importantly, with a loss of coordinate precision (for example, when using the xtc compression library). Another issue that can prove problematic with long runs is that some instantaneous output files (such as the file with running energy terms) are subject to buffered I/O. This depends on compiler and operating and file systems and means information can be lost in case of unclean terminations, which makes it harder to diagnose the error. Keyword FLUSHTIME helps with this problem.
To offer some more flexibility in its analysis pipeline, CAMPARI also supports, since version 4.1, an interface with a Python module that contains template functions that are fed data directly from CAMPARI. The module can then be customized by the user to perform analyses not supported directly in CAMPARI or to provide instantaneous output, also utilizing additional libraries. This is generally simpler than modifying the Fortran source code because there is no compilation required and knowledge of Python is more prevalent. Using this feature does have a cost of course: generally speaking it will be fairly straightforward to write correct Python code but very challenging to write Python code that is both correct and similarly efficient to compiled Fortran code. The module is enabled with the help of keyword PYCALC. Notably, the Python is fundamentally different in logic from the NetCDF analysis mode. While the former focuses exclusively on customizing computations, the latter focuses exclusively on generalizing input data.
When the shared memory (OpenMP) parallelization is in use, there are a few additional considerations to be taken into account. First, few analyses are per se parallelized (mentioned below for these cases). For the remainder, which are largely inexpensive in terms of CPU time, CAMPARI uses an inhomogeneous and dynamic task parallelization approach at every step. This means that it is beneficial to to make sure that analyses are triggered at the same simulation/trajectory steps. Conversely, if their executions are staggered, i.e., only one or few tasks are executed at a given step, the load distribution across threads by task cannot provide a benefit. Notably, the final processing/printing of results, with the exception of structural clustering does not employ the shared memory parallelization at all. In contrast, it is possible to chop a large trajectory into multiple pieces and to use MPI-parallel trajectory analysis to speed up analyses on large data sets (whether this involves combining the results in the end or not).
This keyword sets the interval specifying how often to write out a restart file. Such a file will allow continuing both crashed and normally terminated runs without losing significant accuracy due to truncation of significant digits (such as in pdb-files). Note that they are not bitwise perfect, however. The concept is described elsewhere. Restart files are written to two files continuously replacing themselves on an alternating schedule such that even if a crash occurs during a write-operation at least one sane restart file should exist. These files are generally named {basename}_1(2).rst. Settings for EQUIL are (of course) irrelevant for this output. Whenever a restart file is written, the system's potential energy is recalculated, which is the only part aided by CAMPARI's shared memory (OpenMP) parallelization.ANGRPFILE
This keyword sets path and name to the input file for determining analysis groups by custom request rather than by molecule type (→ ANGRPFILE). By default, CAMPARI will often combine collected analysis data for molecules of identical type. This is not always the desired behavior. For example, CAMPARI fails to recognize differences introduced to molecules of the same type by virtue of molecule-specific constraints or biasing potentials. Analysis groups alleviate this and similar problems by allowing to group molecules of identical type into arbitrary analysis groups. Note that it is never possible to combine data for molecules of chemically different type or to split a single molecule into multiple groups (although the latter may be implemented in the future). Systems employing chemical crosslinks (please refer to sequence input for details) pose a special case: here, intermolecular crosslinks do not conjoin two molecules in terms of data structures and analysis, i.e., it will for example (currently) not be possible to obtain the net radius of gyration of two crosslinked polypeptide chains. Instead, both chains will be analyzed and treated as if they were separate molecules.FLUSHTIME
This keyword determines the desired time interval (in minutes, with the caveat that there is only a single measurement per elementary step) for two things. First, every FLUSHTIME minutes (fractional values are allowed), CAMPARI will report the current production rate (in elementary steps/day) and time to finish. This happens to either log-output (for serial or pure MPI executables) or to a dedicated output file in case the shared memory (OpenMP) parallelization is in use. This performance estimation is bound to be misleading for inhomogeneous calculations where the average cost of a step per time interval can change dramatically (for example, in simulations with hybrid propagators or analysis runs where not all analysis frequencies are 1). Second, CAMPARI will flush the buffers of all instantaneous output files every FLUSHTIME minutes. This can be useful because I/O buffers on a given system may be so large that the information in these files (which are often used to monitor a running simulation) are rarely up-to-date, and that significant data loss can occur upon unexpected terminations.DISABLE_ANALYSIS
In CAMPARI, many analysis features are turned on by default. This keyword exists to simplify turning them off, for example, when trajectory post-processing is desired, for trial runs, etc. DISABLE_ANALYSIS is processed at the lowest hierarchy level of the key-file parser and, like all other keywords, sequentially for a given hierarchy level. It is not a keyword in the traditional sense since it does not set any CAMPARI-internal variable associated with it to a given value. Instead, it is a shortcut for listing explicitly analysis features with calculation intervals larger than the simulation length. This is exactly what DISABLE_ANALYSIS does: it sets all the affected calculation intervals (see below) like XYZOUT, DSSPCALC, and so on to NRSTEPS + 1.The options are:
- All analysis and instantaneous output options are disabled. The only features not affected are the writing of initial and final structure files, the printing to log output (warnings, summary, and some progress information), the log-file for the OpenMP parallelization, the printing of the trace file in certain MPI multi-copy runs, and the writing of restart files. Using this option without other keywords overriding DISABLE_ANALYSIS, a run will not provide much useful information.
- All analysis options are disabled. This is the same as the previous option only that simple instantaneous output features like energy or trajectory output are not turned off. Instantaneous output relying on a built-in feature are disable implicitly, e.g., running DSSP output.
- All instantaneous output options are disabled. This is the same as option 1 except that it leaves all built-in analysis features at their defaults (some are disabled by default in any case since they rely on additional information) but disables both simple and dependent instantaneous output features.
This would result in XYZOUT being set to 1000001, which means it is neither the seemingly requested number of 1000 nor is it disabled. Since using DISABLE_ANALYSIS is equivalent to changing the default settings for many analysis features, it usually leads to a shorter key-file than doing this by hand.
This keyword defines the interval how often current potential energy data are written to a file called ENERGY.dat. Note that the total energy is decomposed into the individual terms controllable by keywords of the type SC_XYZ (for example SC_IPP). It is presently not possible to obtain energy decompositions based on subcomponents of the system. The data in ENERGY.dat are the only direct print-out of unperturbed energy values if energy landscape sculpting is in use. Settings for EQUIL are ignored for this output. Note that energies are calculated at every step for every term that is turned on. Energies are never calculated specifically for reporting purposes. This means that ENOUT has no significant associated cost per se and, more importantly, cannot be used to speed up post-processing runs where energy values are required only for selected frames of an input trajectory (the correct way to deal with this situation is to first extract these frames and to then run the energy analysis in a second step).MOL2ENMODE
Complimentary to running output of instantaneous energies through keyword ENOUT, if the run is a small molecule screen, this keyword allows the user to report an energy signature for every pose it prints to the main output file of the screen. The printed poses are of four different classes (minimal energy, cluster centroid, final, regular interval), and their availability depends on the choice for MOL2OUTMODE. The options for the printed energies, which will be found in a dedicated output file are the following:- No output of energies is requested.
- Single-point energies are requested. These are always available and will be the current energies, resolved by term as in ENERGY.dat but only printing necessary columns with a header, for the entire system (what would the "complex" energy in a small molecule scoring calculation).
- Averaged energies are requested. Meaningful averaging requires classifying points into mutually similar chunks or fulfilling the criterion for mutual similarity by other means. Only the first route is supported, which is why pose clustering must be enabled for this option to be available (see MOL2CLUMODE). If so, the printed energies of poses will be the (linear) averages across all the snapshots that entered pose clustering (the number of available snapshots being defined by keywords NRSTEPS, EQUIL, and CCOLLECT) and were assigned to the same cluster. This works for all possible output poses except when MOL2OUTMODE is 5, for which only options 0 and 1 are available. Since the minimum energy or final pose can be in a retained cluster (see MOL2THRESH), this implies that the printed energies can be the same for two (slightly) different poses.
- This options is the same as option 2 except that the averaging is only applied to poses that are the centroids of retained clusters. This is generally not recommended since it mixes averages with single-point values where some of the single-point values (energy minima) are going to be systematically more negative.
By this keyword, the user sets the interval how often to write current ensemble data to a file called ENSEMBLE.dat. This is only relevant if DYNAMICS is not set to 1 or 6 (pure Monte Carlo sampling or minimization). The reported quantities are informative ensemble variables (limited output presently) including - most prominently - potential and kinetic energies. Settings for EQUIL are ignored for this output. As is the case for keyword ENOUT, these values are calculated at every step regardless, and no computation occurs specifically for reporting purposes.ACCOUT
If pure Monte Carlo or hybrid sampling is used (→ DYNAMICS), this keyword sets the interval how often to report cumulative acceptance data to a file called ACCEPTANCE.dat. Note that these data are mildly informative in that they do not directly allow to compute acceptance rates. They are mostly useful in analyzing a running simulation and assessing the performance of the move set. CAMPARI will report acceptance statistics as well as residue- and molecule-resolved acceptance counts at the very end of the simulation to log-output. The data in ACCEPTANCE.dat are only resolved by move type. Settings for EQUIL are ignored for this output.TOROUT
This keyword lets the user decide how often to write sets of internal coordinate space degrees of freedom to a file FYC.dat in a one structure-per-line format. These files can easily become large due to the number of degrees of freedom in general scaling linearly with system size. There are two options selected by using a positive (mode 1) or negative integer (mode 2) for TOROUT.- Native CAMPARI degrees of freedom are written with a header providing residue-level information. These generally correspond to the unconstrained degrees of freedom in Monte Carlo or torsional dynamics calculations (see sequence input for details). All but rigid-body coordinates are written to FYC.dat and much more information is provided there. Because rigid-body coordinates as missing, the information in the file is never enough to completely reconstruct the system even when assuming the default covalent geometries
- Sampled, dihedral angle degrees of freedom are written with a header that provides atomic indices corresponding to the various Z-matrix lines describing these dihedral angles. This mode excludes degrees of freedom that are actually frozen, and can include degrees of freedom that are not native to CAMPARI. All values are again written to FYC.dat, and more details are provided there. This mode never includes bond angles and/or dihedral angles that have no explicit Z-matrix entry.
This very important keyword sets the frequency with which snapshots containing (at least) the Cartesian coordinates of the system (or selected subsystem) are written to a new file or appended to an existing trajectory file (→ documentation on output files and keyword XYZPDB). Part of the filename(s) will be determined by keyword BASENAME unless the calculation is a small molecule screen. For the latter multiple files with different names are written, thus necessitating a set of unique molecule molecule names in the corresponding input file.This is the fundamental saving frequency for obtaining trajectory data and should be chosen carefully whenever the proposed simulation is resource-intensive. These files can easily be very large, and it is possible for significant I/O lag to arise. The printing of trajectory data is done by a single thread if CAMPARI's shared memory (OpenMP) parallelization is in use. This happens with other threads performing other tasks concurrently (if there are any).
If structural output is requested (→ XYZOUT), this keyword chooses the output file format (see documentation on output files). It is an integer [1-3(4,5)] interpreted as:- Tinker-style .arc-files (ASCII)
- ASCII .pdb-files (default option) in various output conventions (see PDB_W_CONV and PDB_OUTPUTSTRING)
- CHARMM-style binary .dcd-files (these include the box information for each snapshot and have a CHARMM-style header - note that the header is written only once by CAMPARI and contains the number of snapshots in the file. This may not always be correct if simulations are prematurely terminated or trajectory files appended)
- Compressed binary .xtc-files as used in GROMACS: note that this option is only available if the program is linked against a proper version of XDR (see installation instructions)
- Compressed binary .nc-files as define by the NetCDF format in AMBER convention: note that this option is only available if the program is linked against a proper NetCDF library (see installation instructions).
- PostgreSQL database as defined in an original format for CAMPARI, see documentation on output files for details. This relies on a number of auxiliary keywords, most prominently PSQL_W_TRAJKEY.
If structural output is requested (→ XYZOUT), this integer [1-2] keyword determines whether to write to a series of numbered files (1) or a single file (2, the default). This, however, currently works for pdb only (specifically: .arc are always multiple files, and the binary formats always write to (append) a single file).MOL2PDBMODE
This keyword controls whether in a small molecule screen the final and/or relaxed starting conformations of the entire system should be saved for every molecule. This requires having unique molecule names in the corresponding input file. The resultant file names end in "_END.pdb" and "_RELAXED.pdb".- No pdb files are written.
- Only pdb files of final conformations are written (requires that a run for the given molecule is not terminated prematurely).
- Only pdb files of relaxed starting conformations are written (requires that this feature is enabled → TMD_RELAX).
- Both types of pdb files are written (if possible).
If structural output is requested (→ XYZOUT) and the chosen output format is the binary .xtc-format (option 4 for XYZPDB), this keyword can be used to specify the multiplicative factor determining the accuracy of compressed xtc-trajectories (the minimum is 100.0). Conversely, during read-in of xtc-trajectories in xtc-analysis mode (see PDBANALYZE and XTCFILE), the precision is read directly from the input file, and this keyword is ignored.PSQL_W_DATABASE
If structural output is requested (→ XYZOUT) and the chosen output format is database (SQL) format (option 6 for XYZPDB), this keyword lets the user specify the name of the database to write trajectory information into. Naturally, using this functionality requires that CAMPARI has been compiled and linked with PostgreSQL (PSQL) and libpqxx support (see installation instructions for further information). If the database does not exist or is not accessible by the user running CAMPARI, an error is produced. Connection errors can be down to user permissions (ask your system/SQL administrator to correct this if you are trying to access a database you do not administer yourself) but also down to incorrect settings for keywords PSQL_W_DBPORT and PSQL_W_DBHOST. Lastly, the analogous keyword for PSQL_W_DATABASE for the read-in of structural information is PSQL_R_DATABASE.PSQL_W_DBHOST
If structural output is requested (→ XYZOUT) and the chosen output format is database (SQL) format (option 6 for XYZPDB), this keyword lets the user specify the (possibly remote) host where the database specified by keyword PSQL_W_DATABASE can be found. This access is handled through the libpqxx library, which supports remote connections. The default is "" (without the double quotes), which on most Linux systems maps to the local machine. The communication channel uses a specific port, which can be set by keyword PSQL_W_DBPORT. If a remote connection fails, it might be because the database server is not configured to accept noninteractive connections without additional authentication (such as a user password). Lastly, the analogous keyword for PSQL_W_DBHOST for the read-in of structural information is PSQL_R_DBHOST.PSQL_W_DBPORT
If structural output is requested (→ XYZOUT) and the chosen output format is database (SQL) format (option 6 for XYZPDB), this keyword lets the user specify the port number to use when accessing the database specified by keyword PSQL_W_DATABASE hosted on the server specified by PSQL_W_DBHOST. This access is handled through the libpqxx library, which supports remote connections. Allowed values are legitimate port numbers, and the default is 5432. The only common reason to change this is if, possibly for security reasons or to avoid clashes, the SQL administrator has proactively moved the port number to a different value. The default value of 5432 is the officially registered port for the PostgreSQL database. Lastly, the analogous keyword for PSQL_W_DBPORT for the read-in of structural information is PSQL_R_DBPORT.PSQL_W_TRAJKEY
If structural output is requested (→ XYZOUT) and the chosen output format is database (SQL) format (option 6 for XYZPDB), this keyword sets the unique identifier string of the system that is being deposited. If the identifier is not yet present in the database specified by keyword PSQL_W_DATABASE, new entries and tables are created as needed, and the auxiliary keyword PSQL_W_SIMUOFF does not come into play. If, on the other hand, it does exist, different scenarios may arise as follows:- The system existing in the database has no snapshots deposited and is different from the one that is meant to be deposited. This is checked twofold: by, essentially, the contents of the sequence file, which is deposited into the SQL database (see description of standard), and by the number of atoms. In this case, a warning is produced, and the existing system is overwritten.
- The system existing in the database has snapshots deposited and is different from the one that is meant to be deposited (same check as for previous option). In this case, an error is produced.
- The system existing in the database is deemed to be the same, and PSQL_W_SIMUOFF is 0, indicating that a new simulation(s) and snapshot table(s) should be created. In this case, the run will proceed irrespective of other modalities.
- The system existing in the database is deemed to be the same, PSQL_W_SIMUOFF is larger than 0, and there is no simulation entry matching the supplied value. In this case, an error is produced.
- The system existing in the database is deemed to be the same, PSQL_W_SIMUOFF is larger than 0, there is a simulation entry matching the supplied value, and the run is not a multi-copy calculation. In this case, the existing simulation is used and the associated snapshot table is appended.
- The system existing in the database is deemed to be the same, PSQL_W_SIMUOFF is larger than 0, there is a simulation entry matching the supplied value, and the run is a multi-copy calculation. In this case, CAMPARI will search for a set of snapshot tables and simulation entries in the database that map to the same parent, which is defined by the index supplied to PSQL_W_SIMUOFF. Only if these are all found and of equivalent data set size will the simulation proceed by appending the existing tables. Otherwise an error is produced.
If structural output is requested (→ XYZOUT) and the chosen output format is database (SQL) format (option 6 for XYZPDB), this keyword sets the simulation ID ("sim_id") of the parent simulation entry to append to. In order to create a new entry, use 0, which is also the default. Note that the keyword refers only to the "parent" of a set of simulations in case a multi-copy calculation is performed. Details are provided in the documentation for keyword PSQL_W_TRAJKEY. Further background information on the layout of tables in SQL is found in the description of the standard.PSQL_SIMSUM
If structural output is requested (→ XYZOUT) and the chosen output format is database (SQL) format (option 6 for XYZPDB), this keyword allows the user to provide a (plain text) file with a description of the simulation to be deposited. This is supported only in trajectory analysis mode and meant to facilitate depositing trajectories generated by other software packages. The text file is parsed, line by line (see elsewhere for details), and stored in the (text) field "sim_summary" of the table "simulations". In a multi-copy calculation, separate files can be provided by the normal CAMPARI prefix logic.If this keyword is not present or the run is a normal simulation run, CAMPARI will produce a file with its own simulation summary (the one printed to log-output) and use this instead. If the keyword is present in a parallel trajectory analysis run, possible prefixed files are checked first before deferring to the literal name specified (which is then used by all replicas).
If structural output is requested (→ XYZOUT) and the chosen output format is database (SQL) format (option 6 for XYZPDB), this keyword allows the user to specify the name and version of the software that generated the data being deposited. This is meaningful primarily for trajectory analysis runs depositing trajectories generated by other simulation programs.PDB_NUCMODE
CAMPARI's internal representation of polynucleotides has one peculiarity. It assigns the entire PO4- functional group to the same nucleotide residue whereas most other programs seem to assign the 3'-oxygen atom to the residue carrying the sugar. This causes a non-trivial inconsistency when trying to use CAMPARI-generated pdb-files as input for other software. Therefore, this keyword defines how to assign the O3*-atom of nucleic acids in pdb-output only. There are two options:- The O3*-atom is assigned to the residue carrying the 5'-phosphate it is part of, i.e., it is the very first atom in that residue. This is the CAMPARI-inherent convention and reflects the authentic structure of arrays in CAMPARI (which is relevant for any analysis requiring atom numbers, see for example the input on selecting specific distance distributions to be collected).
- The O3*-atom is assigned to the residue carrying the sugar it is part of; this is the PDB-typical convention. Note that this inherently disrupts the 1:1-correspondence between numbering in the pdb-file and how nucleic acids are represented internally. It is recommended if and only if CAMPARI-output is sought to be compatible with other software working in this latter convention. Note that for this option to work correctly with unsupported polynucleotide residues (recognized as such) atom names have to be canonical.
The pdb file format is a keyword-based specification. Most of the keywords are not relevant in the context of a molecular simulation software (like experimental details, authors, gene information, etc.) but some are relevant to achieve proper display with visualization software. These include information about unexpected (non-protein, non-nucleic acid, non-water) entities, information about unexpected covalent bonds (like disulfide bridges), information about parts of the system missing in the output, and information about the container (unit cell). This keyword controls how much of that information is provided.- No additional information is printed (default). This means the output files contain only coordinate sections and container/unit cell information (see SHAPE and SIZE).
- By default, CAMPARI truncates ATOM/HETATM lines. This is remedied by choosing this option (or 3 or 4 below), and these lines will now include information about chemical elements and net charge. The latter in particular is not a very clean annotation and works mostly for monoatomic ions. In addition, a header is produced that adds a "SEQRES" section to represent the (full) sequence, uses "REMARK 465" entries to report missing coordinates (resulting from the use of keywords XYZ_SOLVENT and TRAJIDXFILE), adds "MODRES" entries for nonstandard polymer residues, reproduces the "HET" and "HETNAM" (but not "HETSYN" or "FORMUL") entries for all nonstandard entities, and adds an appropriate "SSBOND" section. While this tries to be consistent with the PDB standard, the latter is written for experimental structures, which are ambiguous in some aspects, e.g., the numbers of resolved solvent (water or other) molecules. This does not fit well with the logic of simulations where the system is almost always defined exactly. It also does not fit well with multi-model PDB files, even for experimental (e.g., NMR) structures.
- In addition to coordinate and container/unit cell information, "CONECT" records are written at the end. These are sorted lists of integers indicating the presence of covalent bonds between the first atom (in the atom numbering present in the file) and all subsequent atoms. Only those covalent bonds are required that are not between standard polymer entries or within water molecules.
- This combines both previous options (both a header and "CONECT" records are written as additional information).
- This options only includes the extended ATOM/HETATM lines to provide element and net charge (if available) annotations.
CAMPARI can in general process different atom and residue naming conventions for the formatting of PDB files. This keyword selects the format for written files. Choices are:- CAMPARI format
- GROMACS format (atom naming, nucleotide and cap residue names, ...)
- CHARMM format (atom naming, cap residue names and numbering
(these result from the patch logic of CHARMM), ...): Note that there are two exceptions pertaining to
C-terminal cap residues (NME and NH2) which arise due to non-unique
naming in CHARMM: 1), NH2 atoms are called NT2 (instead of NT) and
HT21, HT22 (instead of HT1, HT2), and 2), NME methyl hydrogens are
called HAT1, HAT2, HAT3 (instead of HT1, HT2, HT3).
- AMBER format (atom naming, nucleotide residue names, ...)
- Partial CHARMM format (not including the merging of polypeptide cap residues and the ambiguous naming of DNA residues): This is meant for interoperability with GROMACS using official CHARMM ports
- Partial AMBER format (not including the 2/3 rather than 1/2 naming convention for methylene hydrogen atoms): This is meant for interoperability with GROMACS using GROMACS-provided AMBER ports
- GROMOS-friendly format (very similar to default CAMPARI)
This keyword allows changing the formatting string (Fortran) used for the ATOM/HETATM lines in PDB files written by CAMPARI. This can be useful to make PDB files of very large systems and in particular for changing the precision of PDB files. In order for CAMPARI to read these files back in, the analogous keyword PDB_INPUTSTRING has to be used. Because Fortran in general deals poorly with string-based I/O, any improper use of this keyword can easily lead to abnormal program termination. The default format for the controllable section of the standard layout (up to column 54) is "a6,i5,1x,a4,a1,a3,1x,a1,i4,a1,3x,3f8.3" (if you use this keyword, the format must be supplied with the quotes). The letters (a, i, f) give the type of variable, which must not change. The numbers give the fields lengths, and these can be customized for variables of type integer ("i") or real ("f"). It is also possible to modify the field widths of string variables ("a") but that is likely harmful because the variables in use are tied to the default format. Note that the insertion code (the last "a1" element) is always written as a blank by CAMPARI since all residues are renumbered consecutively. The same is true for the alternate location indicator (the first "a1" element). The vast majority of modifications to the output format will create files that are no longer readable (at least correctly) by other software, .e.g., visualization software, or other simulation codes, and it may be useful to use CAMPARI itself (or to write a script) to convert back these nonstandard files whenever needed.Common problems with standard PDB files, which can be addressed at least in part by the format string, are that the integer number for atom index overflows, that the chain indicator becomes fused to neighboring columns (because of overlong residue names or large residue numbers), that the residue number column overflows, that the coordinate entries get fused or overflow (if absolute coordinates are not centered at small (in absolute magnitude) values), or that the coordinate precision is insufficient for recovering exact covalent geometries based on this information alone. These limitations, in particular the system size restrictions, have led to the proposed phasing out of the format by rcsb itself. Unfortunately, the PDBx/mmCIF format, which replaces it, is not particularly convenient for simulation software, but eventually all software will have to provide conversion tools or adapt directly (planned for Version 6).
If structural output is requested (→ XYZOUT), this logical keyword allows the user to suppress trajectory output for molecules labeled as solvent. This can be useful to down-convert trajectory files from explicit solvent runs or - more general - to isolate certain parts of the system from existing trajectory data (employing PDBANALYZE and ANGRPFILE). It may also be used to save space during actual simulations but it should be kept in mind that information about the solvent may be lost irrevocably and that the resultant trajectories may no longer be straightforward to analyze. A more general option for printing only a static subset of the system is provided by supplying an index file via keyword TRAJIDXFILE (mutually incompatible). An alternative way to reduce the number of solvent molecule (but keep a subset of interest) in a dynamic manner is provided by keyword XYZ_PRUNE.TRAJIDXFILE
Usage of keywords XYZ_SOLVENT in conjunction with the concept of analysis groups allows the user some amount of fine control over what is written to the trajectory file. In some scenarios this may not be enough (for example, if external scripts or software or even CAMPARI itself are meant to analyze non-trivial subsets of the system).Then, an input file supplied to TRAJIDXFILE is the workaround. It is relevant in two different contexts, either when XYZ_PRUNE is present and used or when it is not present or unused (0). In both cases, the file is a simple index file of atom indices. If XYZ_PRUNE is unused, the file offers simple per-atom control over what coordinate information is written to the trajectory file. This simply means that only the selected atoms appear in the output with no corrections applied to preserve residue or molecule integrity. This mode is strictly incompatible with XYZ_SOLVENT being 0 and takes precedence. Conversely, if XYZ_PRUNE is used, then the atom indices define a reference set of atoms, and molecules of the selected types will be selected dynamically in output based on proximity to that reference set. This proximity is evaluated at a pairwise minimum level (residue-centroid of molecules of selected types vs. atoms chosen through TRAJIDXFILE). This mode is partially compatible with XYZ_SOLVENT being 0: it will work as long as none of the molecule types tagged as solvent are under dynamic control; otherwise an error will be produced.
If XYZ_PRUNE is unused, the ease of use of reduced trajectories for subsequent trajectory analysis runs hinges on whether the selected subset preserves the integrity of all molecules to remain in the output or, if the output format is pdb, such that missing atoms can be re-built. This is explored in Tutorial 10.
For a description of the input file format, see here. Note that the subsetting of the output is applied to all structural output files (binary or not) except the files written for sanity checking at the beginning of a run. This might sometimes lead to errors when trying to use the automatic visualization scripts.
If structural output is requested (→ XYZOUT), this keyword allows the user to select molecule types to restrict the number of molecules of this type in the output to a dynamic set of fixed size. The most common application is to reduce water molecules in explicit solvent calculations and retain only those that are close to a target species of interest, e.g., an entire protein, a ligand binding site, or a particular macromolecular interface. Because of the constraints of trajectory files, the numbers per type and the order of the types in the output must be fixed. The numbering of molecule types is by occurrence in sequence input, as usual.This keyword is one of the few expecting multiple inputs. The first should be an integer telling CAMPARI how many types should be put under dynamic control. This must be followed by two numbers each per such type: the type index and the target number. For example, "2 3 500 4 50" would ask that 500 molecules of type 3 are retained and 50 of type 4. All other molecule types are retained in full with the possible exception of those disabled by keyword XYZ_SOLVENT being 0 if they are tagged as solvent (see ANGRPFILE).
The subset to be retained is selected by proximity to a reference set of atoms. Specifically, for each residue in the molecule (usually just 1), the residue-centroid is computed, and its distance to all of the reference set's atoms is calculated. The minimum of those values is stored as the criterion for the molecule. Then, separately for each type, these distance criterion values are sorted, and the n nearest molecules are selected for output (where n is the number chosen, like the 500/50 above). The reference set can be supplied by the user with the help of keyword TRAJIDXFILE (it is a simple index file). If no file is supplied or found, CAMPARI will try to create an automatic reference subset. For each interesting residue (those that are not duplicated many times), up to three atoms are identified that span a triangle with the largest circumference (but one must be the reference atom of that residue). This is to account for the spatial extent of larger residues.
There are a number of caveats. First, and this is always the case, molecule index identity is lost, which means that molecules appearing in the same place in the output files are not actually the same. This usually prevents all analyses relying on correlation in time. Second, and this is a related point, the procedure silently assumes strict equivalence between molecules meaning that a patch modifying a specific instance of that molecule type might render this dynamic subsetting misleading. Third, errors are produced if there are not enough molecules of a given type, if a selected type is tagged as solvent and XYZ_SOLVENT is set to 0, or if the reference set is missing and cannot be constructed automatically. Fourth, the output files would still be garbled if molecules in the output order are not grouped into their types exactly. For example, having alternating entries of water and urea in sequence input and retaining both of these types dynamically would be problematic since the sequence of nearest molecules will not preserve the alternating order. Thus, in these cases, the coordinates are reordered for output only such that molecules of the same type always appear in blocks. This means that users should rely only on the pdb file written at the beginning of a run for extracting atom indices, etc. Generally speaking, this scenario should be avoided when possible to minimize confusion. Fifth, while centering images around a reference molecule via XYZ_REFMOL is compatible, this leads to a very confusing visualization if the reference set is not also from the reference molecule.
If a system is simulated or analyzed that utilizes periodic boundary conditions, this keyword can be used to alter the standard CAMPARI way of placing atoms with respect to the unit cell. By default, CAMPARI will never break up molecules in trajectory output, which implies that the absolute coordinates in the trajectory file(s) can extend significantly beyond the formal boundary of the unit cell. Similarly, by default CAMPARI will assume that structural input preserves the integrity of molecules, i.e., that it conforms to exactly this standard. Sometimes (for example, for visualization or for certain analyses), it may be desired to instead have all atoms be inside the unit cell, and this is what this keyword accomplishes.There are currently 4 different options related to both input and output.
- For both input and output, CAMPARI assumes that molecules are intact and intended to be left intact.
- For output, CAMPARI will leave molecules intact. For input, CAMPARI will assume that the coordinates are such that molecules have been forced to reside inside the central unit cell. Using box information, CAMPARI will calculate image shift vectors and apply them to the processed input. This option is incompatible with PDB_READMODE being 1. Image shift vectors are evaluated before possible tolerance violations are considered (see keywords PDB_TOLERANCE_B and PDB_TOLERANCE_A.
- For input, CAMPARI assumes that molecules are intact (no corrections applied). For output, it will force coordinates to reside inside the central unit cell by breaking up molecules as needed.
- This options combines the properties of options 1 and 2 for input and output, respectively.
Note that in some cases trajectory files with broken-up molecules may be ambiguous unless information about the expected topology is present or provided. The input strategy currently implemented works as long as the length of Z matrix bonds remains small in comparison to the box dimensions. Note that the start of the simulation from a pdb file is affected by this keyword as is the analysis of (binary) trajectory files.
It is not recommended to use options 1 or 3 above for writing trajectories from a production simulation, rather the output feature is intended to transform preexisting trajectories (via trajectory analysis mode). Lastly, in the case of an analysis run, any structural input with entire molecules given as the "wrong" images will also be adjusted by options 1 and 3 above. This scenario should be avoided as it leads to inconsistencies in any operations relying on absolute coordinates. For options 0 and 2, output coordinates will not be altered by XYZ_FORCEBOX but they may still be altered by keywords XYZ_REFMOL and ALIGNCALC.
If a system is simulated or analyzed that utilizes periodic boundary conditions, this keyword can be used to alter the standard CAMPARI way of placing molecules in three contexts, viz., trajectory output, structural clustering using absolute Cartesian coordinates (RMSD), and, similarly, the derivation of the alignment operator for output trajectory alignment. The use of this keyword is explained primarily in the context of the first role.By default, CAMPARI will never allow the geometric center (or center of mass in gradient-based simulation runs) to "leave" the central unit cell. When looking at intermolecular interfaces, this can lead to the undesirable effect of the interface being broken across the periodic boundary. These images often flicker back and forth, which makes visual inspection difficult unless periodic images are explicitly replicated. XYZ_REFMOL allows the user to specify a reference molecule whose center serves as reference point for all images, i.e., the coordinates of all other molecules printed to trajectory output are those of the nearest image of these molecules with respect to the chosen reference. This operation does not destroy information (i.e., it does not center or align anything) but leads to molecules being displayed that are outside of the central unit cell. In fact, the reference molecule is the only one that is guaranteed to reside in the central cell at all times.
Note that this keyword does not actually alter coordinates used internally, and therefore has no impact on the majority of analysis functions, etc. The only exceptions are structural clustering relying on absolute Cartesian coordinates (options 5, 6, and 10 for CDISTANCE) and the trajectory alignment facility. For the latter XYZ_REFMOL simply picks the reference molecule to override the internal heuristic. In both scenarios, the alignment operator is derived using image-corrected coordinates for all conformations in question. This role of XYZ_REFMOL is distinct but not separable from the role for trajectory output (i.e., it is not possible to use XYZ_REFMOL to pick the reference molecule for image selection for clustering or trajectory alignment while retaining the default trajectory output). XYZ_REFMOL is also ignored for the pdb files written at the beginning and end of a simulation. Along similar lines, trajectory files created in such a manner can be read back by CAMPARI without problems (internally, every molecule is translated to the central unit cell upon read-in as long as the box information (BOUNDARY, SIZE, and ORIGIN) is preserved). Note that XYZ_REFMOL relies either on geometric centers or on centers of mass, depending on the specific simulation or analysis settings.
If a system is simulated or analyzed that utilizes periodic boundary conditions in a triclinic unit cell, this simple logical keyword (1 means "yes") can be used to transform coordinates before depositing them to trajectory output. Specifically, all coordinates are rotated around the user-defined origin so that the first two box vectors (BOXVECTOR1 and BOXVECTOR2) lie in the xy-plane (positive x- and y-coordinates, respectively) and so that the third box vector (BOXVECTOR3) forms a right-handed coordinate system. This is a convenience option, and it never affects the reference pdb files written at the beginning and end. It is provided because many simulation and visualization software packages automatically assume this reference arrangement while just taking box vector lengths and angles as inputs.ALIGNCALC
In trajectory analysis runs CAMPARI offers the option to structurally superpose the current Cartesian coordinates to a suitable reference. Note that this functionality is conveniently available through almost all molecular visualization software packages. CAMPARI provides automatically generated visualization scripts designed to work with VMD. If these options are unavailable or inconvenient, for example, because the visualization program tries to read an entire very large data set into memory, ALIGNCALC lets the user set the interval at which CAMPARI should perform structural alignment. For example, to create - from an original trajectory - a superposed trajectory of every 10th frame, XYZOUT would have to be 10 and ALIGNCALC would have to be 10 or a factor of 10 (5,2,1).For convenience, the root mean square deviation over the alignment set after alignment can be written to an instantaneous output file. This can be enabled by specifying a negative number to ALIGNCALC, which is, except for the sign, interpreted in the same way. Sometimes it may also be desirable to align on one set of atoms and compute RMSD values for another set. CAMPARI supports this, if in addition to an appropriate choice of alignment set, the user provides another index set via keyword CFILE. This second input then becomes an additional set to compute RMSD values for. This is the same logic as found in RMSD-based structural clustering with split sets (see option 6 for keyword CDISTANCE).
Alignment happens before any of the analysis routines are called and works by first defining a reference set of atom indices (→ ALIGNFILE). It can be somewhat time-consuming and is currently not aided in any way by CAMPARI's shared memory (OpenMP) parallelization. Using a quaternion-based algorithm, an optimal translation and rotation is determined that minimizes - when applied to the current coordinates - the deviation between the transformed current coordinates and the reference coordinates (i.e., a set of coordinates for all atoms in the alignment set). Note that this procedure will always preserve the internal state of molecules and - except for certain cases in periodic boundary conditions - the relative arrangement of molecules. It will not, however, preserve the relative position of the system boundary. This may lead to artifacts in energetic analyses of aligned trajectories or any analyses that rely upon relative, intermolecular coordinates.
There are two ways of defining/providing the coordinates for the alignment set. The first is via an external file. Here, CAMPARI reuses the pdb-template functionality. If keyword PDB_TEMPLATE is specified and successfully read, then the reference coordinate set is extracted from this file for the set of atoms defined via ALIGNFILE. Note that the template may serve a double purpose in this scenario as it may still provide the atom numbering map needed to read binary trajectory formats with non-CAMPARI atom order. If no template is specified, the reference coordinates will be defined by the previously aligned structure. This successive alignment therefore uses a different reference coordinate set each time and will consequently lead to drift.
As described for keyword XYZ_REFMOL, the combination of periodic boundary conditions and multimolecular assemblies can become ambiguous in terms of absolute coordinates. By default, CAMPARI will scan the alignment set and use the molecule with the largest number of contributing atoms as the reference one. This choice can be overridden by keyword XYZ_REFMOL. It can be confusing that the alignment operator is derived for the image-corrected coordinate sets yet the transformation is applied to the coordinates in their default state. This is relevant if molecules are not found in the central unit cell in the trajectory files. In this scenario, the output gets particularly difficult to interpret if XYZ_REFMOL and/or XYZ_FORCEBOX modify the output coordinates as well (not recommended).
If system alignment is possible and requested (→ ALIGNCALC), this keyword allows the user to supply the path and name of a mandatory input file containing an atomic index list defining the set of atoms to align on. For example, in the simulation of a macromolecule with co-solutes it will not be meaningful to use the entire set of atoms in the system as the alignment set since the randomly dispersed co-solutes will dominate the alignment. Instead, one will typically want to only supply nonsymmetric protein atoms here.This keyword serves a second purpose, viz., if structural clustering is requested, and if an RMSD distance criterion with differing alignment and distance atom index sets is desired, this keyword lets the user specify the input file with the alignment set. Simultaneous use of both functionalities is permitted. The converse is also possible, i.e., to specify an additional distance set for RMSD evaluation and instantaneous output in the same logic. Then, keyword CFILE can be used to specify this additional distance set. Lastly, note that any set used for alignment must consist of at least three atoms.
This keyword determines in coarse fashion what information, per screened molecule, is written to the running output file of a small molecule screen. The options available are as follows:- The coordinates corresponding to the respective minimum energy conformations are written. In addition, coordinates corresponding to cluster centroids of clusters exceeding the desired threshold are provided.
- This is the same as option 1 only that in addition the final conformation is written to file as well.
- Only the respective minimum energy conformations are written to the mol2 file.
- Only the respective final conformations are written to file.
- With this option, the output to the mol2 file can be synchronized with the standard trajectory output enabled by keyword XYZOUT. This means that the file is appended periodically (respecting keyword EQUIL for each individual calculation).
The general policy of CAMPARI-written mol2 output files is that they preserve the input file structure and content for the 3 main sections (MOLECULE, ATOM, BOND) as much as possible. Other sections are not preserved (like SUBSTRUCTURE) or generated from scratch (like ALT_TYPE). For bonds, no bond annotation information (like "ar", "am", or "2") is provided for those bonds that are part of the auxiliary set but is preserved otherwise. Moreover, in the ATOM section, the substructure and substructure index fields can be utilized in different manners, which is controlled by keyword MOL2SUBPRT. In addition, there are minor variations, e.g., molecule names are appended to distinguish the various poses and tetherings listed above for the same molecule, or the partial charge set is always given as "USER_CHARGES" in the MOLECULE section.
This keyword works exactly like keyword TRAJIDXFILE only that it selects a subset of additional atomic positions to be written to the running output file of a small molecule screen. The indexing as usual is with respect to CAMPARI's internal numbering.A special option for MOL2AUXINDEX is given by "automatic" (without the double quotes). In this mode, CAMPARI will analyze the constraints specified by keywords FMCSC_FRZFILE, FMCSC_TMD_UNKMODE, and FMCSC_MOL2FOCUS and determine an suitable index list automatically. This does not respond correctly to constraints introduced exclusively by move set choices in pure Monte Carlo runs. The index list is written to output file MOL2EXTRAATOMS.idx. Since the resultant mol2 files no longer contain just small molecule information, their (re)processing by CAMPARI requires specifying the same index list as a separate input file (and the receptor to be present), i.e., for keyword MOL2AUXINPUT.
This keyword sets the priority for what to write to the substructure fields in the ATOM section of a running output file of a small molecule screen. The options available are as follows:- Prioritize residue information over solvation group information over charge group information.
- Prioritize residue information over charge group information over solvation group information.
- Prioritize solvation group information over charge group information over residue information.
- Prioritize charge group information over solvation group information over residue information.
This keyword sets the interval how often to compute and write current system-wide polymeric variables (→ POLYMER.dat). This instantaneous output can be useful to easily monitor structural changes (such as dimerization events) in dilute systems with heterogeneous density. It is completely uninformative for systems with homogeneous density. For simulations of a single polymer chain, distributions of polymeric order parameters as well as correlation functions can be computed from the output in POLYMER.dat. When CAMPARI's shared memory (OpenMP) parallelization is in use, the computation of the system-wide gyration tensor and related properties as well as the writing to POLYMER.dat are done by a single thread. This happens with other threads performing other tasks concurrently (if there are any).POLCALC
This keyword lets the user specify the frequency with which values for polymeric properties incurring low computational cost are computed. These data are collected and reported resolved by analysis group and include characteristic values for shape and size, histograms of end-to-distances, etc. Setting this keyword such that polymeric analyses are performed, several output files are generated: (→ POLYAVG.dat, RGHIST.dat, RETEHIST.dat, and RDHIST.dat). Furthermore POLCALC controls the interval for data collection to obtain averages of the suitably defined angular correlation function along the polymer backbone, which may be related to the intrinsic stiffness or persistence length of the polymer (→ PERSISTENCE.dat and TURNS_RES.dat). Lastly, this keyword controls the frequency for the computation and averaging of molecular, radial density profiles, i.e., the mass distribution function along the radial coordinate originating from the each molecule's center of mass considering only atoms belonging to that molecule (→ DENSPROF.dat). This quantity is used in Lifshitz-type polymer theories. When CAMPARI's shared memory (OpenMP) parallelization is in use, these calculations are performed by a single thread while other threads perform other tasks concurrently (if there are any).RHCALC
Since the computation of comprehensive polymer-internal distances is more expensive, this dedicated keyword controls the data collection interval for analyses relying on such data. A comprehensive set of internal distances in CAMPARI is used to compute three quantities:- An alternative estimate of the polymer's spatial size which is sometimes related to the hydrodynamic radius (→ corresponding entry in POLYAVG.dat; note that should RHCALC be set such that no analysis is performed but POLCALC be chosen such that the other quantities in POLYAVG.dat are compute and provided, the corresponding column must be ignored).
- A scaling profile of the internal distances with distance of separation in primary sequence (→ INTSCAL.dat).
- The scattering (Kratky) profile of the polymer (→ KRATKY.dat; this relies on the additional frequency setting SCATTERCALC).
As alluded to above, this keyword sets an auxiliary frequency for the calculation of scattering properties resolved by analysis group (→ KRATKY.dat). This requires computing Fourier transforms of internal distances for a series of wave vectors and is consequently a very expensive calculation. Due to the coupling to the computation of internal distances (see RHCALC), this keyword is not interpreted like the other interval keywords (???CALC). Instead, SCATTERCALC sets the calculation interval amongst only those steps chosen already via RHCALC. For example, if RHCALC is 10 and SCATTERCALC is 20, then scattering data will be accumulated every 200 steps. The data in KRATKY.dat can be used to compare simulation data directly to experiment. In a double-logarithmic plot, it may also be possible to identify linear regimes ("power law regime" in contrast to the "Guinier regime" for smaller wave vectors) which can be fit to yield the scaling exponent for fractal objects. Conversely, for globular polymers, Porod's law may hold.SCATTERRES
Since the required number of points and range of wave vectors for the prediction of scattering profiles may be system-dependent, this keyword allows the user to adjust the spacing of wave vectors assuming scattering data are being calculated at all (→ RHCALC and SCATTERCALC). The first wave vector's absolute magnitude q=|q| will always be 0.5·SCATTERRES with units of Å-1. In general, the larger the chain, the smaller the absolute magnitudes of wave vectors needed.SCATTERVECS
Since the required number of points and range of wave vectors for the prediction of scattering profiles may be system-dependent, this keyword allows the user to adjust the total number of employed wave vectors assuming scattering data are being calculated at all (→ RHCALC and SCATTERCALC). Together with SCATTERRES, this determines the range of the wave vectors. Note that generally a coarse resolution (and hence a small number of vectors) is sufficient as scattering profiles tend to be very smooth functions.HOLESCALC
For polymers it may be interesting to analyze the distribution of "internal" void spaces. In CAMPARI, a rudimentary analysis routine exists which attempts to place spheres of varying size at different distances from the molecule's center-of-mass and to record whether any overlap with part of the polymer is encountered. This analysis is recorded in instantaneous output (HOLES.dat), and the latter needs to be post-processed. Note that this analysis is restricted to simulations of monomeric polymers. When CAMPARI's shared memory (OpenMP) parallelization is in use, it is performed by a single thread while other threads perform other tasks concurrently (if there are any).RGBINSIZE
If standard polymeric analyses are performed (→ POLCALC), this keyword sets the size of the bins in Å for the three output files RGHIST.dat, RETEHIST.dat, and DENSPROF.dat. It therefore determines the resolution along the radius of gyration- or related axes.POLRGBINS
If standard polymeric analyses are performed (→ POLCALC), this keyword can be used to set the number of bins of size RGBINSIZE for the three output files RGHIST.dat, RETEHIST.dat, and DENSPROF.dat. Since quantities like the radius of gyration or end-to-end distances are strongly system-dependent, it is up to the user to ensure the appropriate number of bins. Note that - just like all other histograms in CAMPARI - terminal bins will be overstocked should range exceptions occur.PHOUT
This keyword controls the frequency how often to output ionization states of certain ionizable residues. Currently, this analysis relies on pseudo-Monte Carlo moves (see PHFREQ) to work and is therefore only available in straight MC runs. Further limitations are listed in the descriptions of sampler and output file.ANGCALC
This keyword lets the user define the interval how often to extract polypeptide backbone torsion angle statistics, i.e., how often to go through all non-terminal polypeptide residues and bin values for the φ/ψ-angles into a two-dimensional histogram. This keyword also controls the data collection frequency for estimation of vicinal NMR J-coupling constants (HN to Hα → JCOUPLING.dat). The Ramachandran analysis itself is reported globally in a file called RAMACHANDRAN.dat. Due to the system-wide averaging (including over molecules of different type), this is probably most meaningful for simulations of single homopolymers. For more detailed control, further output files may be obtained: residue-specific as well as analysis group-specific maps should requests have been provided via keywords RAMARES and RAMAMOL, respectively. When CAMPARI's shared memory (OpenMP) parallelization is in use, these analyses are performed by a single thread while other threads perform other tasks concurrently (if there are any).ANGRES
This keyword matters only if ANGCALC is chosen such that polypeptide backbone φ/ψ-statistics are accumulated. If so, it sets the resolution in degrees for such angular distribution functions. The smallest permissible value at the moment is 1.0°.RAMARES
This keyword matters only if polypeptide φ/ψ-analysis is requested (→ ANGCALC). If so, it allows the user to monitor the distributions specifically for selected polypeptide residues in the system. The first entry, which defaults to zero, specifies the number of such specific requests. The user then has to provide the appropriate number of integer values (residue numbers as defined per sequence input) on that same line in the key-file. The maximum number for individually monitored residues is limited to 1000, but even this requires increasing the default string length CAMPARI assumes (in a file called macros.i) during compilation. Successful requests (those pointing to non-polypeptide, non-existing, or terminal residues will be ignored) will create output files like "RESRAMA_00024.dat".RAMAMOL
This keyword is exactly analogous to RAMARES only that it operates not on residue but on analysis groups (all residues of all molecules in that analysis group are pooled, numbering as reported initially in the log-output). It will create files like "MOLRAMA_00002.dat".INTCALC
This keyword sets the interval how often to compute comprehensive statistics for typical internal coordinates of the system, i.e., all bond lengths, angles, torsional angles, as well as improper torsional angles (trigonal-planar centers - consult PARAMETERS for further details). Note that molecular topology defines which atom pairs - for example - share a bond. With this analysis, it is therefore not possible to analyze arbitrarily defined distances, angles, and torsion angles in the system. If turned on, up to five different output files are provided, namely INTERNAL_COORDS.idx, INTHISTS_BL.dat, INTHISTS_BA.dat, INTHISTS_DI.dat, and INTHISTS_IM.dat. When CAMPARI's shared memory (OpenMP) parallelization is in use, all of these analyses are performed (in sequence) by a single thread while other threads perform other tasks concurrently (if there are any).WHICHINT
This is one of the few keywords expecting multiple inputs and matters only if internal coordinate analysis is requested (→ INTCALC). Four integers should be provided and each one is interpreted as a logical to turn on individual groups of internal coordinate analyses. The first turns on the calculation of bond length histograms, the second that of bond angle histograms, the third that of improper dihedral angle histograms, and the fourth that of proper torsional angle histograms. Note that the number of possible internal coordinates quickly exceeds the number of atoms for any complex molecule. These analyses can therefore easily become fairly time-consuming as well as data-rich (in terms of the sizes of the output files). This is one of the reasons for introducing this selection mechanism. The other lies simply in the fact that in any simulation using CAMPARI-typical torsional space constraints (see CARTINT) analyses of bond length, angle, and improper dihedral distribution is meaningless.SEGCALC
This keyword lets the user specify the interval how often to scan the polypeptide backbone for stretches of similar secondary structure (as defined in the specified file through FMCSC_BBSEGFILE). The annotation - in contrast to DSSP - is obtained purely on torsional criteria and relies on defining consensus regions within φ/ψ-space. These consensus definitions are found in a supplied data file (→ BBSEGFILE). At the end of the simulation results are written to files named BB_SEGMENTS_NORM.dat, BB_SEGMENTS_NORM_RES.dat, BB_SEGMENTS.dat, and BB_SEGMENTS_RES.dat This analysis is resolved by analysis group and useful to identify coarse secondary structure propensities in polypeptides. As an example, the data in BB_SEGMENTS_NORM_RES.dat can be used to compute parameters of the helix-coil transition according to the Lifson-Roig formalism (see for example Tutorial 3 or this reference). SEGCALC also controls the computation of global (at a molecular level) secondary structure order parameters fα and fβ (which are also used for the corresponding bias potentials → SC_ZSEC used in Tutorial 9 or this reference). Various distribution histograms are written to files ZSEC_HIST.dat, ZAB_2DHIST.dat, and ZBETA_RG.dat. Analysis of these order parameters is similarly performed in analysis group-resolved fashion. When CAMPARI's shared memory (OpenMP) parallelization is in use, both of these analyses are performed sequentially by a single thread while other threads perform other tasks concurrently (if there are any).DSSPCALC
This keyword specifies how frequently to perform DSSP analysis. DSSP is a secondary structure assignment procedure for proteins (reference). All eligible (i.e., full peptide) residues are scanned for backbone-backbone hydrogen bond patterns and various statistics and running output is provided if so desired (see DSSP_NORM_RES.dat, DSSP_NORM.dat, DSSP.dat, DSSP_RES.dat, DSSP_HIST.dat, DSSP_EH_HIST.dat, and DSSP_RUNNING.dat). The DSSP results typically complement the results from backbone segment statistics (see for example BB_SEGMENTS_NORM_RES.dat) well as the former are based exclusively on hydrogen bond patterns while the latter are based exclusively on dihedral angles. When CAMPARI's shared memory (OpenMP) parallelization is in use, the DSSP analysis is performed by a single thread while other threads perform other tasks concurrently (if there are any). Similar to contact analyses, the determination of the hydrogen bond patterns scales, at some level, poorly with system size. They thus can become performance-limiting, which will similarly limit parallel efficiency.INSTDSSP
If DSSP analysis is requested (→ DSSPCALC), this keyword is interpreted as a simple logical whether to write out running traces of the full DSSP assignment for the current snapshot (see DSSP_RUNNING.dat). This can be useful when analyzing input trajectories or even individual pdb-structures with CAMPARI. Instantaneous DSSP output is currently not supported for MPI-averaging calculations (see MPIAVG). This output file can easily become very large, and it is possible for significant I/O lag to occur because of this.DSSP_MODE
Based on DSSP analysis (→ DSSPCALC), the code computes two order parameters to measure canonical secondary structure content. The E-score corresponds to the β-content and the H-score to the α-content. they are system-wide quantities and are computed as follows:E-score = E-fraction · ( H-bond-Score_E )1/n
H-score = H-fraction · ( H-bond-Score_H )1/n
Here, E-fraction and H-fraction are simply the fractions of residues which are assigned E or H according to DSSP. n is an arbitrary scaling exponent (see DSSP_EXP). H-bond-Score_E is a continuous variable which measures the mean quality of the hydrogen bonds forming the β-sheets in the system and H-bond-Score_H is the analog for α-helices. In principle, all the hydrogen bond energies are collected and divided by the value for the same number of good hydrogen bonds (see DSSP_GOODHB). The quantity can be capped, however, based on the choice for DSSP_MODE:
- Every hydrogen bond can maximally contribute the value of DSSP_GOODHB. Therefore, H-bond-Score_X is always less than unity and only approaches unity if each and every relevant H-bond is at least as favorable as the cutoff given by DSSP_GOODHB. This is the most stringent score. The resultant X-scores will always be less or equal to the corresponding X-fractions.
- Every hydrogen bond can maximally contribute DSSP_MINHB which is always more negative than DSSP_GOODHB. The value of H-bond-Score_X, however, is capped to be at most unity. In this score, very strong H-bonds can compensate the effects of a few weak ones but the value for X-score still is capped by the corresponding X-fraction.
- Every hydrogen bond can maximally contribute DSSP_MINHB. The value of H-bond-Score_X is not capped and can adopt values larger than unity. The X-score is capped, however, to never exceed unity. This is the most lenient score and the only one in which X-score can exceed the value of X-fraction.
For the DSSP analysis in CAMPARI (→ DSSPCALC), this keyword choose the integer scaling exponent for the H-bond term in computing E- and H-scores (see DSSP_MODE).DSSP_GOODHB
For the DSSP analysis in CAMPARI (→ DSSPCALC), this keyword defines the standard energy for a "good" hydrogen bond. This is used to evaluate the smoothed E- and H-scores (see DSSP_MODE) and not part of the original DSSP standard. Permissible values lie between -1.0 and -4.0 kcal/mol.DSSP_MINHB
For DSSP analysis (→ DSSPCALC), this keyword specifies the minimal (= lowest possible = most favorable) energy for any hydrogen bond. Since the DSSP-formula is based on inverse distances it is useful to introduce this lower cap such that conformations with steric overlap do not overly bias the analysis (for example in pdb-analyses → PDBANALYZE). Permissible values lie between -10.0 and -4.0 kcal/mol.DSSP_MAXHB
For DSSP analysis (→ DSSPCALC), this keyword allows the user to define the maximal (= highest possible = least favorable) energy fo any hydrogen bond. This is the fundamental cutoff for DSSP to consider H-bonds and therefore a very important quantity for the analysis to be meaningful. The recommended value is -0.5 kcal/mol but values between -1.0 and 0.0 kcal/mol are allowed.DSSP_CUT
For DSSP analysis (→ DSSPCALC), this keyword defines the distance cutoff applied to the Cα-atoms of two peptide residues to consider them for hydrogen bonds. This can be relatively short (defaults to 10 Å) but the accuracy hinges on the choice for DSSP_MAXHB. Consistency has to be ensured by the user. Using a Cα cutoff for pre-screening of residue pairs significantly reduces the computation time needed by the DSSP analysis.CONTACTCALC
This keyword specifies the interval how often to perform contact analysis, i.e., how often to get information about which and how many solute residues are close to each other. Such contacts are generally calculated according to two definitions in CAMPARI; by considering center-of-mass distances and by considering minimum atom-atom distances (both applied to pairs of residues). The output includes a map of average contact frequencies (CONTACTMAP.dat), histograms of contact numbers (CONTACT_HISTOGRAMS.dat), and a dependent analysis of solution structure by molecule (CLUSTERS.dat, MOLCLUSTERS.dat, and COOCLUSTERS.dat). The last analysis relies on an additional keyword: CLUSTERCALC. Note that these analyses are always restricted to residues of molecules tagged as solutes (→ "FMCSC_ANGRPFILE) in order to facilitate frequent contact analysis even if solute molecules are explicitly represented (which may be prohibitively expensive otherwise). When CAMPARI's shared memory (OpenMP) parallelization is in use, contact analysis is performed by a single thread while other threads perform other tasks concurrently (if there are any). Similar to DSSP analyses, the determination of spatial proximity patterns scales, at some level, poorly with system size and can thus become performance-limiting, which will similarly limit parallel efficiency.CLUSTERCALC
This keyword (along with CONTACTCALC) controls the computation frequency for solute cluster statistics (i.e., cluster sizes, cluster contact orders, and molecule-resolved cluster statistics) where a cluster is defined through the minimum atom-atom distance contact definition (between any pair of residues). Note that this is the interval at which to perform cluster analysis from within the calculation of contacts (i.e., CLUSTERCALC is relative to CONTACTCALC, as SCATTERCALC is to RHCALC). The reason is that the cluster detection algorithm relies on the determination of contacts but that it may not always be a meaningful analysis to perform (see CLUSTERS.dat, MOLCLUSTERS.dat, and COOCLUSTERS.dat for further details on the output).CONTACTOFF
If contact analysis is requested (→ CONTACTCALC), this keyword defines a sequence-space offset to exclude neighboring residues from the analysis. For topologically connected systems (i.e., polymer chains) data for near-neighbor contacts such as i↔i+1 may be uninformative as they will always be in contact on account of the underlying topology. Note that the omission only applies to intramolecular contacts. Setting this to zero includes everything (even i↔i), and any larger integer lets the analysis start from this distance. The default here is zero, and there is rarely a reason to change it.CONTACTMIN
For contact and cluster analysis (→ CONTACTCALC), this keyword provides the threshold value for of a residue-residue contact in Å. Here, the threshold is applied to the minimum distance between any arbitrary pair of atoms formed by the two residues in question. This defaults to 5.0 Å. Note that this computationally more expensive definition has the advantage of rendering the contact probabilities more or less size-independent for polyatomic residues. In the presence of excluded volume interactions, monoatomic residues (ions) of different size will still yield contact statistics which include physically meaningless biases, however.CONTACTCOM
For contact and cluster analysis (→ CONTACTCALC), this keyword gives the alternative threshold value for a residue-residue contact in Å. Here, the threshold applies to the distance between the centers of mass of the two residue it question. It also defaults to 5.0 Å. Note that (in the presence of excluded volume interactions) contact probabilities obtained this way are by design dependent on the size of the interacting residues and results may be misleading if contact statistics between pairs of residues with highly variable size are compared.PCCALC
This keyword allows the user to specify how often to perform pair correlation analysis, i.e., get distance counts for a variety of intra- and intermolecular distances and - in the case of intermolecular distances - proper normalization by the current volume element for the cases where the analytical result has been implemented by CAMPARI (rectangular, periodic boxes and spherical droplets). It controls the computation frequency for three different classes of distance distributions:- Generic intramolecular amide-amide distributions covering various acceptor-donor pairs, as well as a centroid-centroid distribution (→ AMIDES_PC.dat), only relevant for polypeptide systems.
- Generic intermolecular pair correlation functions for solutes (→ RBC_PC.dat), only relevant for systems with more than one solute. Note that this option can consume inordinate amounts of memory should a lot of different solute types be present. Workarounds consist of disabling this analysis or of using the analysis group feature to redeclare most of those as solvent molecule types and to use specific atom-atom distributions instead.
- Specific atom-atom distributions and/or pair correlation functions as defined through an index file provided by keyword PCCODEFILE (→ GENERAL_PC.dat).
When CAMPARI's shared memory (OpenMP) parallelization is in use, each of these 3 analyses is treated independently. This means that each one is executed by a single thread while other threads perform other tasks concurrently (if there are any).
If pair correlation analysis is requested (→ PCCALC), this keyword enables the user to disable the computation of intramolecular amide-amide distance distribution functions (→ AMIDES_PC.dat) by setting it to zero.PCBINSIZE
This keyword specifies the distance bin size in Å for pair correlation analysis (→ PCCALC).PCCODEFILE
This keyword specifies the path and filename to the input file for requesting specific pair correlation or distance distribution analyses (see FMCSC_PCCODEFILE). It is also possible to generate instantaneous traces for the selected distances with keyword INSTGPC. In general, the input is rather flexible and it is possible to pool many analogous or even unrelated atom-atom distances under a certain code or to use unique codes for very specific requests. Upon successful parsing of the input and given that pair correlation analysis is globally requested (→ PCCALC), the output file GENERAL_PC.dat is created.GPCREPORT
This logical keyword instructs CAMPARI whether or not to write out a summary of the terms requested through FMCSC_PCCODEFILE (→ GENERAL_PC.idx). It is only available if distance distribution / pair correlation analysis is in use (→ PCCALC).INSTGPC
This keyword lets the user instruct CAMPARI how often to print out instantaneous values for all the specific distances selected via FMCSC_PCCODEFILE. Note that this does not include the generic distances CAMPARI analyzes, and consequently the keyword has no effect if no usable input has been provided via FMCSC_PCCODEFILE or of course if pair correlation analysis is not in use. This keyword is understood as a dependent frequency, i.e., a setting of 1 will print instantaneous values for every PCCALCth step. Note that this feature is disabled by default and that the output in GENERAL_DIS.dat can easily become large.SAVCALC
This keyword specifies how often to compute (or record) solvent-accessible volume (SAV) fractions and solvation states for the system. If the ABSINTH implicit solvent model is in use (→ SC_IMPSOLV), this analysis can rely on the current values for those quantities (no additional, computational cost); otherwise computing atomic SAV fractions incurs a moderate computational cost. The solvent-accessible volume will globally depend on the choice for the thickness of the assumed solvation shell (→ SAVPROBE). The mapped solvation states as reported for individual atoms (please refer to the ABSINTH publication for details) will depend on further ABSINTH parameters. Some of these can be adjusted through patches, e.g., user-supplied values for overlap reduction factors.SAV analysis creates at most three output files; an instantaneous one (SAV.dat) that depends on auxiliary keyword INSTSAV, an atom-resolved output file that reports simulation averages (→ SAV_BY_ATOM.dat), and finally a file containing distribution functions (histograms) for selected atoms for those quantities (→ SAV_HISTS.dat). The latter file is dependent on another auxiliary keyword, i.e., SAVATOMFILE. The instantaneous output is primarily useful as a diagnostic tool for the system while the simulation is running, and to be able to compute correlation functions, multidimensional histograms, etc. for quantities related to the solvation of specific sites on macromolecules. Please refer to the descriptions of the output files for further details.
When CAMPARI's shared memory (OpenMP) parallelization is in use, the SAV analysis is handled by a single thread while threads perform other tasks concurrently (if there are any). However, if the ABSINTH DMFI is turned on, the analysis task simply consists of recording the values. Conversely, the evaluation of the DMFI is thread-assisted as usual.
If analysis of solvent-accessible volume fractions is requested (→ SAVCALC), this keyword allows the user to have a quantity related to the total SAV along with a running average being printed to a dedicated output file (→ SAV.dat). In addition, the values for SAV fractions for selected atoms (via SAVATOMFILE) are written out. The latter allows the construction of correlation functions, multidimensional histograms, etc. The keyword (positive integer) is interpreted as a print-out frequency relative to the frequency with which SAV analysis is performed per se. This means that the effective print-out frequency will be SAVCALC·INSTSAV. Depending on the choices, the resultant output file can easily become very large, and it is possible for significant I/O lag to occur because of this.SAVATOMFILE
If analysis of solvent-accessible volume fractions is requested (→ SAVCALC), this keyword specifies the location and name of a simple input file (list of atomic indices, format is described elsewhere) that allows the user to select a subset of the system's atoms for creating histograms of both SAV fraction and resultant solvation state (see above). These histograms are written to a dedicated output file (→ SAV_HISTS.dat). In addition, if instantaneous output of SAV-related quantities is requested (→ INSTSAV), the values for the SAV fractions for the selected atoms are written to the corresponding output file (SAV.dat). Note that instantaneous values for the SAV fractions allow manual computing (during post-processing) of solvation states (using parameters set in the key-file and/or reported in SAV_BY_ATOM.dat, and using the reference publication to retrieve the necessary expressions). It should be kept in mind that with normal settings for SAVPROBE, SAV fractions of nearby atoms are tightly coupled. This means for example that requesting information for atoms that are covalently bound will rarely yield additional information. Lastly, the binning for the histograms is fixed and uses 100 bins across the interval from zero to unity (both quantities are restricted to this interval).NUMCALC
This keyword is relevant only when the chosen thermodynamic ensemble allows for particle number fluctuations (simulation is performed in the (semi-)grand canonical ensemble). It then specifies the number of simulation steps between successive accumulations of number-present histograms for each fluctuating particle type. For a description of the corresponding output file please refer to PARTICLENUMHIST.dat. Instantaneous numbers can be printed with the help of the subordinated keyword NUMINST.NUMINST
This keyword is relevant only when the chosen thermodynamic ensemble allows for particle number fluctuations (simulation is performed in the (semi-)grand canonical ensemble). It then specifies indirectly the number of simulation steps between printing out instantaneous numbers of physically present molecules to a dedicated output file (PARTICLENOS.dat). The interval is taken relative to the instances defined by NUMCALC, and so the real step interval is the product of the two keywords.COVCALC
This simple keyword instructs CAMPARI to collect raw data (signal trains) for select degrees of freedom in the system (currently this is restricted to all flexible dihedral angles → TRCV_xxx.tmp) every COVCALC steps. This is a near-obsolete functionality that has large practical and technical overlaps with the output written to FYC.dat via TOROUT. It was meant to provide intrinsic support for variance/covariance analyses, e.g., with the ultimate goal of performing dimensionality reduction. Given that merely raw data are provided and that dihedral angle data are generally circular (periodic) variables requiring the use of circular statistics (not as trivial as it may sound), usage of this facility is generally not recommended. This option is available in different modes (see COVMODE) and may eventually be revived or extended later. Note that CAMPARI can perform intrinsic principal component analysis (PCA) and time-lagged independent component analysis (tICA) as part of the structural clustering facility (→ CCOLLECT and PCAMODE).COVMODE
This keywords chooses between (currently) two types of raw data to be provided by CAMPARI in output files TRCV_xxx.tmp. It can be set to:- Internal degrees of freedom (i.e., torsions) directly in torsional space (radian)
- Internal degrees of freedom (i.e., torsions) expressed as their cosine and sine components
This keyword specifies how often to compute molecular and residue-wise dipole moments for net-neutral molecules (or residues). Because the analysis relies on atomic partial charges, dipole analysis requires SC_POLAR to be set to a value larger than zero as charges are otherwise not assigned. The (somewhat preliminary) analysis produces output files MOLDIPOLES.dat and RESDIPOLES.dat. When CAMPARI's shared memory (OpenMP) parallelization is in use, these analyses are executed by a single thread while other threads perform other tasks concurrently (if there are any).EMCALC
This keyword specifies how often to compute spatial density distributions for the simulated system. If the density restraint potential is in use, this analysis is automatically performed at every step given that it is computed regardless. The result is an averaged density on a three-dimensional grid of dimensions controlled generally by keywords EMDELTAS and SIZE. For nonperiodic boundaries, the evaluation grid, which is always rectangular and aligned to the cardinal (x, y, z) axes as well as static, is or can not be mapped to the system dimensions exactly, and keyword EMBUFFER becomes relevant. This implies that periodic dimensions are only supported when they align with the cardinal axes, which currently excludes triclinic, periodic boxes and periodicity in conjunction with ensembles allowing volume fluctuations. When using the density restraint potential the grid serves both the purpose of analysis as described here, and the purpose of evaluating the potential itself, which implies that it is an option to adopt the grid dimensions from the input density map. This is the default behavior for a cuboid system with 3D periodic boundary conditions when EMDELTAS is not provided.The resultant spatial density (or densities) is that of a given atomic property selected by keyword EMPROPERTY (which can be more than one). It is written to an output file in NetCDF format, an external library required to use this feature. The details of the file format CAMPARI use are described elsewhere. For a given property X, the spatial density is computed as follows:
ρijk = ρsol + Vijk-1 ΣnN [ Xn - γnVnρsol ] Πd3 BA ( rnd - Pijkd )
Here, Vijk is the volume of the grid cell with indices i, j, and k, N is the number of atoms in the system, Xn is the target property of the atom with index "n", Vn is that atom's volume, and rnd are the three components of its position vector. The parameter γn is a pairwise, volume overlap reduction factor that corrects atomic volume for overlap with covalently bound atoms. It is explained in some detail elsewhere. The parameter ρsol sets a physical background density for the property in question, and this is relevant when not all matter contributing to the property density in the system is represented explicitly. In such a case, an assumed vacuum would lead to severe errors. Note that atomic volumes and volume reduction factors are no longer relevant if ρsol is zero in the above equation. Finally, the product in the above equations utilizes cardinal B-spline functions of order "A", BA, which are assumed centered at the center of each grid cell (vector Pijk with components Pijkd for each dimension). This technique of distributing a property on a lattice is shared with the particle-mesh Ewald method.
Like the corresponding density restraint potential, the accumulation of these data has been parallelized, i.e., when CAMPARI's shared memory (OpenMP) parallelization is in use, all threads work on this task synchronously. If the potential is turned on, there is no additional significant cost; otherwise the grid has to be incremented with the current configuration. Parallel efficiency can suffer in this mode of operation if subsequent snapshots have very different configuration (due to poor load balance). Parallel efficiency is generally poor if the lattices are large (in number of grid cells) relative to the number of atoms.
If the density restraint potential is not in use, but spatial density analysis is requested, this keyword is mandatory and sets the lattice cell size of the analysis grid by providing three floating point numbers corresponding to the lattice cell sizes in Å for the x, y, and z dimensions, respectively.Conversely, if the density restraint potential is in use, this keyword is optional and allows the user to set a lattice cell size different from the one used by the input density map. The keyword again requires the specification of three floating point numbers that set the lattice cell sizes in Å for the x, y, and z dimensions of the analysis and evaluation grid, respectively. Note that acceptable choices require that it be possible to superpose the cells of the input density map exactly with the analysis grid after reducing its resolution to that of the input map. Minor adjustments may be made automatically to system size and/or the origin of the input map. If, for example, in the x-dimension the input map has 10 cells of width 2Å, and the evaluation grid has 26 cells of width 1Å, then the system origin has to be chosen such that the left boundary of the first cell of the input density aligns with the left boundary of the first, third, or fifth cell of the evaluation grid (but not any others). In the same example, CAMPARI would reject a system size of 25Å, because the resultant number of cells in the x-dimension would not be divisible by the integer factor corresponding to the differences in resolution (here 2). It would also reject an origin aligning the first input cell to the seventh evaluation grid cell, because this would mean that the input map extends beyond the system boundaries. Finally, implied boundary conditions of the input map are not made to correspond to system boundary conditions automatically. For any periodic boundaries of the system, the evaluation grid is and must be fit exactly to the system dimensions. The grid is the same for all properties being analyzed.
If spatial density analysis is requested, or if the density restraint potential is in use, this keyword lets the user pick an atomic property to be distributed on a lattice. The options are as follows:- Use atomic mass (resultant units are g/cm3).
- Use atomic number, i.e., proton mass (resultant units are also g/cm3 for convenience).
- Use atomic charge (resultant units are e/Å3) relying on the parameters read-in from parameter file and used in the Coulomb potential, the use of which is required.
- Use a custom property (resultant units are a.u./Å3) relying on a custom custom input file. This is the only option that allows the use of more than one density to be collected and/or restrained. The presence of more than one property means that certain keywords should take multiple input values, notably FMCSC_EMTHRESHOLD, FMCSC_EMTOTMASS, FMCSC_EMTRUNCATE, FMCSC_EMBGDENSITY, FMCSC_EMBACKGROUND, and FMCSC_EMFLATTEN.
It is important to realize that properties that can take both negative and positive values (like partial charge) must be and are treated slightly differently from those that are positive definite (like mass). This is because the interpretation of the input map must sometimes consider the absolute value of the property to arrive at a meaningful interpretation of the parameters for total integral, threshold level, and truncation level.
If spatial density analysis is requested, or if the density restraint potential is in use, this keyword sets an assumed background level for the atomic property or properties in question. This means that, if EMPROPERTY is 4, more than one value may be required (on the same line). In general, background densities should be zero if all relevant matter in the system is represented explicitly, i.e., if empty space is indeed meant to correspond to a vacuum. If not, for example if the background is supposed to capture contributions from solvent molecules not represented explicitly, the values should be given in appropriate units depending on the property that the density is derived from. These are g/cm3 for mass and proton densities (atomic number), e/Å3 for charge, and arbitrary for custom properties. Note that the assumed background can also be zero for other reasons, e.g., the average charge density of an implicit net-neutral solvent buffer zero.EMPROPERTYFILE
If spatial density analysis is requested, or if the density restraint potential is in use, this keyword lets the user specify the path and name of a custom input file to supply values for arbitrary atomic properties to CAMPARI, of which to calculate and possibly restrain their spatial densities (see option 4 for EMPROPERTY). The details of the input format and the consequences of choosing more than one property are described elsewhere.EMBUFFER
If spatial density analysis is requested, or if the density restraint potential is in use, this keyword sets a ratio for how much to extend the evaluation grid for spatial densities beyond any nonperiodic boundaries of the system. In the direction of a nonperiodic boundary, CAMPARI takes the maximum dimension (e.g., the diameter of a sphere) and multiplies it with this factor to obtain the (approximate) size of the rectangular cuboid grid. Alignment with a potential input grid is achieved by shifting the origin of the evaluation grid slightly. Note that the behavior will generally be undefined for cases where solute material samples positions off the evaluation grid. It is up to the user to ensure that the buffer spacing is big enough for the stiffness of the boundaries to prevent this from happening. This is particularly problematic in Monte Carlo simulations with large-scale moves such as rigid-body moves of the randomizing kind. In such a case, a trial conformation might have to be evaluated that extends significantly beyond the boundary. Even though, this move will almost certainly be rejected on accord of the boundary, the density restraint potential might create nonsensical answers, which is to be avoided. By default, all solute coordinates are periodically wrapped onto the analysis/evaluation grid even with a nonperiodic boundary (because no algorithms are implemented that deal with explicit grid edges in terms of B-splines, etc). This is always undesirable in a nonperiodic boundary. In extreme cases (molecules protruding more than one side length from the box, warnings are produced.EMBSPLINE
If spatial density analysis is requested, or if the density restraint potential is in use, this keyword sets the order of B-splines used to distribute the one ore more atomic properties of interest on the lattice. This setting corresponds to parameter "A" in the equation above. B-splines of order 3 or higher lead to functions with smooth derivatives, and are appropriate for gradient-based methods. B-splines have finite support, and the cost per atom will increase with A3 for a three-dimensional lattice. The limiting case of A being unity corresponds to a simple binning function, whereas for large A, a Gaussian function is recovered. The effective width does not grow linearly with A, but it is rather the tails of the functions that grow. This implies that very large values for A are probably not a useful investment of CPU time. Note that the effective width of the B-spline can be thought of as setting an inherent resolution or averaging scale for a given atom in question, since it replaces a point function with a distribution. The choice for this keyword should therefore be made in concert with the choice of formal grid resolution.DIFFRCALC
This keyword specifies how often to compute approximate fiber diffraction patterns for the whole system (excluding ghost particles in GC simulations → ENSEMBLE). The system is aligned according to an assumed fiber axis in the system (see DIFFRAXIS), and amorphous diffraction patterns using cylindrical coordinates (through Fourier-Bessel transform) are computed. The code currently assumes atomic scattering cross sections which are proportional to atomic mass with the additional modification that all hydrogen atoms are excluded from the diffraction calculation. Specifically, the atomic scattering function for heavy atom i is proportional to mi/mC with a proportionality constant yielding units of the square root of scattering intensity. It is zero for hydrogen atoms. See DIFFRACTION.dat for more details. As a cautionary comment it should be noted that these calculations are somewhat untested and that output should be carefully examined. When CAMPARI's shared memory (OpenMP) parallelization is in use, the diffraction pattern is calculated by a single thread while other threads perform other tasks concurrently (if there are any). This is a limitation because of the high inherent cost of this analysis.DIFFRRMAX
For diffraction calculations (→ DIFFRCALC), this specifies the maximum number of bins in the reciprocal radial dimension (r in cylindrical coordinates). The resultant bins will be centered around zero.DIFFRZMAX
For diffraction calculations (→ DIFFRCALC), this specifies the maximum number of bins in the reciprocal axial dimension (z in cylindrical coordinates). The resultant bins will be centered around zero.DIFFRRRES
For diffraction calculations (→ DIFFRCALC), this gives the resolution in the reciprocal radial dimension (r in cylindrical coordinates) in Å-1.DIFFRZRES
For diffraction calculations (→ DIFFRCALC), this gives the resolution in the reciprocal axial dimension (z in cylindrical coordinates) in Å-1.DIFFRJMAX
This defines the maximum order of Bessel functions to use in the Fourier-Bessel (Hankel) transform to generate the (fiber) diffraction pattern (→ DIFFRCALC). Note that the transform takes the product of actual and reciprocal radial coordinate as its argument. Hence, the maximum order will determine how meaningful the generated information for large values of inverse radial dimensions is. This soft cutoff will scale reciprocally with the size of the system in the radial dimension. These features arise due to the fact that Bessel functions of order n only contribute non-zero values beyond a (unitless) argument value of ca. n. Also note that the input file for the Bessel functions (see FMCSC_BESSELFILE) needs to provide the tabulated functions up to the necessary order.DIFFRAXIS
For diffraction calculations (→ DIFFRCALC), it is possible (and usually meaningful and necessary) to use a fixed system axis as the assumed fiber axis. This is (naturally) particularly appropriate for single-point calculations on specific structures. The axis' x, y, and z components have to be provided as three floating point numbers. The length of the vector is not important. The axis will pass through the point defined (see DIFFRAXON). If this keyword is not specified, the program will identify the longest possible atom-atom distance in the system, and use the resultant axis. Note that this axis will not be constant with respect to the absolute (lab) coordinates, but that it is supposed to cover cases where changes in configuration are allowed (especially if rigid-body movement is permitted).DIFFRAXON
This keyword specifies the point the (constant) axis (see DIFFRAXIS) for diffraction analysis (→ DIFFRCALC) will pass through. This will define the zero-point in the z-coordinate, and hence the origin of the cylindrical coordinate system. If this keyword is not provided, CAMPARI will assume the {0.0 0.0 0.0}-point for this (independent of specifications for the system origin).REOLCALC
This keyword is only relevant in MPI replica exchange calculations (or parallel trajectory analysis runs using the same setup). It instructs CAMPARI to compute various overlap measures between the different Hamiltonians employed in the REMC/D run (see N_XXX_OVERLAP.dat). Note that this relies on the evaluation of the system energy at different conditions, i.e., Hamiltonians. Unless the only exchange dimension is temperature, CAMPARI makes the assumption that the energy has to be fully reevaluated for each condition, which means that there is a significant cost associated with the overlap calculation. Cutoffs and long-range corrections (see keywords CUTOFFMODE, LREL_MC, and LREL_MD) are always respected by these additional evaluations of cross- (or foreign) energies. In dynamics runs, an additional complication arises if neighbor list updates are performed infrequently (see NBL_UP). Here, CAMPARI enforces an extra update of neighbor lists that is always out-of-sync with the schedule of the simulation propagation (this is for technical reasons). The unfortunate consequence is that for identical random seed, trajectories are not going to be identical if NBL_UP is greater than 1 and overlap calculations are performed with different frequencies.The user controls whether to calculate foreign energies across all replicas (see REOLALL). If only neighboring conditions are requested, output in N_XXX_OVERLAP.dat may be truncated or uninformative. It is important to mention that the MC branch of the energy functions is used only in plain REMC calculations, and that in all other cases (including hybrid methods → DYNAMICS) the dynamics branch is used. This is important since cutoff and long-range treatments can easily be inconsistent between the two (see LREL_MC and LREL_MD). Because the main cost of the overlap calculation is the evaluation of "foreign" energies, CAMPARI's shared memory (OpenMP) parallelization can employ its full parallelization scope for this task.
This keyword is only relevant in MPI-Replica Exchange calculations (or parallel trajectory analysis runs using the same setup). This keyword requests instantaneous "foreign" energies to be written (see N_XXX_EVEC.dat). "Foreign" or "cross"-energies are simply the energies of the current structure evaluated at Hamiltonians different from the one generating the ensemble. Note that the user controls whether to calculate foreign energies across all replicas (see REOLALL). If only neighboring conditions are requested, a truncated vector (length 2 or 3) is provided in N_XXX_EVEC.dat. To facilitate frequent overlap analysis with sparser instantaneous output, this keyword is interpreted as a subordinated frequency for REOLCALC (as SCATTERCALC is relative to RHCALC).REOLALL
This keyword is only relevant in MPI-Replica Exchange calculations (or parallel trajectory analysis runs using the same setup). It is interpreted as a simple logical which determines whether "foreign" energies are computed over all other or just the neighboring replicas (see N_XXX_EVEC.dat and N_XXX_OVERLAP.dat).TRACEFILE
This optional keyword is relevant for the post-processing of two types of parallel simulation runs. First, if a parallel trajectory analysis run in the RE setup is performed (→ details elsewhere), it allows the user to supply a file with a running map of replicas to starting conditions. Details of format and interpretation are given elsewhere. The default map assumed by CAMPARI is the identity mapping 1..REPLICAS. If a trace file is provided, sets of step number and an updated map for that specific step are read. This is primarily meant to make replica exchange trajectories that are continuous in condition (i.e., have conformational jumps in them) continuous in conformation (i.e., afterwards they have jumps in condition in them). In such a case, the trace file is the history of replica exchange moves such as output by CAMPARI itself. CAMPARI will then recombine information from the input trajectories according to the trace. This means that all analyses performed are on the unscrambled trajectory that can of course also be written (→ XYZOUT). Naturally, this keyword can also be used to specify any other map for other applications, e.g., to create trajectories for obtaining bootstrap-type error estimates. The relation of step numbers in the trace file to frames in the trajectories is handled by keywords RE_TRAJOUT and RE_TRAJSKIP.Second, if a serial trajectory analysis run or a parallel trajectory analysis run using the MPI averaging framework is performed, it can be used to post-process data from an parallel PIGS run. PIGS runs provide their own trace file. For the serial analysis, the trajectories from individual replicas must be concatenated in numerical order, otherwise they should be left as is. Unless the trace file itself is edited (its first column has the step number), keywords RE_TRAJTOTAL, RE_TRAJOUT, and RE_TRAJSKIP define the output settings for the original simulation run, and the settings must be matched exactly. For example, with 4 replicas, XYZOUT 50, NRSTEPS 1000, and EQUIL 500, each trajectory will have 10 snapshots. In trajectory analysis mode, the concatenated trajectory (40 snapshots) can then be supplied with settings of RE_TRAJTOTAL 10, RE_TRAJOUT 50, RE_TRAJSKIP 500, and NRSTEPS 40. Alternatively, the set of individual trajectories can be supplied to a parallel analysis run using NRSTEPS as 10 instead. The trace file is processed exclusively in the context of network-based analyses (see CCOLLECT, CMODE, output file STRUCT_CLUSTERING.graphml, and so on). Reading in the PIGS trace accomplishes the automatic removal and addition of (conformational) network links incurred by the PIGS protocol. Overlapping functionality is provided by keywords TRAJBREAKSFILE and TRAJLINKSFILE, but these are only available in serial analysis mode. If appropriate geometric information is present, this function will also produce an output file containing snapshots weights for the PIGS runs estimated by the weighted ensemble strategy (see dcoumentation for the corresponding output file for details). Note that a PIGS analysis run (see elsewhere for details) does of course not process the PIGS trace as it emulates the behavior of only a single PIGS stretch (reseeding interval). The presence of this keyword in the input key file is explicitly not allowed (results in a halt of the execution) if operating the NetCDF analysis mode.
This keyword is relevant for some trajectory analysis runs. In particular, those runs relying on an input file with the reseeding/exchange history of a parallel simulation run need to translate the information about step numbers in this file to the analyzed data. This keyword therefore lets the user set the trajectory output frequency CAMPARI is supposed to assume for the supplied input trajectories (separate or concatenated) being analyzed. This is important because the trace is meant to use simulation step numbers that are not preserved in trajectory analysis mode (no step number or time information from input trajectories are read and used).If a parallel analysis run in the replica exchange setup run is performed, a successful unscrambling of the trajectories requires that the exchange trace is exhaustive at the level of the output frequency of this keyword. This means that it is sufficient to provide the current map of condition to starting structure for every snapshot in the input trajectories (more information can be supplied without harm, less information will lead to errors). In the replica exchange case, keyword RE_TRAJSKIP is also essential. If an analysis run on a PIGS data set is performed (possibly in parallel), the trace must contain information about all reseedings. Here, both keywords RE_TRAJSKIP and RE_TRAJTOTAL are processed as well.
Unlike in the cases outlined above, which are both related to processing a trace file, this keyword attains a different function in a parallel PIGS analysis run. Since such a run is supposed to emulate the PIGS heuristic, it serves as an output control setting to compute the step number as RE_TRAJTOTAL times RE_TRAJOUT, which is then printed to the output trace file.
This keyword is relevant for some trajectory analysis runs. In particular, those runs relying on an input file with the reseeding/exchange history of a parallel simulation run need to translate the information about step numbers in this file to the analyzed data. This keyword therefore lets the user set the equilibration period for trajectory output that CAMPARI is supposed to assume for the supplied input trajectories (separate or concatenated) being analyzed. This is important because the trace is meant to use simulation step numbers that are not preserved in trajectory analysis mode (no step number or time information from input trajectories are read and used).Both RE_TRAJOUT and this keyword are required for CAMPARI to correctly relate the frames in the trajectories to the step numbers in the trace file. Of course, it is also possible to edit the file with the trace to match the saved trajectory data exactly, and to then set RE_TRAJOUT and RE_TRAJSKIP to 1 and 0, respectively. In the case of a PIGS run being analyzed, keyword RE_TRAJTOTAL is also essential.
If a serial trajectory analysis run or a parallel trajectory analysis run is performed and a file with the PIGS reseeding history (trace) has been provided, this keyword lets the user set the length in numbers of snapshots per replica that CAMPARI is supposed to assume for the trajectory input (→ elsewhere). In the serial case, this is usually NRSTEPS/REPLICAS whereas in the parallel case it is just NRSTEPS. This is important because the trace is meant to use simulation step numbers that are not preserved in trajectory analysis mode (no step number or time information from the input trajectory is read and used). RE_TRAJOUT, RE_TRAJSKIP, and this keyword are required for CAMPARI to correctly relate the frames in the trajectories to the step numbers in the trace file. Note that when using an input file with subsets of frames in random-access mode this keyword has to be adjusted to the actual number of selected frames per replica, which still has to be constant.Unlike in the processing of a trace file, this keyword attains a different function in a parallel PIGS analysis run. Since such a run is supposed to emulate the PIGS heuristic, it serves as an output control setting to compute the step number as RE_TRAJOUT times RE_TRAJTOTAL, which is then printed to the output trace file.
If CAMPARI was compiled with Python support (see installation instructions), this keyword sets the step interval (post equilibration) at which (a user-edited version of) CAMPARI's Python module is activated. This exposes many of CAMPARI's internal data structures directly to Python, largely in the form of NumPy arrays, which is a required library (along with SciPy although this is only used in pre-supplied functions and can be masked by removing those functions). Emphasis is placed on the added value: it should neither occur that functions that CAMPARI already supports are recreated in Python (no added value) nor should it happen that the data passed on are primarily the Cartesian coordinates (no added value relative to a reader of binary files, see PDB_FORMAT). Thus, instead, CAMPARI collects and sends a large amount of available data every PYCALC steps. These data are received by two functions: "campana_step", which is called every time, and "campana_final", which is called only after the regular run has finished. In the simplest cases, you would write Python code to gather, average, or bin some data into a custom data structure in "campana_step" and then print or plot this data structure as part of "campana_final". The workflow is explained in detail in Tutorial 18. All data are passed as copies, which means that presently it is not possible to modify CAMPARI (Fortran) arrays through the module. Python 3 is assumed throughout.Much of the difference between a binary trajectory reader and a simulation package is their comprehension of what the numbers mean. For example, a simulation package like CAMPARI is aware that the sytem observes certain boundary conditions, contains molecules of known topologies such as polypeptides or polynucleotides, can be described energetically by a biomolecular force field, etc.. This information is essential in generating the aforementioned added value. Furthermore, CAMPARI already contains manifold analysis routines in good out-of-the-box performance (due to the use of compiled Fortran code). Similarly, this implies that the Python module should have access to most of these data, in particular since naive Python code tends to perform very poorly. This can be used to enable very simple extensions of existing analyses like constructing block averages and generating a measure of error from their distribution.
The two functions, "campana_step" and "campana_final" receive the same data as follows. First, there are a number of static data structures that are extremely helpful in parsing the system, e.g. to calculate properties for subsets of molecules or residues, to find covalently bound atoms, etc. These containers are systematically named and exist for 3 hierarchies (atoms, residues, molecules) in 2 variants each (integer and floating point data types). Their contents are listed in detail as part of Tutorial 18 but summarized here as well:
The properties per atom in "static_atom_ints" are:
- BIOTYPE: CAMPARI biotype as explained elsewhere
- LJ_TYPE: CAMPARI atom (Lennard-Jones) type as explained elsewhere
- RESIDUE_NR: The running residue number in the sequence defined by sequence input
- MOLECULE_NR: The running molecule number in the sequence defined by sequence input (note that crosslinked molecules count as separate molecules)
- BACKBONE_CNT: The running index per residue of backbone atoms, which is 0 otherwise (in unsupported residues and small molecules, this assignment can be uninformative)
- ZMATRIX_I1, ZMATRIX_I2, ZMATRIX_I3: The indices of atoms that define its Z-matrix row (determined automatically and possibly misleading in unsupported residues); 0 indicates that atoms are part of molecular reference frame
- ZMATRIX_CHIRAL: If -1 or 1, this distinguishes chirality in ambiguous Z-matrix entries (those relying on 2 bond angles), 0 otherwise
- NR_BOUND_ATOMS: The number of covalently bound atoms (determined automatically and possibly misleading in unsupported residues)
- BOUND_ATOM1, BOUND_ATOM2, ...: The indices of covalently bound atoms (determined automatically and possibly misleading in unsupported residues); 0 once exhausted for a given atom
- MASS: Atomic mass (a.m.u.) as inferred from atom types (see elsewhere)
- RADIUS: Effective atomic radius in Å as explained elsewhere
- PARTIAL_CHARGE: If the Coulomb potential is enabled, the partial charge per atom in e- (set from charge types or patches)
- RFOS_INCR: If the ABSINTH DMFI is enabled, the atomic contribution to the reference free energy of solvation in kcal/mol (set from solvation parameters or patches)
- TYPE: CAMPARI residue type that is translatable with the "restype_to_name" array of "user_counters"
- NR_ATOMS: Number of atoms
- NR_SC_ATOMS: Number of sidechain atoms (in unsupported residues and small molecules, this assignment can be uninformative)
- FIRST_ATOM: The index (starting at 1!) of the first atom in the residue
- REF_ATOM: The index (starting at 1!) of the atom defined as the reference (mostly for residue-based distance calculations)
- CROSSLINK: The index (starting at 1!) of the residue the current one is chemically cross-linked to (0 if none, for disulfide bridges)
- BB_CA, BB_N, BB_O, BB_C, BB_HN: The indices (starting at 1!) of the respective polypeptide backbone atoms (0 if the residue does not feature them, can be misleading for unsupported residues and small molecules)
- BB_O3*,BB_P,BB_O5*,BB_C5*,BB_C4*,NUCBASE_N1/9: The indices (starting at 1!) of the respective polynucleotide backbone and sidechain atoms (0 if the residue does not feature them, can be misleading for unsupported residues and small molecules)
- MOLECULE_NR: The running molecule number in the sequence defined by sequence input (note that crosslinked molecules count as separate molecules)
- MAX_DIS_FROM_REFAT: Effective CAMPARI residue "radius" in Å that is a measure of the maximal distance of any point in the residue from the assigned reference atom (see static_res_ints above)
- MASS: Total mass (in a.m.u.)
- TOTAL_CHARGE: The sum of all partial charges of atoms in the residue (only available if the Coulomb potential is turned on)
- START_ATOM,END_ATOM: Indices of the first and last atoms defining the molecule (these are always continuous, and crosslinked molecules always count as separate molecules)
- START_RES,END_RES: The same for residue indices
- TYPE: The molecule type (CAMPARI numbers types consecutively as they are discovered in sequence input)
- ANALYSIS_GROUP_NR: The assigned analysis group as determined by default or from user input
- NR_TORSIONS: The number of bonds with topologically flexible dihedral angles (can be misleading for unsupported residues)
- NR_IMD_DOFS: The total number of internal coordinate space degrees of freedom as configured by custom constraints, TMD_UNKMODE, and related settings
- MASS: Total mass in a.m.u.
- CONTOUR_LENGTH: A very approximate measure of how long a polymer chain could be in Å when fully stretched but covalent bonds and angles remain intact (ignore for anything not a polymer)
Finally, we have three holder objects that are fully documented inside CAMPARI's Python module itself. The first, "box", is used to pass on information about the simulation container, so essentially the information encoded in keywords SHAPE, BOUNDARY, ORIGIN, SIZE, SOFTWALL, BOXVECTOR1, etc. Notably, these will be updated values in case the run is a trajectory analysis run assuming fluctuating volume conditions (in this scenario, the required information is contained in binary trajectory files such as those supplied as arguments to keywords XTCFILE or NETCDFFILE). The second, "analyses", can contain a wide range of properties that CAMPARI computes in a conditional manner. These are primarily the results (or raw values required to arrive at these results) of its built-in analysis functions. For example, DSSP is a highly specialized framework for assigning secondary structure to polypeptides based on hydrogen bonding patterns between backbone atoms. In CAMPARI, it is easily requested through keyword DSSPCALC. If you set both PYCALC and DSSPCALC to 10, at every 10th step (once past equilibration), CAMPARI will deposit the list of backbone-backbone hydrogen bonds, their energies, and the resultant instantaneous secondary structure assignment into "analyses.holder" as fields "dssp_d_hb_list", "dssp_a_hb_list" (donor and acceptor lists per eligible residue), "dssp_d_hb_energies", "dssp_a_hb_energies" (the associated energies), and "dssp_assignment" (the integer assignment). This is an example where CAMPARI sends the raw, instantaneous data. This is documented directly in the "analysis_holder" class of These data are sufficient to replicate histograms such as those produced by CAMPARI itself, e.g., DSSP_EH_HIST.dat. If the step intervals of PYCALC and DSSPCALC are mismatched, the Python module might receive outdated data. The general case where this might happen if one still wants to take advantage of the built-in analyses and their outputs but operate the Python module at a lower frequency (i.e., PYCALC is a multiple of DSSPCALC). Not everything is sent, and in some cases this might change in the future. The primary reasons for data being omitted are that the raw data are already contained in the dynamic content arrays discussed above, that the data are easily computable with functions that already exist in the module (like most of the data in POLYAVG.dat), or that the corresponding features are themselves still considered experimental in CAMPARI (like diffraction maps). The third holder object, "user_counters", is largely empty and meant to hold the variables required by the custom modifications introduced by the user. By default, it contains a call counter and a map from residue type to legible name.
For the Python module to work correctly, it must be found and free of syntax errors. For finding it, you can use either environmental variables or keyword PYDIR. The interface between Python and Fortran is provided by Elias Rabel's ForPy module, which takes care of all variable interoperability. To get it running, ForPy must be compiled, which entails linking to additional libraries. This is why the compilation of the Python interface is optional and controlled by a conditional as explained in the installation instructions.
To get started with this feature, it is strongly recommended to work through Tutorial 18, which explains the logic, workflow, and some caveats in detail.
If CAMPARI's shared memory (OpenMP) parallelization is in use, only the master thread will execute the Python functions, and there is (almost) no support from other threads in setting up things either. Because the Python module requires other analyses to be complete, other threads will generally have to wait idly for Python to return control to Fortran. In theory, it should be possible to invoke shared memory parallelization from within Python but users should keep in mind that the threads are not released leading to all the known issues resulting from oversubscription. This is a complicated topic touched upon in a few tutorials like Tutorial 7. Conversely, when CAMPARI is run in an MPI-parallel mode (→ REMC and MPIAVG), the Python module is invoked by separate processes. Generally speaking, this will run independently for every replica, which might put restrictions on what can be done with output, interactivity, etc., and it is up to the user to ensure that there are no clashes in this regard. The only notable exception is the sending of features for structural clustering, which sends aggregated data only from the master process if a corresponding MPI mode is chosen. There is an additional peculiarity in this in that the master process does not update these data at every step implied by CCOLLECT. Instead, these data are buffered and pooled with a generally unknown granularity. To infer this granularity, users have to check the size of the 2nd dimension of the received array (field "clu_features") and divide by the number of replicas.
What follows are some more details on the object "analyses" of type "analysis_holder" that is passed to "campana_step" and "campana_final". As can be seen in the module itself, the majority are related to analysis functions. As mentioned, for DSSP analysis, instantaneous information is sent in five arrays. For polymer statistics that scale linearly with system size, aggregate, lumped information (by analysis group) is sent. This is either in the form of cumulative histograms (matching output files RGHIST.dat, RDHIST.dat, and RETEHIST.dat) or in the form of instantaneous profiles (output files (DENSPROF.dat and PERSISTENCE.dat). For polymer statistics that scale superlinearly with system size, instantaneous data are sent that underlie the averages in output files INTSCAL.dat and KRATKY.dat. Also, estimates of "hydrodynamic" radius as reported averaged in POLYAVG.dat are transmitted. In all cases, the lumping by analysis group might make complicate keeping track of the proper normalization (and those data are not sent). For spatial density analysis, lattice information is sent along with the instantaneous density maps (as obtained in averaged form in output file This analysis is free if the corresponding bias potential is enabled. The same logic applies to global secondary structure contents (output file ZSEC_HIST.dat and associated bias potential bias potential), which are sent as instantaneous quantities. In contrast segment statistics (output file BB_SEGMENTS_RES.dat) are transmitted in cumulative form. Residue-level contact data are sent as cumulative maps (as in output file CONTACTMAP.dat) and as instantaneous assignments of molecular clusters (data underlying output files like MOLCLUSTERS.dat, controlled by additional keyword CLUSTERCALC). Solvent-accessible volumes and solvation states (see SAV_BY_ATOM.dat) are either enabled by using the ABSINTH DMFI or by enabling SAV analysis, and these are sent as instantaneous values. Conversely, distance histograms are only sent in the form of cumulative histograms matching output files RBC_PC.dat and GENERAL_PC.dat.
The first group of arrays sent that deviates a bit is related to forces and energies. Depending on the Hamiltonian constructed through the associated keywords, different terms found in output file ENERGY.dat will be present (nonzero). If the run is a trajectory analysis run, care must be taken to match the force fields in use to the original simulation; otherwise, the reported energies and forces might be of reduced or even vanishing usefulness. For forces to be available, keyword DYNAMICS must not be 1. Cartesian forces are always sent albeit not necessarily in absolutely complete form (see CHECKGRAD for details). This information could for example be used to identify quickly points where an input structure has energetic (usually steric) clashes. If keyword CARTINT is 1 (the default), also forces acting on internal coordinate space variables (rigid-body coordinates and dihedral angles) are made available. These would be the same forces analyzed by the relaxation facility (→ TMD_RELAX). Importantly, parsing these arrays requires additional information, which is actually static, but sent as part of the "analyses" object (field "ia_imd_explain"). Lastly, inertial masses (either translational or rotational in nature) are contained as well. These are, for the rotational case, generally dynamical variables (see keywords TMD_INTEGRATOR or FUDGE_DYN for further information). The second group is information on the MPI setup, specifically the rank that is calling the Python module and some exchange statistics for replica exchange. If MPI-enabled CAMPARI is used, this opens up the possibility even to do heterogeneous processing in different ranks, i.e., have a version of the module that has explicit conditionals for the MPI rank.
If CAMPARI was compiled with Python support (see installation instructions), and the invocation of CAMPARI's Python module was requested through keyword PYCALC, this keyword can be used to point CAMPARI to a folder where it will find an edited copy of "". This serves the same function as appending the "PYTHONPATH" environmental variable on a Linux operating system.CCOLLECT
This keyword controls the frequency with which a selected set of features (see CDISTANCE and CFILE) extracted from the trajectory data (typically in a trajectory analysis run → PDBANALYZE) is stored in a large array in memory for post-processing. Such post-processing currently consists of different algorithms (→ CMODE), for example to identify structural clusters in the data, and is performed after the last step of the run has completed. If CCOLLECT is set to something larger than the number of simulation steps (NRSTEPS), the clustering analysis is disabled (also the default).Various output will be produced aside from information written directly to standard out or the log-file. At the most basic level, the extracted features themselves, after the various preprocessing steps outlines below, can be written to disk in an optional output file (see and keyword CDUMP). The most common output file is a list of cluster annotations per analyzed snapshot (→ STRUCT_CLUSTERING.clu) that is produced along with a helper script for the visualization software VMD (→ STRUCT_CLUSTERING.vmd). Furthermore, CAMPARI will print a file representing the clustering as a graph in an xml-based (so-called "graphml") format (→ STRUCT_CLUSTERING.graphml). Taken together these files allow further analyses of the clustering, primarily those that take advantage of the fact that the clustering yields a complex network/graph (e.g., cut-based free energy profiles using committor probabilities).
All clustering algorithms and also the progress index algorithm (→ CMODE) will write a number of diagnostic and reporting summaries to log-output. For clustering algorithms, this includes a summary of the determined clusters (usually involving at least the number of contained snapshots and a measure of size) to log-output. The exact progress index method is an exception as it does not explicitly record a clustering (the three aforementioned output files are missing). With any progress index method in use, at least one additional output file is obtained. This file is the essential requirement to create plots as in the progress index reference.
Note that structural clustering breaks the typical CAMPARI paradigm of "on-the-fly" analysis since the bulk of the CPU time for analysis will be invested only at the very end. Therefore, structural clustering will most often be used in trajectory analysis runs as it will be highly undesirable to risk an unclean termination of an actual simulation (certain algorithms for structural clustering require large amounts of memory and/or CPU time). Note as well that structural clustering should not be confused with the (much simpler) analysis of molecular clusters (see CLUSTERCALC and its corresponding output files). Because structural clustering and related analyses can be CPU time-intensive tasks, they are handled by CAMPARI's shared memory (OpenMP) parallelization, i.e., many algorithms are tackled by all threads at once. Most importantly, the tree-based clustering and the approximate progress index method (options 4 and 5 to keyword CMODE) as well as iterative algorithms operating on derived graphs (see MAXTIME_ITERS for details) have been parallelized this way, Details are provided below, in particular for keyword CMODE.
A special remark is required for simulation runs using the MPI averaging technique. Similar to any use of the clustering functionality "on-the-fly", trajectory output should be generated in accordance with the setting for CCOLLECT (most easily by using MPIAVG_XYZ and a matching value of XYZOUT). This is so the clustering results can be annotated and understood at all. In an MPI averaging run, CAMPARI will then at each collection step gather data from all replicas and store them in an array allocated exclusively by the master process. The data arrangement is such that trajectories will be continuous and ordered by increasing replica number. The concatenation introduces spurious transitions that may affect subsequent computations. Data collection causes a synchronization and communication requirement absent in other types of MPI averaging calculations. At the end of the simulation, the resultant concatenated trajectory is analyzed exclusively by the master process, which - depending on settings and algorithms in use - may lead to severe imbalances in terms of both memory consumption and CPU time requirements. This should be kept in mind when using this approach across machines not sharing any memory. To enforce the complementary behavior of every identical replica analyzing its own trajectory, it is possible to use a fake replica exchange run by using a single dummy (or irrelevant) parameter for exchange. In a hybrid MPI/OpenMP calculation, the OpenMP layer on the master process performing the analysis will be limited to the number of threads granted initially even though other MPI processes residing on the same shared-memory environment will be idle during this time. Note that the feature extraction itself, which is the only task performed during the run, does not benefit from threads parallelization except for the calculation of dynamic weights for options 2 or 4 for keyword CDISTANCE. Conversely, feature extraction in parallel does occur in an MPI averaging run as outlined above.
Because the chosen set of degrees of freedom often is a superset of an unknown subspace of particular interest to the user, CAMPARI offers two common routes for a dimensionality reduction. These rely on standard linear algebra techniques and are available if i) the chosen proximity metric is not circular (this excludes options 1-2 for CDISTANCE); ii) the code was linked to a linear algebra library (LAPACK-compliant, see installation instructions for general information on linking libraries); and iii) there are more samples than variables (degrees of freedom). The reason that circular (periodic) data are currently not supported is that the required measures of variance and in particular covariance become somewhat empirical and laborious to compute. If this type of transformation is performed (→ PCAMODE), CAMPARI produces up to two output files, one containing the eigenvectors themselves (PRINCIPAL_COMPONENTS.evs) and another optional one containing the data matrix in the transformed space (PRINCIPAL_COMPONENTS.dat). The latter can be used to derive probability or free energy surfaces in reduced-dimensional spaces.
If data for structural clustering are collected (→ CCOLLECT), this keyword instructs CAMPARI to calculate and perform a linear transformation on the collected data. As mentioned above, this option is not available for all measures of conformational distance. The linear algebra works straightforwardly for options 3 and 7 for keyword CDISTANCE and always involves centering the data first (subtraction of dimension-wise means). For options 4, 9, and 10 (local weights), the locally adaptive weights are averaged, and the input data are scaled by dimension-wise average weights. The same scaling idea is used for option 8 (global weights). Lastly, the possibility of alignment of 3D coordinates (options 5, 6, and 10 → CALIGN) causes additional complications. The general strategy here is to first align all snapshots to the last one (static alignment), which may or may not be provide a meaningful description.Five options are currently available:
- No transformation is performed
- Principal component analysis (PCA) is performed via single-value decomposition (SVD), and the eigenvectors of the covariance matrix are written to a dedicated output file. PCA works by identifying linear transforms of the centered data that collect maximal sample variance in as few components as possible. The principal components are normalized and orthogonal, i.e., have unit length and zero (linear) covariance. The latter should not be equated with a lack of correlation. Many nonlinear correlations between variables yield zero covariance. The amount of variance contained in the first few components can differ dramatically between data sets. The printed eigenvectors and eigenvalues are the only result of this analysis, i.e., the transform is not actually used.
- This is the same as the previous option with an important difference. Here, the original sample data are transformed and centered data in PCA space. The transformed data set is written to an additional output file. If keyword CREDUCEDIM is not zero, the original data are overwritten and lost, and any algorithm relying on conformational distance evaluations thereafter will treat these as the simplest case (CDISTANCE becomes 7). This is because the weighting or alignment requests were taken care of before. The benefit of using CREDUCEDIM is to be able to obtain a more informative representation in a space of reduced dimensionality in an unsupervised fashion.
- Time structure-based independent component analysis (tICA) is performed, which is based on original work from the 90s. tICA solves the matrix equation ΤF=ΣFΛ, where Σ is the covariance matrix, Τ is a time-lagged and symmetrized covariance matrix (lag time is set by keyword CLAGTIME), F is the matrix of eigenvectors, and Λ is a diagonal matrix with eigenvalues, which correspond to the values of the autocorrelation function at the specified lag time for the transformed variables. Unlike in PCA, the eigenvectors do not form an orthonormal basis (rather, they satisfy FTΣF=ID). This means that unlike PCA the transformed data do not preserve values of Euclidean distances between points even if the full dimensionality is used. As in option 2, the printed eigenvectors and eigenvalues are the only result of this analysis, i.e., the transform is not actually used.
- This is the same to option 4 as option 3 is to option 2, i.e., the original sample data are transformed to tICA space and centered with the aforementioned options, implications, and consequences.
If data for structural clustering are collected (→ CCOLLECT), this keyword defines what type of data to collect and how to define structural proximity. There are currently 10 supported options:-
This option is tailored toward the intrinsic degrees of freedom of a typical CAMPARI simulation
that are also the essential internal degrees of freedom of most molecular systems, i.e. the molecules' dihedral angles.
The values {φk} for a set of K dihedral angles are collected throughout the run.
A list can be provided by using a dedicated input file (→ CFILE), otherwise most of
CAMPARI's internal degrees of freedom are used (excluding those pertaining to the conformation of five-membered
rings). The details of the set of eligible dihedral angles are controllable by keyword
TMD_UNKMODE. More information can be found in the
description of the input file. The distance between two states is given as:
dl↔m = [ (1.0/K) · ΣkK ( (φkl - φkm) mod 2π )2]1/2
Because dihedral angles are periodic (circular) quantities, a meaningful metric of proximity must account for boundary conditions, hence the "mod 2π" term. Dihedral angles-based clustering poses - aside from periodicity - the challenge that all considered degrees of freedom are bounded and that the strongest contribution to the signal will come from those torsions with large variance, which unfortunately are often the ones of least interest (for example sidechain torsions). Therefore, a careful selection of the subset to use is critical for an informative clustering. Like any other method, dihedral angle-based clustering is vulnerable to Euclidean distances in high-dimensional spaces becoming uninformative. Note that all dihedral angle-based proximity criteria are useful primarily for single molecules since relative intermolecular orientations are not representable whatsoever. -
This is identical to the previous option only that each dihedral angle is also associated with a locally adaptive weight. Adaptive weights
are those that change from snapshot to snapshot. Initially, the weights for this representation are set to the effective masses (the
associated diagonal element in the mass matrix, i.e., mass-metric tensor) of a given dihedral angle. Evaluating a distance
requires combining these adaptive weights for 2 respective snapshots l and m, e.g. wklm = f(IMkl,IMkm).
The distance between two states will then be given as:
dl↔m = [ (ΣkK wklm ) -1 · ΣkK wklm · ( (φkl - φkm) mod 2π )2]1/2
The actual values for the weights for individual snapshots, e.g., IMkl, can be altered using keyword CMODWEIGHTS. The function for combining the weights to yield wklm is selected with the help of keyword CWCOMBINATION. Adaptive weights are generally normalized per snapshot (such that ΣkK IMkl evaluates to 1.0 for all l). This is different from what is described in 2 reference publications (see here and here). Any type of weighting scheme (static or adaptive) can be used to remedy the problem with the previous one regarding the impact of "uninteresting" degrees of freedom. The weighting with the effective masses ensures that slow degrees of freedom (e.g. central backbone torsions) will contribute much more to the overall signal than sidechain torsions. This effect becomes exacerbated for long chains. There are two additional caveats. First, the initial mass matrix-based weights are affected by the choice for ALIGN. Second, Dihedral angles describing disulfide bonds are supported but the presence of disulfide bonds destroys the notion of the effective masses (see CRLK_MODE for some background). The default weights for the Cα-Cβ-S-S and Cβ-S-S-Cβ torsions are simply set to 1.0. This means that a meaningful use of this option while selecting disulfide bonds as part of the representation requires setting CMODWEIGHTS to something other than 0. -
This option is largely identical to option 1. It carries all the same caveats with the exception of the periodicity
of dihedral angles. Here, we expand each dihedral angle into its sine and cosine terms to construct a distance metric as follows:
dl↔m = [ (0.5/K) · ΣkK (sin(φkl) - sin(φkm))2 + (cos(φkl) - cos(φkm))2]1/2
Note that the sine and cosine terms of the same angle are nonlinearly but strictly correlated. This has consequences for the interpretation of dimensionality in this representation. -
This is the analogous modification of the previous option by introducing locally adaptive weights that are initially composed from the effective masses
and can be altered by keyword CMODWEIGHTS:
dl↔m = [ 0.5 (ΣkK wklm) -1 · ΣkK wklm · ( (sin(φkl) - sin(φkm) )2 + (cos(φkl) - cos(φkm) )2 ) ]1/2
Note that this implies the presence of only a single weight per pair of Fourier terms. -
This option is probably the most commonly used variant, the positional RMSD. The
Cartesian position vectors {rk} for a set of K atoms
are collected throughout the run. A list can be provided by using a dedicated input file
(→ CFILE), otherwise all atoms in the system are used.
The distance between two states is then given as:
dl↔m = [ (1.0/K) · ΣkK ( rkl - RoTr(rkm) )2]1/2
Here, RoTr is meant to indicate rotation and translation operators that superpose the {rk}m optimally with the frame provided by the {rk}l. This alignment uses the same quaternion-based algorithm mentioned elsewhere. Superposition (alignment) implies that the atomic RMSD is not necessarily a bona fide metric of distance as it is not guaranteed to satisfy dl↔m ≤ dl↔p + dp↔m, i.e., the triangle inequality. This is because the operator RoTr is different for computing dp↔m than it is for computing the other two distances. In reality, for similar structures, this is never really a problem in the context of clustering. RMSD-based clustering is - like any other method - vulnerable to Euclidean distances in high-dimensional spaces becoming uninformative and - in particular - to obscuring of the signal by uneven variances (a reason why very commonly terminal parts of polymer are excluded from such analyses). The alignment step for both this and the next option can be disabled with the help of keyword CALIGN (RoTR is then simply the identity operator). Without alignment external degrees of freedom become part of the distance criterion. The coordinate-based RMSD is generally difficult to use for sets of atoms spanning multiple molecules since intermolecular motion can easily provide most of the variance in the signal. In periodic boundary conditions, there is a particular difficulty of which image of a molecule to use. Keyword XYZ_REFMOL is supported in this context and can be use to circumvent this problem (although it should be kept in mind that there is no unique solution for assemblies of more than 2 molecules). -
This is similar to the previous option, and is only relevant if alignment is performed.
Then, this option allows the user to split the atomic index sets used for alignment and distance computation, i.e., the
alignment operator, RoTr, minimizes pairwise distances computed over an independent set of atoms that can either be a superset,
subset or completely different set of atoms than the one specified via CFILE.
Then, if we term the distance set {D} and the alignment set {A}, with {A} to be provided via
ALIGNFILE, the distance between two states will be given as:
dl↔m = [ (1.0/|D|) · Σd|D| ( rdl - RoTr{A}(rdm) )2]1/2
Note that choosing disparate sets can easily destroy the fundamental meaning of alignment, i.e., the removal of differences caused purely by external (rigid-body) degrees of freedom. This in turn would almost certainly lead to violations of the assumption that members of different clusters are dissimilar, and can also eliminate the notion of similarity amongst members of the same cluster. Conversely, it can be useful in improving the signal-to-noise ratio for cases where one is interested in states populated by a specific part of a much larger system that moves as a single entity (specifically, states characterized by relative arrangements of parts of a system may emerge more clearly if alignment is performed on the whole entity, but distances are computed only over a small portion of interest). Note that errors in calculations relying on mean cluster properties computed for example in the tree-based algorithm or hierarchical clustering (→ CMODE) using mean linkage can easily become large if the two atom sets have little overlap. Specifically, a cluster of similar snapshots as determined by the distance set, which is constituted by elements with large differences in the alignment set, will produce deteriorating accuracy of, for example, computing a snapshot's mean distance to it. This is because the heterogeneity of the alignment operator is masked by the simplified algebra used to compute these properties in constant time. The general caveats for RMSD-based clustering mentioned for option 5 above remain relevant as well. -
Let us define a set of K interatomic distances, {rij} over unique atom pairs i and j. These distances
are collected throughout the run. A list can be provided by using a dedicated input file
(→ CFILE), otherwise a subset of randomly selected but unique interatomic distances
is used. The number of randomly selected degrees of freedom is usually set to 3N where N is the number
of atoms (it can be smaller for small N). Because the {rij} are geometric distances, they are also positive
and potentially large. CAMPARI allows a functional transform to be applied during data collection
to, which, generally speaking, allows focusing the sensitivity to particular distance regimes. If we consider this transformed
set of distances, {f(rij)} (f(x) can be the identity function of course, which is also the default),
the distance between two states will then be given as:
dl↔m = [ (1.0/K) · ΣkK [ f(rij(k)l) - f(rij(k)m) ]2]1/2
I.e., the chosen distance metric is simply the root mean square deviation across the set of transformed interatomic distances. Distance-based clustering inherently removes external degrees of freedom from the proximity measure, and it is therefore suitable to most applications. As with any other measure, Euclidean distances in high-dimensional spaces may become uninformative and results may be obscured by uneven variances. -
This is identical to the previous option only that each distance, which is potentially transformed,
is subjected to a static weight. This weight is computed initially from the combined mass of
the constituting atoms. The distance between two states would then be given as:
dl↔m = [ (ΣkK (mi(k)+mj(k)) ) -1 · ΣkK (mi(k)+mj(k)) · [ f(rij(k)l) - f(rij(k)m) ]2]1/2
Here, mi denotes the mass of atom i, and f(x) is the function used for transformation (defaults to the identity function). These (static) weights are not particularly useful in the default form but can be altered by changing masses, e.g., by a suitable patch, or by means of the dedicated facility (keyword CMODWEIGHTS). They are normalized such that similar distance thresholds can be used as in the unweighted case. -
This is identical to option 7 above only that each interatomic distance, which is potentially transformed,
is subjected to a locally adaptive weight (as in options 2 and 4 above).
These weights increase the corresponding memory demands by a factor of 2 and are all initialized to be unity.
It is necessary to use the dedicated facility (keyword CMODWEIGHTS) to make
them meaningful. All localized weights available for interatomic distances require at least a window size parameter
and a rule for how to combine weights from different snapshots. The latter is expressed as function g(x1,x2) specified by keyword CWCOMBINATION.
The resultant functional form for pairwise distance between snapshots is:
dl↔m = [ (ΣkK g(Ωkl,Ωkm) ) -1 · ΣkK g(Ωkl,Ωkm) · ( [f(rij(k)l) - f(rij(k)m) ]2 ) ]1/2
Here, Ωkl is the locally adaptive weight for the kth feature and the lth snapshot, and f(x) is the function used for transformation (defaults to the identity function). The same general caveats apply as for options 2 and 4 above. In particular, it is important to reemphasize that all locally adaptive weights are now normalized per snapshot in contrast to the descriptions found in the literature (see here and here). -
This is similar to options 5 and 9 above. Here, each of the 3K Cartesian coordinates, X, of a system of K selected atoms is subjected to a separate, locally adaptive weight.
Due to the presence of these weights, pairwise alignment is currently not
supported for this option. CAMPARI computes the Euclidean distance between snapshots, which means that any type of input data
can be analyzed straightforwardly by transcribing the data set into a fake trajectory of atoms with each Cartesian coordinate
corresponding to an input data dimension.
The locally adaptive weights increase the corresponding memory demands by a factor of 2 and are all initialized to be unity.
It is necessary to use the dedicated facility (keyword CMODWEIGHTS) to make
them meaningful. As for option 9, weights require at least a window size parameter
and a rule for how to combine them for different snapshots. The latter is expressed as function g(x1,x2) specified by keyword CWCOMBINATION.
The resultant functional form for pairwise distance between snapshots is:
dl↔m = [ 3(Σk3K g(Ωkl,Ωkm) ) -1 · Σk3K g(Ωkl,Ωkm) · ( (X(k)l - X(k)m )2 ) ]1/2
For the weighting aspect, the same caveats apply as for options 2, 4, and 9 above. Due to the distance definition relying on absolute coordinates, the caveats mentioned for option 5, which relate to atoms sets encompassing multiple molecules, remain relevant as well.
If data for structural clustering or related analyses are to be collected (→ CCOLLECT), this keyword provides the path and location to an input file selecting a subset of the possible coordinates. For options 1-4 of the proximity measure, this file is a single column list of indices specifying specific system torsions (see elsewhere). For options 5, 6, and 10 it is a single column list of atomic indices (see elsewhere). Lastly, for options 7-9, it is a list of pairs of atomic indices (two columns, see elsewhere). The keyword can take on an additional meaning if instantaneous output of RMSD values is requested through ALIGNCALC. In this context, CFILE specifies an atomic index set as for option 6 of CDISTANCE. In a small molecule screen, the presence of such an input file is mandatory if the structural clustering facility is intended to be used. Details are provided below.CALIGN
If structural clustering is performed (→ CCOLLECT), and an atomic RMSD variant is chosen as the proximity measure (→ CDISTANCE), this keyword can be used to specifically disable the alignment step that occurs before the actual RMSD of the two coordinate sets is computed. To achieve this, provide any value other than 1 (the default) for this on/off-type keyword. Note that alignment must be disabled for option 10 to be available for CDISTANCE.MOL2CLUMODE
If data for structural clustering or related analyses are to be collected (→ CCOLLECT) during a small molecule screen, this keyword is serves to amend the functionality normally provided by CFILE. The allowed proximity measures in this execution mode are (currently) 1-5, 7, 9, and 10. For coordinate RMSD-based measures (5 and 10), the options are as follows:- All atoms are used.
- All atoms except aliphatic hydrogen atoms (absolute value of partial charge below 0.2) are used.
- Only those atoms defining a dihedral angle in the Z matrix are used (in addition to the three atoms forming the "base" of the molecule). Conceptually, this option is best understood with the help of keyword TMDREPORT and associated documentation.
- All dihedral angles are used.
- All dihedral angles except those involving the rotation of only terminal atoms of identical atom type are used. Conceptually, this option is best understood with the help of keyword TMDREPORT and associated documentation.
- This is the same as option 1.
The use of interatomic distances is peculiar in this regard. Here, the input file can contain, in addition to any pairs within the static part, distance pairs referencing the first atom of the screened molecules. All the partner atoms coming from the static part defined by these pairs, are then paired with all atoms from the small molecules as per the selection for MOL2CLUMODE explained above. Of course, it is, for all three classes, possible to request a clustering representation composed purely of the static part of system, but this will seldom be meaningful.
If data for structural clustering or related analyses are to be collected (→ CCOLLECT) during a small molecule screen, this keyword defines a fractional size threshold for clusters. The centroids of clusters exceeding this threshold, i.e., those that have more members than MOL2THRESH*N, where N is the number of snapshots collected for the molecule in question, are written to the main output file of the screen if a suitable option for keyword MOL2OUTMODE is chosen.CDISTRANSFORM
If structural clustering is performed (→ CCOLLECT), and the raw features are interatomic distances (CDISTANCE is 7, 8, or 9), this keyword can be used to specify a function that stores a transformation of the interatomic distance as the feature (this is the function f(x) in the description above). Options are as follows:- f(x) = x: This is the identity function and the default.
- f(x) = 1.0 - [1.0 + exp(-(x-χ)/τ)]-1: This is a sigmoidal function decaying from a maximum value of 1.0 to 0.0 with increasing distance. The step is centered at χ and the sharpness is given by τ. Vanishing values for τ give a transform that is equivalent to a contact map transform. For larger values of τ, the values of f(x) when x is small will increasingly deviate from 1.0 (be smaller).
- f(x) = [x + rbuf]-1.0/hexp: This is a hyperbolic function. At an interatomic distance of 1.0-rbuf, the value is always 1.0. Smaller values will diverge, which should be avoided by choosing at least 1.0 for this parameter. At larger distances the values are very small and approach 0.0 asymptotically. The rate of approach is controlled by hexp, and smaller values give a faster approach. Note that the so-called DRID metric (reference) fundamentally relies on hyperbolic transforms where rbuf is 0.0 and hexp is 1, 1/2, or 1/3.
- f(x) = 1.0 if x > rcut and f(x) = sin(x·0.5π/rcut) otherwise: This is a piecewise function that for small distances resembles the identify function (with an effective scale) before tapering off to a constant value. The point where the function becomes 1.0 exactly is rcut.
If structural clustering is performed (→ CCOLLECT), the raw features are interatomic distances (CDISTANCE is 7, 8, or 9), and a transform other than the identity function is used, this keyword sets a shift parameter. For the sigmoidal function, this is the parameter χ, for the hyperbolic function, it is the parameter rbuf, and for the sine transform, it is the parameter rcut. The equations are given above. The value is to be provided in Å and can be zero or positive.CDISTRANS_P2
If structural clustering is performed (→ CCOLLECT), the raw features are interatomic distances (CDISTANCE is 7, 8, or 9), and either a sigmoidal or a hyperbolic transform is used, this keyword sets a scale (or width) parameter. For the sigmoidal function, this is the parameter τ and for the hyperbolic function it is the parameter hexp. The equations are given above. The value is either to be given in Å (sigmoidal) or unitless (hyperbolic) and must be positive.CWCOMBINATION
If data for structural clustering or related analyses are collected (→ CCOLLECT) and locally adaptive weights are in use, this keyword sets the function to be used for combining locally adaptive weights from different snapshots. This is relevant for options 2, 4, 9, and 10 for CDISTANCE. The input is interpreted identically to that for keyword ISQM, i.e., values of -1, 0, and 1 give harmonic, geometric, and arithmetic means, respectively. Values outside of this range can be expected to degrade performance due to expensive powers being evaluated. Special options avoiding most arithmetics altogether simply use the smaller or larger of the two values. In reality, these correspond to the limits of negative and positive infinity, and they are available by selecting -999 and 999, respectively.CPREPMODE
If data for structural clustering or related analyses are collected (→ CCOLLECT), this keyword offers the user a choice to perform simple data preprocessing operations. Specific options are as follows and are all applied independently for all data dimensions:- The data are untouched.
- The data are centered (subtraction of the means).
- The data are centered and scaled by the inverse standard deviation. The resultant data are often referred to as standard or Z-scores.
- The data are smoothed by cardinal B-splines of specified order. This operation scales linearly with this order, and it is therefore computationally wasteful to specify very large values (the long tails of the polynomial functions contribute little to the smoothing). Note that virtually no result obtainable from these data is preserved upon smoothing (except the mean), which means that results may become difficult to interpret.
- The data are centered and smoothed.
- The data are converted to Z-scores and then smoothed.
If data for structural clustering or related analyses are collected (→ CCOLLECT), data smoothing may be in use. It is enabled by certain choices for keywords CMODWEIGHTS and CPREPMODE. Smoothing currently relies on cardinal B-splines, and this keyword lets the user specify the order of these functions. Cardinal B-splines are also used elsewhere (keywords BSPLINE for the PME method and EMBSPLINE for structural density restraints), but the keywords are completely independent.CMODWEIGHTS
If data for structural clustering or related analyses are collected (→ CCOLLECT), and either static or locally adaptive weights are in use (options 2, 4, 8-10 for CDISTANCE), it is possible to override the default weights with data-derived information obtained in post-processing. This is required for options 9 and 10 to be meaningful (the locally adaptive weights for these cases are all initialized to be 1.0). Depending on the chosen option, additional parameters may be required. A detailed list is as follows:- This leaves all weights unchanged.
- This option computes local estimates of the root mean square fluctuation (RMSF) and takes the inverse as the resultant, locally adaptive weight. The window size is chosen by the user. The definition of "local" by proximity in the trajectory itself implies that the data are ordered, usually along a time or similar progress axis. Note that this option is invariant only to data translation (centering). The windowed MSF are computed using an incremental algorithm that has constant cost with window size.
- This replaces weights with weights derived from the autocorrelation function (ACF) evaluated at fixed lag time. The weights are static, i.e., they can be understood as a pre-scaling of the data. For dimensions with a negative ACF at the chosen lag time, the weight is explicitly adjusted to zero, which means that the effective dimensionality can be reduced considerably. As second moments, ACF values are noisy and generally more reliable at short lag time. For options 2 and 4 for CDISTANCE, the resultant weight is always the larger of the two obtained for sine and cosine terms. The ACF is invariant under data translation and global scaling operations.
- This option computes a composite weight by taking the square root of the product of the ACF at fixed lag time (as for option 2) and the inverse RMSF over a window of specified size (as for option 2).
- This option defines locally adaptive weights based on crossings of the global mean. Specifically, for each dimension, the global data mean is computed. Over a window of a user-defined size, it is then counted how many times the value of that dimension crosses the mean. Each data point receives a weight based on a window centered at this point in terms of the trajectory. The definition of "local" by proximity in the trajectory itself implies that the data are ordered, which is most often but not necessarily by time. Because it is possible that the count is zero, the resultant, locally adaptive weights are computed as (ncross+a)-1, where "ncross" is the aforementioned number of crossings of the global mean and "a" is a user-defined buffer parameter (see keyword CTRANSBUF). For options 2 and 4 for CDISTANCE, the resultant weight is always the larger of the two obtained for sine and cosine terms. The idea behind this type of weight is to deemphasize data dimensions sampling roughly symmetric distributions with a single peak and to emphasize data dimensions sampling multimodal distributions with locally small variance. False negatives can be produced if the global mean happens to coincide with one of the peaks of a multimodal distribution. These weights are exceptionally simple, can be computed efficiently and with high accuracy for large data sets, and require no additional parameters beyond the window size. They are also invariant for data translation and global scaling.
- This option is the same as the previous one (#4) except that the data are smoothed for the purpose of generating weights. This leaves the original data untouched, i.e., it does not imply data smoothing in general (see CPREPMODE for the latter). The smoothing entails an additional parameter, viz., CSMOOTHORDER.
- This option is a combinations of options #2 and #4. The final, locally adaptive weights correspond to the square root of the static weights derived from the ACF at fixed lag time and the weights derived from crossings of the global means within windows of user-defined size.
- This option is the same as the previous one (#6) except that the data are smoothed for the purpose of generating the local component of the weights (based on crossings of the mean). This does not imply data smoothing in general. The smoothing entails an additional parameter, viz., CSMOOTHORDER.
- Similar to option #4, this defines locally adaptive weights based on counting crossings. Here, a histogram is created for each data dimension (fixed number of 100 bins). From the histogram, CAMPARI automatically locates minima in the histogram (at least 3 bins to either side have to have larger counts). Over a window of user-defined size, crossings of any of these minima are counted, and the weight is constructed as wmax/(ncross+1). Here, wmax is an adjusting weight. Each minimum splits the data into two fractions of unequal size, and wmax is the maximum of the smaller fractional populations across all minima. If no minima are found, this option reverts to option #4 for the dimension in question. The histogram construction and minima parsing mask many parameters that cannot be controlled by the user at the moment. Histograms are constructed in a way that makes these weights invariant for shifted and scaled data. This option is marred primarily by the lack of both robustness and significance of the minima detection procedure. The meaning of keyword CTRANSBUF is preserved in exactly the same way as for option #4.
- This option is the same as the previous one (#8) except that the data are smoothed for the purpose of generating weights. This leaves the original data untouched, i.e., it does not imply data smoothing in general. The smoothing entails an additional parameter, viz., CSMOOTHORDER.
If data for structural clustering or related analyses are collected (→ CCOLLECT), and certain types of locally adaptive weights are in use, this keyword sets the window size (in numbers of snapshots) from which to obtain the weight. Each snapshot is given a weight derived from data in a window centered around that point. This makes sense primarily if the data are in a specific order, most often they are assumed to be sorted by time. Points toward the beginning (or end) of the data set all obtain the same weight as the first (or last) snapshot to have access to a complete window. This implies that windows should generally be much smaller than the data set length (they can at most extend to half the data set length). This keyword is relevant for locally adaptive weights based on variances and transition counts that can be selected via CMODWEIGHTS.CLAGTIME
If data for structural clustering or related analyses are collected (→ CCOLLECT), the autocorrelation function (ACF) at fixed lag time can play a role, and this lag time is set by CLAGTIME. This is relevant if either static or locally adaptive weights are in use (options 2, 4, 8-10 for CDISTANCE), or if time structure-based independent component analysis (tICA) is performed (see PCAMODE). This keyword sets the time (in numbers of snapshots) to be used for this purpose.In the case of weighted distance functions, the ACF is evaluated for each dimension independently and assumes a single, generating process:
ACF(τ) = [ ΣN-τ(X(k)(n)-μ(k))(X(k)(n+τ)-μ(k)) ] / [σ(k)2(N-τ)]
Here, the global data mean and variance, μ(k) and σ(k)2, are estimated directly from the data for each dimension. Note that fewer data are available for large τ. Importantly, negative values for the ACF are all set exactly to zero meaning that these data dimensions are eliminated from distance evaluations. When applied to dihedral angles (options 2 or 4), the ACF is always evaluated separately for sine and cosine terms to avoid ambiguous definitions of variance for circular variables. The weight is then set to the larger of the two values. In case of tICA, the ACF features as a time-lagged covariance matrix that is computed for simple, centered data (no circular variables, no pairwise alignment, no locally adaptive weights). No corrections and truncations are applied to this matrix.
If data for structural clustering are collected (→ CCOLLECT), and a linear transformation is computed and applied (→ PCAMODE), this keyword allows the user to elect to run all further post-processing (→ CMODE) on a data set of reduced dimensionality that corresponds to the first NV data vectors in the transformed space, where NV is set by the choice for this keyword. The components are sorted from largest to smallest eigenvalues such that the maximum amount of variance (PCAMODE is 3) or autocorrelation (PCAMODE is 5) is included.Note that the transformed data are interpreted as simple, aperiodic signals, i.e., none of the peculiarities for different choices of CDISTANCE are considered any longer (CAMPARI internally converts everything to CDISTANCE being 7, which may lead to confusing output regarding units, etc). Specifically, for options 4, 9, and 10 for CDISTANCE, the underlying locally adaptive weights are averaged, and the data are pre-scaled by these averages. This means that use of this keyword for those cases changes more than just the dimensionality. Similarly, for options 5 and 6, if alignment is requested, this alignment is performed as a preprocessing step, and the last snapshot of the data is used as reference. Furthermore, for option 6, only the atom set chosen for distance evaluations is retained, and this is the set to eliminate further dimensions from with the help of this keyword. Note that Euclidean distances are invariant for the full-dimensional transformed data set relative to the original data set in PCA (PCAMODE is 3) but not in tICA (PCAMODE is 5). This of course applies only to the linear transformation and not to any possible preprocessing operations.
If no linear transform is computed, or if the choice for PCAMODE implies that the data transform is not actually computed, this keyword can be used to simply discard dimensions at the end of the internal list of dimensions. This is supported for specialized applications and should not be used unless absolutely needed (use CFILE to control dimensionality precisely). This option does not work with any distance measure requiring alignment. In all cases, if CREDUCEDIM is not specified or set to too large a value, data processing will proceed with the original data and the original size. If linear transforms have been computed, the transformed data are simply written to output file PRINCIPAL_COMPONENTS.dat but not used otherwise.
If data for structural clustering or related analyses are collected (→ CCOLLECT), and weights are in use (options 2, 4, 8-10 for CDISTANCE), it is possible to use alter the definition of weights relying on counts of crossings (see option #4 and following for CMODWEIGHTS). In a general functional form of w = (n+a)-1, the offset or buffer parameter a is set by this keyword. The default value is 1. Large values will lead to weights with low sensitivity. The limit of CTRANSBUF approaching zero will lead to cases with n=0 receiving all the weight, which is not generally useful.CDUMP
If data for structural clustering or related analyses are collected (→ CCOLLECT), this simple logical keyword instructs CAMPARI to write the preprocessed set of features along with possible weights and after eventual dimensionality reduction to a binary output file. This file can later be reused in CAMPARI's data mining mode.CMODE
If data for structural clustering are to be collected (→ CCOLLECT), this keyword allows the user the specify the algorithm by which the accumulated data are to be clustered. Before going into detailed options, a few general words are in order:- CAMPARI strives to allow the geometric and other net quantities of a collection of snapshots to be computable irrespective of which metric of proximity is chosen (→ CDISTANCE). For options 3 and 7 this is trivial. For option 1, periodicity has to be accounted for. This is solved approximately by i) making sure the proper image of an added snapshot is considered, ii) adding appropriate periodic shifts to the geometric center increments each time a boundary violation is found after updating. Other transforms are corrected accordingly. Options 2 and 4 incur the use of several additional cluster sums (means) due to the dynamic weights (for details, the reader is referred to the source code in clustering_utils.f90: key subroutines are "clustering_distance" or "cluster_addsnap"). For option 5 (atomic coordinate RMSD), the first member of a cluster defines a reference frame. This frame is used for alignment of all subsequently added frames (therefore, the definition and all derived quantities are approximate although the error is usually small for small clusters). Option 6 is a generalization of this allowing for a split of the set of coordinates into an alignment and a distance subset. Option 8 is a mass-weighted equivalent of option 7. Options 9 and 10 are extensions of 7 and 5 to use dynamic weights as for options 2 and 4, which again requires maintaining additional cluster sums.
- With the geometric center being defined, certain properties of a cluster are computable at constant cost with respect
to cluster size. For example, the square of the average distance from the center ("radius") is, in the simplest case, given as:
R2 = D-1 N-2 · [ N · ΣkN xk2 -
ΣkNxk · ΣkNxk]
xk denotes the coordinate vector belonging to the kth member of the cluster, D is the number of coordinates, and N is the number of members of the cluster. Other properties such as the mean snapshot-to-snapshot distance ("diameter") are similarly available. All that is required is that each cluster accumulates the necessary cluster sums.
The currently implemented options for CMODE are as follows (the short discussion above applies to all of them except option 4 when CPROGINDMODE is 1):
- The data are clustered according to the leader algorithm. This is a very simple algorithm that sequentially scans the data. Each new snapshot is compared to the center snapshots of preexisting clusters and added to the first one for which a provided distance threshold is satisfied (→ CRADIUS). If no such cluster is found, a new cluster is spawned. Results will be input order dependent and clusters will have ill-defined "centers" since the central snapshot is set at the time the cluster is spawned and remains unchanged. Processing direction(s) can be chosen with the auxiliary keyword CLEADER. The leader algorithm has not been parallelized and is executed by a single thread when CAMPARI's shared memory (OpenMP) parallelization is in use, It generally offers no obvious benefit over the tree-based algorithm below (option 5). The performance of the leader algorithm deteriorates with decreasing threshold because the number of spawned clusters will eventually become significant relative to the number of snapshots.
- The data are clustered according to a modified leader algorithm. This works very similarly to the standard leader algorithm with two important modifications. First, each new snapshot is compared to the current geometric center of preexisting clusters to evaluate the threshold criterion. Second, the result is (optionally → CREFINE) post-processed and snapshots belonging to smaller clusters that would also satisfy the threshold criterion for a larger cluster are transferred to that larger cluster. There are exactly two passes over the data of this refinement step (iteration is difficult and time-consuming due to continuously changing cluster centers). Processing direction(s) can be chosen with the auxiliary keyword CLEADER and the threshold criterion is set via CRADIUS. Modified leader-based clustering tends to generate fewer clusters compared to the standard leader algorithm due to better cluster centers. Due to centers changing position, the maximum snapshot-to-snapshot distance is no longer guaranteed to be below twice the value for CRADIUS (although in typical scenarios violations are very rare). The modified leader algorithm has not been parallelized and is executed by a single thread when CAMPARI's shared memory (OpenMP) parallelization is in use, It generally offers no obvious benefit over the tree-based algorithm below (option 5).
The data are clustered according to a hierarchical algorithm. In theory, a hierarchical algorithm works by first creating
a sorted list of all N(N-1)/2 unique snapshot-to-snapshot distances. Starting with the shortest distance, the two constituting
snapshots do one of the following:
- They spawn a new cluster (if they are both unassigned and the threshold criterion is fulfilled).
- They merge the two clusters they belong to (if they are both assigned and the threshold criterion is fulfilled).
- The cluster the previously assigned snapshot is part of is appended with the unassigned snapshot (if one of them is unassigned and the threshold criterion is fulfilled).
- They terminate the algorithm (if the threshold criterion is not fulfilled).
Because the problem as stated is intractable for large data sets, CAMPARI uses a dedicated scheme to help keep the computation as feasible as possible. In the first step, a snapshot neighbor list is generated that uses a truncation cutoff set by CCUTOFF. The neighbor list generation uses a pre-processing trick that aims to reduce the number of required distance calculations. This pre-processing step relies on a truncated leader algorithm whose target (threshold) cluster size is set by the (borrowed) keyword CMAXRAD. The resultant clusters are then used to screen groups of snapshot pairs and to exclude them from distance computations. Unfortunately, the problem of dimensionality often renders this procedure worthless. In high-dimensional spaces → CFILE, volume grows with distance so quickly that the distance spectrum becomes increasingly δ function-like, and in turn becomes unsuitable for exploiting additive relationships. This stems from conformational distances having a rigorous upper bound for systems in finite volume and with fixed topology. The situation is obfuscated further if many of the dimensions are tightly correlated (such that the effective number of dimensions is indeed lower). Alternatively, this neighbor list can be read in from a previously obtained file (→ NBLFILE). The neighbor list is then further truncated to exactly match the size threshold specified via CRADIUS. For the algorithm to work properly, CCUTOFF has to be at least twice the value of CRADIUS. From this truncated list, a global list is created and sorted according to size. This can be quite memory-demanding. The global list is then fed into the algorithm as described. The results of hierarchical clustering depend very strongly on the linkage criterion (→ CLINKAGE). For many real and high-dimensional data sets, the limitations in both processing time and memory footprint mean that analyses are restricted to thousands of snapshots. The hierarchical algorithm has not been parallelized and is executed by a single thread when CAMPARI's shared memory (OpenMP) parallelization is in use, This is in part because such a parallelization would not solve the memory problem. -
The data are arranged according to the so-called progress index method described in detail elsewhere (→ reference).
The progress index is a rearrangement of the snapshots such that a given snapshot at position i is added on account of it having
the shortest available distance to any snapshot j<i. In its exact form, this is resemblant of the hierarchical clustering
described directly above. In technical terms, the main (and only real) (hyper)parameter is a specified
criterion of distance. Using this criterion, either the exact minimum spanning tree (MST) or an
approximation to it is constructed for the underlying complete graph constituted by all trajectory snapshots (vertices) and
the N·(N-1)/2 unique, pairwise distances (weighted edges) between them. The spanning tree is a convenient data structure
for deriving the progress index but it is not conceptually fundamental to the algorithm.
Provided a certain starting snapshot, the spanning tree is mined to generate a sequence of snapshots (progress index)
with the above property, i.e., the snapshot added next is the one that has the minimum distance
to any other snapshot already added. The complete progress index has the desirable property that it is likely to group similar objects together
without overlap. In order to work, it requires the sampling density to be sufficiently inhomogeneous, i.e.,
there are enclosed regions (basins) that are sampled preferentially and that consequently have higher point density than the regions
connecting them. It is important to keep in mind that the chosen features along with any
preprocessing steps (e.g., weights) may project/distort the full, underlying phase space.
The method provides an annotation function for the progress index that contains kinetic (or effectively kinetic) information. This function assumes that the evolution of the system is incremental and happens on a continuous manifold. Therefore, apparent jumps in phase space such as those introduced by the replica-exchange methodology may diminish the value of this annotation. There are alternative annotation functions, and some are discussed further in the documentation of the corresponding output file. For practical concerns, there is a methodological choice to pick either the exact or the approximate scheme (→ CPROGINDMODE) in addition to providing a starting snapshot (→ CPROGINDSTART). There are further keywords associated exclusively with this methodology, the most important one being CPROGINDRMAX, which sets the number of search attempts per snapshot per iteration for the approximate scheme (this is the primary controllable determinant of computational cost). Keyword CPROGINDWIDTH is a parameter related to annotations while CBASINMIN and CBASINMAX are related to automatic ways of finding starting snapshots. The approximate scheme runs almost entirely in parallel with excellent efficiency when CAMPARI's shared memory (OpenMP) parallelization is in use. Compared to the published algorithm, there are a few technical tweaks to allow this parallelization: these pertain to the auxiliary clustering (see option 5 below as well as keyword BIRCHMULTI), to search exhaustiveness (→ CPROGRDEPTH), to the handling of points in low-density regions that are not between basins (→ CPROGMSTFOLD), and to the parallel random search procedure itself (→ CPROGRDBTSZ), -
The data are clustered according to a tree-based algorithm (→ reference) that
shares architectural similarities with the BIRCH clustering algorithm.
The tree algorithm implemented in CAMPARI is not focused on memory efficiency but instead keeps the entire data set stored in memory.
The tree is assumed to be of a set height (number of hierarchical levels → BIRCHHEIGHT) that span a provided
range of threshold criteria (upper and lower bounds set by CMAXRAD and
CRADIUS, respectively) for cluster sizes. In the process of
providing a parallel version for CAMPARI's
shared memory (OpenMP) parallelization, a few minor modifications relative to the
published algorithm were also introduced for the serial version. These modifications
are included in the description below.
Briefly, the algorithm consists of three phases. In the first phase, the levels are looped over starting at the root of the tree (the coarsest level)
and going up to the penultimate level.
At each level, every snapshot is added to an existing cluster (if the nearest distance is below the threshold for that level) or it spawns a new one
(if it is not). The metric is defined
by the criterion of distance applied to the snapshot and the geometric center of the cluster.
The key trick is that the search space (in terms of clusters) for a given snapshot is only the (growing) set of children of the cluster
this snapshot belongs to on the previous level. By spacing the thresholds accordingly, it can thus be achieved that the
number of clusters searched per level and snapshot is constant irrespective of data set size. During the first phase, cluster centroids
move on account of snapshots being added (compare the modified leader algorithm above). The first phase can be understood
as "learning the tree" from the data. In the second phase, cluster centroids are frozen and all snapshots are reassigned
to the nearest existing cluster. This again loops over the same levels as phase 1. The second phase can be understood as "binning"
all snapshots into the learned tree. In the third and last phase, the results from phase 2 are used as follows. For a given
cluster on a given level, all the snapshots it contains after phase 2 are subjected to a "local" reclustering at the next finer level, i.e.,
the search space is restricted to the binning results at the next coarser level. Phase 3 is executed for the penultimate level,
which gives rise to the finest (leaf level) clustering, and potentially additional levels (see BIRCHMULTI).
As for refinement, the challenge is to find protocols that do not exceed the time/space complexity of the algorithm itself.
Currently, there is only one type of optional refinement step that will locally merge leaf clusters
that have different but proximal parent clusters if the diameter of the joint cluster decreases upon merging (relative to
the individual values).
Except refinement, the tree-based algorithm has been fully OpenMP-parallelized. While phases 2 and 3 are relatively straightforward in this regard, phase 1 is more tricky. To achieve stable results that are identical to those of the (slightly modified) serial version, the coarsest level can only be addressed by a single thread at a time, which obviously impairs load balance. This requires keyword BIRCHCHUNKSZ to be 1, which is the default. In parallel, the clustering can alternatively try to achieve better load balance by a divide-and-merge scheme described in the context of keywords BIRCHCHUNKSZ and CMERGEDIAM. In general, the tree-based algorithm is extremely fast and will generate more clusters than, for example, the leader algorithm with the same setting for CRADIUS. However, the cluster distribution is altered nonuniformly (the largest clusters in the tree-based algorithm will often be larger, but the number of very small clusters (1-5 snapshots) will increase substantially, especially for large height). Overall, the clusters tend to be substantially tighter. In essence, the multiple hierarchical levels act, metaphorically speaking, as a layered array of filters that creates a resultant net pore size that is smaller than any one of the filters by themselves. - The data are assumed to be clustered already, and the clustering is read from a dedicated input file. This takes a list of snapshot-to-cluster mappings that is identical to the simple output produced in STRUCT_CLUSTERING.clu. This mode is obviously redundant unless further operations dependent on the clustering are performed, e.g., obtaining a graph output file, network-based reweighting, calculating a cut-based free energy profile, and so on. The actual data are read and fully processed. This has two important consequences. First, the clustering is completely reconstructed including information about geometric sizes and distances. Second, the time savings relative to redoing the clustering may not be significant (e.g., for typical applications with large data sets, the tree-based clustering itself takes little time compared to reading and processing the input trajectory). Note that the setting for CRADIUS is meaningful in this execution mode: both for computing cluster quality (reported to log-output) and for options 8/-8 for keyword CADDLINKMODE.
- This is the same as the previous option, i.e., the data are assumed to be clustered already, and the clustering is read from a dedicated input file. The important difference is that the actual data are not read. This means that information about geometric sizes and distances of clusters and snapshots is not present, and any options relying on this information are either disabled or redundant. This option is useful if repeated graph-based operations are to be performed on a fixed clustering, e.g., changing the lag time of a Markov state model and recalculating the steady state. It can reduce both execution time and memory usage dramatically. In this mode, CAMPARI's normal functionality is entirely skipped (although it is still required to define a(n arbitrary) system, which can lead to (irrelevant) warnings and messages being printed).
Note that the connectivity map for snapshots always refers to the actual data that ends up in memory. This is controlled by NRSTEPS, CCOLLECT, EQUIL, and, if present, an input file with subsets of frames. It is consequently up to the user to ensure that the constructed network model remains meaningful.
If data for structural clustering are to be collected (→ CCOLLECT), this keyword lets the user provide the name and location of a file containing a series of integers to be interpreted as the main input for file-based clustering. This file should be formatted identically to the output file STRUCT_CLUSTERING.clu written by CAMPARI itself. The associated keyword CLUFILECOL can be used to pick from the possibly more than one columns in the input file. Some more details are provided elsewhere.The limitations are as follows. First, trajectory analysis mode has to be enabled. Second, no MPI support exists if the data are not also read. Third, the file should be analyzed with settings that pertain to the underlying data and are not remapped to the file itself (i.e., if the file is produced by CAMPARI itself, input settings should be the same as the settings for the original clustering). For example: Using NEQUIL 1000 and CCOLLECT 10, a trajectory of 10000 snapshots would give rise to 900 entries for the output in STRUCT_CLUSTERING.clu (corresponding 1010, 1020, 1030, ..., 10000). While it would be possible to reset NEQUIL to 0, NRSTEPS to 900, and CCOLLECT to 1 to read these data back in, doing so would destroy the validity of any auxiliary input file that refers to snapshot numbers in the original trajectory. Files that do so irrespective of the setting for CCOLLECT and other modifiers are files with additional breaks, trace files from PIGS runs, or files with additional links. Instead, keywords should be left at their original settings. This is because the snapshot connectivity structure is inferred independently and before the clustering (irrespective of algorithm) is performed or (if CMODE is not 7) the data are even read.
If data for structural clustering are to be collected (→ CCOLLECT), and file-based clustering has been selected (CMODE is either 6 or 7), this keyword allows the user to pick a particular column (default is 1) from the input file. The input file format is the same as that of output file STRUCT_CLUSTERING.clu, and some more details are provided elsewhere.If the clustering was generated by the tree-based clustering with a number of informative resolutions (see keywords BIRCHHEIGHT and BIRCHMULTI), CLUFILECOL can be used to quickly calculate network-derived properties for a number of resolutions, which can be used to assess the robustness of a result or to find the best resolution given a target property.
If data for structural clustering are to be collected (→ CCOLLECT), and an algorithm is used that requires a rigorous snapshot neighbor list (currently either hierarchical clustering or the exact variant of the progress-index based scheme → CMODE), this keyword defines the cutoff distance for said neighbor list. It is very critical to choose an appropriate (as small as possible) value for this parameter as otherwise CAMPARI will both run out of (virtual) memory and create humongous files that are written to disk. Note that even with a minimal setting, the problem of computing and storing the neighbor list can very easily become intractable. Often simulation data in high-dimensional spaces will be clustered very unevenly in space meaning that multiple "length scales" in distance space matter. This is detrimental to a neighbor list relying on defining a single, specific length scale through CCUTOFF.NBLFILE
If data for structural clustering are to be collected (→ CCOLLECT), and an algorithm is used that requires a rigorous snapshot neighbor list (currently either hierarchical clustering or the exact variant of the progress-index based scheme → CMODE), this keyword can be used to provide name and location of an input file in the appropriate format. CAMPARI uses the versatile binary NetCDF format for this purpose, and consequently the code needs to be linked to the NetCDF library for this option to be available (see installation instructions). Most commonly, this type of file will have been created by CAMPARI itself (it is automatically written if the code is linked against NetCDF and if an algorithm is used that requires a neighbor list → corresponding documentation). This keyword is primarily meant to circumvent the costly neighbor list generation in subsequent applications of the algorithm (for instance, with different settings for CRADIUS).CRADIUS
If structural clustering is performed (→ CCOLLECT), and an algorithm is used that uses a distance (span) threshold criterion (→ CMODE), this keyword sets the value for said threshold criterion. For leader-based clustering this is either the distance from the center snapshot (standard leader) or from the current geometric center (modified leader) and therefore constitutes a maximum cluster radius. For hierarchical clustering, twice this value is the maximum distance of any two snapshots to be part of the same cluster, so again CRADIUS will control the maximum cluster radius. For tree-based clustering, this keyword again sets the maximum distance from the current geometric center. Values are to be provided in Å for proximity measures 5-10, unitless for 3-4, and in degrees for 1-2 (→ CDISTANCE).When working with a new set of data, it is often difficult to estimate appropriate values for this keyword right away. The same is true when changing dimensionality after data transformation on a well-known data set. To help with this, CAMPARI will print out a summary of the distance spectrum reporting subsampled quantiles and the mean before entering the selected clustering algorithm. To avoid extremely slow and/or memory-consuming executions, it is thus recommended to, in first instance, choose a Leader or the tree-based clustering with extremely large values for CRADIUS (and possibly CMAXRAD) and to, in a second run, set appropriate values based on this provided summary of the estimated spectrum.
If structural clustering is performed (→ CCOLLECT), this simple logical keyword lets the user control whether to apply any possible refinement strategies to the initial clustering results. Currently, there are two such procedures: for the modified leader algorithms, a refinement procedure is available which redistributes polyvalent snapshots to larger clusters. For the tree-based algorithm (for descriptions of these methods see elsewhere), a possible refinement consists of a (noniterative) merging of clusters with sufficient overlap. They are largely experimental procedures and can have a strong negative impact on performance. In particular, in the OpenMP parallel execution of the tree-based algorithm, only a single thread does the refinement.CRESORT
If structural clustering is performed (→ CCOLLECT), this simple logical keyword lets the user control whether to break ties in the sorting of clusters by size in a systematic way. If set to 1, CAMPARI will resort clusters with identical sizes by the indices of their centroid representatives or origin snapshots in increasing order. This is useful primarily in OpenMP parallel executions of the tree-based clustering (CMODE is 4 or 5) when clustering consistency is achieved (the batch size parameter is 1). Under these circumstances, results are technically identical across multiple runs with more than one thread, but can differ in the order of clusters of identical sizes. This inconvenience can be avoided using keyword CRESORT.CLEADER
If structural clustering is performed (→ CCOLLECT), and a leader-based algorithm is used (→ CMODE), this keyword allows the user to alter the processing directions of the leader algorithm by the following codes:- The collected trajectory data are processed forward. Clusters are searched backward (starting with the most recently spawned one).
- The collected trajectory data are processed forward. Clusters are searched forward (starting with the one spawned first).
- The collected trajectory data are processed backward. Clusters are searched backward (starting with the most recently spawned one).
- The collected trajectory data are processed backward. Clusters are searched forward (starting with the one spawned first).
If structural clustering is performed (→ CCOLLECT), and the hierarchical algorithm is used (→ CMODE), this keyword allows the user to choose between different linkage criteria:- Maximum linkage: Appending a cluster with a snapshot implies that the new snapshot is less than twice the value for CRADIUS away from all snapshots currently part of the cluster. For merging two clusters, maximum linkage implies that all possible inter-cluster distances satisfy the threshold condition. This creates clusters with an exact upper bound for their diameter (maximum intra-cluster distance) and therefore resembles leader clustering.
- Minimum linkage: Appending a cluster with a snapshot implies that the new snapshot is within a distance of twice the value for CRADIUS of at least one snapshot already contained in the cluster. Merging two clusters implies that at least one inter-cluster distance satisfies the threshold condition. With a minimum linkage criterion clusters no longer have a well-defined radius and tend to get very large unless tiny values are used for CRADIUS. This is rarely a useful option for molecular simulation data.
- Mean linkage: Appending a cluster with a snapshot implies that the snapshot is within a distance of CRADIUS of the current geometric center of the cluster. Merging two clusters implies that their respective geometric centers are within a distance of CRADIUS of one another. This will create clusters that no longer have a rigorous upper bound for the intra-cluster distance and therefore resembles the modified leader algorithm.
If structural clustering is performed (→ CCOLLECT), and the tree-based algorithm or the approximate progress index-based scheme is used (→ CMODE), this keyword sets the upper distance threshold value for the hierarchical tree, i.e., it corresponds to the coarsest threshold used outside of the (virtual) root (see BIRCHHEIGHT for additional details).BIRCHHEIGHT
If structural clustering is performed (→ CCOLLECT), and the tree-based algorithm or the approximate progress index-based scheme is used (→ CMODE), this keyword sets the number of hierarchy levels in the clustering algorithm. Briefly, the tree-based algorithm works by defining a series of threshold criteria (set by interpolating between CRADIUS and CMAXRAD) that define hierarchical levels. An initial clustering tree is learned from the data (phase 1), the snapshots are reassigned to the learned clusters (phase 2), and the reassignment is used to recluster at the next finer level (phase 3). In all phases, the fundamental trick is to restrict the search space for adding snapshots to clusters with the help of the tree structure (parent-child relations). The base of the tree is never counted toward BIRCHHEIGHT as it always encloses all snapshots. By specifying 1 for BIRCHHEIGHT one can thus recover an algorithm that is - in its basic outline - very similar to the modified leader scheme (see CMODE).Larger numbers of levels generally lead to the formation of more clusters. This is because of a specific characteristic that is linked to the fact that children of a cluster (i.e., a set of clusters at the next finer level) can occupy similar space as the children from a nearby parent. If a snapshot explores only the children of a single cluster, inevitably the chances increase that an actual appropriate target cluster at finer levels is missed. In phase 3, this leads to the creation of tight clusters that are "too small" with respect to the desired threshold, i.e., this new cluster could theoretically be combined with one or more nearby clusters without the maximum intracluster distance ever exceeding the distance threshold. The offered refinement option is meant precisely to combat these errors. However, this merging scheme cannot extend arbitrarily far toward the root without destroying the computational efficiency of the algorithm. It also needs to apply stringent criteria. In practical terms, the value of BIRCHHEIGHT is a relatively free parameter if the smaller clusters can be tolerated (their quality is not compromised). In particular it can be used, in conjunction with BIRCHMULTI, to create a high-quality multi-resolution clustering, which is a desirable starting point for network model optimization. At coarser resolutions, if a multi-threaded executable is in use, the divide-and-merge strategy (see BIRCHCHUNKSZ) actually offers a separate handle on this characteristic.
If structural clustering is performed (→ CCOLLECT), and the tree-based algorithm or the approximate progress index-based scheme is used (→ CMODE), this keyword sets the number of hierarchy levels to refine during the second stage of the algorithm. Normally, only the most fine-grained (so-called leaf) level is populated in phase 3 of the algorithm. This leaves all levels closer toward the root in a less refined state. By specifying a value for BIRCHMULTI that is larger than the default of zero, the user requests CAMPARI to extend phase 3 to additional levels toward the root. The (virtual) root (single cluster) and the level with the coarsest actual threshold are both excluded from this refinement. The output in output file STRUCT_CLUSTERING.clu is adjusted to provide the correct number of coarse-grained trajectory annotations. Other analyses, unless specified otherwise, are only performed for the leaf level clustering (network). With appropriate settings for CRADIUS, CMAXRAD, and BIRCHHEIGHT, BIRCHMULTI can be used to create a high-quality multi-resolution clustering. The output in STRUCT_CLUSTERING.clu can then be used to perform graph-based analyses for all resolutions by choosing CMODE to be 6 or 7 and relying on keywords CLUFILE and CLUFILECOL, which is an efficient way of overcoming the aforementioned limitation.BIRCHCHUNKSZ
If structural clustering is performed (→ CCOLLECT), the tree-based algorithm or the approximate progress index-based scheme is used (→ CMODE), and the multi-threaded executable is in use, this keyword controls an important aspect of the parallel clustering algorithm. To avoid load imbalance and scaling limitations, the first phase of the algorithm can, optionally, employ a divide-and-merge strategy for the determination of clusters at tree levels with few clusters overall. This divide-and-merge strategy can be enabled by choosing an integer value larger than 1 for this keyword (the default is 1). As a result of the divide-and-merge approach, clustering results become dependent on the number of threads in use. Only a setting of 1 guarantees that the clustering result will be the same as in single-threaded execution.The precise role of this keyword is to set a threshold of Ns/BIRCHCHUNKSZ for coarse clusters (including the tree's root, where all snapshots are in a single cluster) to enable the divide-and-merge approach. Here, Ns is the total number of snapshots. Larger values will favor further divides and can improve load balance and scalability. The merging procedure is of course associated with a cost itself. The amount of merging is tunable by the associated keyword CMERGEDIAM. This in itself is a viable approach to controlling the properties of the clustering at coarser levels (see BIRCHHEIGHT for additional information).
If structural clustering is performed (→ CCOLLECT), the tree-based algorithm or the approximate progress index-based scheme is used (→ CMODE), and the multi-threaded executable is in use, this keyword controls an important aspect of the parallel clustering algorithm. To avoid load imbalance and scaling limitations, the first phase of the algorithm can, optionally, employ a divide-and-merge strategy for the determination of clusters at tree levels with few clusters overall. This divide-and-merge strategy is enabled by the associated keyword BIRCHCHUNKSZ. Applying the divide-and-merge approach implies that clustering results become dependent on the number of threads in use.The precise role of CMERGEDIAM is to define a leniency on a criterion for merging clusters in the divide-and-merge approach. This is necessary because individual threads will likely have produced clusters with very substantial overlap. A merging of a smaller into a larger cluster is accepted whenever two conditions hold. First, the centroid-to-centroid distance is required to be less than CMERGEDIAM times a data-derived mean cluster radius at the same tree level. Second, the normalized difference of joint radius and mean snapshot-to-snapshot distance must be less than CMERGEDIAM-1.0. The default value of 1.0 therefore provides a stringent merging criterion. Smaller values make this even more stringent whereas larger values increase leniency. Note that only clusters part of a divide-and-merge block and being created by different threads are considered for merging. This makes this functionality complimentary to refinement, which operates at the leaf level.
If structural clustering is performed (→ CCOLLECT), and the progress index-based algorithm is used (→ CMODE), this keyword allows the user to choose between the exact (1) and the approximate scheme (2 = default). The two cases differ as follows:- In the exact scheme, CAMPARI attempts to construct the true minimum spanning tree (MST) for the trajectory of interest. This is achieved by following the same setup procedure used in hierarchical clustering (described under option 3 to CMODE), i.e., a heuristics-based scheme is used to construct a neighbor list in snapshot space up to a certain hard cutoff. Alternatively, the neighbor list can be read from a dedicated input file. From this list, a globally sorted list of near distances is constructed. This setup work provides the foundation to construct the MST without additional parameters via Kruskal's algorithm . The high cost (both in terms of time and memory) makes the exact scheme impractical for large data sets. Note that the neighbor list must be sufficient for the algorithm to run. This means that all the edges for the MST have to occur in the neighbor list, which is unfortunately not guaranteed even if each snapshot has multiple neighbors listed. Potential failures are therefore difficult to predict. When using CAMPARI's threads parallelization, the exact scheme is carried out by a single thread alone. This limitation is in part due to the same reasons as the same limitation for hierarchical clustering: threads parallelization does not solve the memory problem.
- In the approximate scheme, CAMPARI utilizes a two-stage approach. The goal is to improve upon the large computational
cost associated with the exact scheme without sacrificing too much information encoded in the progress index. This is done by replacing
the calculation of the minimum spanning tree with an approximation, called a short spanning tree (SST). First, the trajectory data
are clustered using the highly efficient tree-based algorithm (described under option 5 to CMODE).
For clarity, the hierarchical tree of groups of snapshots (clusters) is not to be confused with the spanning tree we wish to generate. Because
the tree-based clustering is used, keywords CRADIUS, CMAXRAD, and
BIRCHHEIGHT are all relevant. The hierarchical tree is then used as follows. For every snapshot,
a fixed number (sometimes an upper limit) of guesses is made to find the shortest available distance to any other eligible snapshot. Rather than
searching exhaustively among all snapshots (as would be required for the MST), we restrict the search to pairs of eligible snapshots belonging to
the same cluster at the finest possible level of the clustering tree. The shortest guess for every snapshot becomes a candidate
edge for the SST. At every one of the ~logN iterations a number of guesses are discarded because they would introduce cycles. The algorithm is continued
until the SST is complete. At any given stage, the eligible snapshots are those belonging to different subtrees of the SST.
The SST will be formed primarily by connections between snapshots in the same clusters at the finest level.
This procedure emulates Borůvka's algorithm with a search space limited by the hierarchical
tree. Because the spanning tree thus constructed is not strictly minimal, it is important to update component memberships after
each merging operation.
The algorithm is dependent on two parameters. The first regulates the maximum number of search attempts used for finding the next-nearest and eligible neighbor for any snapshot (the minimum across a spanning tree component then becomes the candidate edge for that component). It is set by keyword CPROGINDRMAX. The respective clusters at the finest level of the hierarchical tree offering any eligible candidate edges may not offer CPROGINDRMAX guesses. In this case, the second parameter becomes relevant. It controls a depth as to how many additional levels of the hierarchical tree to descend into in order to satisfy the maximum number of guesses. This second parameter is set by keyword CPROGRDEPTH. There is a third parameter, CPROGRDBTSZ, which is a technical setting controlling how a cluster is searched randomly. This is only necessary if the number of eligible candidates in a cluster exceeds the number of missing guesses requested by CPROGINDRMAX. Then, CPROGRDBTSZ can be used to reduce the number of required random numbers. Depending on the settings, the algorithm is expected to run in approximately NlogN time with the constant prefactor determined by the clustering and the choice for CPROGINDRMAX. Similarly, the quality of the generated spanning tree depends nontrivially on both aforementioned search parameters as well as on the properties of the tree-based clustering. It is of course unlikely that the SST be in fact the true MST for trajectories of appreciable length. By using appropriately large values for both CPROGINDRMAX and CPROGRDEPTH one can create an asymptotic limit for recovering the true MST. This limit can be of practical use even though a guaranteed MST computed this way requires at least O(N2) time, which is worse than the time complexity of the exact form (aided by safe heuristics). However, the space (memory) complexity of this approach is much superior (linear rather than O(N1.5) to O(N2)).
The above description reveals some minor differences relative to the original published algorithm. The change associated with keyword CPROGRDBTSZ is directly related to the parallelization of the SST construction. When CAMPARI's thread-parallel version is in use, this operation offers excellent scaling properties, and limiting the number of required random numbers is beneficial for large numbers of threads. Because threads access the random number generator in nondeterministic order, keyword RANDOMSEED cannot be used to ensure that the SST in two successive executions is exactly identical for more than one thread. The second modification relative to the original publication is the straightforward generalization offered by keyword CPROGRDEPTH (in essence, the previous algorithm implied this keyword to always be 0).
If structural clustering is performed (→ CCOLLECT), and the progress index-based algorithm is used (→ CMODE), this keyword allows the user to pick a specific snapshot to serve as starting point for the generation of the progress index. Because, like INISYNSNAP, the keyword uses a snapshot index, it is important to point out that the value of this keyword must always be specified in absolute terms of the input data, i.e., generally speaking, no corrections must be applied in case CCOLLECT is greater than 1, a sequential access file with user-selected input frames is specified, or frames are discarded at the beginning (this was different in previous versions). CAMPARI takes care of this automatically. The use of a frames file requires particular care. If the file accesses the trajectory in random access ("as is") mode, the snapshot index is assumed to refer to the line number in the frames file rather than the index of the frame on that line. This is a general change of interpretation inherent to FMCSC_FRAMESFILE with certain input file formats. Conversely, if the file accesses the trajectory in strictly sequential mode, step numbers continue to refer to the original trajectory.As a special option, specifying zero instructs CAMPARI to find a set of suitable starting snapshots. These are generally found by generating a sample profile (discussed elsewhere) that is then scanned for extrema using an automated detection system that can be tuned with two additional keywords, CBASINMAX and CBASINMIN. The idea behind this is to generate profiles starting from a complete set of putative basins. If this automatic detection is unsuccessful, CAMPARI will revert to using the first snapshot as a starting point.
As a further option that is only available in the approximate scheme (→ CPROGINDMODE), a specified value of "-1" instructs CAMPARI to use as starting snapshot the central snapshot of the largest cluster found during the preparatory tree-based clustering. The default options are -1 in the approximate scheme and 0 in the exact scheme.
If structural clustering is performed (→ CCOLLECT), and the progress index-based algorithm is used (→ CMODE), this keyword allows the user to modify the spanning tree underlying the progress index before the index is computed. The modification consists of "folding" or collapsing the leaves into their parent vertex, which means that they are added first as soon as the index encounters the parent in question. By specifying a positive integer, the user requests CPROGMSTFOLD applications of this inward folding procedure (each of which scales linearly with the number of snapshots in terms of computational cost). After each iteration, the identity of vertices as leaf vertices is updated, which means that branches are continuously folded inward. Note that already a single iteration will fold a large number of edges (the actual number is reported to log output). For multiple folded vertices connected to the same parent CAMPARI preserves the expected order (shortest distance first).The reasoning behind this modification is the following. When operating on the (minimum) spanning tree, Prim's algorithm proceeds by always finding the shortest distance available. As long as basins are sampled densely and transitions are rare, this has the desired effect of arranging snapshots in a way that allows identification of basins by suitable annotation. However, it is common for basins to have "fringe" regions where sampling density becomes low (and distances are large). Points in these regions will often be missed by the progress index and placed at the end (far away from "their" parent basin). Points in these regions are also likely to correspond to leaf vertices in the spanning tree. Therefore, it can be assumed that collapsing them into their parent will partially ameliorate this issue (they will occur in the correct basin). Users should keep in mind that this alters the rule that the progress index is built to track local density as much as possible.
If structural clustering is performed (→ CCOLLECT), and the approximate progress index-based algorithm is used (→ CMODE and CPROGINDMODE), this keyword allows the user to control the maximum search depth for random guesses. In this method, a hierarchical tree is used in conjunction with a parameter, CPROGINDRMAX, to restrict the search space for finding edges of a short spanning tree. The hierarchical tree is based on the tree-based clustering algorithm, and its height is set by keyword BIRCHHEIGHT. For each snapshot, the algorithm will start searching for putative edges within the cluster the snapshot is part of at the finest level offering any eligible candidates. Often, the number of candidates is smaller than the setting for CPROGINDRMAX. Then, CAMPARI will descend the hierarchical tree toward the root by at most CPROGRDEPTH levels to fulfill the requested number of guesses per snapshot. The reason for offering this restriction is that the search at additional levels is often inefficient. This is because it introduces additional redundancy (the same candidates are evaluated more than once), and the candidates at a coarser-than-necessary level are unlikely to be better guesses than the ones at the finest available level. The default for CPROGRDEPTH is zero. Note that, with a meaningful clustering in place, the default setting will prevent the spanning tree from approaching the correct minimum spanning tree in almost all cases. This is because of the hard search space restrictions. At considerable cost, this keyword can overcome the impact of these restrictions.CBASINMAX
If structural clustering is performed (→ CCOLLECT), the progress index-based algorithm is used (→ CMODE), and an automatic determination of multiple starting snapshots for profiles is requested (→ CPROGINDSTART), this keyword controls how a test profile using the standard annotation function described elsewhere is parsed to automatically identify minima in this function. Specifically, around each eligible point in the profile, environments of varying sizes are considered, and the following criteria are used:- The sum of values to the left over a stretch of ne points must be greater than the sum of values over a stretch of ne points centered at the point currently considered.
- The sum of values to the right over a stretch of ne points must be greater than the sum of values over a stretch of ne points centered at the point currently considered.
- The sum of values to the left and right over a stretch of ne points each must be greater than a reference sum that is given as twice the sum of values over a stretch of ne points centered at the point currently considered plus 4ne.
- The left (far) half of the sum of values to the left over a stretch of ne points must be greater than the right (near) one.
- The right (far) half of the sum of values to the right over a stretch of ne points must be greater than the left (near) one.
- No point toward the left over a stretch of ne points must be greater than or equal to the point currently considered.
- No point toward the right over a stretch of ne points must be greater than the point currently considered.
If structural clustering is performed (→ CCOLLECT), the progress index-based algorithm is used (→ CMODE), and an automatic determination of multiple starting snapshots for profiles is requested (→ CPROGINDSTART), this keyword controls the minimum value considered for ne as explained in the documentation of keyword CBASINMAX.CPROGINDRMAX
If structural clustering is performed (→ CCOLLECT), the progress index-based algorithm is used (→ CMODE), and the approximate version is chosen (→ CPROGINDMODE), this keyword controls the maximum number of attempts for a search of the next correct spanning tree neighbor of a growing spanning tree component. Depending on the choice for keyword CPROGRDEPTH, such a search first exhausts the possibilities within a given cluster of the hierarchical tree underlying the approximate algorithm and will only consider a limited amount of clusters at coarser-than-necessary levels. Therefore, the parameter is interpreted as a maximum and not generally an actual value. Whenever the number of eligible candidate snapshots in a cluster is less than the missing amount of guesses for the snapshot in question, the search becomes deterministic. Otherwise, it is random (with replacement). In both cases, the eligible snapshot with the minimum distance to the spanning tree component under consideration is becomes a candidate for the next link of the approximate MST. If CAMPARI's shared memory (OpenMP) parallelization is in use, the choice for this keyword affects parallel performance at most weakly. This is because the parallelization is at the level of snapshots and not at the level of guesses.CPROGRDBTSZ
If structural clustering is performed (→ CCOLLECT), the progress index-based algorithm is used (→ CMODE), and the approximate version is chosen (→ CPROGINDMODE), this keyword controls the structure of the random search in a cluster with a number of eligible candidates that exceeds the remaining number of required guesses for all but the first stage of Borůvka's algorithm. The default is 1 meaning that every single random guess requires a random number. Values larger than 1 imply that the random search proceeds in systematic stretches of length CPROGRDBTSZ in the contiguous stretch of eligible candidates starting from a member selected with uniform probability. The specified value is an upper limit, i.e., the number of guesses is never exceeded. Use in the first stage is forbidden so as to avoid bias from the input order, which can still be present in the list of snapshots constituting a cluster. In later stages, cluster snapshot lists have been reordered by subtree memberships, and systematic biases become increasingly unlikely. The keyword is relevant primarily if CAMPARI's shared memory (OpenMP) parallelization is in use. In this scenario, the cost of random number production can become significant relative to distance evaluations because the individual threads share the same random number generator (which leads to minor waiting times). Consequently, if the cost of the distance evaluations is high to begin with (dependent on features and metric), the default should not be changed.CPROGINDWIDTH
If structural clustering is performed (→ CCOLLECT), and the progress index-based algorithm is used (→ CMODE), this keyword controls the auxiliary annotation function defined elsewhere. Specifically, it corresponds to the parameter lp in the documentation found by following the link.TMAT_MD
This important keyword lets the user set the time direction(s) for the inferred transition matrix to be used in all related analyses, which may be iterative solutions of the steady state (see CREWEIGHT), mean first-passage times (see CMSMCFEP and MFPT_MATRIX), the generation of synthetic trajectories (or random walks, see SYNTRAJ_MD) and/or, if the code was compiled and linked with HSL support (→ installation instructions), the achievement of committor probabilities (DOPFOLD and CMSMCFEP) and/or the computation of the spectral properties of the transition matrix itself (EIGVAL_MD). For the present keyword and related analyses to have an effect, a clustering analysis must be performed (see CCOLLECT and CMODE). In case CMODE is set to 4, the approximated version of the progress index algorithm must be used (CPROGINDMODE set to 2). In this case, or if CMODE is set to 5, the underlying state space used for the evaluation of the matrix is always the one at the leaf level (→ BIRCHHEIGHT).In general, the analysis of transition matrix-derived properties will be done exclusively for those states that form the reference strongly connected component, which is identified by the cluster selected with the keyword INISYNSNAP. Some algorithms may provide solutions for all eligible components individually (see INISYNSNAP). Results of all analyses that explicitly reference user-selected snapshots (e.g. committor probabilities for folded and unfolded sets) are obtained over this reference strongly connected component only. The reference component is isolated from all the other ones (if any) by removing all the one-way transitions. The transition matrices can be requested to be written to file with the specialized keyword TMATREPORT. For the present keyword, the relevant options are:
- Only data derived from one transition matrix are calculated, i.e., the one that uses the forward-time transitions between the clusters of the reference component (default). In the simplest scenario - viz. CLAGT_MSM set to 1, no breaks, no links and no trace files specified - these transitions are simply reflected in the output file STRUCT_CLUSTERING.clu when processed line by line from top to bottom.
- Only data derived from one transition matrix are calculated, i.e., the one that uses the backward-time transitions between the clusters. Breaks and links are interpreted by reversing the time-information in the relevant input files. Similar to option 1, in the simplest situation, the transitions for this case are reflected in the STRUCT_CLUSTERING.clu output file when processed line by line from bottom to top.
- Both types of transition matrices (i.e. forward and backward time) are constructed and most subsequent analyses are performed twice, once per type of transition matrix.
Since the basic inference of the transition matrix is based on the count matrix, the initial estimate depends on the assumptions of snapshot-to-snapshot connectivity in the input trajectory. The default assumption (subsequent snapshots are connected in time) can be altered by several keywords. The general time spacing (lag time) can be changed with keyword CLAGT_MSM. Custom links and breaks can be added with specific input files TRAJBREAKSFILE and TRAJLINKSFILE. In addition, a special automatic handling of rerouted transitions is offered in case the input trajectory has been generated using the PIGS protocol and the associated trace file is provided as input. Importantly, these snapshot-based modifications can be handled and analyzed by CAMPARI at the beginning of the run. Obviously, all changes to the transition matrix impact all the routines that use it subsequently. Keyword BRKLNKREPORT can be used to instruct CAMPARI to print a summary of rerouted (or all) snapshot transitions.
This keyword is interpreted as a simple logical. When set to 1, it asks CAMPARI to write one or more files (see TMAT_xxxxxx_yyy.dat for details on formatting) the non-zero entries of the processed transition matrix(ces) (see TMAT_MD). For this option to be available, structural clustering analysis must be performed (see CCOLLECT and CMODE). If any method is used that relies on keyword INISYNSNAP, which includes synthetic trajectories, spectral decomposition, and committor probabilities, this snapshot will be used to identify the strongly connected component, for which the file(s) are written.CLAGT_MSM
This integer value specifies the lag time τ to be used to compute the transition matrix for any relevant analysis based on a network (graph, Markov state model) derived from a structural clustering, e.g., SYNTRAJ_MD, EIGVAL_MD and DOPFOLD. Setting its value to any number greater than 1 (default) entails superimposing all the transition counts between clusters as derived at fixed time distance τ along the coarse-grained trajectory (STRUCT_CLUSTERING.clu), an approach that is often called "sliding window" in the relevant literature This way, there are as many superposition steps as the integer value of the lag time. CLAGT_MSM strictly refers to the spacing (in number of frames) of the data actually stored for clustering, which depend on EQUIL, CCOLLECT, and, possibly, an input file with user-selected frames. The actual distance in units of time that CLAGT_MSM corresponds to has to be computed by considering the actual spacing of the underlying data set (e.g., for a CAMPARI molecular dynamics trajectory, this would have been controlled by TIMESTEP and XYZOUT). Because CLAGT_MSM ultimately edits the way snapshots (frames) of the input trajectory are linked together, it is also relevant for the output of the progress index method (see output file PROGIDX_000000000001.dat).The sliding window mode of operation will automatically propagate modifications to the connectivity introduced by input files TRAJBREAKSFILE and TRACEFILE. Conversely, any manually added links are always kept "as is." If the user is interested in processing all the trajectories that are superimposed this way as separate entities, it is necessary to prepare as many dedicated input frames files as the integer value of the lag time and to perform a separate analysis for each of them. It is worth pointing out that the independent clustering on each input frames file may introduce some inconsistencies in this workaround.
If structural clustering is performed (→ CCOLLECT), this keyword allows the user to request different modifications of the link (edge) structure of the derived network (graph). This is unavailable if the exact progress index method has been selected. Modifying the link structure can be useful because the transition counts usually suffer from poor statistics for many if not most links. This can cause problems, e.g., by splitting the graph into several strongly connected components or by creating dramatic sensitivities of network-derived properties (such as the steady state) on very few elements of the transition matrix. For small values of chosen lag times, networks are assumed to be locally connected only (sparse). While this potentially reduces the impact of statistical errors, a large number of subsequent analyses (whether in CAMPARI or elsewhere) unfortunately assume a memory-less evolution, which is difficult to fulfill. Conversely, for large values of chosen lag times, the memory-free nature of the dynamics may become appropriate, but the number of relevant transition matrix elements grows dramatically while the counts available for inference decrease considerably.These joint concerns mean that it is unfortunately not at all simple to identify the optimal transition matrix based on counts. The available options are meant to deal with this problem as follows:
- The network is left as is, i.e., all transition matrices will be inferred directly from the observed transition counts. This is the statistically optimal estimator if the system is truly Markovian. It can lead to fractured graphs, which introduce arbitrary probability relationships between subgraphs.
- Strongly connected components are identified using Tarjan's algorithm. They can result from supplying a file with trajectory breaks or a trace file for an MPI PIGS calculation. Any one-way links between different components are augmented with the reverse transition. The floating point weight for this reverse link is set by keyword CLINKWEIGHT. If there is no link in either direction, multiple components will remain as in option 0.
- Any clusters (vertices) without any observed self-transitions (self-loops) are augmented with a self-transition with a floating point weight of CLINKWEIGHT. In a Markov model sense, this will increase residence times and populations for the augmented nodes. It also removes deterministic chains of singleton clusters, which often occur in high resolution networks in fringe regions (regions of low sampling density).
- This is a combination of options 1 and 2.
- The count matrix is symmetrized. If one of the two corresponding elements is zero, this creates a new reverse link with the same properties as the existing forward one. If both directions are already populated, this means that the transition with a lower number of observed counts is augmented to match the exact count number of the more populated one. This option ignores keyword CLINKWEIGHT. This is different from symmetrization achieved by adding the entire transition count matrix obtained from the same trajectory reversed in time. Both variants imply detailed balance. Again, if no link exists in either direction, nothing is done, and multiple strongly connected components may persist as in option 1.
- This is a combination of options 2 and 4.
- Symmetrization of the count matrix is a crude way to impose detailed balance. In particular, it is almost certainly suboptimal in a statistical likelihood sense. If we assume the likelihood of the inferred transition matrix as Πi,jTijcij, where cij is the number of observed counts for the transition from i to j, and Tij is the inferred transition matrix element, then it is possible to solve a constrained problem that maximizes this likelihood while maintaining row normalization and detailed balance as constraints on the Tij. It is important that the particular form of the likelihood function is a strong imposition, i.e., the transition counts are assumed independent, which is equivalent to asserting Markovianity. Markovianity is a very challenging property to achieve with sufficient accuracy (in the sense of a true statistical test). This means that in many applications the resultant transition matrix does not actually maximize a meaningful quantity. This holds as much for the non-augmented inference (option 0) as for this option, which includes the added constraint of maintaining detailed balance. Bowman et al. derived an iterative estimator solving this constrained problem. This estimator was subsequently simplified by Prinz et al., and their version, which works on the log-likelihood as usual, is implemented in CAMPARI. Also with this method, links with weight zero in both directions will remain empty, which again means that multiple strongly connected components may persist. This iterative algorithm benefits from CAMPARI's shared memory (OpenMP) parallelization and is under time control (the procedure can be slow). Parallelization is such that the results should always be identical to serial execution.
- Undersampled transition matrices can give rise to very noisy predictions. All options above (except 0) can be understood as attempts to regularize these matrices and thus make them more well-behaved. A different framework for doing so is the inclusion of objective prior information. This option adds pseudocounts of weight CLINKWEIGHT to every element of the transition matrix. This implies that the resulting graph is complete. In a statistical sense, this means that the inferred transition matrix is the maximum a posteriori estimate assuming a Dirichlet prior (which is conjugate to the multinomial distribution) with flat concentration parameters of value 1+CLINKWEIGHT. This is in a way an objective estimate assuming full connectivity. Note that this option does not impose detailed balance. A common choice for the pseudocount weight is 1/N where N is the number of states. This can be achieved by using a negative value for CLINKWEIGHT.
- Similar to the previous option, this uses a pseudocount strategy only that the prior is nonuniform. Briefly, every row of the transition matrix receives a total weight of CLINKWEIGHT in fractional counts. These counts are distributed according to the distribution, given the metric and rerpesentation and given the lag time, of conformational distance between snapshots connected in time. Specifically, the weight is proportional to the observed frequency without corrections (fitting, smoothing). The finest binning considered depends on keyword CRADIUS (10%). If this is not fine enough, the frequency estimate may be too coarse to add useful information. This is not generally an issue unless file-based clustering (option 6 for CMODE) is used. This option was introduced in a reference publication in 2019.
- All links connecting different strongly coupled components are removed completely. This can create separate graphs (networks).
- This is currently redundant (it is exactly the same as option 2 above).
- This is the same as option 3 above only that all links connecting different strongly coupled components are removed completely (this does not affect self-transitions).
- This is the same as option 4 above only that all links connecting different strongly coupled components are removed before imbalanced transitions are symmetrized. This removal creates symmetry and will thus leave the components separate.
- This is the same as option 5 above only that all links connecting different strongly coupled components are removed before imbalanced transitions are symmetrized (this does not affect self-transitions).
- This is the same as option 7 above only that all existing links connecting different strongly coupled components are removed and no pseudocounts are added for links connecting different strongly coupled components in the original network (they remain separate).
- This is the same as option 8 above only that all existing links connecting different strongly coupled components are removed and no pseudocounts are added for links connecting different strongly coupled components in the original network (they remain separate).
If structural clustering is performed (→ CCOLLECT), and the addition of links (edges) to the derived network (graph) is requested (→ CADDLINKMODE), this keyword sets the floating-point weight for some of the added links (see above for details). Note that the basic unit is an (integer) count of observed transitions in the input trajectory. The default is therefore 1.0.TRAJBREAKSFILE
If any type of structural clustering is performed (→ CCOLLECT), or if the exact progress index-based algorithm is used (→ CMODE), the resultant trajectory is used to infer the properties of a network. Essentially, the sequence of events in the trajectory defines a transition matrix. However, not all transitions in a trajectory may be equally valid, as they may be caused by trajectory concatenation (e.g., when using structural clustering with the MPI averaging technique, by replica exchange swaps, by nonlocal Monte Carlo moves and so on). It may therefore be appropriate to remove such spurious transitions from the analysis in order to keep inferences regarding the underlying dynamics accurate. This is what this file accomplishes, and the input and its interpretation are described in detail elsewhere.The removal of links is relevant for a number of output files, most obviously in STRUCT_CLUSTERING.graphml and TMAT_xxxxxx_yyy.dat (the mesostate (cluster) network and implied transition matrix). All output files that depend on the transition matrix (→ TMAT_MD) and the output of the progress index method are affected as well. There are two additional notes. First, CAMPARI will not remove any transitions by default, and it may sometimes be difficult to obtain or preserve the required information (e.g., the replica exchange trace file must be used to extract the exact history of accepted swaps). Second, there is no guarantee that the graph remains intact (it may fracture into multiple, disconnected subgraphs), and this may impact the interpretability of the data in the aforementioned output files. Conversely, the native processing of a PIGS trace via keyword TRACEFILE is both more convenient and more universally supported.
If any type of structural clustering is performed (→ CCOLLECT), or if the exact progress index-based algorithm is used (→ CMODE), the resultant trajectory is used to infer the properties of a network. Essentially, the sequence of events in the trajectory defines a transition matrix. However, trajectory concatenation may give rise to scenarios where some links are spurious (→ TRAJBREAKSFILE) and others are missing, e.g., if multiple trajectories are branched off from a common starting point and simply appended for analysis purposes. This keyword can be used to add such missing links at the snapshot (frame) level. This function can overlap with keyword CADDLINKMODE, which operates at the cluster level. It also overlaps with the use of keyword TRACEFILE for managing the reseeding operations of a PIGS calculation, which is a type of simulation yielding such a set of branched trajectories. The input format is described in detail elsewhere.The addition of links is relevant for a number of output files, most obviously in STRUCT_CLUSTERING.graphml and TMAT_xxxxxx_yyy.dat (the mesostate (cluster) network and implied transition matrix). All output files that depend on the transition matrix (→ TMAT_MD) and the output of the progress index method are affected as well. We emphasize that considerable care is required to manage the links in a conformational space network through keywords (TRAJLINKSFILE, TRAJBREAKSFILE, CADDLINKMODE, TRACEFILE, and CLAGT_MSM). This is mostly due to the fact that data generation and post-processing (necessarily) are usually separate operations, which makes it difficult to achieve a compromise between controllability and ease of use.
This is a simple keyword that allows the user to request information on the snapshot-to-snapshot connectivity map CAMPARI assumes for all network-based analyses (→ TMAT_MD and the output of the progress index method). Options are as follows:- No report is printed.
- Rerouted snapshot-to-snapshot links with a step spacing that is different from the requested lag time are printed to log-output at the beginning of the run (before any data are read). Indexing both relative to the stored data and relative to the original input is provided (the latter depends on CCOLLECT and EQUIL and, possibly, the presence of a file with user-selected frames). During post-processing, for the same links, the geometric distance for the two snapshots in question is printed as well.
- This is the same as the previous option only that all links are printed.
If structural clustering is performed (→ CCOLLECT), which includes the case of the approximate progress index method, the resultant coarse-grained trajectory serves to define a network (graph) of clusters (vertices). If the original trajectory carries strong initial state (but no energetic) bias (for example, if it is a concatenation of many short trajectories), it may be of interest to attempt to quantify the bias in the data. This is what this keyword is meant for, and it currently supports the following options:- No network-based reweighting is undertaken.
- The steady state (equilibrium probability distribution) of the underlying (and assumed!) Markov state model is computed using an iterative algorithm. As alluded to, the resultant graph may not be strongly connected or even fractured. Any modification to the link (edge) structure of the network (→ CADDLINKMODE, TRAJBREAKSFILE, TRACEFILE, TRAJLINKSFILE, CLAGT_MSM) can influence the steady state and any other network-derived properties profoundly. Even for a single continuous trajectory, the observed probability distribution in cluster space does not exactly agree with the network-derived prediction due to the imbalance caused by having a beginning and an end. Consequently, a simultaneous use of network-dependent properties such as mean first-passage times or synthetic (state-based) trajectories and the raw sampling weight per state will be inconsistent. This is why the steady state - if computed - will be used in the subsequent computation of cut-based free energy profiles. Note that the steady state can also be computed using linear algebra (→ EIGVAL_MD) and is required in the computation of the (-) committor.
- This option is the same as the previous one only that all edges are first scaled by their geometric lengths. XXXXXXXXXXXXXXX
The computation of the steady state uses an iterative algorithm that can become quite time-consuming due to the slow convergence behavior. There is a time control for all iterative schemes of this type. The algorithm is also numerically weak in that the convergence measure is unable to estimate the deviation of the current from the exact solution accurately and in that the convergence properties can differ across the network. The algorithm does detect periodicity, which generally prevents convergence (the easiest example is a system of two mutually connected states with no self-transitions), and will eventually report this and terminate. Unfortunately, the difficult cases for the iterative scheme are the same as those for the linear algebra solution. It can be illustrative to compute both solutions if possible and compare them (→ STRUCT_CLUSTERING.graphml). The steady state also provides a route toward reweighting a set of simulation data biased by initial conditions, i.e., an ensemble of short trajectories. The resultant weights are written by default to dedicated output file(s) unless CREWEIGHT is 0, and these files can usually be used as an input to FRAMESFILE for subsequent weighted analysis. The iterative algorithm has the advantage over the linear algebra solution that it benefits from CAMPARI's shared memory (OpenMP) parallelization. Parallelization is such that the results should always be identical to serial execution.
If structural clustering is performed (→ CCOLLECT), which includes the approximate progress index method, this keyword can be set to define a maximum execution time (in seconds) of any iterative scheme computing convergent properties of/from the transition matrix, which are currently used for the steady state (→ CREWEIGHT), the mean first-passage times (→ CMSMCFEP) to a reference cluster, and the iterative maximum likelihood inference (→ CADDLINKMODE is 6). The normal convergence threshold for these algorithms must for accuracy reasons be set to such a small number that the execution can easily time out (without a printed solution) on time-limited resources if the network is large and not well-connected. This is why this keyword, which defaults to an unlimited execution time, can be set to force a solution after a given time irrespective of convergence. Note that the execution time specified here refers to a single execution of an individual algorithm, and that multiple invocations as well as the remainder of CAMPARI's execution time must be estimated independently and corrected for. Note that all 3 of the aforementioned iterative algorithms can take advantage of CAMPARI's shared memory (OpenMP) parallelization. The efficiency of this depends on the size and connectedness of the network (larger is better for both). Notably, the keyword does not control other iterative schemes such as flux decomposition (which has its own maximum time control) or the generation of synthetic trajectories, which has a much more predictable and controllable cost.INISYNSNAP
If any type of structural clustering is performed (→ CCOLLECT), the underlying trajectory is used to infer a transition network. With this keyword, the user indicates the snapshot that is used for the selection of the reference cluster (and reference strongly connected component) in a number of related analyses (SYNTRAJ_MD, EIGVAL_MD, DOPFOLD, and CMSMCFEP). The reference cluster is simply that cluster that contains the snapshot indicated by the value of this keyword and the reference component is the one the reference cluster belongs to. In the case of the generation of random walks on the network, the reference cluster will be the starting node for options 1 and 2. This keyword also becomes relevant when committor probabilities are requested (DOPFOLD) but no input file for the reference set B (see DOPFOLD for definitions) is specified or found (CLUFOLDFILE). In this case, CAMPARI reverts to use as the only cluster of the set B the cluster selected here. If no value is specified, the default will take the cluster with the largest number of frames and its component as reference (option 0 described below).Because the keyword uses a snapshot index, it is important to point out that the value of this keyword must always be specified in absolute terms of the input data, i.e., generally speaking, no corrections must be applied in case CCOLLECT is greater than 1, a sequential access file with user-selected input frames is specified, or frames are discarded at the beginning. CAMPARI takes care of this automatically. The use of a frames file requires particular care. If the file accesses the trajectory in random access ("as is") mode, the snapshot index is assumed to refer to the line number in the frames file rather than the index of the frame on that line. This is a general change of interpretation inherent to FMCSC_FRAMESFILE with certain input file formats. Conversely, if the file accesses the trajectory in strictly sequential mode, step numbers continue to refer to the original trajectory. If the selected reference snapshot is not present in the data to be finally extracted, the program will terminate at the very beginning.
As special values we have:
- 0 (default) : The cluster with the largest number of snapshots is selected as reference one for SYNTRAJ_MD,
and possibly reference set B.
Only the eigenvalue decomposition, the pseudo free energy profile, and the matrix of mean first-passage times for the corresponding strongly connected component are calculated (depends also on TMAT_MD), if requested. - -1 : The cluster with the largest number of snapshots is selected as reference one for SYNTRAJ_MD
and possibly reference set B.
CAMPARI will (if possible) utilize all strongly connected components of the underlying graph and use the largest cluster within each component (subgraph) as reference for multiple, distinct pseudo free energy profiles and the matrices of mean first-passage times (separate output files), if requested. Conversely, the eigenvalue decomposition will only be obtained for the single component containing the largest cluster overall. - -2 : The cluster with the largest number of snapshots within the largest strongly connected component
is selected as reference one for SYNTRAJ_MD
and possibly reference set B.
Only the eigenvalue decomposition, the pseudo free energy profile, and the matrix of mean first-passage times for the corresponding strongly connected component are calculated (depends also on TMAT_MD), if requested.
If any type of structural clustering is performed (→ CCOLLECT), and CAMPARI was compiled and linked with HSL support (→ installation instructions), it is possible to perform a spectral analysis of the transition matrix(ces) derived from clustering (→ TMAT_MD), providing that the rank N of the transition matrix is > 3. The HSL library deputed to this task is the FORTRAN double precision version of EB13, viz. calls to EB13ID, EB13AD, and possibly EB13BD are made in CAMPARI whenever required. Those routines implement the Arnoldi method for large sparse matrices and the user is invited to read up on the relevant documentation (EB13). CAMPARI hides, however, some of the functionality offered by the HSL library itself. For example, referring to the relevant documentation (EB13), the Arnoldi method used by CAMPARI is always the one with Chebychev acceleration of the starting vectors, i.e., ICNTL(9) is hard-coded to 2, which is the only option we have tested. Currently, this choice can be altered only by modifying the source code where the relevant initialization happens (subroutine calc_eigs_msm(...) in source file graph_algorithms.f90). With the present keyword, the user can decide whether or not to perform the spectral decomposition of the transition matrix and, in case it is performed, how to sort the eigenvalues of the transition matrix (with this keyword the "IND" variable of EB13AD is set to the same value with an offset of -1). The following options are available:- No spectral decomposition is performed (default).
- A selected number (→ NEIGV) of eigenvalues with largest absolute values are computed.
- A selected number (→ NEIGV) of right-most eigenvalues are computed. These are the eigenvalues with the largest real parts. This option is probably the only useful option for the spectral analysis of a transition matrix as complex eigenvalues are not generally interpretable to begin with.
- A selected number (→ NEIGV) of eigenvalues with largest imaginary parts are computed.
If the spectral analysis of the transition matrix is requested, several dependent keywords controlling the task to be solved as well as parameters of the Arnoldi method become relevant. The number of eigenvalues to be computed is set with NEIGV, while keyword DOEIGVECT lets the user request the computation of the eigenvectors associated with the NEIGV eigenvalues as well. The Arnoldi method is controlled by keywords NEIGBLOCKS, NEIGSTEPS, NEIGRST, and EIGTOL. The output produced by the use of this keyword is always written to a dedicated output file (EIGENS_xxx.dat). In addition, if the chosen option is 2 and the eigenvectors are available, output file STRUCT_CLUSTERING.graphml will contain the first eigenvector as well. Lastly, note that the same routines may be called in case the computation of the (-) committor was requested. In this particular case, the options are not directly controllable by the user, however.
Because the HSL routines are not (currently) threads-parallel, this functionality does not benefit from CAMPARI's shared memory (OpenMP) parallelization, which is a limitation.
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and EIGVAL_MD is not zero, this integer value defines how many eigenvalues should be returned by the spectral decomposition (EIGVAL_MD) of the transition matrix(ces) (TMAT_MD). The returned eigenvalues are maximal in some sense, and this is defined by the choice for EIGVAL_MD. This keyword effectively sets the value of variable "NUMEIG" in the underlying HSL routine EB13. The value for this keyword influences the choice for NEIGSTEPS and NEIGBLOCKS, since it is required that min(N, NEIGV) ≤ NEIGSTEPS·NEIGBLOCKS ≤ N, where N is the rank of the transition matrix (N > 3). It is worth to note that the cost for the Arnoldi steps at each iteration scales as (NEIGBLOCKS·NEIGSTEPS)2·N, while the cost of computing the Hessenberg matrix is proportional to (NEIGBLOCKS·NEIGSTEPS)3 and the memory requirements are proportional to (NEIGBLOCKS·NEIGSTEPS)2, as outlined in the documentation for EB13. Therefore, increasing the number of eigenvalues to be computed can impact both the achievement of the desired convergence criterion (EIGTOL), which may be addressed by keyword NEIGRST, and can have a dramatic effect on execution time and memory footprint.DOEIGVECT
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and EIGVAL_MD is not zero, this simple logical (1 is true) allows the user to request CAMPARI to compute eigenvectors along with eigenvalues, which are added to the same output file, viz., EIGENS_xxx.dat. If EIGVAL_MD is set to 2, the first eigenvector contains the steady state of the transition network (TMAT_MD), which is also reported in the output file STRUCT_CLUSTERING.graphml. In case the network is fractured into multiple, strongly connected components, the computation and output are limited to the strongly connected component the reference cluster (INISYNSNAP) resides in. Additional eigenvectors (→ NEIGV), which refer to the eigenvalues smaller than 1, are often interpreted to report on the involvement of each cluster in the transition associated with a characteristic time scale, which is given by the corresponding eigenvalue λ as t = - τ/ln(λ) where τ is the lag time (→ CLAGT_MSM).NEIGBLOCKS
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and EIGVAL_MD is not zero, this keyword lets the user set the number of blocks for the Arnoldi method. It corresponds to the variable "NBLOCKS" in the reference library documentation (EB13). NEIGBLOCKS must be ≥ 1 and the conditions min(N, NEIGV) ≤ NEIGSTEPS·NEIGBLOCKS ≤ N must always hold true, with N being the rank of the transition matrix (TMAT_MD), N > 3. If NEIGBLOCKS is set to 1, the unblocked Arnoldi method is used (EB13). If this keyword is not found, the current default choice is to set NEIGBLOCKS to NEIGV + 2. However, the best choice for this value together with the value for NEIGSTEPS depends on the problem. In the reference documentation (EB13), the suggestion is to set NEIGBLOCKS to at least the value of NEIGV and to set NEIGSTEPS such that NEIGSTEPS·NEIGBLOCKS lies in the range between 3·NEIGV and 10·NEIGV.NEIGSTEPS
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and EIGVAL_MD is not zero, this integer variable sets the number of steps for the Arnoldi method and corresponds to the variable "NSTEPS" in the reference library documentation (EB13). The minimum allowed value is 2 and the requirements min(N, NEIGV) ≤ NEIGSTEPS·NEIGBLOCKS ≤ N must always be respected, with N being the rank of the transition matrix (TMAT_MD), N > 3. The current default choice is to set this variable to ceiling((8.·NEIGV)/(NEIGV + 2)), if no specifications are given from the user. However, the best choice for this value together with the value for NEIGBLOCKS is dependent on the problem. In the reference documentation (EB13), the suggestion is to set NEIGBLOCKS to at least the value of NEIGV and to set NEIGSTEPS such that NEIGSTEPS·NEIGBLOCKS lies in the range between 3·NEIGV and 10·NEIGV.NEIGRST
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and EIGVAL_MD is not zero, this keyword lets the user select the number of restarts of the Arnoldi's method before the execution is terminated in case the wanted convergence (EIGTOL) has not been achieved in the previous set of steps. The default value is set to 10, which means that the execution is aborted after 10 restarts from the possible intermediate and not yet converged solution. For the sake of completeness and clarity, the hard-coded value for the number of iteration within EB13, viz. ICNTL(11), is 999, and that is not the value set by this keyword.EIGTOL
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and EIGVAL_MD is not zero, this keyword sets the tolerance on the residuals that needs to be achieved before the computed solution of the eigenvalue problem on the transition matrix(ces) (→ TMAT_MD) is deemed appropriate. It defaults to 103·"machine precision" and corresponds to CNTL(1) in the library documentation (EB13). For completeness, we mention here that ICNTL(7) is hard-coded to 1, which means that convergence is checked against the Frobenius norm of the matrix, which is computed by default.HSL_VERBOSITY
If any type of structural clustering is performed (→ CCOLLECT) and CAMPARI was compiled and linked with HSL support (→ installation instructions), then this keyword controls the verbosity level for reporting from calls to HSL library functions. This is largely undocumented and depends on HSL implementation details. Values from 0 to 6 are allowed.CMSMCFEP
If structural clustering is performed (→ CCOLLECT), which includes the case of the approximate progress index method, this keyword allows the user to select a type of cut-based pseudo free energy profile to be computed (reference). The target node for this profile can be chosen with keyword INISYNSNAP, which is snapshot-based and includes the selection of the largest cluster(s) (by sampling weight). Depending on the assumed direction of time (→ TMAT_MD) and the choice below more than 1 profile may be generated (separate output files). Currently, there are 4 fully supported options producing output (some hidden options exist, which will not be disabled):- No cut-based free energy profiles are computed. This is the default.
- The mean first-passage times to the reference node in the Markov state model approximation are computed iteratively. After sorting all clusters according to these mean first-passage times, partitions can be defined as a function of a threshold time. The cut-based pseudo free energy profile associates each threshold time with the total weight of edges (number of transitions) crossing this threshold along the trajectory, and plots the normalized weight in logarithmic fashion (see elsewhere for details). Because the iterative algorithm may be slow to converge, its maximum execution time can be controlled by keyword MAXTIME_ITERS. In this mode, a separate profile for each strongly connected component of a nonergodic graph can be produced if INISYNSNAP is -1. These profiles are referenced to the respective largest clusters (by sampling weight) in each component. The iterative algorithm used here benefits from CAMPARI's shared memory (OpenMP) parallelization (similar to network equilibration). Parallelization is such that the results should always be identical to serial execution.
- The (+) committor probabilities for a set of clusters defining a target set and an unfolded set are used to sort all clusters (folded and unfolded set members have values of 1.0 and 0.0 by definition, and clusters are sorted in decreasing order). This requires having computed those committor probabilities separately (CMSMCFEP can not be used to enable this calculation) with the help of keyword DOPFOLD. Because the committor probabilities are only available for the reference component the sets reside in, INISYNSNAP has no direct influence (in particular, option -1 is not available). See elsewhere for details on the corresponding output file(s).
- This is the same as the previous option only that the (-) committor probabilities are used instead, which additionally relies on keyword DOPFOLD_MINUS. See elsewhere for details on the corresponding output file(s).
- This is the combination of the prior 2 options, i.e., separate profiles based on both (+) and (-) committor probabilities are produced. The same dependencies and restrictions apply.
- In this case, clusters are ordered by the entries of the eigenvector(s) computed from the spectral decomposition of the transition matrix. In order to enable this mode, it is thus necessary to consider keyworkds EIGVAL_MD, which must be set to 2, DOEIGVECT, NEIGV, and CCOLLECT, and possibly related ones (INISYNSNAP, TMAT_MD, CMODE, etc.). This means that the required functionalities are not automatically enabled by the present keyword CMSMCFEP. Since the entries of the first eigenvector with EIGVAL_MD enforced to 2 are the steady state of the associated MSM, NEIGV must be ≥ 2 to possibly produce any output. The number of output files produced is at maximum equal to the number of converged eigenvectors. Note, however, that computations are stopped (with no crash) at the first occurrence of a complex eigenvector, as it is unclear how complex spectral properties should be interpreted. This may change in the future. As a final note, analogously to mode 8, 9 and 10, only the clusters belonging to the component identified with INISYNSNAP will compose the cut-profile(s) (see elsewhere for information on the output file(s)). This also means that setting INISYNSNAP to -1 will not produce multiple sets of output file(s) (one set of file(s) per component), but it will just amount to using the largest cluster by sampling weight to identify the one reference component (same as INISYNSNAP set to 0). The map between these component-local clusters and the original absolute numbering can be recovered, as usual, by the correspondig centroids annotateted in the output file(s), or by file STRUCT_CLUSTERING.graphml, or through the simplified transition matrix output if TMATREPORT is enabled.
If any type of structural clustering is performed (→ CCOLLECT), and CAMPARI has been linked to LAPACK (see installation instructions), this keyword acts as a simple logical (0 or 1) to request the computation of the matrix of mean first-passage time values, using the classic procedure of Kemeny and Snell (see Hunter for details).If successful, this produces one or more output files (MFPT_MATRIX_yyy_xxxxxxxx.dat). They can be obtained for multiple strongly connected components (→ INISYNSNAP) and assumed directions of time (→ TMAT_MD). The details are provided in the documentation of the output file itself. Note that the computation is currently skipped automatically if the number of states (clusters) in the component is larger than 10000. This is to protect from memory overflow. Knowing the steady state of the underlying transition matrix is essential for computing the mean first-passage times in this way, and thus it is necessary to set CREWEIGHT to 1 if HSL routines are not available. If HSL routines are available, any missing steady-state information will be computed in essentially the same way as setting EIGVAL_MD to 2, DOEIGVECT to 1, and NEIGV to 1.
Similarly to almost all graph-derived quantities, mean first-passage times will respond to any modification of the underlying transition matrix, e.g., via keywords CADDLINKMODE or CLAGT_MSM (indirectly). If the transition matrix does not fulfill the criterion of detailed balance, mean first-passage times will not satisfy any symmetries. If it does, the forward and backward time versions should be the same (within numerical accuracy). The matrix itself is never symmetric. Note that switching the time arrow does not switch the meaning of "to" and "from" in the network prediction but only the directionality of increments to the transition matrix. This is why low-likelihood states, which are inherently difficult to reach irrespective of network connectivity, will always have higher MFPT values across a column but not show any systematic bias across a row (since getting to other states is largely independent of the steady-state probability of the starting state). This is directly related to an invariant of this matrix, the so-called Kemeny constant.
If any type of structural clustering is performed (→ CCOLLECT), this keyword specifies whether random walks on a transition network should be performed (and recorded), and how the initial and termination conditions are chosen. Synthetic trajectories are always confined within the strongly connected component that contains the cluster that hosts the initial reference snapshot (INISYNSNAP). The values allowed for the present keyword and their associated outcomes are:- No synthetic trajectories are generated (default).
- Random walks are initiated in the cluster that contains the reference initial snapshot (INISYNSNAP) and are terminated either when the walker hits the cluster that contains the target snapshot (ENDSYNSNAP) or when the target number of steps per trajectory is exceeded (NSYNSNAPS). In the latter scenario, the unsuccessful trajectory is not written to file (MSM_SYN_TRAJ_xxxxx_yyy.frames). The target number of trajectories to be generated is set by the keyword NSYNTRAJS. Since trajectories may fail to hit the target end node, it is possible that the number of successful trajectories is less than NSYNTRAJS. If trajectories fail repeatedly, it is advisable to increase the number of steps per trajectory (NSYNSNAPS). If the fraction of productive trajectories is small, their lengths will obviously be biased systematically toward shorter lengths.
- Synthetic trajectories are started at the cluster that hosts the reference snapshot (INISYNSNAP) and propagated for NSYNSNAPS steps, regardless where they end. Therefore, the generation of a trajectory is always successful, viz., NSYNTRAJS trajectories are always written to file (MSM_SYN_TRAJ_xxxxx_yyy.frames).
- Each NSYNTRAJS trajectory starts in a random cluster. The probability that a cluster is the starting one reflects its statistical weight, which is proportional to the raw population of the cluster if no equilibration of the transition network is performed or to the steady state of the Markov State Model otherwise (see CREWEIGHT and EIGVAL_MD). Each trajectory is propagated for NSYNSNAPS steps and written to file (MSM_SYN_TRAJ_xxxxx_yyy.frames). Keyword INISYNSNAP is used solely to identify the strongly connected component where the random walk takes place.
If any type of structural clustering is performed (→ CCOLLECT), and the generation of synthetic trajectories with a target end point has been requested, this keyword lets the user select the target snapshot for these random walks. It works analogously to keyword INISYNSNAP, and it is up to the user to ensure that the reference end target node selected this way differs from the starting one and belongs to the same strongly connected component. If these conditions are not met, the relevant analyses will be skipped. In case committor probabilities are requested but no input clusters are provided (CLUUNFOLDFILE) for set A (see DOPFOLD for definitions), CAMPARI reverts to use the cluster selected here as the only one forming set A. If this keyword is not specified but needed, CAMPARI will use the last stored snapshot from the trajectory as reference end snapshot (default).Because, like INISYNSNAP, the keyword uses a snapshot index, it is important to point out that the value of this keyword must always be specified in absolute terms of the input data, i.e., generally speaking, no corrections must be applied in case CCOLLECT is greater than 1, a sequential access file with user-selected input frames is specified, or frames are discarded at the beginning. CAMPARI takes care of this automatically. The use of a frames file requires particular care. If the file accesses the trajectory in random access ("as is") mode, the snapshot index is assumed to refer to the line number in the frames file rather than the index of the frame on that line. This is a general change of interpretation inherent to FMCSC_FRAMESFILE with certain input file formats. Conversely, if the file accesses the trajectory in strictly sequential mode, step numbers continue to refer to the original trajectory.
If any type of structural clustering is performed (→ CCOLLECT), and the generation of synthetic trajectories has been requested, this keyword specifies the target number of synthetic trajectories (random walks) to be generated. The default value is 10 and the upper limit is 104. This number is guaranteed to be the actual number of generated trajectories only if SYNTRAJ_MD is not set to 1. For trajectories with requested start and end points (mode 1), this keyword specifies the total number of attempts instead. The fraction and average length of productive trajectories will be reported to log output, but only successful trajectories are written to file.NSYNSNAPS
If any type of structural clustering is performed (→ CCOLLECT), and the generation of synthetic trajectories has been requested, this keyword sets the (maximum) number of steps per synthetic trajectory (random walk). All trajectories will have this length unless keyword SYNTRAJ_MD is set to 1. Note that in mode 1 where both a starting and an end point are used, too small a value for NSYNSNAPS will obviously bias the distribution of productive (reactive) trajectories to short ones.SYNTRAJOUT
If any type of structural clustering is performed (→ CCOLLECT), and the generation of synthetic trajectories has been requested, this keyword sets the output frequency for the synthetic trajectories (random walks) themselves. These files are documented elsewhere, but they ultimately contain lists of integers, which can get large (in total file size) very quickly. This is why this keyword allows the user to print only every SYNTRAJOUTth step of each random walk to the corresponding output file. If SYNTRAJ_MD is 1, the keyword can also be set to 0, in which case all output is suppressed (using a very large value instead causes the individual files and just their respective header lines to be written).DOPFOLD
If any type of structural clustering is performed (→ CCOLLECT), and CAMPARI was compiled and linked with HSL support (→ installation instructions), this simple logical (1 is true) selects whether or not to compute (+) committor probabilities (or pfolds+ values) for the clusters that belong to the reference component of the graph inferred from the clustering. The underlying transition matrix can be modified in various ways (see TMAT_MD for details), which may weaken or fracture the graph into multiple strongly connected components.A set of clusters to form the target set B (CLUFOLDFILE) and a set of clusters to form an alternative set A (CLUUNFOLDFILE) are required and must belong to the same component. If these input files are missing choices are deferred to keywords INISYNSNAP and ENDSYNSNAP. The remaining clusters in the same component constitute the intermediate set and for them the probability that a random walker started in that intermediate state reaches any cluster in the target set B before it reaches any cluster in the other set A can be calculated. By definition (as boundary condition), all the nodes that belong to the target set B have a pfold+ value equal to 1, while all the clusters that belong to set A have a pfold+ value equal to 0.
If DOPFOLD is set to 1, with the aid of HSL-provided (double precision) external routines (HSL_MA48), the solution is the solution of a linear system of equations for the clusters i in the intermediate set I (Noé et al.):
-pfoldi+ + Σj∈I Tijpfoldj+ = - Σj∈BTij
Here, T is the underlying transition matrix (→ TMAT_MD). Once solved, the committor probabilities are written to a specific output file (PFOLD_PLUS_xxx.dat). The computed pfold+ values are obviously sensitive to any modifications to the transition matrix. The time direction matters unless detailed balance holds (→ CADDLINKMODE), in which case the pfold+ values become equivalent to 1.0-pfold- computed via keyword DOPFOLD_MINUS. In case detailed balance does not hold, the values for the (-) committors (pfolds-) must be computed separately.
One of the reasons to compute committor probabilities may be probability flux analyses and decompositions of pathways according to transition path theory (Noé et al., Berezhkovskii et al.), and committors are indeed fundamental to these analyses. This aspect is covered by keyword TPTMODE.
Because the HSL routines are not (currently) threads-parallel, this functionality does not benefit from CAMPARI's shared memory (OpenMP) parallelization, which is a limitation.
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and (+) committor probabilities have been computed, this simple logical keyword (1 is true) lets the user request the computation of (-) committor probabilities or pfolds- values as well. These committors are defined as the probability that a random walk that reaches an intermediate state i was last seen in the alternative set (A) rather than in the target set (B). Details on nomenclature and background are found in the description of keyword DOPFOLD. Their computation requires the solution of a linear system similar to the one specified for the (+) committors (pfold+):-pfoldi- + Σj∈I X̄ijpfoldj- = - Σj∈AX̄ij
Here, X̄ is defined as T̄ij = (πj/πi) Tji, πi is the steady state probability of node i of the original transition matrix T. The underlying transition matrix is affected by a number of keywords (see TMAT_MD for details). The resultant (-) committor probabilities are written to a specific output file (PFOLD_MINUS_xxx.dat). If microscopic reversibility (detailed imbalance) holds (→ CADDLINKMODE), the (-) committor probabilities can simply be computed as 1.0-pfold+, and there is no need to use this DOPFOLD_MINUS. If it does not hold, X̄ is not simply the backward time transition matrix (→ TMAT_MD). Therefore, if DOPFOLD_MINUS is used, CAMPARI always computes T̄ from the definition above, which may be numerically problematic in case state probabilities differ by some orders of magnitude. Values for the (-) committor are required in transition path theory (→ TPTMODE). Since we require the steady state or first eigenvector, CAMPARI will check whether an acceptable solution is already available from the use of keywords EIGVAL_MD or CREWEIGHT (in this order). If not, CAMPARI attempts to solve the steady state using the HSL library EB13 similar to what would be done if EIGVAL_MD is 2, DOEIGVECT is 1, and NEIGV is 1. For this solution, keywords NEIGBLOCKS and NEIGSTEPS are not respected (values of 3 and 3 are used instead), but the settings for EIGTOL and NEIGRST remain relevant. If the solution is successful and acceptable, it will also be reported in output file STRUCT_CLUSTERING.graphml.
If this simple logical is enabled (1 is true) several files that inform on the linear system(s) relevant to the achievement of committor probabilities (DOPFOLD, DOPFOLD_MINUS) may be written. Those files are: fold_clus.out, unfold_clus.out, mat_pfold_XXX_YYY.dat, rhs_pfold_XXX_YYY.dat, and possibly tmat_pfold_XXX_minus.dat and ss_pfold_XXX.dat. Files fold_clus.out and unfold_clus.out simply replicate the clusters that make up set B and set A (see also DOPFOLD for definitions) respectively, while files mat_pfold_XXX_YYY.dat contain the coefficients of the linear system(s) solved to achieve committors, viz. T - I for DOPFOLD and/or T̄ - I for DOPFOLD_MINUS. These last two outputs obviously consider only the entries relevant to the edges between the intermediate states and the self transitions, similarly to the rhs_pfold_XXX_YYY.dat files, where the right hand side of the linear systems (DOPFOLD and/or DOPFOLD_MINUS) are written. Files tmat_pfold_XXX_minus.dat contain the entries of T̄ and are written only if DOPFOLD_MINUS is enabled, while files ss_pfold_XXX.dat contain the steady state that was used to achieve T̄. Those are written only in case the steady state was not available from other options (for details see DOPFOLD_MINUS).CLUFOLDFILE
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and the computation of at least (+) committor probabilities has been requested, this keyword let the user specify the path and name of the file that stores the reference snapshots for the selection of the clusters forming the target set B for all committor probabilities. Details on the format and interpretation of the input are provided elsewhere.In case this file is not provided or not found, CAMPARI will revert to the cluster defined by keyword INISYNSNAP as the only representative of set B. In case keyword INISYNSNAP is not specified either, CAMPARI will use the cluster that contains the largest number of snapshots. Because the sets A and B have to reside in the same strongly connected component of the clustering graph, it may not be possible to know reasonable values a priori. It can therefore be helpful to use this analysis in conjunction with a previously generated clustering-derived graph. The graph can be analyzed to identify suitable values, and the committor probability analysis can be performed in a second step by reading the coarse-grained trajectory in conjunction with any link structure modifications back in by means of file-based clustering (modes 6-7 for CMODE).
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and the computation of at least (+) committor probabilities has been requested, this keyword let the user specify the path and name of the file that stores the reference snapshots for the selection of the clusters forming the alternative set A for all committor probabilities. Details on the format and interpretation of the input are provided elsewhere.In case this file is not provided or not found, CAMPARI will revert to the cluster defined by keyword ENDSYNSNAP as the only representative of set A. In case keyword ENDSYNSNAP is not specified either, CAMPARI will use the cluster that contains the last snapshot of the trajectory, which is not generally meaningful. Because the sets A and B have to reside in the same strongly connected component of the clustering graph, it may not be possible to know reasonable values a priori. It can therefore be helpful to use this analysis in conjunction with a previously generated clustering-derived graph. The graph can be analyzed to identify suitable values, and the committor probability analysis can be performed in a second step by reading the coarse-grained trajectory in conjunction with any link structure modifications back in by means of file-based clustering (modes 6-7 for CMODE).
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), and the computation of both (+) committor probabilities and (-) committor probabilities has been requested, this keyword let the user choose between several approaches to extract paths from an underlying flux network. Generally speaking, the flux along an edge will be given as:fij = πi p-fold,i Tij p+fold,j
And the net flux, rfij will just be 0 or fij-fji in case fij is greater than fji. Here, the Tij are the transition matrix elements, which are dependent on various possible modifications (see TMAT_MD for details), while p+fold,i and p-fold,i are the forward and backward committor values for cluster i as also reported in output files PFOLD_PLUS_xxx.dat and PFOLD_MINUS_xxx.dat. The source and target states can be defined as explained for keyword DOPFOLD. From the boundary conditions of the calculation, it is clear any edge going into the source state will carry zero flux (because p+fold,j will be 0.0) just like any edge going out of the target state will carry zero flux (because p-fold,i) will be 0.0). The total reactive flux is easily measured by summing over all outgoing edges for the source state, or, alternatively, over all incoming edges of the target state.
The standard approach to pathway decomposition first imposes that p+fold must increase rigorously along the pathway. This is fundamentally dissimilar from stochastic pathways such as those found by the generation of synthetic trajectories. The network containing only reactive fluxes as edge weights has non-zero edge weights only in one direction for every edge (see above) and satisfies this requirement (albeit exactly only if detailed balance holds). Then, we can identify paths that carry a proportionally large amount of flux by finding the shortest paths in a (slightly) modified network that uses as edge weight the quantity -log( rfij / Σk rfik). The logarithm is used to avoid numerical underflow when dealing with large products of small numbers, and the sign is reversed because the algorithms are written to look for shortest, not longest paths. The normalization by the net reactive flux into or out of a node might appear confusing at first but it is necessary to prevent overcounting flux values. For example, in a linear graph, there is only one (forward) path between the two end states, and all edges in the network already carry the same net reactive flux. Thus, the total flux of a path must contain the absolute flux value exactly once, which is the outgoing edge from the source state. A single shortest path can be found very quickly using Dijkstra's algorithm while a set of k shortest paths can be found more readily using Eppstein's algorithm.
Taken together, this setup gives rise to the following strategies implemented in CAMPARI. Note that none of these take proper advantage of CAMPARI's shared memory (OpenMP) parallelization, which can be an issue if the interest is in discovering very many paths or when working on very large graphs:
- This is the default and disables the analysis of transition paths.
- This option only computes and reports the net flux without performing any decomposition into transition paths.
- This option follows the traditional method of Berezhkovskii et al., in which a single shortest path is determined. This is followed by removing the smallest edge capacity along the path from the flux network. This effectively eliminates all paths using that edge. The procedure is applied iteratively until a threshold for the net integrated flux has been met, the allotted maximum time has been exceeded, or the maximum allowed number of paths has been reached.
- This is the same as the previous option only that rather than removing the entire capacity of the minimum edge (bottleneck), a fraction of this capacity is removed to update the flux network, as introduced by Bacci et al.. It can lead to redundancy in the output but will be more appropriate than the previous option (see below for details) in capturing path diversity. Setting TPTBTNFRACTION to 1.0 makes this option equivalent to the previous one.
- This is the same as the previous option only that the base quantity to remove is not the capacity of the minimum edge along the path but rather flux carried by the path. This quantity is the same as the bottleneck capacity if and only if the bottleneck edge is unique to the path in question. Keyword TPTBTNFRACTION is interpreted analogously to the previous option. This option will almost always lead to redundancy in the output but can be more appropriate in resolving path degeneracy than option 2.
- This is the only strictly rigorous option of decomposing a flux network into a selected (maximum) number of paths. This is the correct approach for any network where this decomposition is able to encapsulate the majority of the flux. For networks with many states that are well-connected, at least in part, this is usually not the case. Here, the integrated flux of a very large number of paths will instead be vanishingly small, and the results become harder to interpret (see below for details).
- This is a generalization of option 3 above where the flux network is updated only after updating a finite number of paths extracted using Eppstein's algorithm, which is set by TPTMAXPATHS (despite its name). The algorithm proceeds for iteration cycles. With TPTMAXKSP set to 1, it is the same as option 5. With TPTMAXPATHS set to 1, it is the same as option 3 except that TPTMAXKSP plays the role of TPTMAXPATHS.
- This is the same generalization to option 4 that option 6 is to option 3, and the keywords controlling it are the same as for option 6.
Options 2-7 require some additional comments. Calculating a very large number of shortest paths in the same network (option 5) is clearly the most appropriate variant from a conceptual point of view. The reason why it is not often used in practice is that, for large numbers of states that are at least partially well-connected, the number of similar paths grows combinatorially, all contributing a vanishing amount of flux. Conversely, with small numbers of states (< 50), it may well be both a possible and an advisable approach (see Tutorial 19). The idea of modifying the flux network and recomputing one or several shortest paths in options 2-4 and 6 is exactly related to this. Here, paths are lumped together in an implicit manner by arguing, in essence, that all paths that share the same edge with the lowest flux capacity (bottleneck) will be closely related. This is very much rooted in the idea that the state space is well-partitioned, offering kinetically homogeneous regions (metastable states) that are separated by a large difference in time scales from each other (a Kramers-like picture). It is then further assumed that the state partitioning is coarse enough that the edges do not suffer so much from statistical noise that the position of the bottleneck becomes arbitray. In other words, the bottleneck edge is expected to be found between metastable states and not within and should ideally measure the number of transitions between these coarse states directly. If, lastly, there are alternative but well-separated pathways, then a flux decomposition according to option 2 should produce exactly one representative pathway for each of these channels of probability flux. By removing the entire capacity of the bottleneck edge for the pathway, no further pathways involving this edge can be found and all pathways that use these edge are implicitly contained in the single pathway being reported. Clearly, it will be difficult to rely on all of these conditions being met in general, which is why the results of this approach can be hard to interpret, in particular the values for the carried flux.
Option 3 generalizes this by allowing only a fraction of the bottleneck capacity to be removed. This means that the edge in question remains "alive" but is successively penalized every time it is part of a path. This algorithm can never be "complete" (it approaches the total integrated flux only exponentially) and reports many more paths, which can be identical. The latter point means that some post-processing is usually required. The advantage is that the degeneracy of those paths sharing the same bottleneck edge can be resolved, at least to some extent. Option 6 offers another level of generalization by extracting several shortest pathways at once before updating the flux network as is done for option 3. Option 4 is the same as option 3 except that the quantity removed along the path is the actual carried flux of the path, which is the same as the bottleneck capacity only if the bottleneck edge is unique to this path. In all cases, the ad hoc modification of the flux network can have unwanted consequences: paths are not obtained in the correct order and the level of implied degeneracy cannot generally be read off. Additionally, if TPTBTNFRACTION is significantly smaller than 1.0 (not relevant for mode 2) or, for mode 6, TPTMAXPATHS is large, the list of paths can get exceptionally long with high redundancy.
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), the computation of both (+) committor probabilities and (-) committor probabilities has been requested, and TPTMODE is 2 or larger (meaning a flux decomposition into pathways is requested), this keyword sets a total time limit for the pathway decomposition. It works analagously to MAXTIME_ITERS. CAMPARI offers a separate keyword because the cost of the operations controlled by this keyword is not directly to related to those controlled by MAXTIME_ITERS. Note that pathway decompositions do not currently benefit from CAMPARI's shared memory (OpenMP) parallelization in a meaningful way.TPTBTNFRACTION
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), the computation of both (+) committor probabilities and (-) committor probabilities has been requested, and TPTMODE is set to a mode that supports this option (such as 3), then this keyword determines how the flux network is successively modified with the purpose of discovering additional paths. Specifically, it determines the fraction to be multiplied with the flux increment that has been identified for removal from the path in question, and the value needs to be between 0 and 1 (not including 0). The flux increment is either the so-called bottleneck capacity, meaning the capacity of the edge along the path that carries the lowest amount of flux, or the flux carried by the path (→ TPTMODE). These two quantities will be the same if the bottleneck edge is unique to the path, otherwise the carried flux will always be smaller. The default value for this keyword is 1.0.TPTMAXFRACTION
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), the computation of both (+) committor probabilities and (-) committor probabilities has been requested, and TPTMODE is 2 or larger (but not 5), this keyword sets a threshold for the pathway decomposition to stop. The threshold refers to the net integrated flux, and thus the keyword can be between 0 and 1 (not including 1). Note that in modes where TPTBTNFRACTION is supported and it is less than 1, it is theoretically impossible for the fractional, integrated flux to reach 1.0 exactly. With this combination, the other limits will have to limit the runtime, namely those on the maximum time and either the maximum number of paths or the maximum number of iteration cycles. The default value for this keyword is 0.95.TPTMAXPATHS
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), the computation of both (+) committor probabilities and (-) committor probabilities has been requested, and TPTMODE is 2 or larger, this keyword sets the number of paths to disover in one outer iteration. For all modes of TPTMODE except the hybrid one (6), this is simply the upper limit for the number of paths that can be found and which are reported in SHORTESTPATHS_xxx.dat. For the hybrid mode, it is the number of paths found with the Eppstein algorithm at each iteration before updating the flux network. There are other limits to limit the runtime, namely those on the maximum time the threshold on the integrated flux, and, for the hybrid mode, the maximum number of iteration cycles.Note that, for the pure Eppstein algorithm (mode 5), it is not recommended to rely on this keyword to control the output size (use TPTMAXFRACTION or MAXTIME_TPT instead). This is because the algorithm has no clean way of being interrupted at a fixed number of paths since it does not discover paths in order (but rather needs to find the minimum in a hierarchical set, which is fixed allocation size determined by this keyword). Imposing a limit on the computed paths leads to them not necessarily being the shortest ones. This effect is difficult to control in practice. CAMPARI tries to alleviate it by buffering the search space dynamically, but this is a heuristic that offers no rigorous guarantees in practice. It is therefore always safer, in terms of rigor, to use a very large value. Note, however, that this can lead to memory issues. This caveat also affects the hybrid modes (e.g., TPTMODE is 6).
If any type of structural clustering is performed (→ CCOLLECT), CAMPARI was compiled and linked with HSL support (→ installation instructions), the computation of both (+) committor probabilities and (-) committor probabilities has been requested, and TPTMODE is a hybrid approach (like 6), this keywords sets the maximum number of iterations of applying Eppstein algorithm before updating the flux network by removing the accumulated flux from the TPTMAXPATHS paths discovered at a given cycle. Setting this to 1 is redundant (same as option 5 for TPTMODE). Setting TPTMAXPATHS to 1 along with a large value for this keyword is also redundant (same as the Dijkstra-based approaches) but using TPTMAXKSP instead of TPTMAXPATHS.NetCDF Data Mining:
(back to top)
Preamble (this is not a keyword)
This mode is enabled by a separate executable that, however, rests on the same library as CAMPARI itself. The compilation of this executable is analogous to CAMPARI executables and described in the installation instructions. This executable provides functionality to analyse input data provided in NetCDF binary format. The conversion of compliant ASCII files to the required NetCDF standard is also possible. It is only available if the code was compiled and linked with NetCDF-support (see installation instructions). The available functionality is restricted to structural clustering of the input data (permissible values of CDISTANCE are currently limited to 1, 2, 7, 8 and 9) and downstream analyses such as, e.g., cut-based free energy profile, progress index, principal component analysis and other network analyses (see for instance CREWEIGHT, SYNTRAJ_MD etc.). In case the code was also compiled with HSL support (→ installation instructions), spectral decomposition of transition matrices and the computation of committor probabilities are also supported. The presence of the keyword trace file in the input key file is explicitly not allowed and results in a halt of the execution if detected.NCDM_NCFILE
With this keyword the user specifies the path and name of the input file in NetCDF format that encodes the data to be analyzed. For the required file standard (dimensions, variables, attributes etc.) we refer the user to the relevant input file delineation and to the external NetCDF library documentation. Keyword NCDM_WRTINPUT can be used to produce a summary of the information that was processed from the input NetCDF data source file in the guise of a NetCDF output file. This output file, useful for debug, is in compliance with the internal standard, but since it reflects choices such as CCOLLECT, NCDM_FRAMESFILE and NCDM_CFILE, it should not be used as input for the present keyword. The reason lies in the fact that it may result incomplete if the same key file is used again (see NCDM_WRTINPUT). As a final note, it is important to mention that keyword NCDM_NCFILE is mutually exclusive with respect to NCDM_ASFILE: either the user provides a binary NetCDF as input or an ASCII file.NCDM_ASFILE
This keyword is used to indicate the path and name of the input ASCII file that contains the data to be either converted to NetCDF format according to internal conventions, or analyzed with structural clustering and downstream routines, or both. This behavior is controlled by keywords NCDM_WRTINPUT (to explicitly write down all the contents of the ASCII file in a NetCDF data base compliant with the correct standard) and NCDM_ANONAS (to do the analysis on the input ASCII data) and either one, or both, should be specified. For the ASCII file standard required by the program see elsewhere. This keyword is mutually exclusive with respect to NCDM_NCFILE: either the user provides a binary NetCDF as input or an ASCII file. When an ASCII file is provided as input, there are more requirements from the user side than when a NetCDF file is used, and keywords NCDM_NRFEATS, NCDM_NRFRMS and NCDM_CHECKAS become relevant in addition to, possibly, NCDM_PRDCRNG.NCDM_WRTINPUT
By setting this keyword to 1, it is possible to ask the program to write a NetCDF file that is either the converted input ASCII file (and in this case choices from keywords such as CCOLLECT, NCDM_FRAMESFILE, NCDM_CFILE, etc., have no influence on the output, only on the possible downstream analyses, see NCDM_ANONAS) or the key information of the input NetCDF database that have been inferred by the software (for debug). The output names are "" and "", respectively. It is important to mention here that only the converted ASCII file is safe to be used as input NetCDF database since the output named as "" may not fully reflect the input NetCDF file which it corresponds to. This is because "" is sensible to keywords such as CCOLLECT, NCDM_FRAMESFILE and NCDM_CFILE, differently from the output named "".NCDM_PRDCRNG
Needed only when conversion and/or analysis of an input ASCII file that contains periodic variables is requested, this keyword lets the user specify the left and right boundaries of the periodic range of the degrees of freedom as two consecutive real values, respectively. The program will possibly use this information in analysis if CDISTANCE is compatible with periodic data (set to either 1 or 2) and always adds the "periodic_range" attribute to the NetCDF variable that contains the degrees of freedom if conversion is requested (see also elsewhere).For periodic variables CAMPARI uses the interval [-180°:180°] as reference. This can be tricky if periodic variables are defined on a different interval, e.g., [0°:360°]. In that case NCDM_PRDCRNG should still be set to -180 180 as the correct remapping happens automatically. In other words, even if the periodic variables take values in [0°:360°], setting NCDM_PRDCRNG to 0 360, or having the periodic attribute of the NetCDF variable (see also elsewhere) to 0 360, would generate wrong results. These aspects are presented and discussed in Part B of Tutorial 14, especially in Step B.4.
This keyword is useful to inform the program on the name of the variable in the input NetCDF file that contains the values of the coordinates per frame to be analyzed. The default choice is "featuresvals". It is important to note that there are tools that allow renaming variables, dimensions and attributes of a NetCDF database (see NCDM_NCFILE, NetCDF documentation and the Internet) as the software does not offer the possibility to specify names for all the required specifications of the standard it expects. It is also used to turn the name of the same variable from the default ("featuresvals") in case conversion of an input ASCII file is requested.NCDM_ANONAS
This simple logical (1 is true) allows the user to activate the analysis (structural clustering and possibly downstream routines) on the input ASCII file according to the specifications in the key file.NCDM_NRFEATS
Number of features contained in the input ASCII file, required for conversion and/or analysis. This number has to strictly correspond to the number of columns of the input ASCII file (see also elsewhere). The only way to subsample the provided degrees of freedom is to use the dedicated analysis input file. Disabled by default, the user can request specific (but slow) row-by-row checks of the integrity of the input ASCII file (see NCDM_CHECKAS).NCDM_NRFRMS
Number of frames contained in the input ASCII file, required for conversion and/or analysis. This keyword corresponds to the number of rows in the input ASCII file that contain the degrees of freedom (number of frames). This means that the possible presence of feature weights in the ASCII file must not be accounted for here (see elsewhere for the ASCII standard that the data mining executable expects).NCDM_CFILE
This keyword provides the path and location to an input file selecting a subset of the system coordinates for structural clustering and dependent analyses. It is always a column list of integer indices specifying a set of the system degrees of freedom encoded in the input files (NCDM_NCFILE or NCDM_ASFILE).NCDM_FRAMESFILE
It is possible to analyze just a specific set of frames from the input file (NCDM_NCFILE or NCDM_ASFILE). The entries in NCDM_FRAMESFILE always refer to the frames, viz., possible features weights are adjusted automatically to match the specified frames, regardless input file of origin (ASCII file or NetCDF file). Referring to the specular keyword FRAMESFILE, the frames file for the NetCDF analysis mode is always read "as is" (it is not sorted and duplicates are allowed). Therefore, frames weights are not allowed and the contents of TRAJBREAKSFILE and/or TRAJLINKSFILE have to be treated accordingly. This means that the numbers specified in TRAJLINKSFILE and/or TRAJBREAKSFILE are interpreted as referring to the lines of the frames file. This is because the frames file becomes the actual trajectory. For example, if the first two frames in the frames file are 1541 and 760, then specifying a 2 in the break file entails enforcing a break after the second snapshot of the frames file, i.e. of the actual trajectory (here snapshot 760 of the original trajectory) with respect to everything that follows in the actual trajectory (i.e. in the frames file after the second line). These changes in connectivity are reflected in network analyses (see for example TMAT_MD), and in the relevant output files (e.g., STRUCT_CLUSTERING.graphml and TMAT_xxxxxx_yyy.dat). The same behavior involves all that keyords that specify frames numbering, e.g. INISYNSNAP. For example, if frames 760 of the original trajectory has to be served as INISYNSNAP and its position in the frames file is in the second line, than a 2 has to be specified as INISYNSNAP. The other keywords affetcted are: ENDSYNSNAP, folded set, unfolded set and CPROGINDSTART. The format of the NCDM_FRAMESFILE file is a single column of integers and is detailed elsewhere. The choices made with NCDM_FRAMESFILE are not reflected in the possible conversion of an ASCII file, but are reflected in the output database in case an input NetCDF file is provided and debug output requested (see NCDM_WRTINPUT). As a final note, the use of NCDM_FRAMESFILE with CCOLLECT not equal to 1 is explicitly disabled.NCDM_CHECKAS
This simple logical (disabled by default, 1 is true) allows the user to check the integrity (appropriateness) of the numbers contained in the colums of the input ASCII file. This considerably slows down the reading of the input ASCII file and should be enabled only for debug purposes.(back to top)