CAMPARI Development

Introduction to CAMPARI Source Code

A priority of the CAMPARI project is to provide code that is easy to access and can be tailored to better suit the research needs of each user, whether it be obtaining different output, using novel molecule types, or by introducing a new algorithm that improves sampling. However, experience and familiarity with a code base are indispensable to make development less and less cumbersome with time. Therefore, at this point, we simply hope that the "average" knowledgeable programmer familiar with Fortran will find CAMPARI adequately readable and customizable.

CAMPARI is generally written in Fortran 90/95 but uses certain features from the Fortran 03/08 standards (see elsewhere). From version 3 onward, CAMPARI relies heavily on Fortran OpenMP directives to produce threads-parallel code. Fortran is a language originally developed and optimized for scientific programming with many implicit functions that are useful for algorithms in mathematics and physics. Conversely, it lacks comprehensive support for many of the features that modern code would attempt to utilize from a software engineering point of view. Given that the target audience (researchers in the fields ranging from molecular biology to molecular physics) will inevitably span a wide range of programming experience and skills, it is not clear to us whether this is a caveat or a feature with respect to CAMPARI development. The choice of programming language is mostly historical for this particular project. Scientific software is rarely written by a team of dedicated software engineers over a well-defined stretch of time, but rather evolves and adapts continuously to the needs of the researchers using it. We considered translating CAMPARI to C++ early during development, but ultimately decided that the immediate upside would be too limited given the high time investment necessary for such an endeavor. We maintain that Fortran is a good language for an open source project where accessibility and readability by an inhomogeneous audience are priorities. Since version 3, readability has become worse due to the OpenMP directives. We recommend producing preprocessed source files (with OpenMP support removed) if developers or users are interested primarily in understanding what a specific part of the code does (and not how this function is parallelized). Your compiler should have a way of producing such files, usually from the C compiler. For gfortran, it is "gfortran -E".

Getting Started

Depending on the complexity of the planned development and your familiarity with a code, it may be useful to obtain a clean copy of CAMPARI rather than introducing multiple changes to the same copy. To give you a rough idea where to start looking for things, the following overview may be helpful:

Module files are called mod_*.f90 and hold variables and derived data type declarations. The only exception is "mod_interfaces.f90", which holds interfaces for basic operations (sort, etc.) to support different input data types, optional arguments, or arguments with the "ALLOCATABLE" attribute. Source files are called *.f90 and hold subroutines. Only those modules that are needed should be "included" by each subroutine ("use" commands just below the subroutine declaration). Some modules are large, and it may help to use "ONLY" statements with the "use" commands to avoid accidental overloading of variable names (this is done inconsistently throughout the code at the moment).
The source-files creating the actual binaries are called "chainsaw.f90" and "datasaw.f90". The latter is for a reduced functionality executable that provides a flexible handle on data inputs for analysis tasks (see documentation elsewhere). From these outermost "drivers", you are theoretically able to recover the entire execution tree by tracing calls to subroutines and searching the source-files for these subroutines. Many subroutines are nonuniversal, i.e., are called only in specific executions of the program.
Theoretically, you can place new variables in any module file, and new subroutines in any source-file. However, we attempt to loosely group them into intuitive blocks. Often, several subroutines are found in the same source-file. This serves to achieve a compromise between file size and file number (which is, admittedly, tipping over to the size side). You can copy-paste existing subroutines with similar scope and delete the actual "meat" to avoid trivial errors. The choice of having few source files is largely due to personal preference, and nothing stops you from creating new source files or using ones that have very little in it, e.g., "getkey.f90" during development (reduces incremental compile time). We supply a specific file for this purpose that is compiled and linked by default, called "hacking.f90".
When creating new files, make sure to append the lists in the three files "Makefile.in" (the template for the Makefile used by the configure script), "Makefile.manual" (the explicit Makefile for manually building the code), and "make_dependencies.sh", which are all found in the source-subdirectory. Otherwise your additions will not get compiled, and the compilation and/or linking step will fail (these lists are in variables called "MODULES" and "PRESOURCES" in the latter and in "PREPREMODS" and "PREPRESRCS" in the other two).
CAMPARI uses a hybrid MPI/OpenMP approach. Currently, the OpenMP (inner) layer is used to decompose tasks for a single copy of the simulation system whereas the MPI (outer) layer handles only communication between multiple copies of the simulation system. This means that interaction with the MPI layer is rare while interaction with the OpenMP layer is almost inevitable. OpenMP works through compiler-interpreted directives while MPI works through linked library calls. Essentially, the threads "library" is completely embedded in the compiler.
The driver routines for MPI-based simulation schemes are called "MPI_REMaster", "MPI_ASMaster", and "MPI_AVGCollect" in "fmsmcmpi.f90".
Generally speaking, the team of OpenMP threads is generated only once and maintained from there on throughout the entire calculation. This happens in "master.f90" (for the normal executables) and in "datasaw.f90" (for the NetCDF data mining executables) after all the initial setup stages have completed. Whenever the master routine is invoked multiple times in a single run, this also means that OpenMP is (re)initialized repeatedly, such as in a small molecule screen. This program-wide approach with respect to threads parallelization implies that the scope of variables for individual threads is private by default (since they encounter stack-allocated variables). Shared variables are the ones with global scope found in the modules. The general variable to identify a thread is called "tpi" or sometimes "tpi2", and all subroutines entered by multiple threads at once should have this argument (its presence helps clarify that the code needs to be thread-aware). At the time of writing, there is no satisfactory solution for intermittently destroying and recreating the team of threads into its prior state. This would sometimes be required to allow better interactions with linked libraries that have their own threads parallelization, such as FFTW.
Many subroutines are maintained in a way that serial and threads-enabled code will execute correctly (preprocessor directives "#ifdef ENABLE_THREADS"). This is beneficial for keeping the algorithms robust and results reproducible but can make for very scattered reading. It thus may be helpful to use the preprocessor to produce processed source files for a clearer reading of either version. Other subroutines have been explicitly duplicated, most notably "force3_threads" in "force_wrap.f90". In these cases, any additional modifications must be made twice to maintain proper functionality. Many OpenMP-associated routines are found in "thread_utils.f90".
From experience, it is very useful to think in parallel from the beginning for new algorithms to be added to time-consuming parts of CAMPARI's execution path (energy, forces, neighbor searching, coordinate operations, ...). An example would be the calculation of instantaneous pressures based on the virial. Expensive features that are exclusively serial will become a bottleneck quickly and thus have low viability in CAMPARI.
The driver routines for the small molecule execution mode are all in "docking.f90" along with most of the required auxiliary routines. In this mode, the same base calculation is repeated as many times as there are small molecules in the input file.
The top-level file containing the decision tree for picking and executing a specific Monte Carlo move is in "mcmove.f90".
The top-level subroutines containing the Newtonian leap-frog integrators in Cartesian and torsional/rigid-body space are in files "cartmd.f90" and "intmd.f90", respectively. The Langevin integrator in Cartesian space is in "cartld.f90".
The majority of energy and force evaluations are based on the organizational unit of the residue. This means that interactions are computed as residue-residue interactions. For Monte Carlo (no forces needed), the necessary wrappers are in "energy_wrap.f90", and the actual routines in "inner_loops_en.f90" and "energy.f90". For forces, it is slightly less complicated on the wrapper-side ("force_wrap.f90"), but a lot more complicated on the execution side ("inner_loops.f90", "inner_loops_imp.f90", and "force.f90"). The innermost routines are generally expected to be executed by a single OpenMP thread at a given time. It is important that they do exploit vectorization, however.
The building process for the molecules representable in CAMPARI is controlled through "makepept.f90", and uses the straightforward subroutines in "proteus.f90" and "sidechain.f90". Many of the core algorithms related to coordinate operations/representations are found in "topology.f90". None of the initial setup routines are OpenMP-aware and basically all are called identically by different MPI ranks.
Nearly all CAMPARI-internal analysis is handled through a top-level wrapper in "mcstat.f90" (see the second FAQ below for more on this).
After having written and compiled a new feature, we strongly recommend to use debugging options provided by the compiler you are using. Examples of necessary flags can be found in "Makefile.manual" for several common compilers. The ability of compilers to pick out and correctly report on runtime (or even compile time) errors is far from universal, which is why using more than one compiler can be beneficial. A related issue that can be unmasked this way is the (silent) reliance on language standard extensions offered by a specific compiler suite.
For a simplistic development environment, it will be enough to have access to a good text editor, and to use (on UNIX) simple command line tools such as "grep" efficiently (for example to find out where specific global variable defined in one of the modules is used). In this regard, it will also be beneficial to keep the formatting consistent with the rest of the code.
Lastly, it is important to keep in mind that in any complex software package, changes to one section of the code may have unexpected consequences on others. If you intend to see a new feature you developed incorporated permanently, it will be necessary to make sure most of these consequences are avoided upfront.

Unfortunately, having comprehensive in-code documentation is very difficult to achieve, since it is not typically possible to anticipate the questions other developers may have (for instance, it is relatively easy to provide full documentation on the inputs and outputs for each subroutine, but in most cases that will not help anyone understand the algorithm inside). CAMPARI is documented internally rather inhomogeneously and usually requires a conceptual understanding of what a specific piece of code is trying to do. It is always recommended to first consult the actual user documentation on a feature of interest before inspecting the code itself. In the same spirit, higher level information such as the FAQs below may (at least in certain cases) be an efficient means of "documenting" code.

Code Navigation Tools

Currently, the only code development tool we provide is a little script written in Python to help track global variables. There is a README file that explains its usage (link). It can be found in the doc-subdirectory as a file called "typeglobals_html.py".
When run successfully, this will create a set of html-files summarizing which subroutines access and modify which global variables. Naturally, it assumes CAMPARI-style syntax, and may not always work 100% accurately if formatting violations occur. The html-files produced can be very valuable for avoiding conflicts/pitfalls during development early on.

FAQ

This is a short list of (two) questions that will occur frequently. Both deal with extending pre-existing functionality in CAMPARI by different but at some level analogous additional cases.

Q: What do I need to do and know if I want to add (permanent and complete) support for a new small molecule, or a new polypeptide or polynucleotide residue type which has identical backbone architecture?
A: This is one of the more tedious exercises in CAMPARI. Note that CAMPARI does allow simulations and trajectory analyses of systems containing entities (residues) not natively supported by CAMPARI (→ FMCSC_PDB_TEMPLATE and sequence input) and even has an automatic mode for doing this (→ FMCSC_MOL2SCREEN) for a library of unsupported single-residue molecules. In addition, there is an automatic parameterization tool for the CHARMM force field that can much more easily provide most of the required patch files than when done manually (→ Ruby parameterization script).
If permanent modification is deemed superior or preferred for any reason, then there are a number of steps involved as follows. Importantly, there is no need to worry about either OpenMP or MPI parallelization here because all steps are during the initial execution stages where (currently) all MPI ranks independently perform the same tasks and no OpenMP threads have been created yet.

Append the biotypes section of all force field parameter files (see dedicated section of the documentation) with the necessary information for the new residue. Note that there is freedom to make each atom its own biotype although CAMPARI aims at summarizing chemically equivalent atoms under a single biotype (for example, in this convention, propane has only four biotypes). While you are at it, assign LJ and bonded types (which are the only ones absolutely needed). Leave zeros for these types for force fields you do not intend to support: this will allow CAMPARI to correctly infer this lack of support. You should generally use shell-level scripting for performing this task to minimize the risk of random errors (typos).
Next, edit "mod_aminos.f90". Increase MAXAMINO accordingly and append the arrays amino (three-letter code to be used), aminopolty (whether bona fide polypeptide ('P'), polynucleotide ('N') residue or whether anything else ('X')), and rescontlen (a reasonable diameter for small molecules or the contour length for polymer residues). The variable seqtyp defined in "mod_sequen.f90" will during runtime hold - for each residue in the system - the index into these arrays. This is occasionally used to identify a residue by an integer number.
Following up on this, another very simple step for a new new polymer residue is to tell the wrapper-routine makepept() in "makepept.f90" that this residue with its new, unique index (seqtyp) is a valid chain-residue. makepept() compares in two more or less identical loops the values of a variable called sickle and of seqtyp(i), respectively, to hard-coded ranges permissible to support the residue in specific positions (N-terminal, C-terminal, or non-terminal). These conditionals (ranges) will have to be appended accordingly.
Now comes one of the difficult steps: actually building the geometry of the molecule / residue. Depending on the intended application, find force field parameters or (usually preferred) high-resolution crystallographic information for the intrinsic geometry of the residue (except rotatable bonds). You are explicitly providing Z-matrix information here, so it may not be as easy as anticipated (and require a fair amount of trial and error). If a file with example Cartesian coordinates is available, it may be easier to explicitly generate a Z-matrix first (e.g., using CAMPARI by reading in the residue as an unsupported one) and to then transfer values. In general, the Z-matrix format references back a new atom i1 to three previously built ones: i2, i3, i4. i1 is directly bound to i2 which is directly bound to i3 which is directly bound to i4. i1, i2, and i3 form a well-defined bond angle. Beyond that there are three alternative formats: first, i1, i2, and i4 may form a second bond angle which in addition then employs a chirality indicator to distinguish two otherwise indistinguishable placements. This is fully supported, but numerically problematic (to be avoided). Second, i1, i2, i3, and i4 may form a proper dihedral angle. This is the recommended case for the first atom connected to the second atom of a chemical bond. Third, i1, i2, i3, and i4 may form an improper dihedral angle (i2 is covalently bound to all other atoms). This is the recommended setup for further atoms connected to the terminal atom of a chemical bond. The chirality indicator is zero for both the second and third cases. The Z-matrix is populated by calls to z_add(bio_type,r2,r3,r4,i2,i3,i4,chirality) in "topology.f90":

z_add(bio_type,r2,r3,r4,i2,i3,i4,chirality) bio_type = Defines current atom (i1) i2 = The previously built atom to which i1 is connected i3 = A previously built atom to which i2 is connected i4 = A previously built atom to which i3 or i2 is connected r2 = The bond length between i2 and i1 r3 = The bond angle between i1-i2-i3 r4 = The dihedral angle (or bond angle) between i1-i2-i3-i4 (or between i1-i2-i4) chirality = If 0, r4 = dihedral. If +/-1, angle between i1-i2-i4
As stated, if chirality is zero, r4 is interpreted as the proper or improper dihedral angle i1, i2, i3, and i4. If it is either +/-1, r4 is interpreted as the bond angle between i1, i2, and i4. For the first three atoms in any new molecule, not all references are available and zeros are passed to indicate this.
For a small molecule, add the new molecule in "proteus.f90" in the first major block that stands under the conditional control

if ((rsmol(imol,2)-rsmol(imol,1)).eq.0) then
which simply implies that the molecule has only a single residue. For a canonical polypeptide residue add a block in subroutine sidechain() in "sidechain.f90" and for a canonical polynucleotide residue add a block in nuc_sidechain() in "sidechain.f90". You will note that residues are either identified by their name (in a variable often called resname; this is for better readability only), their residue type index (the aforementioned seqtyp), or a coarser classification into polymer types (the aforementioned "aminopolty" that takes the residue typ as index). Note that other useful arrays are not yet populated ("seqflag" or "seqpolty") when these routines are called. For a polymer residue, conditionals will have to appended throughout in "proteus.f90" to ensure that the new residue is available in all positions within the polymer. For example, non-terminal polypeptide residues are screened for by the following conditional:

else if (((seqtyp(i).ge.1).AND.(seqtyp(i).le.25)).OR.& & ((seqtyp(i).ge.31).AND.(seqtyp(i).le.35)).OR.& & (seqtyp(i).eq.51).OR.& & ((seqtyp(i).ge.108).AND.(seqtyp(i).le.119))) then
Such a conditional allows only residues with seqtyp-values in the indicated ranges to be built in the middle of a polypeptide chain. A new polypeptide residue will have values outside of these ranges, thereby requiring the extension.
Also for polymer residues, you may note the local data arrays in both "proteus.f90" and "sidechain.f90" (such as "ntyp") which hold biotype indices. A correct entry for a new polymer residue is crucial since otherwise the wrong biotype and hence the wrong parameters may be used. Aside from calling z_add as many times as necessary while passing along the right geometry information (utilize arrays bo, an, and dh for this), several arrays or indicators must be set manually:
1. Increment the arrays at(rs)%bb and at(rs)%sc and the corresponding counters at(rs)%nbb and at(rs)%nsc. Here, "rs" is the residue number (usually "i" in "proteus.f90" and "sidechain.f90"). Note that the first atom in a residue or small molecule should always be a backbone one and that branched mobile groups have to be in sc-arrays. The second condition mentioned is naturally irrelevant for small molecules but crucial for polymeric residues (never append at(rs)%bb within sidechain.f90). For adding a canonical polypeptide or polynucleotide residue, you should therefore not have to touch at(rs)%nbb and at(rs)%bb.
2. Populate the pointer arrays for degrees of freedom. This is very important for giving the residue access to dedicated move sets in Monte Carlo, but also has relevance in internal coordinate space dynamics (see FMCSC_TMD_UNKMODE). There are pointer structures that can be incremented (nchi+chiline and nnucs+nucline) and those that cannot (fline, yline, wline). In the construction of sidechains and small molecules, it is strongly advised to only use the nchi and chiline construct although exceptions exist (see for example ω-sampling using wline in the small molecule amides NMF and NMA). As you can see from inspecting the code, the initial values for those degrees of freedom are already held in their respective pointer arrays (for example "chi"). They are initialized elsewhere (in subroutine makepept() in "makepept.f90") and can be adjusted there if special, initial values are required.
3. Populate the array eqatm if necessary. This array holds information about rigidity within the molecule, i.e., whether there are non-rotatable dihedral angles or rigid ring systems. It generally works best by assigning that prior atom which is the nearest to the mobile parts to all downstream atoms which are connected rigidly to that atom. For example, in the histidine sidechain, atoms CE1 and HE1 always have atom ND1 set as their equivalent atom in equatm. This implies that (for certain interaction models) they are going to be treated just like the nitrogen atom in determining excluded interactions. In this case, for example, the CG-to-HE1 interaction would be excluded on account of the eqatm setting, even though those atoms are separated by three bonds and hence would interact if flexibly connected.
4. Populate the array atmres which is simply a residue number per atom array.
5. For polymers, certain prominent atoms along the main chain have their own pointer arrays (for example ni, cai, ...). These will not usually be required for building small molecules or new canonical residues of a supported polymer type.
6. For all of the above, it is important to be mindful of the ua_model flag (user-controllable through keyword FMCSC_UAMODEL) that chooses between different united-atom representations (non-polar hydrogen atoms are not represented explicitly) and all-atom representation (see below).
This concludes the necessary groundwork to allow CAMPARI to build the new unit. Further modifications are needed to extend the functionality (some of it essential) that the new residue or molecule is supposed to cover.
A further essential step is to set the reference atom and sequence flag setting for the residue / molecule. This is trivial and done in subroutine absinth_residue() in "assignprm.f90". The reference atom ("refat") is used to identify roughly the center of the unit for either just topology- or grid- and topology-based cutoffs. The sequence flag ("seqflag") is used to group residues into classes, e.g., regular polypeptide residue vs. proline-like residue, etc.
With the geometry of the new residue and its reference atom cast in stone now, another simple step is to append the subroutine setup_resrad() in "structure.f90", which populates arrays for two variables: the first is the effective residue radius (defined as the maximum possible distance measured from any atom to the residue's reference atom); the second is the maximally expected number of atoms in that residue. In most cases, this will simply be the actual number of atoms. Note, however, that through conditions for termini in polymeric residues and through future extensions pertaining to titratable sites, the number is in some cases variable. The value specified here has to be the rigorous upper limit. Outside of wasted memory and minute efficiency drawbacks, it will not be harmful to specify something too large here.
The setup of possible degrees of freedom for MC and TMD calculations is generally automatic. If you accidentally set one of the pointers (fline, etc.) to an ineligible Z-matrix line, subroutine find_rotlsts(...) should identify this problem and exit with an error. This subroutine flags all dihedral angles that seem to correspond to freely rotatable ones by allocating a rotation list (list of atoms moving as a result of perturbation to the dihedral angle assuming the default building direction). This can be probed quickly by checking whether izrot(i)%alsz is larger than zero ("i" being the atom index in question). The only other atoms for which this condition is true are the very first atoms in each molecule (they hold a trivial rotation list for rigib-body rotation). All setup for TMD runs (dealing with effective building direction, recursive computation of torsional gradients, ,etc.) should work without additional code modifications.
If permanent support is needed for the ABSINTH implicit solvation model (rather than using patches such as solvation group patches), a number of additional steps are involved. It may take some time to get familiarized with the data structures employed, so the learning curve may be a bit steeper than for other modifications:
1. The new sidechain and/or small molecule needs to be analyzed: is there a corresponding, chemically well-defined model compound for which transfer free energies have been determined experimentally? If no experimental data are available, are there in silico estimates available or possible to attain? It is important to note that the ABSINTH paradigm rests somewhat on the assumption that the polymer is modular and that the linkages between the model compounds are not dominant solvation sites themselves. As an example, it would be misleading to break up an ether at the oxygen atom and to use the aliphatic substituents as the two model compounds.
2. With numbers and a satisfactory decomposition in mind, the next step is to add a global variable in "mod_fos.f90" and the facility to read and report this value in "readprm.f90". These are trivial extensions of the subroutines read_fos() and fos_summary(). It is also recommended to add the values in all of the parameter files distributed with the software right away since the code does not warn about missing free energy of solvation (FOS) parameters. Remember that for polymers, the specific entries only address the variable parts, i.e., the sidechains. For example, if FOS_ALA in the parameter file is set to 1.9, then this is meant to be interpreted as the FOS of the alanine sidechain model compound, i.e., methane (although the code can handle this flexibly: see below).
3. Next, open "sav.f90" and append subroutine setup_freesolv(). This is trivial: note that the array freesolv has room for only three values per residue (corresponding to the (maximum of three) numerical values on a single input line in the parameter file). This array simplifies coding later, but is not used beyond setup tasks. The next major step is adding a section in the helper routines to subroutine solvation_groups(). This populates the solvation group data structure (see "mod_polypep.f90"). For small molecules, an entry is needed in subroutine helper_svgrps_single() and for polymer residues one in helper_svgrps_scpep() or helper_svgrps_scnuc(). Note that we assume the polymer residues fit into the pre-existing backbone paradigms flawlessly. As a counterexample, a polynucleotide residue with a modified sugar would require further modifications. Because of the support for DNA/RNA, subroutines helper_svgrps_middle(), helper_svgrps_Nter(), and helper_svgrps_Cter() all contain explicit conditionals checking for residue names, which you will have to extend. To get a feel for the data structures representing the solvation model, check a simple entry in helper_svgrps_single(). Afterwards, implement the decomposition accordingly. You basically have to define groups of atoms that form solvation groups on the unit you added (in the simplest case, this can be just a single group). Note that sometimes an experimentally available number is further decomposed if the unit is intrinsically anisotropic. The tyrosine sidechain - for example - has a benzene-like part and a polar part, which actually form separate solvation groups. This may create subtle parameter interdependencies (in this case, the resultant FOS value for a tyrosine sidechain in a specific conformation of a polypeptide would depend weakly on the value used for benzene in the parameter file; this is because the two actual solvation groups derived from the entries for tyrosine and benzene will be different). Conversely, for fully solvated states, the sum of all values should always yield the desired reference FOS.
4. While it may seem as if this should be enough, there is one more major step involved, which does nothing but set the parameters in the array atsavmaxfr. These values are the fractions of solvent-accessible volume remaining after the volumes of the atoms belonging to the same model compound have been subtracted. In theory, for fixed size and solvation parameters and fixed initial geometry, it is possible to compute these only once and to simply hard-code the values or read them from a file. The subroutine setup_savol() and its daughter routines in "sav_auxil.f90" do this dynamically at the beginning of each simulation. The process is three-fold: i) for each model compound, build any necessary dummy atoms (usually aliphatic hydrogens on aliphatic C-connectors); ii) assemble the list of atoms belonging to the model compound (this can be tricky if atoms are shared such as along the polypeptide backbone by consecutive NMA units); iii) run a pairwise loop over the complete list and determine volume reductions to be captured by atsavmaxfr. For small molecules, this is trivial as seen in subroutines gen_complst_bb() and gen_dummyH_bb() in "sav_auxil.f90". For novel peptide or polynucleotide residues, this can be relatively easy if analogy coding is used (especially with respect to the generation of the few necessary dummy hydrogens): see subroutine gen_all_sc(). For a completely new polymer or backbone modifications, this would be much more cumbersome. It is definitely worthwhile to check the values obtained for atsavmaxfr, which can be written out by turning on SAV-analysis (written to SAV_BY_ATOM.dat).
To support polar interactions, the only things needed are to assign charge types in the parameter file and to append any appropriate conditionals in subroutine polar_groups() in "polar.f90" (often none, as seqflag is used for parsing, the primary exception being in offering hard-coded guidance to assign net charge groups). In some cases, certain parameterization paradigms and/or molecular architectures require specific exclusion rules or adjustment of parameters. As outlined above, exclusions are generally handled via the eqatm construct. This can be seen for polar interactions in the latter half of subroutine polar_groups() and for other non-bonded interactions in subroutine setup_srinter() in "unbond.f90". In general, modifications should not be necessary if eqatm is populated correctly. Developers may note that there is an automatic treatment available for unsupported residues (uses subroutine ia_rotlsts(...) in "topology.f90"), which can theoretically be used also for built-in residues (if the eqatm treatment appears infeasible). There are further exceptions: the GROMOS force field uses its own set of rather obscure excludes (see FMCSC_INTERMODEL). For compatibility, CAMPARI has two subroutines (GROMOS_excludes_polar() in "polar.f90" and GROMOS_excludes() in "unbond.f90"), which may need to be appended. CHARMM on the other hand uses pre-polarized charges on the two hydrogens in primary amides (cis vs. trans) which have the same biotype in CAMPARI. A special rule to support this paradigm is found in polar_groups() (see FMCSC_AMIDEPOL) and may need to be appended as well.
If you have gotten this far, this is a good time to append the documentation, specifically that on sequence input. If there are any special considerations for samplers or analysis routines, you should insert them into the documentation. If not done already, it would also be very useful to append all nonobsolete parameter files distributed with the code. As alluded to above, the biotype sections must be consistent across parameter files even if there is no actual parameter support.
From here on, the major part of the work consists of making sure that everything works as intended. CAMPARI utilizes plenty of conditionals checking for residue identity through the array seqpolty, through the arrays seqflag and seqtyp, through the presence of positive entries pointer arrays (such as fline, wline, ...), or (rarely) through the residue name via the array aminos. If the new unit/residue is supposed to work just like another existing case for some of the code's functionality, then it will often be necessary to append the corresponding conditionals. This is best illustrated using an example: suppose you added a non-canonical polynucleotide residue with identical backbone architecture, for example 5-methylcytosine. Then, conditionals which test for seqpolty being 'N' or for seqflag being 22/24 will be unaffected as will be those testing for the pointer arrays nucsline. Conversely, the remaining checks require code modifications (at the very least: appending conditionals). CAMPARI has been evolved to become more and more developer-friendly in this regard, i.e., the goal is that all the required changes in the code actually correspond to conscious decisions regarding the new residue, and that all things that should work automatically do in fact work automatically.
One of the more important auxiliary things to fully support a new residue or small molecule is to enable PDB structures to be written and read-in correctly. The former is simple. If special names are required (compared to those in the biotype section of the parameter file), the output strings for atom names may need to be adjusted. This is done in routines name_heavies() and name_hydrogens() in "prtpdb.f90". Reading from PDB requires more work. The first thing to do will be to consider whether the new unit has any common ambiguities with respect to its PDB atom names, which can be corrected upfront in subroutine pdbcorrect() in "readpdb.f90". The only thing to watch out for here is the structure of the conditionals. Beyond that, there are three modes for reading in a PDB file (see FMCSC_PDB_READMODE). Options 2 and 3 (the defaults for trajectory analysis and simulation tasks, respectively) both use xyz-coordinates in the PDB directly. In the absence of exceptions, extending this functionality to support the new residue type is simple. For a small molecule, the subroutine trf_sms() needs to be appended; for new polynucleotide or polypeptide residues, it is just trf_scs() to add support for reading in the novel sidechain. These routines are generally trivial to append. In trf_nts(), trf_cts(), and trf_bbs(), you will at most have to append conditionals, which is particularly relevant for polynucleotide residues. Possible exceptions in dealing with PDB input are the shifting of atoms to a neighboring residue because of convention (as for FMCSC_PDB_NUCMODE, this should be avoided at all cost), problems with duplicate atom names (CAMPARI will crash if the names are not unique), and the use of wild card residue names (as for distinguishing deoxyribo- from ribonucleotides or as for supporting the residue name "HIS" in PDB files). If the new residue warrants any of these exceptions, you will also need to modify subroutine "FMCSC_readpdb2" → search for the string "HIS" in this subroutine to see an example.
The first option for FMCSC_PDB_READMODE is considered largely obsolete. We recommend that you simply add a call to "fexit" with an error message in a subroutine like "readpdb1_chi" in "readpdb.f90" and skip to the next point. For completeness, this PDB parsing mode extracts CAMPARI's internal list of MC degrees of freedom but keeps CAMPARI's built-in geometries as follows:
1. Through some pre-parsing, the file is read and residues are identified (which is why residue numbering has to be proper). Each residue block is scanned for specific atom names assigned in readpdb1_collect() along with an indicator variable ("ok") that a certain name was found.
2. For each molecule, the code attempts to extract the rigid-body coordinates which are simply fixated by the positions of the first three atoms. The code balks if this fails, which happens in subroutine readpdb1_rbs().
3. Next, the sidechain angles are extracted in subroutine readpdb1_chis(). Here, it is important to know which reference atoms each dihedral degree of freedom uses in order to ensure that the correct value is assigned. The four relevant positions (if found) are then passed to subroutine dihed() which calculates the dihedral angle which is subsequently assigned to the appropriate pointer array (such as chi).
4. After this, backbone angles are extracted through two subroutines (readpdb1_omegas(), which is hardly relevant, and readpdb1_bbs()). Lastly, some of the presence indicators are shifted so that knowledge of some positions of atoms in prior residues is carried through to the next one. None of this should be relevant if one is adding just a new small molecule or a canonical residue.
Given this rough overview, the necessary modifications in readpdb1_collect(), readpdb1_rbs(), and readpdb1_chis() should again be trivial to undertake. More than anything else, the pdb-writing and -reading facilities are just cumbersome.
Lastly, if one can think of any circumstances under which the new unit could cause problems that are not already covered by the existing sanity check machinery, it will prove useful to implement additional high-level sanity checks (in "sanity_checks.f90"), which disable the respective features upon encountering the new residue or molecule. Such holes are best identified by trying to run a few calculations with the new group which attempt to utilize all of CAMPARI's functionality (all analysis flags small, all running output enabled, different samplers, ...).
As an additional comment: as mentioned above in the context of building, CAMPARI alternatively creates residues and molecules internally in AA- or UA-representation (see FMCSC_UAMODEL). This can lead to plenty of shifts in atomic arrays or even torsional degrees of freedom (for example for ethanol or p-cresol), which may require adding conditionals reflecting those shifts. These occur in a wide variety of places. In this context it is useful to know that a de facto UA model can always be set up through the parameter file even though the molecules are built in AA representation. This works fine (and hence UA modifications are not required even if the force field is UA) but is inefficient - hence the explicit (and cumbersome) UA support.

Q: What do I need to do and know if I want to include a custom analysis routine?
A: This is generally simple unless the custom analysis routine itself is complicated and/or computationally expensive. As always, a decent knowledge of the CAMPARI data structures will be helpful. For quick hacks, there is usually no need to worry about parallelization (whether shared- or distributed memory), but for making a contribution permanent there is of course.
Disclaimer: Since version 5, the intended method of supporting analyses that are not built-in is through the Python interface. This is explained in detail in Tutorial 18. Importantly, the main Python module can also be extended permanently: for example, the function align_by_set() in campana.py replicates the built-in CAMPARI functions accessible through FMCSC_ALIGNCALC. This offers a second route toward permanent extensions of CAMPARI's analysis facilities. The advantages of this route are that only a single file needs to be modified, that no compilation is required, and that you can take advantage of other Python modules and built-in language features. The downsides are that you cannot take advantage of the OpenMP parallelization, that the code can be harder to debug, and that the end user needs to invest more effort to use this function (setting a single keyword vs enabling the Python module and extending the run-specific copy of campana.py by calling the new function and printing its results).

For modifications within CAMPARI's Fortran code, the following steps are required:

First, decide which new global variables are needed and add them (preferably) in an existing module that deals with similar quantities (e.g., "mod_polyavg.f90" or "mod_torsn.f90"). If necessary, initialize the non-allocatable variables in "initial.f90". Make sure memory of the requisite size is allocated for allocatable variables. This should but - especially for temporary code modifications - must by no means happen in "allocate.f90". It is recommended to maintain the existing structure in "allocate.f90", i.e., there are many allocation blocks controlled by an individual subroutine that contains creators (including initialization) and destructors. Conditional controls are generally applied (where necessary) to prevent unneeded memory from being allocated. As an example, have a look at subroutine allocate_polyavg().
For permanent analysis routines, the next step would be to provide the user interface for controlling the analysis, i.e., to add keywords and their read-in statements in "parsekey.f90". This usually works without much thought by appending any of the lists in a suitable helper subroutine. Copy a block which reads in a keyword assigning a variable of identical dimension and data type and then simply edit the details (make sure to match the length of the substring of the keyword correctly and do not delete the assignment of "have_legit_key" to true ← otherwise your keyword will be incorrectly reported as unrecognized). Note, however, that the main subroutine parsekey() is called several times with different flags for the hierarchy level being passed. These are needed if and because keywords depend on each other, i.e., require a conditional assignment (for example, the code needs to know first whether the simulation system is a cubic box or a sphere before it can decide whether to read in three or just one number for the size specification). For temporary modifications aiming to facilitate specific analyses, these modifications are usually more time-consuming compared to modifying the code directly and hard-coding the settings.
Once the scope is clear, add information in "summary.f90" that reports to the user at runtime whether your new analysis function is enabled or not along with any, if necessary, additional parameters set by the user. The block reporting analysis settings starts at:

write(logi,*) '**OUTPUT AND ANALYSIS PARAMETERS:'
Try to find a suitable place near similar analyses. A second thing to do, once the scope appears clear, is to write (at least) the keyword documentation for the analysis. While this might seem paradoxical, it can help to write the documentation first to define a feature more clearly. Frequently, you might discover that a particular idea, for example for how to request additional input from the user or for how to design the actual algorithm, works only at first glance. It can be more efficient to "debug" this at the documentation level, which forces you to explicitly define these things, than at the code level (and there is usually little or no time lost since you will eventually have to write the documentation regardless).
Almost all analysis routines are called from subroutine mcstat() in "mcstat.f90". In general, each analysis functionality has at least two subroutines, one for collecting data during the simulation or trajectory analysis run, and one for producing the final result. Because analysis may be the dominant CPU time cost of a given run, the OpenMP shared-memory parallelization has to address it. This currently happens in one of two ways: either the analysis code is serial and dealt with by an individual thread (there are two large OpenMP SECTIONS directives for this), or the analysis code is itself OpenMP-parallelized and the function is called by all threads (see for example the call to "do_em_density_map(tpi)" in "mcstat.f90" with the actual function found in "emicroscopy.f90"). The former scenario is of course much simpler to handle (no writing of parallel code required) while the latter is recommended only for analyses that scale poorly with system size. For a block added into one of the SECTIONS directives, simply copy an existing block, e.g., the one calling "do_rama". You will see that the call to an analysis routine only happens every so often (the mod(istep,???calc) construct accomplishes that). Thus, if the functionality is to be added permanently, ???calc should be a global frequency variable you created. A second subroutine invocation, viz., to the one producing the final output, should be added if necessary. For this, find the block under the control statement "if ((istep.eq.nsim).AND.(nsim.gt.nequil)) then" which simply means that the last step of a simulation (or analysis run) is reached (in which not all steps were discounted as equilibration). There are two such blocks (see for example the calls to "printrama"), one for MPI code with rank-separate analyses, and one for non-MPI code (see below for the case of MPI code with lumped analyses). There are usually no OpenMP concerns here, since the block is currently in an OpenMP MASTER directive. One prominent function outside of the MASTER directive is the call to "do_clustering(tpi)", which is the gateway toward the large module dealing with structural clustering, the progress index, graph-based analysis (MSMs), etc. Additional analyses that operate on these data structures (such as a novel clustering algorithm) should only rely on the data collected through FMCSC_CCOLLECT and have no independent calculation interval.
Write the analysis subroutine. To keep things simple, use an existing source file. First you can develop within "hacking.f90", which exists precisely for this purpose. In the end, you can then migrate your new subroutine to a source file with similar scope such as "structure.f90", "polymer.f90", or "torsion.f90". Study a few of the existing analyses to get familiar with some of the data structures, which allow you to access information about the system (for example, x,y,z for Cartesian coordinates found in "mod_atoms.f90"; ztor, bang, blen for Z-matrix coordinates found in "mod_zmatrix.f90"; pointer arrays to specific sites and/or degrees of freedom found in "mod_polypep.f90" and "mod_fyoc.f90", and so on ...). Never forget to declare "implicit none" in a subroutine and to include the needed modules. Make sure the interfaces you used in "mcstat.f90" match those you create now. If you prefer to be neat, write explicit interfaces for all subroutines with "INTENT" statements for passed arguments; this helps the compiler detect incorrect calls (currently, CAMPARI uses interfaces only where required, and they are collected in a single module: "mod_interfaces.f90"). For a specific but simple example, study do_rbc_pc() and prt_rbc_pc() in "structure.f90": these subroutines compute distance distribution functions between molecular centers of mass and print the results to a file called RBC_PC.dat (for MPI-unaware or MPI averaging runs) or N_XYZ_RBC_PC.dat (MPI rank-separated analyses/runs). The rest is up to you!
If your analysis is meant to work with MPI averaging calculations, there is an additional step. The functions MPI_AVGCollect() and MPI_AVGSend() in "fmsmcmpi.f90" handle the transfer of information from the harvesting ranks (which call MPI_AVGSend()) to the master rank (which is the only one to call MPI_AVGCollect()). If you are unfamiliar with MPI programming, you may want to read up on it a little bit. What is needed is very simple, though: while executing the custom analysis routine, each rank collected information in one or more global variables: these must be collected and combined across all ranks. For this, we pass specific messages of specific size in specific order from the slaves to the master. The aforementioned subroutines are straightforward to extend, but the following must be kept in mind: i) all conditional controls in both subroutines must be identical, ii) all message codes (integers) in both subroutines must be identical, iii) whenever passing a message, the size and data type of the message must match the passed variable exactly, iv) if the analysis requires a separate normalization / print-out call, this subroutine must be called from within MPI_AVGCollect() but (obviously) after combining the data from all the ranks. This is why there is no third block in "mcstat.f90" to cover this case (as mentioned above): it is fully contained in MPI_AVGCollect in "fmsmcmpi.f90" but with the additional collect operations spliced in between.
If you coded your analysis routine with explicit OpenMP support, it may be useful to add a test for it. Have a look at subroutine "threads_test" in "thread_utils.f90" for reference (this routine is executed only when keyword FMCSC_THREADS_TEST is set to 1). The idea is simply to run through a sequence of unit-like tests using serial execution of the same (or functionally equivalent) code on the same but dynamically generated data sets for comparison. Adding a test can be very useful in excluding (or finding) specific parallel programming bugs as sources of error later on.