Tutorial 18: Writing Your Own Analyses Through the Python Interface

Preface:

This tutorial complements the general overview of trajectory analysis presented in Tutorial 10. As shown in Tutorial 17, it goes without saying that the data to be analyzed can come from other software as well. Here, the focus is on customizing user interactions with the trajectory data through your own code. This is possible of course by modifying and recompiling CAMPARI but this requires some guidance (see elsewhere), some familiarity with Fortran, and, ideally, some familiarity with thread-based and MPI parallel programming models. It is also possible, in many scenarios, to achieve the desired result by pursuing tedious workarounds, for example rewriting partial trajectories, as explained in Tutorial 10. To lower this barrier, CAMPARIv4.1 introduced a hook that exposes most of the relevant molecular data (dynamic and static) to the user by exporting it into a Python module.

Step-by-Step:

Step 1 - Understanding the logic of the Python module
To make accessing CAMPARI data structures in another language worthwhile, two things should be true. First, the data made visible should not just be the coordinates. Doing so would just lead to a reader module of binary files with little immediate usability. Instead, the data structures exposed to the user should be maximally cognizant of the systems that are represented. This is exactly where CAMPARI comes into play since that is what it has been designed to do. This means, for example, knowing which positions in the coordinates correspond to those of the backbone Cα atoms of polypeptides, which atoms are bound covalently to one another, what sequence of residues constitutes the various molecules in the system and how are they delineated in the coordinates, etc. Second, the language is a scripting language (compilation-free) that offers significant advantages in convenience, accessibility, library support, and/or community penetration.
With these two requirements arguably fulfilled, we can now address the actual logic. Generally speaking, CAMPARI will run as usual (whether in trajectory analysis mode or not). At the beginning of the run, all static elements (like residue bounds in atom numbers, Z-matrix structure, atomic masses, residue types, etc.) are collected, and Python objects are created from them. These Python objects will remain persistent in memory, and they are, in part, of variable size across different runs depending on which features are enabled (for example, partial charges are only available if the polar potential is turned on). In addition, NumPy arrays are created that will hold copies of dynamic content (like atom positions, molecular centers of mass, etc.). Every FMCSC_PYCALC steps, the dynamic NumPy arrays, which might also include box information, are updated by CAMPARI, and sent, along with the static objects, through a module called ForPy (developed by Elias Rabel, GitHub page) to a function in CAMPARI's own Python module (campana.py). In addition, the Python module contains a persistent object that is meant to hold user-defined variables required by the analysis, which is also sent to the function "campana_step". Within "campana_step", the user can write code that does the analysis that is possible and desired with the data for an individual snapshot, and also stores the data required for performing calculations across multiple snapshots. At the end of the run, "campana_final" is called with identical arguments although the dynamic content should not generally be useful then (the static content might still be required). In this function, the user can write code to finalize (e.g., normalize) or further process the data stored in the object "user_counters". Note that initialization tasks must be put into "campana_step" with a conditional check on the number of calls.
Importantly, any such run must produce output to be meaningful. For the analysis function, there is no feed-in mechanism from Python to Fortran meaning that the user-written code is strictly responsible for producing output. This output can be instantaneously written in "campana_step" either to dedicated files or to the standard output stream or only provided at the end in "campana_final" or both. CAMPARI is relatively restricted when it comes to instantaneous output that is not in the form of trajectories, so it is a common application for the Python module to circumvent this restriction. The script-like nature of Python and its broad usage allow the straightforward inclusion of additional libraries by adding "import" declarations to campana.py. The data exposed to Python are generally in NumPy arrays or present as simple variables. The static content is arranged into dictionaries to provide some embedded annotation.
A particular point to understand is what exactly the coordinates are that get sent to Python. CAMPARI supports several keyword-based modifiers that control what coordinates are deposited to trajectories in what form. As a general rule, these will not be available in Python. This means that keywords FMCSC_TRAJIDXFILE, FMCSC_XYZ_SOLVENT, FMCSC_XYZ_PRUNE, and FMCSC_XYZ_REFMOL have no or only limited impact on the coordinates sent to "campana_step" and "campana_final" in "dyn_atom" (see below). Generally, this is necessary to avoid mismatches with static arrays, with box settings, etc. In order to have access to modified or pruned coordinates, it is always recommended to first rewrite the trajectory using CAMPARI to the desired standard and to then use the Python module on the rewritten data. Importantly, the reading-related capability of keyword FMCSC_XYZ_FORCEBOX is respected (so molecules broken on input are sent as restored molecules when this keyword is either 1 or 3, see below). Similarly, if alignment is requested through FMCSC_ALIGNCALC, the Python module will receive the aligned coordinates, which generally invalidates box-dependent operations. These subtleties are very important to keep in mind to avoid confusion. CAMPARI does a lot of things behind the scenes, which is why a 2-step approach is often more favorable compared to working on the raw trajectory as is done in this tutorial. For example, in what follows below, information on water molecules is always read, processed, and sent, but never used.
The remainder of this tutorial will deal with three different use cases that do not rely on additional computing libraries (only "matplotlib" is added for plotting directly from Python). The system we study is the same as that in Tutorial 17 with the basic key-file just provided directly here. If you care to understand how this key-file comes about, please refer to steps 1-3 in Tutorial 17.
Step 2 - Creating a basic key-file (trajectory analysis)
In the folder of tutorial 17, you can find an example trajectory in xtc-format and an associated pdb file. The corresponding sequence file is in the folder for tutorial 18. Copy these files into an empty working directory. The pdb file needs a simple change as follows:

sed -i 's| SOL | T3P |g' tutorial17.pdb

To have CAMPARI process this trajectory, we need to have a number of basic keywords set, including a choice of parameters. Note that the choice of parameters and the mapping of biotypes to the various parameter types might be relevant for certain analyses, and in these cases it will be best to match the original parameters.

PARAMETERS <full path to folder>/campari/params/amber99.prm
FMCSC_SEQFILE tutorial18.seq
FMCSC_DYNAMICS 2 # Change the default
FMCSC_PDBANALYZE 1
FMCSC_PDB_FORMAT 3 # 1 for pdb, 4 for dcd, 5 for NetCDF, 6 for PostgreSQL
FMCSC_XTCFILE tutorial17.xtc
FMCSC_XTCPREC 1000000.0 # actually overridden by XDR library on input but here to preserve precision later
FMCSC_PDB_TEMPLATE tutorial17.pdb
FMCSC_PDB_R_CONV 4 # AMBER, actually needed here
FMCSC_NRSTEPS 20 # extremely short trajectory useful only for illustration

For the parameter file, we go with the one that matches most closely the original settings. It must be kept in mind that for energetic analyses in a trajectory analysis in particular it is a limitation that not only the force field might not be exactly the same but also that the detailed algorithms by which different software packages calculate forces and energies are not generally matchable (primarily in the dealing with truncation of interactions and with added approximations). The setting for FMCSC_PDB_R_CONV corrects some naming convention issues for the pdb-file but, generally speaking, the vast majority of these is taken care of automatically. From the PDB-file, in particular the nucleic acid part, the file is closest to AMBER convention.
To express the container, we need to calculate what a box vector of 85Å at an angle of 60° means. The third angle, so the one between the first two box vectors, is the only one that is not orthogonal:

FMCSC_SHAPE 4 # triclinic
FMCSC_BOUNDARY 1 # full, 3D periodic boundary conditions
FMCSC_ORIGIN 0.0 0.0 0.0 # undefined in the available files
FMCSC_BOXVECTOR1 85.0 0.0 0.0 # first along x-axis
FMCSC_BOXVECTOR2 42.5 73.61216 0.0 # second in the xy-plane at angle of 60 deg. so that third points in positive z for a right-handed system
FMCSC_BOXVECTOR3 0.0 0.0 85.0 # third along z-axis (orthogonal to both first and second)
FMCSC_ENSEMBLE 3 # assume NPT conditions (only trajectory analysis at the moment)
FMCSC_PDB_AUXINFO 3 # maximal auxiliary information for PDBs
FMCSC_XYZ_FORCEBOX 1 # 1 means yes on input, no on output

The final section deals (for now) energy terms and specifies the relevant analysis-relevant settings:

FMCSC_SC_IPP 0.0 # the only term enabled by default
FMCSC_SEQREPORT 1 # to get some information about composition
FMCSC_DISABLE_ANALYSIS 1 # disable analyses for now, make sure this is always below FMCSC_NRSTEPS
FMCSC_PYCALC 1 # call Python at every step
FMCSC_PYDIR <full path to current working directory>

Put all these keywords into a file called "t18.key". Now run this once, and it should finish with an error.

campari_threads -k t18.key

CAMPARI complains that it cannot find the Python module. To solve this issue, copy the Python file from the tools directory into the current working directory as well, then retry. Now everything should finish fine. Further below, if this issue reoccurs, it generally implies that the module file that CAMPARI tries to load is syntactically corrupt. To see what exactly is the problem, it will be beneficial to launch a Python interactive session (python3) and import the module manually (import campana), possibly after setting your PYTHONPATH variable correctly. As you can tell from the output (stdout), the default versions of the functions "campana_step" and "campana_final" do nothing but increment a counter and print a confirmation. If these are not printed, the Python hook is not executed fully, which happens, generally speaking, if the module throws a runtime error. In this scenario, print statements can help to find the exact point of the runtime problem. Note that for syntax Python3 is assumed throughout. We are now ready to interact with the data in Python.
Step 3 - Monitor box volume
Take a moment to read the comments in campana.py: they explain roughly the data structures that are available. Whenever you edit this file, make sure to edit the working copy in the local folder, otherwise the safe, original version might be lost. Importantly, you must not remove any functions or definitions and you must not change the existing parts with three exceptions: first, you can and should rewrite the content of the functions "campana_step" and "campana_final" at will (but you must not change the argument list); second, you can add and should add variables to the "counters" class either in the constructor or in "campana_step" and "campana_final"; third, you can define new functions and import additional modules to be called and used by "campana_step" and "campana_final".
Before proceeding, there is a very important caveat to be aware of. Native Python code interpreted by the reference implementation (CPython) is, by default, very slow. In fact, it can be so slow that it makes the resultant code practically unusable when deployed on large data sets. Keep this in mind throughout: while it might seem attractive to save development time, there are still many scenarios where it will be much more cost-efficient to modify CAMPARI directly in Fortran or call C from Fortran. There are many reasons for this such as lack of compiler optimization, need for many additional checks due to weak typing, inefficient memory layouts, etc. but as of 2022 it is also true that other languages of a similar paradigm (like Javascript) are superior in this regard. The most common Python workarounds to the performance issue focus on packaging precompiled library code to circumvent CPython such as NumPy, which is used here, or to move away from the reference implementation. These can both have their own downsides. For example, NumPy must never be used in an element-by-element logic for it to retain its speed advantage over plain Python, which puts heavy restrictions on how to write down your algorithms (as will become clear below).
For our first task, we want to simply calculate the average box volume along with its minimum and maximum values. The box in this simulation is triclinic but, fortunately, the volume as calculated by CAMPARI is already passed to "campana_step" as part of the "box" argument, which is an object of class "box_holder". Thus, the modifications consist primarily in defining a holding variable and incrementing it. First, extend the "counters" class by a holder array (the italic code is the code already there):
self.ncalls = 0 # insert own code from here self.boxvols = np.array([0.0,-1000.0,1.0e20])

For all edits, keep in mind that any syntactic error (like with indentation) will prevent the loading of the module (automatic disabling) or lead to errors in the CAMPARI run, some of which might be obscure (the worst case scenario being an undiagnosed stop of execution of "campana_step" or "campana_final"). When developing the Python code, it is therefore recommended to initially add/preserve logging statements that confirm that the functions have been traversed in full. Modify "campana_step" as follows:
user_counters.ncalls = user_counters.ncalls + 1 user_counters.boxvols[0] += box.volume user_counters.boxvols[1] = max(box.volume,user_counters.boxvols[1]) user_counters.boxvols[2] = min(box.volume,user_counters.boxvols[2]) print("campana.py: This was invocation #",user_counters.ncalls, " of campana_step.")

Now, we just need to normalize and report in "campana_final":
print("The average is ",user_counters.boxvols[0]/user_counters.ncalls,"A^3 and the min/max are ",user_counters.boxvols[2]," and ",user_counters.boxvols[1]," A^3, respectively.") print("campana.py: This is the end of campana_final.")

Double-check the Python syntax, then run again:

campari_threads -k t18.key

From the printed log, you can find out that the average volume is ca. 571nm3 and that the maximum, unlike the minimum, is very close to this value. As an exercise, check why this is the case and also check whether the box angles and aspect ratios have been preserved, i.e. whether the box has been scaled isotropically in the original simulation. These are examples of questions for which CAMPARI has no built-in, targeted analysis tools. Of course, you could still use a combination of FMCSC_PDB_AUXINFO, FMCSC_XYZOUT (as 1), and FMCSC_XYZPDB (as 2), to get the desired data (as pdb-file headers) but that requires a two-step calculation.
Step 4 - Create a custom contact map
Residue-level contact maps are supported by CAMPARI through keyword FMCSC_CONTACTCALC. However, these contact maps offer only two types of contact definitions. Using a strict threshold function on the residues' centers of mass or on the minimum atom-atom distance. Calculating other contact maps is possible in post-processing through the instantaneous distance output facility, but the required file sizes to write to disk might be prohibitively large. Here, we create a side chain-only, heavy atom-only contact map that uses a smooth step-like mapping function instead of a true step function.
In principle a contact map requires nothing but a function to calculate distances. Naively, one might assume that this is a matter of only a few lines of code (and sometimes it is), but, as usual, calculating a subset of distances correctly is more tedious than one might think. First, we need to understand what are side chains and what are heavy atoms. A binary trajectory file like the xtc-file here is nothing but a dump of floating-point numbers in binary format that is not interpretable without further information. The annotation is passed to "campana_step" and "campana_final" as several Python objects. In essence, these contain static information about the system at the resolution of atoms ("atom"), residues ("res"), or molecules ("mol"). They are present in two data types: 32-bit integers, "ints", and 64-bit floating point numbers, "dbls".
For the contact map, "static_atom_ints", "static_atom_dbls", and "static_res_ints" are going to be used. The properties per atom in "static_atom_ints" are:
  • BIOTYPE: CAMPARI biotype as explained elsewhere
  • LJ_TYPE: CAMPARI atom (Lennard-Jones) type as explained elsewhere
  • RESIDUE_NR: The running residue number in the sequence defined by sequence input
  • MOLECULE_NR: The running molecule number in the sequence defined by sequence input (note that crosslinked molecules count as separate molecules)
  • BACKBONE_CNT: The running index per residue of backbone atoms, which is 0 otherwise (in unsupported residues and small molecules, this assignment can be uninformative)
  • ZMATRIX_I1, ZMATRIX_I2, ZMATRIX_I3: The indices of atoms that define its Z-matrix row (determined automatically and possibly misleading in unsupported residues); 0 indicates that atoms are part of molecular reference frame
  • ZMATRIX_CHIRAL: If -1 or 1, this distinguishes chirality in ambiguous Z-matrix entries (those relying on 2 bond angles), 0 otherwise
  • NR_BOUND_ATOMS: The number of covalently bound atoms (determined automatically and possibly misleading in unsupported residues)
  • BOUND_ATOM1, BOUND_ATOM2, ...: The indices of covalently bound atoms (determined automatically and possibly misleading in unsupported residues); 0 once exhausted for a given atom
Generally speaking, it is very important to keep in mind that any reference to an index in these data structures is to CAMPARI's own indexing, which starts at 1 as in Fortran. In contrast, the arrays themselves start their index at 0 as in Python/NumPy. Thus, to find, for example, the biotype of the first Z matrix reference atom of atom 42, you would have to consult:

static_atom_ints.theprops[static_atom_ints.theprops[41,static_atom_ints.col_dic["ZMATRIX_I1"]]-1,static_atom_ints.col_dic["BIOTYPE"]]

The properties per atom in "static_atom_dbls" are:
The properties per residue in "static_res_ints" are:
  • TYPE: CAMPARI residue type that is translatable with the "restype_to_name" array of "user_counters"
  • NR_ATOMS: Number of atoms
  • NR_SC_ATOMS: Number of sidechain atoms (in unsupported residues and small molecules, this assignment can be uninformative)
  • FIRST_ATOM: The index of the first atom in the residue
  • REF_ATOM: The index of the atom defined as the reference (mostly for residue-based distance calculations)
  • CROSSLINK: The index of the residue the current one is chemically cross-linked to (0 if none, for disulfide bridges)
  • BB_CA, BB_N, BB_O, BB_C, BB_HN: The indices (starting at 1!) of the respective polypeptide backbone atoms (0 if the residue does not feature them, can be misleading for unsupported residues and small molecules)
  • BB_O3*,BB_P,BB_O5*,BB_C5*,BB_C4*,NUCBASE_N1/9: The indices (starting at 1!) of the respective polynucleotide backbone and sidechain atoms (0 if the residue does not feature them, can be misleading for unsupported residues and small molecules)
  • MOLECULE_NR: The running molecule number in the sequence defined by sequence input (note that crosslinked molecules count as separate molecules)
It is a good opportunity to place a reminder that the indices like "FIRST_ATOM" or "BB_CA" are all to CAMPARI's own indexing starting at 1 and thus require shifting down by 1 to find the corresponding elements in the atom-based Python arrays.
To start with the custom contact map code, insert this code block, which performs initialization, into "campana_step" directly after the call counter increment. Make sure that all indentation is correct and that you do not accidentally duplicate or delete the call counter increment or other pre-existing lines. Whether you leave the code from task #3 or remove it is irrelevant:
user_counters.ncalls = user_counters.ncalls + 1
if (user_counters.ncalls == 1): col_bb = static_atom_ints.col_dic['BACKBONE_CNT'] col_rnr = static_atom_ints.col_dic['RESIDUE_NR'] col_mass = static_atom_dbls.col_dic['MASS'] schats_bool = ((static_atom_ints.theprops[:,col_bb] <= 0) * (static_atom_dbls.theprops[:,col_mass] >= 5.0)) user_counters.custom_ccm_rsnrs = np.unique(static_atom_ints.theprops[schats_bool,col_rnr]) user_counters.custom_ccm_sz = len(user_counters.custom_ccm_rsnrs) user_counters.custom_ccm = np.zeros((user_counters.custom_ccm_sz,user_counters.custom_ccm_sz)) user_counters.custom_ccm_atoms = {} for rsn in np.nditer(user_counters.custom_ccm_rsnrs): user_counters.custom_ccm_atoms[int(rsn)] = [] for atn in range(0,len(static_atom_ints.theprops[:,col_rnr])): rsn = static_atom_ints.theprops[atn,col_rnr] if ((rsn in user_counters.custom_ccm_atoms) and (schats_bool[atn])): user_counters.custom_ccm_atoms[rsn].append(atn-1)

What this block of code performs is mostly NumPy-supported vector operations (NumPy is loaded as "np"). Everything is within the conditional of checking that the call counter is at 1. The first 3 lines use the dictionaries to find the column indices of the required fields in the two objects "static_atom_ints" and "static_atom_dbls". Then, the joint occurrence of the two desired conditions is written to a Boolean vector "schats_bool", where we take advantage of the fact that the property "BACKBONE_CNT" is 0 whenever an atom is considered to be a sidechain atom. All elements in "schats_bool" that are "True" correspond to the atoms we want, but they are not yet partitioned by residue. To do so, we first find the unique set of residue numbers associated with these atoms (stored as "custom_ccm_rsnrs" in "user_counters") and how many such residues there are ("custom_ccm_sz"). We then initialize a contact map object as a 2D NumPy array ("custom_ccm"). This is potentially risky for a system like this since any error retaining all water molecules in "custom_ccm_rsnrs" could lead to a memory violation when trying to do this allocation. We also initialize an empty Python dictionary as "custom_ccm_atoms" to store the map of relevant atoms to residues more handily. In the two subsequent loops, this dictionary is first initialized to an empty list for each relevant residue, and then the lists are appended with the atoms where we again take advantage of "schats_bool" and rely on a single iteration across atoms. Note that the atom indices are shifted by -1 because we will later index "dyn_atom" with them, which starts indexing at 0 not 1 (see above).
To actually calculate distances, we take advantage of the predefined function "vector_calc_dis2", which is able to compute correct minimum-image distances for all simulation containers and boundary conditions supported by CAMPARI. It is of course possible to write any number of utility functions in Python and make them semi-permanent members of a user-modified version of "campana". If so desired, it is good practice to consider the CAMPARI source code for the possible complexities, which might not always be obvious to users who do not write or modify simulation code normally. As mentioned above, due to how NumPy works, it will also generally be recommended and often necessary to write these routines in a vector logic where many input arguments are evaluated at once using NumPy's array-based routines.
For simplicity, we consider here the contact map between the centroids of the side chains. Since side chains are of different sizes, we want to make the properties of our step-like function somewhat responsive to the properties of any contact pair in question. Fortunately, CAMPARI passes some useful information in this regard.
The properties per residue in "static_res_dbls" are:
  • MAX_DIS_FROM_REFAT: Effective CAMPARI residue "radius" in Å that is a measure of the maximal distance of any point in the residue from the assigned reference atom (see static_res_ints above)
  • MASS: Total mass (in a.m.u.)
  • TOTAL_CHARGE: The sum of all partial charges of atoms in the residue (only available if the Coulomb potential is turned on)
Next, insert the definition of the step-like function directly in "campana.py" (paste it before the definition of "campana_step"):
def vector_steplike(start,end,val): if (np.sum(end <= start) > 0): raise Exception("The function vector_steplike requires that the 2nd argument is larger than the 1st for all pairs.") rval = np.zeros(len(val)) l1 = (val < end)*(val > start) if (np.sum(l1) > 0): scf = (end[l1]-start[l1])/np.pi rval[l1] = (np.cos((val[l1]-start[l1])/scf)+1.0)/2.0 l2 = (val <= start) if (np.sum(l2) > 0): rval[l2] = 1.0 return rval

As the name implies, this is a vector-based function that takes an array of distance values ("val") and jointly evaluates an interpolation between 1.0 and 0.0 with the help of a cosine (mapping the interval between "start" and "end" to that from 0 to π). It will produce undiagnosed errors when called with arguments of wrong type or mismatched in size. However, almost no errors are catchable by default through ForPy, so debugging is a bit more tedious than would be ideal.
Now we can complete "campana_step" with the remaining code to accumulate the contact map:
if ((rsn in user_counters.custom_ccm_atoms) and (schats_bool[atn])): user_counters.custom_ccm_atoms[rsn].append(atn-1)
col_radius = static_res_dbls.col_dic['MAX_DIS_FROM_REFAT'] tmp_cogx = np.zeros(user_counters.custom_ccm_sz) tmp_cogy = np.zeros(user_counters.custom_ccm_sz) tmp_cogz = np.zeros(user_counters.custom_ccm_sz) tmp_crad = static_res_dbls.theprops[user_counters.custom_ccm_rsnrs-1,col_radius] rs1 = 0 while (rs1 < user_counters.custom_ccm_sz): rsn1 = user_counters.custom_ccm_rsnrs[rs1] ilst1 = user_counters.custom_ccm_atoms[rsn1] nlst1 = 1.0*len(ilst1) tmp_cogx[rs1] = np.sum(dyn_atom[ilst1,0],axis=0)/nlst1 tmp_cogy[rs1] = np.sum(dyn_atom[ilst1,1],axis=0)/nlst1 tmp_cogz[rs1] = np.sum(dyn_atom[ilst1,2],axis=0)/nlst1 rs1 += 1 rs1 = 0 while (rs1 < (user_counters.custom_ccm_sz-1)): ds = vector_calc_dis2(box.shape,box.type,box.bmatrix,box.ibmatrix,box.minimage2,tmp_cogx[rs1],tmp_cogy[rs1],tmp_cogz[rs1],tmp_cogx[(rs1+1):],tmp_cogy[(rs1+1):],tmp_cogz[(rs1+1):],False) user_counters.custom_ccm[rs1,(rs1+1):] += vector_steplike(tmp_crad[(rs1+1):],tmp_crad[rs1]+tmp_crad[(rs1+1):]+1.0,np.sqrt(ds)) rs1 += 1

This code has to go into the main body of "campana_step" and must be evaluated at every step. The first lines initialize holding arrays for the centroids of residues and their effective radii. The first while loop populates the centroid arrays by averaging the corresponding positions of the atoms identified earlier and stored in the "custom_ccm_atoms" dictionary. The second while loop then calls the vector-based distance function "vector_calc_dis2", passing all the required box parameters. It expects x, y, z scalars for the reference and vectors for the comparison set (shortened to avoid redundancy). The last argument is an override to forcefully ignore image distances. In CAMPARI logic, it would be correct to set this override any time two positions are from the same molecule (this issue is inconsequential here).

Double-check the Python syntax, then run again:

campari_threads -k t18.key

If everything runs through, you should still see the step counter being printed. You should also note that the run is much slower, which, despite the use of NumPy, is still the problem with Python performance coupled to the fact that triclinic boxes are nontrivial when it comes to finding minimum images.
In the last step, we normalize, save, and visualize the contact map. First, add:
import matplotlib.pyplot as pl

You can do this, for example, at top-level. This will import the general plotting capabilities of "matplotlib" into a names module "pl". While it is not generally useful to do the final plotting directly from CAMPARI, we can at least illustrate the usage here (there might be applications in continuously updated plots for example). Now, the actual code doing the remaining work is the following, and it has to go into the function "campana_final":
user_counters.custom_ccm /= user_counters.ncalls lastrs = int(np.max(user_counters.custom_ccm_rsnrs)) full_ccm = np.empty([lastrs,lastrs]) full_ccm[:,:] = -1.0 rs1 = 0 for rsn in np.nditer(user_counters.custom_ccm_rsnrs): full_ccm[int(rsn)-1,user_counters.custom_ccm_rsnrs-1] = user_counters.custom_ccm[rs1,:] rs1 += 1 with open('sc_cmap.npy', 'wb') as f: np.save(f,full_ccm) allrs = range(0,lastrs+1) pl.pcolormesh(allrs,allrs,full_ccm,shading="flat") pl.axvline(x = 176.5, color = 'r') pl.axhline(y = 176.5, color = 'r') pl.axvline(x = 155.5, color = 'r') pl.axhline(y = 155.5, color = 'r') pl.axvline(x = 85.5, color = 'r') pl.axhline(y = 85.5, color = 'r') pl.show() print("campana.py: This is the end of campana_final.")

This first normalizes the map, then creates a separate 2D array that solves the visualization problem that our list of residues in "custom_ccm_rsnrs" contains gaps. If we were to plot "custom_ccm" directly, the axes would be nonlinear (in sequence number) and would have to be explained or annotated explicitly. Conversely, "full_ccm" contains all polymer residues and implies a linear axis in sequence numbering. We also take advantage of creating an offset (-1.0 rather than 0.0) for those residues that are not eligible. Then, the matrix is saved as a binary NumPy file so that it can be accessed and reloaded later on. Finally, the plot is created, here relying on "pcolormesh" and adding a few delineation lines (in red) to be able to visually distinguish the two protein chains (these are actually a single chain normally but linked by a flexible linker of 50+ residues, see PDB entry 5N8M) and the two RNA strands in the map.
Double-check the Python syntax, and add the following keyword:

FMCSC_CONTACTCALC 1 # every step

to t18.key, then run again:

campari_threads -k t18.key

The image that pops up makes a few points. The antiparallel nature of the DNA strands is very obvious as is the similarity in the folds of the two protein domains and the lack of dominant quaternary interactions between them. Within each domain, the dominant tertiary contacts, which are likely the hydrophobic core, involve residues near the termini more so than residues in the middle of the chain. As soon as you close the image, you can cross-check against the "normal" CAMPARI contact map whether the picture is reasonable. This has not only been computed and printed out to CONTACTMAP.dat" as you would expect but it also sent directly to the Python module as part of the "analyses" argument. This is a container that is dynamically stocked with arrays covering whatever optional functionalities have been enabled through keywords. A full list is found directly in the definition and description of the "analysis_holder" class in campana.py. The particular field in question is named "contact_map" and setting FMCSC_CONTACTCALC has caused it to become defined and populated.
Further exercises:
  • Extend the plotting to show the custom map along with the default map side-by-side. Access the default map through analyses.holder["contact_map"]. Adjust keywords FMCSC_CONTACTCOM and FMCSC_CONTACTMIN to increase comparability.
  • Change the custom contact map from residue centroids to residue centers of mass.
  • Add a contact map based on minimum atom-atom distances to the other half-matrix.
  • Make the computation more efficient by introducing residue-level screening as is done in CAMPARI. This involves checking the distance of the reference atoms ("REF_ATOM" in "static_res_ints"), while buffering them with the effective radii ("MAX_DIS_FROM_REFAT" in "static_res_dbls"), against the upper limit for considering a contact.

Step 5 - Measure two order parameters for double-stranded RNA
The system contains a piece of double-stranded RNA (dsRNA). CAMPARI has relatively few built-in analysis tools applicable to nucleic acids compared to those specific to polypeptides. While it is possible to read off pucker distributions or print out the data required to study backbone dihedral angle correlations, the larger structural aspects unique to nucleic acids are not covered. Here, we use the Python module to estimate the distributions of helical rotation and rise (per bp). First, add the following block to "campana_step":
user_counters.ncalls = user_counters.ncalls + 1 if (user_counters.ncalls == 1): n19_col = static_res_ints.col_dic['NUCBASE_N1/9'] mol_col = static_res_ints.col_dic['MOLECULE_NR'] bool_n19_1 = ((static_res_ints.theprops[:,n19_col] > 0) * (static_res_ints.theprops[:,mol_col] == 3)) user_counters.p1lst = static_res_ints.theprops[bool_n19_1,n19_col] - 1 bool_n19_2 = ((static_res_ints.theprops[:,n19_col] > 0) * (static_res_ints.theprops[:,mol_col] == 4)) user_counters.p2lst = static_res_ints.theprops[bool_n19_2,n19_col] - 1 user_counters.twist = np.zeros(20) user_counters.rise = np.zeros(20) user_counters.twistbins = np.arange(0.0,61.0,3.0) user_counters.risebins = np.arange(1.0,5.1,0.2) user_counters.totaltwist = 0.0 user_counters.totalrise = 0.0

This initialization block takes advantage of the pointer arrays in "static_res_ints" to isolate the atom indices of the atoms in the nucleic acid bases that connect to the sugar (either N1 (pyrimidines) or N9 (purines) in pdb terminology). The shift down by 1 is because we want to access "dyn_atom" which starts indexing at 0, not 1 (same as in Step 4). Fortunately, the system contains no modified bases, so we do not have to account for exceptions. The indices for the two chains (molecules 3 and 4 in the system) are stored in "user_counters.p1lst" and "user_counters.p2lst", respectively. Finally, we allocate some arrays to accumulate histograms with fixed bin edges ("twistbins" and "risebins") and store them in "user_counters" as well. Setting the bin edges to meaningful values requires having run the analysis at least once while printing out the data range (not shown here).
There is no unique, perfect answer to determine parameters like the rise per base pair for a flexible object in a simulation. It is inevitable that choices have to be made unless the community has defined a standard that is unique (while not necessarily optimal). Here we assume no such standard exists. If you visualize the pdb file, you might note that the particular dsRNA is quite different from B-DNA, the most common form of dsDNA. The bases are angled relative to the plane orthogonal to the central axis and the central axis itself does not pass through the base pairs (instead they are wound around), to name two obvious differences. In our particular case, on both chains, there are also two uracil-carrying residues on the 3'-end that do not undergo base pairing. Since they are two chains, we need to worry about images (again). Here, we choose a different solution from Step 4 that takes advantage of the fact that keyword FMCSC_XYZ_FORCEBOX is set to ensure molecules are whole. Thus, it is enough to evaluate shifts at the molecule level, for which we take advantage of further data structures at our disposal:
The properties per molecule in "static_mol_ints" are:
  • START_ATOM,END_ATOM: Indices of the first and last atoms defining the molecule (these are always continuous, and crosslinked molecules always count as separate molecules)
  • START_RES,END_RES: The same for residue indices
  • TYPE: The molecule type (CAMPARI numbers types consecutively as they are discovered in sequence input)
  • ANALYSIS_GROUP_NR: The assigned analysis group as determined by default or from user input
  • NR_TORSIONS: The number of bonds with topologically flexible dihedral angles (can be misleading for unsupported residues)
  • NR_IMD_DOFS: The total number of internal coordinate space degrees of freedom as configured by custom constraints, TMD_UNKMODE, and related settings
For completeness, the properties in "static_mol_dbls" are:
  • MASS: Total mass in a.m.u.
  • CONTOUR_LENGTH: A very approximate measure of how long a polymer chain could be in Å when fully stretched but covalent bonds and angles remain intact (ignore for anything not a polymer)
In addition, we have the dynamic information for molecules in "dyn_mol". As per the legend in (campana.py), columns 0-2 contain the x, y, and z-coordinates of molecular centers of mass and columns 3-5 those of molecular centroids. This is enough information to shift all the molecules into a single reference frame defined by a single molecule (this is essentially what FMCSC_XYZ_REFMOL does in CAMPARI).
Insert the following code, to take advantage of this information:
user_counters.totalrise = 0.0 minat_col = static_mol_ints.col_dic['START_ATOM'] maxat_col = static_mol_ints.col_dic['END_ATOM'] sx = np.empty(3,'float64') sy = np.empty(3,'float64') sz = np.empty(3,'float64') d2 = vector_calc_dis2(box.shape,box.type,box.bmatrix,box.ibmatrix,box.minimage2,dyn_mol[3,0],dyn_mol[3,1],dyn_mol[3,2],dyn_mol[0:3,0],dyn_mol[0:3,1],dyn_mol[0:3,2],False,svecx=sx,svecy=sy,svecz=sz) for i in range(0,3): lo = static_mol_ints.theprops[i,minat_col] hi = static_mol_ints.theprops[i,maxat_col] dyn_atom[(lo-1):hi,0] += sx[i] dyn_atom[(lo-1):hi,1] += sy[i] dyn_atom[(lo-1):hi,2] += sz[i] dyn_mol[i,0] += sx[i] dyn_mol[i,1] += sy[i] dyn_mol[i,2] += sz[i] dyn_mol[i,3] += sx[i] dyn_mol[i,4] += sy[i] dyn_mol[i,5] += sz[i]

This code uses the same distance function ("vector_calc_dis2") as before, just not on individual atoms or residue centroids but on molecular coordinates. Different from before, we also, with the help of optional keyword arguments ("svecx", "svecy", "svecz") retrieve the image shift vectors. These then need to be added to the coordinates that are not from the reference molecule, which is what the loop below accomplishes. Note that we only do this for the polymer molecules here and that we directly modify the values in both "dyn_atom" and "dyn_mol". Generally speaking, doing so might have unwanted consequences downstream.
If you want to, you can add a little sanity check and rerun CAMPARI at this stage:
d2 = vector_calc_dis2(box.shape,box.type,box.bmatrix,box.ibmatrix,box.minimage2,dyn_mol[3,0],dyn_mol[3,1],dyn_mol[3,2],dyn_mol[0:3,0],dyn_mol[0:3,1],dyn_mol[0:3,2],True) print(np.sqrt(d2))

This uses the last argument as "True" to skip any image corrections in "vector_calc_dis2". If the centers of mass have been brought into a proper reference frame, all three distances to the second RNA molecule should be below ca. 25Å. Note that in this particular trajectory the two RNA chains, unlike the protein ones, are already in the same image throughout. The correction is shown here as a general solution to this problem.
The next block of code computes the twist-related data (how much the helix turns per base pair).
dyn_mol[i,5] += sz[i] nps = len(user_counters.p2lst) s1 = np.zeros(nps-3) s2 = np.zeros(nps-3) axiscom = np.zeros(3) for i in range(0,nps-3): midstacki = 0.5*(dyn_atom[user_counters.p1lst[i],0:3] + dyn_atom[user_counters.p2lst[nps-3-i],0:3]) midstackip1 = 0.5*(dyn_atom[user_counters.p1lst[i+1],0:3] + dyn_atom[user_counters.p2lst[nps-4-i],0:3]) axiscom += midstacki if (i == (nps-4)): axiscom += midstackip1 s1[i] = scalar_calc_dih_by_vector(dyn_atom[user_counters.p1lst[i],0:3],midstacki,midstackip1,dyn_atom[user_counters.p1lst[i+1],0:3]) s2[i] = scalar_calc_dih_by_vector(dyn_atom[user_counters.p2lst[nps-4-i],0:3],midstackip1,midstacki,dyn_atom[user_counters.p2lst[nps-3-i],0:3]) countss1, unused = np.histogram(s1,bins=user_counters.twistbins,density=False) countss2, unused = np.histogram(s2,bins=user_counters.twistbins,density=False) user_counters.twist += countss1 + countss2 user_counters.totaltwist += 0.5*(sum(s1)+sum(s2)) axiscom /= (nps-2.0)

Our strategy is to measure the dihedral angle of the reference points (N1/N9) on a single chain, using an approximation of the effective helix axis as the "virtual" bond. The block above initializes temporary vectors, then computes the mid-point of base-paired reference atoms starting from the first base pair (the 5'-residue in the first RNA chain, which is paired with the 3rd last residue in the second chain). In each iteration, this point is constructed for both residue i+1 and i+2 on the 3rd chain (note that i starts from 0). Then, we use another function built into "campana.py" to compute a dihedral angle from 4 vectors, passing the reference points on either chain along with the effective bond. The answers (in °) are stored in the temporary arrays "s1" and "s2". Past the loop, we bin using NumPy's built-in function and accumulate the data in our dedicated data structure. We furthermore keep track of the average total twist along the 19 base pairs. Finally, and this is needed for the next step, we also compute the centroid of the approximated axis positions in a 3-vector called "axiscom". This is done here just to save a loop.
The last bit of code in "campana_step" is the following:
axiscom /= (nps-2.0) s3 = np.zeros(nps-3) axisgyrten = np.zeros([3,3]) for i in range(0,nps-2): midstacki = 0.5*(dyn_atom[user_counters.p1lst[i],0:3] + dyn_atom[user_counters.p2lst[nps-3-i],0:3]) axisgyrten += np.outer(midstacki-axiscom,midstacki-axiscom) axisgyrten[:,:] /= (nps-2.0) evals, evecs = np.linalg.eig(axisgyrten) longest = np.argmax(evals) for i in range(0,nps-3): midstacki = 0.5*(dyn_atom[user_counters.p1lst[i],0:3] + dyn_atom[user_counters.p2lst[nps-3-i],0:3]) midstackip1 = 0.5*(dyn_atom[user_counters.p1lst[i+1],0:3] + dyn_atom[user_counters.p2lst[nps-4-i],0:3]) s3[i] = np.abs(np.sum((midstackip1-midstacki)*evecs[:,longest])) countss3, unused = np.histogram(s3,bins=user_counters.risebins,density=False) user_counters.rise += countss3 user_counters.totalrise += sum(s3)

The code computes with the help of NumPy's linear algebra routines the gyration tensor (note that there is also a built-in function in the module that performs this task but, without modifications, it operates directly on atomic positions, see "gyration_tensor_by_bool"). The tensor is incremented with the help of the outer product function ("np.outer"), normalized, and then diagonalized ("np.linalg.eig"). The largest eigenvalue identifies our putative helix direction, and the associated eigenvector ("evecs[:,longest]") already gives us the unit vector along that direction. We compute the projection length of the difference vector of two subsequent helix axis mid-points onto this axis direction to get the effective rise along the axis. These values are then binned and averaged as before.
All that is left to do is to add code to "campana_final".
print("Average rise ",user_counters.totalrise/user_counters.ncalls) print("Average number of turns ",user_counters.totaltwist/user_counters.ncalls/360.0) fig1, subplots = pl.subplots(2) subplots[0].plot(user_counters.twistbins[1:]-1.5,user_counters.twist) subplots[0].set_title('Empirical distribution of twist (in deg., x-axis)') subplots[0].axvline(x=user_counters.totaltwist/user_counters.ncalls/18.0,color='red') subplots[1].plot(user_counters.risebins[1:]-0.1,user_counters.rise) subplots[1].set_title('Empirical distribution of rise (in Angstrom, x-axis)') subplots[1].axvline(x=user_counters.totalrise/user_counters.ncalls/18.0,color='red') pl.show() print("This is the end of campana_final.")

This is just printing the total averages and opening a two-panel plot for showing the distributions. As before, the hard-coded bin ranges and sizes are not something you would likely want to use in practice. In the plots, we overlay vertical lines to show the averages per base pair (knowing that we had analyzed 18 neighbor pairs for both rise and twist.
Now the module is ready to be run again:

campari_threads -k t18.key

The image that pops up and the printed text make a few points, especially if you compare it to what you might infer from direct visualization of the pdb file. First, for the rise, the length of ca. 50Å is in very good agreement with measurements on the structure. The distribution highlights that the lack of coplanarity of the base stacks with the plane that is orthogonal to the axis makes the distribution comparatively wide. Naturally, looking along the axis of an individual stack (two base pairs), that distance cannot be shorter nor much longer than the size of a heavy atom, so roughly 3.0-4.0Å. Smaller and larger values occur whenever there is tilt, which is the case for this piece of dsRNA. Second, for the twist, visual inspection gives roughly 12 base pairs per helical turn, leading to ca. 1.5 turns for the 18 neighbor pairs we could study here. It appears that our dihedral angle-based measure does not get this quite right, as the reported twist is higher on average, closer to that of B-DNA as shown for example in Tutorial 6 and its pdb file. This is most likely because we use a pair-specific axis each time.
Further exercises:
  • Switch the twist calculation so that it uses the same axis as the rise calculation (use molecular centroids to position this axis in space, which is important).
  • Investigate correlations of the base pair-resolved twist and rise features with themselves, with types (base, base pair, or pairs of base pairs), with position along the helix axis, and with pucker state to see if these explain the bimodality seen in both (original) histograms.
  • Make step 4 (contact map) more efficient by introducing molecule-level shifts as done here.
This concludes this tutorial. Several aspects of the Python module have not been touched upon a lot, most importantly the dynamic content that is sent as part of the "analyses" object. It is important to be aware of what is available to avoid needless duplication of code, and the best source for this is the module itself. Summarizing information is also given in the documentation for keyword FMCSC_PYCALC. Other tasks might be simplified further by using additional libraries. For example, to perform alignment operations, SciPy has a dedicated module/package called "scipy.spatial.transform.Rotation" that covers the required mathematics well. There are also further functions available directly in campana.py such as for computing bond angles, dipole moments, or inertia tensors across groups of atoms. In addition, MPI-parallel versions of CAMPARI will in fact instantiate and use the module several times in parallel, and it is possible to write modules that effectively perform heterogeneous task (as long as no data exchange at the Python level is required).
Lastly, if you find that the setup (initialization) time that CAMPARI takes due to the size of the system is interfering with quick Python debugging, rewrite the trajectory to one without solvent (FMCSC_XYZ_SOLVENT is the easiest), adjust the sequence file (pulling up the "END"), and then perform all further analyses on the pruned trajectory. Also remember to refer back to Tutorial 17 to learn more about the possible post-processing options for dealing with this type of trajectory data.


Design downloaded from free website templates.