Tutorial 16: Small Molecule Screens for Computational Docking
Preface:
This long tutorial covers the most important functionalities of the small molecule extension of CAMPARI (see keyword FMCSC_MOL2SCREEN). It walks you through setting up a receptor, a reference ligand, and performs three docking runs on a supplied library of small (drug) molecules. From the beginning, it should be clear to you that the goal here is not to generate many different poses quickly. Instead it is demonstrated how to use a full atomistic force field with implicit solvent as it would be used in a molecular simulation for the purpose of generating and, more importantly, evaluating poses. The workflow covers aspects of what was shown in a 2020 reference paper. If you already have a receptor-ligand complex with the ligand in mol2-format, and you are amply familiar with CAMPARI otherwise, you can skip most of the first three steps (but you will have to make adjustments in the other steps of course). In particular, you might not have a library of molecules parameterized specifically for ABSINTH, in which case you should switch to a primitive model (FMCSC_SC_POLAR small and FMCSC_SC_IMPSOLV 0.0). The details of the Hamiltonian are not particularly relevant for what there is to learn from this tutorial. That said, the CPU time required for the first three steps is negligible.Step-by-Step:
Step 1 - Preparing the receptor
In the folder for this tutorial, you can find the PDB-file for structure 3P1C (version as of 11/2020) in the same way that you could retrieve it yourself from the PDB website. It has two copies of the bromodomain in the unit cell, which is a result of the experiment rather than a biological feature. To get this system as a folded protein to be simulated in CAMPARI, the very least we need, in addition to the PDB-file, is a sequence file. In this tutorial, we are going to use the tool that CAMPARI ships with for this purpose. This would be the normal workflow when simulating biomolecules in their folded states. The tool parses the PDB file, and tries to extract what the real sequence of the construct in the underlying experiment was (this is contained in the "SEQRES" entries). What is actually in the coordinates (ATOM/HETATM sections) is not as straightforwardly relatable to this sequence as one may think: well-defined small molecules are listed but only those that were resolved in the experiment, water molecules (the universally assumed main buffer component) are only found in the coordinate section (but not anywhere else), entire polymer residues might be missing in the coordinate section (loops and tails) that are listed in the sequence entry, protonation states, tautomers, etc. are ambiguous or undefined, and covalent modifications of polymer residues, might be missing or misplaced (glycosylations are particularly ill-defined). For proceeding with this tutorial, the main question to answer would be what to retain and how to deal with residues missing coordinates (listed under "REMARK 465"). In a clean working directory, execute the following command:<full path to folder>/campari/tools/convert_SEQRES_toseq.sh <full path to folder>/campari/examples/tutorial16/3p1c.pdb tutorial16.seq HID 5 0 no A
With this command, we are asking the script to write to "tutorial16.seq" the sequence to be simulated, requesting that all histidine residues are protonated (only) at the Nδ atom, that missing tails/loops are included up to a length of 5 residues, that unsupported residues are not part of the sequence, that no water molecules are included in the sequence, and that we only want chain A (rather than both, which would be the default). If you inspect the sequence file or rerun the command above with various options and compare the results, you might note that there are actually no histidine residues in the bromodomain of CREBBP and that the longest missing tail in chain A is less than 5 residues long. As you will be able to read in the terminal, the structure also contains unsupported entities, here two ions (SCN- and K+) and two ligands (ALY), one per chain, which correspond to free acetyllysine (amino acid). This is actually a supported residue in CAMPARI (KAC), and the warning reflects that there are "HET" and "HETNAM" entries in the PDB-file for it, and that it occurs like a small molecule (i.e., not in a polymer). Note that the tool has a few important limitations that are not covered by this example: single-residue missing "loops" in the coordinate information require editing the PDB file, the desire to keep some but not all resolved small molecules requires editing the PDB file, shortened (2- or 1-character) nonstandard nucleotide names are not handled well or at all, the correspondence between records of missing residues ("REMARK 465") and sequence numbering ("SEQRES") must be linear (often fails when sequence insertion codes are in use in missing segments), and names of CAMPARI-known but nonstandard amino acids are not automatically translated where this would be necessary (as in the ALY/KAC example). Appropriate warnings and comments are generally produced for these cases, however. One exception to this are D-amino acids: these are automatically translated from their official PDB names (e.g., "DPR" resolves to "PRO_D"), and there is no way to have a different residue with the name DPR. Importantly, an inverted chirality is not a topological difference, so structural input can easily override the sequence-based chirality assignment. Appropriate warnings and comments are generally produced for these cases, however.
Now let us create a basic key file named "tutorial16.key":
PARAMETERS <full path to folder>/campari/params/abs4.2_charmm36.prm
FMCSC_SEQFILE tutorial16.seq
FMCSC_PDBFILE <full path to folder>/campari/examples/tutorial16/3p1c.pdb
FMCSC_PDB_READMODE 3 # the default
FMCSC_PDB_HMODE 1 # the default (and irrelevant if input contains no hydrogen coordinates)
FMCSC_UNSAFE 1
FMCSC_SHAPE 2
FMCSC_BOUNDARY 4 # these two options set up an aperiodic spherical droplet
FMCSC_SIZE 50.0
FMCSC_ORIGIN 20. -15. 5. # should be set in accordance with PDB input
FMCSC_SC_IPP 1.0 # the default
FMCSC_SC_BONDED_B 1.0 # bond length potentials
FMCSC_SC_BONDED_A 1.0 # bond angle potentials
FMCSC_SC_BONDED_I 1.0 # improper dihedral angle potentials
FMCSC_SC_BONDED_T 1.0 # dihedral angle potentials
FMCSC_RANDOMIZE 1
FMCSC_RANDOMATTS 1000
FMCSC_RANDOMTHRESH 1.0
FMCSC_DYNAMICS 2
FMCSC_NRSTEPS 1
FMCSC_EQUIL 100000
FMCSC_DISABLE_ANALYSIS 1
FMCSC_TMD_RELAX 2.0
FMCSC_TMD_NRRELAX 50
You should be familiar with most of these options from earlier tutorials. Note that we use a 2020 parameter file (see the reference publication for further details). Here, it suffices to say that these parameters were in part developed and tested precisely with an application mind like the one in this tutorial. Keyword FMCSC_PDB_READMODE is set to its default, which ensures that missing tails at the N- and C-termini can be generated in (if possible) excluded volume-obeying conformations. When longer tails are involved, keywords FMCSC_RANDOMATTS and FMCSC_RANDOMTHRESH become particularly important in controlling the chance of arriving at a reasonably clash-free starting conformation. We set FMCSC_DYNAMICS to 2 to enable another useful function in dealing with unknown, usually experimental structures: CAMPARI offers a relaxation procedure, activated by setting FMCSC_TMD_RELAX to a positive value. This procedure identifies degrees of freedom (side chain torsions, rigid rotations of small molecules, finally rigid translations of small molecules) that are subject to high forces. Based on a force threshold (the units are not directly interpretable), these internal coordinate space degrees of freedom are subjected to a controllable number of rounds of Monte Carlo moves. The result of this procedure will be written to a specific output file. We are currently using a Hamiltonian that uses only excluded volume and bonded interactions. The latter are almost always useful to turn on right away: they cannot cause terminations due to missing parameters (unless you simulate in Cartesian space), and they prevent unreasonable geometries from occurring (such as non-planar polypeptide amide groups). Since small structural errors like this are difficult to detect visually, it is good practice to prevent them from creeping in at any stage.
Now run:
campari -k tutorial16.key > t16_test.log
mv yourcalc_RELAXED.pdb t16_receptor.pdb
chimera t16_receptor.pdb <full path to folder>/campari/examples/tutorial16/3p1c.pdb
If you have not installed UCSF Chimera, you can use your own favorite viewer instead of course. The comparison of the two structures should reveal that they are identical at the backbone level for chain A (where both have information), that the C-terminal tail has been constructed reasonably, and that there are only a few differences in side chain conformations. If you inspect "t16_test.log", you can retrieve more accurate information on what the relaxation procedure accomplished, on whether there were problems with the tail randomization, and how much information was missing from the PDB. Part of the need for relaxation arises from missing heavy atoms in some side chains in our input structure.
In a docking application, there is usually no interest in globally sampling receptor flexibility. At most, one might be interested in sampling receptor flexibility locally, i.e., in and near the binding site. Thus, it is not particularly important for us how good the structure is overall and what the conformation of a distal tail is. For the bromodomains, both termini are on the opposite end from the binding site, and thus we can largely ignore the details on that opposite end. When working with an incomplete PDB structure, it is nevertheless always recommended to at least check what the missing bits are and where they roughly end up. Very long tails and loops are particularly problematic in this regard, and will often introduce major caveats into a docking application. With a complete atomistic model of a bromodomain constructed, we can now turn our attention to the natural ligand.
Step 2 - Preparing a reference ligand
Docking with a full atomistic energy function is a tricky endeavor: there is usually very limited sampling available per screened compound, and the energy surface in a frozen binding site is extremely rugged. Thus, blind docking requires a mix of simplifications and clever (often hierarchical) sampling strategies many of which are not implemented in CAMPARI. CAMPARI's philosophy for working with small molecules is that it can simply parallelize (almost) any calculation it natively supports across many molecules, i.e., the same calculation is run for every single one of the small molecules in the input file, see the documentation of FMCSC_MOL2SCREEN, which we will use further below. One such strategy that is docking-specific and supported by CAMPARI is the use of prior information from a reference ligand. A reference ligand from an experimental structure provides information on what and where the binding is exactly, and on what chemical moieties fit, both spatially and interaction-wise, into the one or more pockets that the binding site has to offer. If you visualize the binding site of 3p1c.pdb, you might notice an important asparagine residue (#1168 in the PDB, #90 in our renumbered reconstruction), some nearby water molecules, etc. However, unfortunately, the ligand is rather incomplete and only the last few atoms of the acetyllysine side chain are resolved.This is a very common problem, and the general case is not easy to address. At the very least, the experiment to give rise to the structure should know the complete 2D structure of the ligand in question. Then, the PDB itself will usually have a reference for that ligand available for download, which is however not in the absolute position or conformation compliant with any specific PDB-file (the download is available in sdf-format, which is a common file format for small molecules). Other sources of three-dimensional information can be the Cambridge Structural Database, predictions from cheminformatics packages like OpenBabel or RDKit, and from databases curated specifically for docking like ZINC. However, there are frequently uncertainties that are not straightforward to address. For example, the protonation states of titratable sites have to be predicted (if you visualize the sdf-file for ALY, you will see that it is not in its zwitterionic form), chiral centers not resolved in the structure might imply that there are actually multiple enantiomers in play, etc. These issues become highly nontrivial very quickly, and often there simply might not be a definitive answer. Since all of the above is beyond the scope of a CAMPARI tutorial, we rely here on the tools we are using anyway.
Execute the following, which takes advantage of CAMPARI's native knowledge of what acetyllysine looks like:
echo -e "KAC_N_C\nEND" > ligonly.seq
Now populate a new key-file, t16_lig.key, with the following entries:
PARAMETERS <full path to folder>/campari/params/abs4.2_charmm36.prm
FMCSC_SEQFILE ligonly.seq
FMCSC_SHAPE 2
FMCSC_BOUNDARY 4 # these two options set up an aperiodic spherical droplet
FMCSC_SIZE 50.0
FMCSC_ORIGIN 0. 0. 0. # should be set in accordance with PDB input
FMCSC_SC_IPP 1.0 # the default
FMCSC_SC_BONDED_B 1.0 # bond length potentials
FMCSC_SC_BONDED_A 1.0 # bond angle potentials
FMCSC_SC_BONDED_I 1.0 # improper dihedral angle potentials
FMCSC_SC_BONDED_T 1.0 # dihedral angle potentials
FMCSC_DYNAMICS 1
FMCSC_NRSTEPS 1
FMCSC_DISABLE_ANALYSIS 1
And run CAMPARI once, just to get the native atom order of acetyllysine (KAC) in CAMPARI:
campari -k t16_lig.key > t16_lig.log
From this run, we simply use the starting PDB file as follows:
awk 'BEGIN{nd=0}; {if ((substr($0,1,6) == "HETATM") && (substr($0,18,5) == "ALY A")) {for (k=1;k<=3;k++) {nd=nd+1; lns[nd] = sprintf("%5s %2d 1 %12.6g 10.0 ",substr($0,13,4),-1*k,substr($0,24+k*8,8));}}} END{print nd; for (k=1;k<=nd;k++) {print lns[k];}}' <full path to folder>/campari/examples/tutorial16/3p1c.pdb > ligonly.drest
k=0; for i in $(awk '{if (substr($0,1,4) == "ATOM") {print substr($0,13,4)}}' yourcalc_START.pdb); do k=$(echo "${k} + 1" | bc); sed -i -e "s|${i} |${k} |" ligonly.drest; done
The first command extracts the atom coordinates from our input PDB and writes them out as absolute position restraints for the distance and position restraint facility in CAMPARI. Since there are 6 resolved atoms, this makes for 18 restraints on individual (x/y/z) coordinates. The file produced by the first command contains atom names rather than indices, and this is corrected (largely) by the second command. We will now use Monte Carlo sampling together with restraints to generate an acetyllysine conformation compliant with the resolved atom positions in the bromodomain co-crystal structure.
To do this, add/replace the following keywords:
FMCSC_NRSTEPS 100000 # change existing keyword
FMCSC_SC_DREST 10.0 # these position restraints will be used later on as well
FMCSC_DRESTFILE ligonly.drest
FMCSC_RIGIDFREQ 0.4
FMCSC_OTHERFREQ 0.8 # most work on the side chain
FMCSC_OTHERNATFREQ 0.8
FMCSC_ALIGN 4 # allow longer end to move occasionally
FMCSC_SYBYLLJMAP <full path to folder>/campari/params/abs4.2.ljmap
With the above options, we take advantage of the fact that pivot moves of the type controlled by FMCSC_OTHERFREQ can align in such a way that the shorter end remains in place even if it is a side chain. This is not possible with the normal side chain pivot moves controlled by FMCSC_CHIFREQ. This can be useful here since the restraints we just created limit the motion of the terminal part of the acetyllysine side chain rather than its amino acid backbone. The rest of the sampler will be a default move set.
Now, simply run CAMPARI again, and save the final conformation:
campari -k t16_lig.key > t16_lig.log
mv yourcalc_END.mol2 t16_ligand.mol2
mv yourcalc_END.pdb t16_ligand.pdb
chimera t16_ligand.mol2 t16_receptor.pdb
The visualization should demonstrate to you that the coordinates in "t16_ligand.pdb" are such that the side chain is inserted into the binding pocket in exactly the same way as in the reference PDB file. If not, the MC calculation might have not found an appropriate position but this is unlikely with the chosen settings (re-run in that case). However, it is also clear that the remainder of the amino acid is not (necessarily) in a well-defined position with respect to the bromodomain, and this is something we should correct (see next step). Note that we obtain the final conformation also in mol2-format. The mol2-format is another common file format for small molecules (like sdf). You should open the file and look at it if you are not familiar at all with its conventions. Some basic information is also included in the CAMPARI documentation, see elsewhere. In addition to coordinates in the "@TRIPOS<ATOM>" section (fields 3-5), which also includes Sybyl atom types (field 6), it contains an explicit bond matrix including bond orders (the "@TRIPOS<BOND>" section). You might notice that the bond orders are all (incorrectly) 1, which is because CAMPARI does not have inherent knowledge of bond orders in residues it supports natively and does not try to guess them. The atom entries can also contain additional information (see FMCSC_MOL2SUBPRT), commonly including partial charges (here, they are all 0.0 so far since we did not actually use Coulombic interactions in the calculation).
Step 3 - Relaxing the complex in the presence of the reference ligand
In this step, we want to turn on a full interaction model, which for this tutorial will be based on the ABSINTH implicit solvent model. In order to use an arbitrary ligand in such a model, we require parameters for this ligand: partial charges, bonded parameters (the extent is controlled by whether we sample (in part) in Cartesian space), Lennard-Jones parameters, and ABSINTH-specific parameters (solvation groups, etc). For acetyllysine this is not a problem since acetyllysine is natively supported in CAMPARI, but where are the other parameters obtained from? In CAMPARI, the logic is that parameters are either determined (semi)automatically from what is provided in the mol2-file or directly from the same file. In other words, when working with a library of ligands, it is assumed that the mol2-formatted input file for such a calculation is the only file containing ligand-specific information. We will see in the next step how this works in CAMPARI.For relaxing the complex, we use instead the native support. To do so, we must first merge the PDB-files "t16_ligand.pdb" and "t16_receptor.pdb" and extend the sequence file. You can do this manually or programmatically as follows:
awk -v fn=t16_ligand.pdb 'BEGIN{k=0; while ( (getline ln<fn) > 0) {if (substr(ln,1,4) == "ATOM") {k=k+1; lns[k]=ln;}}} {if ($1 == "END") {for (i=1;i<=k;i++) {print lns[i];}; print "END"} else {print}}' t16_receptor.pdb > t16_complex.pdb
awk '{if ($1 == "END") {print "KAC_N_C"; print "END"} else {print}}' tutorial16.seq > t16_complex.seq
Now we can modify the initial key-file, viz. "tutorial16.key". Let us first point it to the complete PDB-file of our complex and add the interaction model.
FMCSC_PDBFILE t16_complex.pdb # edit existing keyword
FMCSC_SEQFILE t16_complex.seq # edit existing keyword
FMCSC_NRSTEPS 10000 # edit existing keyword
FMCSC_UAMODEL 0
FMCSC_INTERMODEL 1
FMCSC_ELECMODEL 3 # not the default
FMCSC_MODE_14 1
FMCSC_FUDGE_EL_14 1.0
FMCSC_FUDGE_ST_14 1.0
FMCSC_EPSRULE 2
FMCSC_SIGRULE 1
FMCSC_SC_ATTLJ 1.0
FMCSC_SC_POLAR 1.0
FMCSC_POLTOL 0.05
FMCSC_SC_IMPSOLV 1.0
FMCSC_FOSFUNC 3 # not the default
FMCSC_FOSMID 0.1
FMCSC_FOSTAU 0.25
FMCSC_SCRFUNC 3 # not the default
FMCSC_SCRMID 0.9
FMCSC_SCRTAU 0.5
FMCSC_SAVMODE 2 # not the default
FMCSC_SAVPROBE 2.5
FMCSC_IMPDIEL 78.2
FMCSC_SCRMODEL 1 # not the default
If you did not do a previous tutorial with the ABSINTH model, you might want to repeat that or read up on some of these keywords in the documentation. In particular, the meanings of the choices for FMCSC_UAMODEL, FMCSC_INTERMODEL, and FMCSC_ELECMODEL should be unequivocally clear. For this tutorial, we are deviating from the original 2008 reference. First, we increase the compatibility with gradient-based samplers by choosing appropriate and conservative options for FMCSC_SAVMODE, FMCSC_FOSFUNC, and FMCSC_SCRFUNC that offer continuous derivatives. Second, we use group-consistent screening of charges, which is not only more rigorous but was also shown to be superior in applications of this type (consult the reference publication for details). Third, we are allowing some leniency in the detection of charge groups (here, 0.05e). This should not be required in general for supported entities, but free amino acids are a particular case. Fourth, we turn on a self-correction for monopole-monopole interactions excluded due to the short-range interaction model (see FMCSC_ELECMODEL). This was also introduced in the 2020 paper.
Next, we need to think about what should move during the relaxation of the complex. Generally speaking, in a medium- or high-throughput virtual screening application, you will practically never allow the entire receptor to move because this creates a tremendous amount of noise, thus increasing sampling requirements dramatically. Consequently, we might just as well address the issue of relaxation now. Here, we know that the experimental structure contains the free acetyllysine but that we simply do not know, unlike the conformations of protein side chains in the binding site, the exact position of its backbone. Thus, it would be ill-advised to allow the protein to relax. Execute the following to create a simple file with custom constraints:
echo -e "M\n1 ALL" > t16.frz
Then, add keywords or edit existing ones as follows:
FMCSC_NRSTEPS 10000 # change existing keyword
FMCSC_TMD_RELAX 0.0 # edit existing keyword
FMCSC_DYNAMICS 1 # edit existing keyword
FMCSC_FRZFILE t16.frz
FMCSC_RIGIDFREQ 0.0
FMCSC_CHIFREQ 0.0
FMCSC_OTHERFREQ 1.0
FMCSC_OTHERNATFREQ 0.8
FMCSC_CHECKFREQ 1000
FMCSC_NRTHREADS 4 # if you can allow it, set it to the number of physical or even logical cores in your system (hyperthreading can be beneficial)
FMCSC_THREADS_VERBOSE 1 # level of diagnostic output
FMCSC_THREADS_DLB_FREQ 500 # restarts DLB segments every X steps
FMCSC_THREADS_DLB_STOP 300 # stops DLB for each segment after X steps
FMCSC_THREADS_DLB_EXT 1 # accumulate load data over X steps for DLB adjustments
This will use a Monte Carlo sampler exclusively on the KAC residue and using exclusively single-torsion pivot moves. The acetyllysine ligand cannot really escape from the binding site nor can the protein conformation change. Run with this key-file, possibly a few times, each time inspecting the final conformation:
campari_threads -k tutorial16.key > t16_relax_complex.log
mv yourcalc_END.pdb t16_complex_relaxed.pdb
chimera t16_complex_relaxed.pdb
Once you are satisfied that the final conformation is reasonable (you should watch out in particular that the interaction with the conserved asparagine residue is preserved), the last thing to do in this step is to extract the ligand conformation on its own in mol2-format. As usual, there are a number of ways of doing this, but the simplest might be the following:
awk '{if (((substr($0,1,4) != "ATOM") && (substr($0,1,6) != "HETATM")) || (substr($0,18,3) == "KAC")) {print}}' t16_complex_relaxed.pdb > t16_ligand_relaxed.pdb
echo "FMCSC_PDBFILE t16_ligand_relaxed.pdb" >> t16_lig.key
campari -k t16_lig.key > t16_ignore.log
mv yourcalc_START.mol2 t16_ligand_relaxed.mol2
It should be noted, however, that the resultant mol2-file contains only coordinates and basic atom type information but no partial charges or ABSINTH-specific parameters.
Step 4 - Screening a library of pre-parameterized small molecules blindly
To activate and use a small molecule screen within CAMPARI, we have a number of specialized keywords to use that do not appear in any other context. To make their meaning clear, you should of course follow the links and read the corresponding documentation, but a short overview is also provided here.Add keywords to tutorial16.key:
FMCSC_MOL2SCREEN 1
FMCSC_MOL2FILE <full path to folder>/campari/examples/tutorial16/t16_library.mol2
The two added keywords are the ones that are strictly necessary to enable this particular execution mode of CAMPARI. With these in place, CAMPARI will try to run the same calculation specified by all the other keywords for a molecular system in which the last residue, which has to be a single-residue molecule in the sequence file, is successively replaced by each of the molecules found in the input file. The last residue must not be a CAMPARI-supported one, so execute:
sed -i -e 's|KAC_N_C|LIG_N_C|g' t16_complex.seq
Remember to change it back if you want to return to one of the prior steps. The specific input file used here, t16_library.mol2, contains 43 molecules, and you should inspect it briefly. In particular, you should note that in the "@<TRIPOS>ATOM" section, columns 7-9 contain specific information. Column 9 are partial charges (here, generated by CGenFF). Column 7 is a code (integer) that refers to a particular reference set of solvation groups. Column 8 is a counter for the solvation groups occurring in a particular molecule. You cannot know from this file alone what the integer code refers to, and we have to point CAMPARI to the underlying reference set:
FMCSC_MOL2FOSLIBFILE <full path to folder>/campari/examples/tutorial16/t16_library.foslib
Briefly, the idea behind the small molecule parameterization for ABSINTH is to parse every molecule into a set of solvation groups from a reference database of experimentally known values (see the 2020 paper). Because only limited numbers of experimental values are available, some molecules containing unusual chemotypes cannot be parameterized while others have to rely on corrective values (usually single atoms whose reference free energy of solvation is estimated by a decomposition analysis). This has been done in t16_library.mol2, and the results are stored in columns 7 and 8, which become quantitatively interpretable only with the help of t16_library.foslib. As for partial charges, mol2-files might contain unconstrained parameterization results leading to charges that do not sum up exactly to integer values, even at the level of the entire molecule. In our case, this is not a problem because they are based on CGenFF, but it does not hurt to proactively prevent fractional charges from being used. Thus, set:
FMCSC_MOL2POLMODE 1
This will at least guarantee that charges sum up meaningfully at the level of the entire molecule. In addition, the aforementioned threshold of 0.05e set via FMCSC_POLTOL is particularly relevant in a small molecule screen. Frequently, because charges for small molecules can be much more delocalized and complicated than those occurring in the relatively simple building blocks used in polypeptides and polynucleotides, it is difficult to find neutral groups (meaning compact sets of atoms summing up to a charge of zero). This is where this tolerance setting becomes important. Note that the algorithm finding these groups was extended and refined for this purpose (see Parameters for a description) when transitioning from Version 3 to Version 4. The downside of not detecting meaningful charge groups is twofold: first, intramolecular polar interactions will be largely eliminated (this can be particularly problematic if the molecule is polar/charged and has substantial flexibility and/or if ABSINTH (or any other implicit solvent model with a conformation-dependent effective dielectric) is in use; second, the use of group-consistent screening models becomes questionable.
The remaining parameters we need to have values for are those for the Lennard-Jones potential and bonded parameters. For the former, we use a map between Sybyl types and CAMPARI atom types as mentioned above through keyword FMCSC_SYBYLLJMAP. If no map is provided, a map is guessed automatically based on names, valence, etc. This procedure is always dependent on the parameter file in use, and, of the ones shipped with CAMPARI by default, only the abs4.1* and abs4.2* parameter files have been used/tested extensively in this regard. Bonded parameters are currently guessed in their entirety with only very limited possibilities of modifying or supplying parameters from the user side (the only currently available option to do so is to use options 2 or 3 for FMCSC_MOL2RESPECTBOND, which allows bonded potential types to be assigned in a file-based manner, but the integer code in use is not exposed to the user). Here, it will be sufficient to have CAMPARI guess the potentials via:
FMCSC_MOL2BONDMODE 3
FMCSC_PLANAR_TOLS 15.0 30.0 # in degrees
FMCSC_MOL2VERBOSE 4 # maximal
FMCSC_MOL2BONDMODE works in the same way as FMCSC_GUESS_BONDED, and using anything except 3 will generally not be advisable. Bond lengths and angles are always assumed to be in a harmonic basin around a single equilibrium value, and this equilibrium value is extracted from the input conformer. This means that CAMPARI handles only mol2 input files that have coordinates representing realistic, three-dimensional conformers of molecules. Any errors in the input conformers at the level of bond lengths and angles will be propagated to the results. These parameters come, currently, into play only if you sample in Cartesian space, however. In an internal coordinate space, there currently is no way of sampling flexible rings (as there is for proline and polynucleotides), which is a caveat. In terms of bond lengths and angles, mismatches between input conformers and chemically reasonable values are of course propagated and maintained exactly in an internal coordinate space (CAMPARI has no internal knowledge of "chemically reasonable" for arbitrary small molecules). The most critical element in guessing parameters is to prescribe rotational barriers that are due to electronic effects. For this purpose, CAMPARI performs a relatively detailed scan of the bonds in a molecule. If the bond is rotatable (meaning not part of a ring and not involving a terminal atom), its rotation might be hindered. This is the case, for example, for the N-C(=O) bond in the amide moiety in the acetyllysine side chain. Based on (Sybyl) atom types, valence, and geometric criteria (planarity, co-planarity of substituents, etc), which are assigned using two threshold values, an attempt is made to identify the local chemotype and to assign a strong enough potential that biases the molecule toward a planar conformation. These can be asymmetric (favoring particular E/Z-style isomers), e.g., in the case of amides with asymmetric substituents. To get an idea of how detailed the parsing is, we use the verbosity setting. At maximal verbosity, we will get a report on every non-terminal bond in every molecule reporting the inferred chemotype. Note that some chemotypes are difficult to make general rules for, in particular those involving atoms offering more diverse hybridization states, e.g. sulfonamides. You should always expect this procedure to yield some erroneous assignments. A particular caveat is that some assignments depend on whether the input conformer is planar. As a result, subsequent runs for the same molecule but different starting conformer might yield different assignments if a dihedral angle is changed to an unexpected value. This is exactly the type of problem we could and should address using options 2 or 3 for FMCSC_MOL2RESPECTBOND (see Step 6).
Having defined how we arrive at parameters for the small molecules, we next need to define how to sample them. This is primarily a matter of defining a base sampler, and for this first run, we will stick with a Monte Carlo sampler. Replace the existing move set keywords with the following:
FMCSC_RANDOMIZE 0 # change existing keyword (for clarity)
FMCSC_RIGIDFREQ 0.5
FMCSC_RIGIDRDFREQ 0.5
FMCSC_CHIFREQ 0.0
FMCSC_OTHERFREQ 1.0
FMCSC_OTHERRDFREQ 1.0
FMCSC_OTHERUNKFREQ 1.0 # only in small molecules
FMCSC_ALIGN 4 # allow longer end to move occasionally
FMCSC_MOL2FRZMODE 2
FMCSC_MOL2RANDOMIZE 1
Here, the only keywords specific to a small molecule screen are the last two. For FMCSC_MOL2FRZMODE, the value of 2 implies that we want all dihedral angles in the small molecules to be able to move. Dihedral angles moving larger numbers of atoms are favored by the picking scheme (since they tend to be both more important and more difficult to sample). Note that the receptor is completely frozen since we are still going to be using "t16.frz" for FMCSC_FRZFILE. By choosing FMCSC_MOL2RANDOMIZE to be 1, our molecules will start from a random position in the box. In version 4, this was the entire simulation container, which necessitated a large investment in a search just to "find" the receptor. This is changed since version 5 where the meaning changed. First, there are additional options for FMCSC_MOL2RANDOMIZE. Second, a choice of 1 for FMCSC_MOL2RANDOMIZE performs randomization only in the absence of alignment to a tether group (see below). Third, only a subvolume is considered: this is either around the receptor (using a minimum buffer size of 10Å to pad the volume), or it is the binding site if implicitly defined via distance or position restraints (as it would be in steps 5 and 7 below).
We also need more steps:
FMCSC_NRSTEPS 100000 # change existing keyword
FMCSC_EQUIL 50000 # change existing keyword
FMCSC_CHECKFREQ 10000 # change existing keyword
The above means that for every one of the 43 molecules, 100000 Monte Carlo steps need to be performed.
The last two aspects we should control explicitly are what outputs we want to retain and how to distribute our available computational resources. Since we are using a sampler at finite temperature in conjunction with extensive constraints, we should expect this sampler to produce a narrow distribution of energies but a distribution nonetheless. Since any inference of binding free energies will rely on energies, CAMPARI, through FMCSC_MOL2OUTMODE, offers the user the choice to make energy a part of the criterion for retaining output conformations (poses) of the small molecules (and any possible moving parts of the receptor, here none so far). This information is printed to the central output file(s) of a small molecule screen.
FMCSC_MOL2OUTMODE 4 # prints only final pose
FMCSC_MOL2ENMODE 1 # prints instantaneous energy per pose
FMCSC_REMC 1 # the correct way to distribute molecules across MPI processes
FMCSC_REPLICAS XXX # for clarity, this should be the number of MPI processes (will be overridden in this mode)
When deciding what poses to retain, the danger in using minimum energies in conjunction with a finite-temperature sampler and the ABSINTH model is twofold: first, total energies of the complex might be influenced by receptor-only contributions (irrespective of constraints); second, fluctuations are random and the minimum encountered during a finite number of steps is statistically poorly defined. A better approach is to use geometric clustering as we will see in the next step. When using clustering, the question is what score (energy) to associate with a given pose, and this is what FMCSC_MOL2ENMODE will be useful for.
For small molecule screens, it is recommended to either use MPI (but not OpenMP) or OpenMP (but not MPI). In the first mode, and this is controlled by FMCSC_REMC, the available N MPI processes will be split into a single master processes that reads and parses the input file before sending individual molecule blocks to one of the N-1 subordinate processes (whichever one becomes available). These are doing the actual work meaning that the master is largely idle, and you can often pick N to be one larger than the number of cores available on your system. This mode ensures load balancing and is maximally resource-efficient with the only caveat being the largely "unemployed" master. Instead, if you let a single multi-threaded process handle the calculation, all setup steps are inefficient (no threads parallelization), and all sampling steps are bound to the parallel efficiency offered by the combination of system, sampler, and number of threads. On platforms with large numbers of cores, you might sometimes elect to mix OpenMP and MPI in order to save memory and cache and to achieve a higher throughput per MPI process (since jobs may only be allowed to run for a certain amount of time uninterruptedly, as is common on supercomputers). See Tutorial 7 and Tutorial 9 for more technical details on parallel execution modes in general.
You are now ready to run this calculation. Paste/replace all the keywords to "tutorial16.key", and then execute either of the following:
campari_threads -k tutorial16.key > t16_blindrun_threads.log
mv yourcalc.mol2 blind_run.mol2
mv yourcalc.energies blind_run.energies
chimera t16_complex_relaxed.pdb blind_run.mol2
Or:
mpirun -np XXX campari_mpi -k tutorial16.key > t16_blindrun_mpi.log
cat N_*_yourcalc.mol2 > blind_run.mol2
cat N_*_yourcalc.energies | awk '{if ((NR == 1) || ($1 != "IDENTIFIER")) {print}}' > blind_run.energies
chimera t16_complex_relaxed.pdb blind_run.mol2
Chimera (unlike Pymol) will visualize all output poses at the same time by default. While the calculation is running, which will take some time depending on your available resources, you should use the opportunity to look at the log file, either "t16_blindrun_threads.log" (for the OpenMP case) or "N_001.log", "N_002.log" and so on (for the MPI case) while opening the mol2-library file in a molecular viewer and opening it as text. You should then try to recapitulate for the first screened molecule how the molecule was understood chemically, what charge groups were assigned, and what solvation groups are present. This might be a bit tedious (due to working with indices and atom names), but it will give you a glimpse of how well CAMPARI understands the molecule at hand (of course this requires some chemical knowledge on your part). The atom-based labeling feature in your selected viewer might help in this regard.
Once the calculation is finished (or at least once it has proceeded sufficiently far), inspection of the poses will probably reveal that only a few molecules, if any, are close to the binding site. They are clealy not random, and many of them will probably cluster in particular spots on the surface. These surface spots could offer complementary charge for charged molecules, be hydrophobic, or have a shallow cavity shape. Note that the display might be a bit irritating because the heavier halogen atoms in the small molecules carry an extra lone-pair particle, which is a consequence of relying on CGenFF for their parameterization. The corresponding energies found in their dedicated output file are not interpretable on their own because they "contain" the intrinsic ligand (and receptor) energies (no finite difference performed).
The poses might seem confusing at first, but we should recall what exactly was done: from a random starting position, we used a pure Monte Carlo sampler in an atomistic force field to simulate the position of relatively complicated small molecules in the presence of a frozen receptor. The constraints on the receptor mean that the energy surface is even more rugged than it would be for a flexible receptor. This is also a good time to realize that we simulate an ill-defined system in many ways: the receptor has a (admittedly small) net charge (-1), there are no compensating ions around, and the small molecules might differ in their own net charge. Furthermore, we did not check how close the receptor is to the boundary of our simulation system, which might also play a role in restricting access to certain spots on the surface. Taken together, these considerations all point to a calculation set up to yield limited useful information. Molecular simulations sampling from a standard Boltzmann distribution are, generally speaking, not fast or efficient enough search procedures to provide good and somewhat systematic coverage of the available phase space, and no docking software relies exclusively on such a strategy. While the idea of importance sampling is useful also in this type of application, we have to find a way to prevent the system from spending large amounts of time being "stuck" in an uninteresting place. Play around with alternative strategies: (much) higher temperature, using many repeated runs rather than only one run per molecule (for example, via FMCSC_MPIAVG), etc. There might be settings with which you achieve better consistency, but it should be clear that the problem remains hard if we do not restrict the available search space in some way.
Step 5 - Screening a library of pre-parameterized small molecules with the help of the reference ligand
In many docking programs, binding sites are defined by choosing a set of receptor residues, which then sets up a spatial subregion to which any search algorithms are restricted. We can achieve something similar in multiple ways in CAMPARI. For example, you can set up preemptive distance and position restraints that apply to all atoms in the screened molecules. These restraints can provide a description of a binding site, and pose randomization (→ FMCSC_MOL2RANDOMIZE) would respond to these constraints. If these are flat-bottom potentials constructed from two half-potentials (→ elsewhere), a binding site can be defined that does not lead to an overly rugged potential energy landscape. Something similar can be achieved using spatial density restraints. The solution explored in this tutorial is to enlist the help of a reference ligand, which we prepared above (but which still needs to be modified). Add this keyword:FMCSC_MOL2_REFMOL t16_ligand_relaxed_amidetag.mol2
This keyword has three primary functions that are all based on its coordinates. First, it can define a molecular substructure (here, simply a set of topologically connected atoms) that other molecules can be aligned to if they possess the same or a similar substructure. This is known as "tethered" docking, and it removes most of the required searching in rigid-body coordinates by choosing an appropriate starting conformation of the tether group in relation to the binding site. Second, and this is a related function, it can set up positional restraints that will strongly bias the screened molecules toward the binding site (and prevent them from escaping). Third, CAMPARI can identify nearby receptor residues and set up automatic constraints that freeze everything in the receptor except those nearby side chains.
To enable the first function, we need to tag the substructure of interest in "t16_ligand_relaxed.mol2". In our case, we want to tag the terminal secondary amide group. It makes a difference whether we include hydrogen atoms in the substructure, and here we omit the methyl hydrogen atoms. The atoms making up the remainder of the terminal amide are "NZ", "HZ", "CH", "OH", and "CI". You can tag them manually by appending suffixes starting from "_1" through to "_5" or you can do it programmatically like this:
awk 'BEGIN{k=0;} {if (($2 == "HZ") || ($2 == "NZ") || ($2 == "CH") || ($2 == "CI") || ($2 == "OH")) {k=k+1; printf("%d %s ",$1,$2 "_" k); for (i=3;i<=NF;i++) {printf(" %s",$i);}; printf("\n");} else {print}}' t16_ligand_relaxed.mol2 > t16_ligand_relaxed_amidetag.mol2
Now that we have tagged a substructure, CAMPARI will look in the molecules it is planning to screen for similar structures. There are two thresholds for this that we need to set. The first is a threshold on how chemically different the substructures can be, and the second is a threshold for how geometrically dissimilar the substructure coordinates are allowed to be after the alignment has happened.
FMCSC_MOL2ASSIMILAR 0.1 # per atom, this is very stringent
FMCSC_MOL2RMSDTHRESH 0.5 # in A, this is stringent
If we were to include the chemically equivalent hydrogen atoms here, we would incur two problems: first, the chemical matches are combinatorially duplicated (here, to 6 possible matches of hydrogen to hydrogen); second, an ideal alignment would be possible only if the uninteresting spin of the methyl group is the same as in our tether. The second aspect would force us to increase FMCSC_MOL2RMSDTHRESH slightly, which would put as at the risk of redundantly enumerating what is essentially the same tethering. Sometimes, this enumeration is desired, for instance when dealing with molecules with symmetric rings used for tethering. Including the hydrogens is also beneficial or even required when it should be clear to the chemical matching algorithm that there can be no continuation at this methyl group.
If you read the documentation on keyword FMCSC_MOL2ASSIMILAR, you will find that we are trying strategy #1 here. If you had visualized the results in the previous step, you might have noticed that almost all of the molecules in the library for this tutorial contain a terminal amide group as the one in acetyllysine.
For the second function, we simply need to add the following keywords:
FMCSC_MOL2DRESTMODE 1 # enable automatic tether position restraints
FMCSC_MOL2DRESTBUF 2.0 # produces at bit of leeway if positive (in Angstrom)
FMCSC_SC_DREST 1.0 # default strength can be scaled globally by this
In order to enable the position restraints, we need to enable their parent potential by choosing a positive scale factor. Normally, we would need to provide an input file for this, but CAMPARI recognizes our implicit request and can skip this step (you could still provide a file with additional restraints, though). The automatically added terms have a fixed prefactor of 1.0 kcal·mol-1·Å-2, which is strong depending on the number of terms (the number of atoms in the tether) and the buffer setting. The latter, here set to 2.0Å, creates flat bottom-type restraints where the 2.0 corresponds to the width of the flat part. This can be advantageous because the covalent geometry of different molecules might imply that the tether group cannot always be in exactly the same place. The position restraints are applied to the analogs found for the selected tether atoms using the reference positions, which is chosen by mode 1 for FMCSC_MOL2DRESTMODE. In mode 2, the reference molecule would instead provide a global restraint.
For the third function, CAMPARI offers the logic that, at least in an atomistic force field, completely rigid side chains create too much ruggedness, which in turn creates unrealistic penalties. Due to the inverse power nature of excluded volume interactions, a shift of a side chain in the binding site even by 1.0Å or less can have dramatic consequences for the steric favorability of the binding of different small molecules. By allowing protein residues to adjust to small molecules, the space of feasibly docked candidate molecules should be widened. That said, it is important that the number of receptor degrees of freedom be kept small to avoid sampling bottlenecks and to limit the size of energy fluctuations due to these degrees of freedom. In our first attempt, we will stick with a rigid receptor. To do so, we could keep using the same constraint input file as before, but we can also use the automatic mode as follows:
FMCSC_MOL2FOCUS -0.1 # nothing will move
# FMCSC_FRZFILE t16.frz # comment existing keyword
This achieves the same effect by telling CAMPARI that only those receptor degrees of freedom will be be kept free that only move atoms that are all within 0.1Å of all atoms of the reference ligand. This is clearly impossible and any value meant to provide some receptor flexibility must either be positive (then the interpretation changes, see FMCSC_MOL2FOCUS) or must be much larger in absolute magnitude, at least as large as the span (size) of the reference molecule itself.
Finally, we need to consider our sampler. The rigid environment in the binding site means that degrees of freedom should be correlated, and this is achieved most easily with gradient-based samplers. That said, the initial conformations resulting from the alignment have very high chances of nuclear overlap leading to extreme forces, which makes such samplers unusable. So what we want is to relax the initial conformation using Monte Carlo so that a clash-free conformation can be found (if any) and then optimize/sample this conformation using molecular dynamics. For the Monte Carlo part, the use of option 4 for FMCSC_ALIGN is now fundamental: without it, the base of motion of the molecule will never be able to move and any clash created by will only ever be resolved by rigid-body displacement, which would be an unnecessary restriction. The hybrid method just described is available and can be activated like this:
FMCSC_NRSTEPS 10000 # change existing keyword
FMCSC_EQUIL 5000 # change existing keyword
FMCSC_DYNAMICS 5 # change existing keyword, now hybrid MD/MC
FMCSC_TMD_INTEGRATOR 1
FMCSC_TMD_INT2UP 4
FMCSC_TIMESTEP 0.005 # in ps
FMCSC_TSTAT 2 # adapted Andersen method
FMCSC_CYCLE_MC_FIRST 9000
FMCSC_CYCLE_DYN_MIN 1000
FMCSC_CYCLE_DYN_MAX 1000
FMCSC_CYCLE_MC_MIN 250 # relevant later
FMCSC_CYCLE_MC_MAX 250 # relevant later
This type of hybrid sampler is also used in Tutorial 15 and in Tutorial 7 (for the latter, with a similar goal in mind). The settings for the internal coordinate space dynamics engine should be familiar from earlier tutorials except that here we use the Andersen thermostat. The alternative of relying on the velocity rescaling version has a higher chance of producing equipartition artifacts for a system like the one used here. The hybrid schedule prescribed above samples 9000 steps of MC followed by a single MD segment of 1000 steps. The total sampling is reduced compared to our blind run, and this is a reflection of the fact that we have simplified the problem by using information from the reference molecule. Of course, there is no universal guide or rule for how much sampling is sufficient: generally speaking, the more the better.
We also want to prevent numerical problems and make the MC and MD potential energy functions as identical as possible:
FMCSC_NBCUTOFF 10.0 # in A
FMCSC_ELCUTOFF 10.0 # in A
FMCSC_TMD_UNKMODE 3 # sample everything (overlaps with FMCSC_MOL2FRZMODE)
FMCSC_MCCUTMODE 2 # required for consistency between MC and MD Hamiltonians
FMCSC_N2LOOP 0 # avoid using N^2 energy as reference anywhere
FMCSC_FUDGE_DYN 1 # avoid problems with XH-group dihedral angles
FMCSC_THRESHOLD_INCR 0.1 # stop unstable simulations aggressively
The use of FMCSC_THRESHOLD_INCR is fundamental for ensuring that the small molecule screen can continue if the dynamics part of a given screened molecule becomes unstable. The desired behavior is that such a molecule is aborted without producing output. This is why it is important that such cancellations occur only when there is an actual steric problem. The use of FMCSC_FUDGE_DYN ensures that there is no other obvious source of integrator instability.
In theory, we could save time by safely identifying tetherings that are incompatible with the rigid part of the molecule and the shape of the binding site. Some clashes might be impossible resolve, and keyword FMCSC_MOL2PRUNEMODE is meant to save needless computations for these cases. However, this filtering works best if the receptor is locally flexible, and we omit it here (introduced in step 7 below). Now we are ready to proceed with this run as before (be careful throughout that you do not accidentally append an existing "yourcalc.mol2" or its MPI equivalents; these files are not deleted but appended by default):
campari_threads -k tutorial16.key > t16_tetheredrigidrun_threads.log
mv yourcalc.mol2 tetheredrigid_run.mol2
chimera t16_complex_relaxed.pdb tetheredrigid_run.mol2
Or:
mpirun -np XXX campari_mpi -k tutorial16.key > t16_tetheredrigidrun_mpi.log
cat N_*_yourcalc.mol2 > tetheredrigid_run.mol2
chimera t16_complex_relaxed.pdb tetheredrigid_run.mol2
Upon visualizing these results, you should see a dramatic difference compared to the blind run. There are fewer poses but practically all of these poses should have a terminal amide group in the binding site that is engaging important residues like the conserved asparagine (you can use Chimera's FindHBond tool under "Structure Analysis" in "Tools" to selectively show intermolecular hydrogen bonds).
You might wonder about the speed of the screening calculations. There is a significant problem in that we have a complete but constrained receptor. All the interactions within the receptor are recalculated at every step, so we are effectively simulating a protein that cannot move. The normal workaround for this is twofold: in Monte Carlo steps, energies are incremental and this implies recognizing the terms that did not change. Conversely, in dynamics steps, we would have to rely on keyword FMCSC_SKIPFRZ to omit constrained interactions, but this is only supported if the Hamiltonian is pairwise decomposable (which ABSINTH is not). Enabling this requires a much more complicated heuristic that is not available at the moment (02/2024).
If you have limited resources, you can speed up the tethered run like this:
FMCSC_SKIPFRZ 1 # does not work with ABSINTH
FMCSC_SC_POLAR 0.05 # change existing keyword
FMCSC_SC_IMPSOLV 0.0 # change existing keyword
This creates a primitive model (no desolvation, fixed dielectric of 20.0). Note that FMCSC_SKIPFRZ has no impact on the speed of incremental energy calculations during Monte Carlo steps. Some caution is required in interpreting energy values during hybrid runs when using this keyword, and a warning is produced. This is because the Monte Carlo segments report the complete energies.
Step 6 - Scoring the resultant poses
So far, we have produced geometric information and complex energies, but we have stored nothing else. So how do we produce a prediction or proxy of a binding free energy out of this (which is the main purpose of computational docking)? In most virtual screening workflows, we rely on the rigid complex assumption. This means that we take a single conformation (pose and receptor) and simply compute the finite difference of potential energies of complex versus ligand and protein in isolation. This has a number of limitations, and the reader is referred to the reference publication for what these limitations are and how some of them can be addressed. Here, we will keep it simple. CAMPARI offers a scoring mode for the rigid complex assumption with the option to also supply a separate receptor conformation (keyword FMCSC_MOL2_PDB_RELAXED), which is activated as follows:FMCSC_MOL2SCOREONLY 1 # activates scoring mode
FMCSC_MOL2FILE tetheredrigid_run.mol2 # change existing keyword
FMCSC_MOL2RESPECTBOND 3 # keep as much as possible from the previous output
Here, we turn to the aforementioned FMCSC_MOL2RESPECTBOND in order to prevent changes in assigning conformation-dependent potentials. Now, we can run again:
campari -k tutorial16.key | sed -e "s|Energies-Difference:|Energies-Difference |g" > t16_tetheredrigidrun_scores.log
/bin/bash <full path to folder>/campari/examples/tutorial16/get_scores.sh t16_tetheredrigidrun_scores.log "Energies-Difference" > t16_tetheredrigidrun.sco
The bash script simply extracts information from the log-file and puts it into a table format that can be read by R or other plotting software. Now repeat the same thing for the blind run (and any other runs you might have performed). Edit the key-file again:
FMCSC_MOL2FILE blind_run.mol2 # change existing keyword
Then:
campari -k tutorial16.key | sed -e "s|Energies-Difference:|Energies-Difference |g" > t16_blindrun_scores.log
/bin/bash <full path to folder>/campari/examples/tutorial16/get_scores.sh t16_blindrun_scores.log "Energies-Difference" > t16_blindrun.sco
Have a look at the log-files to understand how the information is constructed. In addition to "TOTAL" energies, the scores contain individual contributions to this total energy difference and some ligand-specific information (sum of reference free energies of solvation, total charge, numbers of nonzero net charge groups). For the tethered run, the complex energies ("Energies-Complex" and "TOTAL") should be tightly correlated with those found in "tetheredrigidrun.energies." This holds if you subtract the distance restraint potential (if any) from the latter. The same test does not work as well for the blind run because it used a slightly different potential energy function (see the changes introduced in step 5 to enforce consistency of the MC and MD parts, which you would have to revert first), but the correlation should nevertheless be very high. This is a useful sanity check. Note, however, that we take a detour through limited-precision coordinates, which impacts this comparison in particular for the excluded volume potential.
Inevitably, you might notice some issues with individual molecules. As an example, the reference molecule, which was run and scored in the tethered run, reports correct charge groups but a bad reference free energy of solvation. This arises because we did not prepare "t16_ligand_relaxed.mol2" in congruence with the information supplied to FMCSC_MOL2FOSLIBFILE. This means that a single, nonsensical solvation group was assigned to acetyllysine once it was treated as a small molecule via FMCSC_MOL2_REFMOL. Thus, we have to ignore this score. Similarly, you will find that the charge group detection often fails to partition the molecules perfectly with the chosen tolerance, see for example "bf47e41dcce3df21fb232a6de7ef8744". It is an important caveat when working with libraries of unknown small molecules that it will never be possible to avoid all such errors. This is irrespective of the protocols and software used to process them: instead, it is a conceptual limitation of not being able to test for the unknown beforehand.
Find a way to plot histograms of the total energies and individual contributions for the two runs. If you have additional tethered runs, you might want to also try to correlate the scores for specific molecules (note that the tables might have different sets of molecules in different order, though; the order changes when using MPI). You should find that, maybe to your surprise, the net scores from the blind run are not worse than the ones from the tethered run. This says that, at the level of the total energy, the binding site is not a better fit for most of the molecules than a spot on the surface elsewhere. This is surprising in that all molecules contain a specific recognition motif. At the same time, it can be expected because no other property of the molecules is tailored toward the binding site, which is also completely rigid. By imposing prior information and restricting the search space using the tether, we are proactively preventing better binding spots from being found.
We do have some systematic issues, though. First, you can plot the "NETCHARGE" column against the "TOTAL" column and you will likely find that there is a negative correlation. Such net charge biases are very common in docking with physics-based scoring functions, and they result from missing appropriate corrections, see the 2020 reference paper. Our receptor is in fact negatively charged, but the signature is much stronger than it should be because there are no counterions. Second, if you check the score decompositions, you will find that the blind run most likely has a few data points with very favorable polar interactions (add "POLAR" and "IMPSOLV" for this) but substantially less negative nonpolar interactions (add "ATTLJ" and "IPP") on average. The more favorable nonpolar contributions are the direct result of forcing molecules into a largely hydrophobic binding site through tethering. Third, and this is not evident here, these data are based on finite-temperature sampling and highly stochastic. Without independent reruns, it will be very hard to adjudicate how systematic any of the observed effects really are. For further illustration, you might want to identify the best-scoring molecules in both runs and inspect them visually to rationalize these favorable scores. If we treat the scores thermodynamically, and there are poses for a given molecule that produce dramatically worse scores than others, it means that these poses would not contribute anything significant to a binding equilibrium between this small molecule and the receptor and should thus be ignored. This conclusion always comes with an asterisk in docking due to missing corrections, such as net charge or entropy effects. Limitations of the rigid complex approximation are overcome most commonly by adding correction terms or turning to explicit free energy calculations on unconstrained systems, see Tutorial 7.
Step 7 - Introducing receptor flexibility and collecting more information
In the final part of the tutorial, we repeat the calculation from step 5 above but add some flexibility to side chains in the binding site. As mentioned before, a negative value for keyword FMCSC_MOL2FOCUS is interpreted as a minimum distance (in absolute value) that all atoms moved by a particular degree of freedom ("moved" in the sense of CAMPARI's internal coordinate space samplers and their respective base of motion, see FMCSC_ALIGN) have to be within of for every atom in the reference molecule. Our reference ligand is relatively small, and a base value for its extent/span is about 7-8Å. Thus, change/add these keywords in tutorial16.key:FMCSC_NRSTEPS 15000 # change existing keyword
FMCSC_CYCLE_DYN_MIN 250 # change existing keyword
FMCSC_CYCLE_DYN_MAX 250 # change existing keyword
FMCSC_MOL2FILE <full path to folder>/campari/examples/tutorial16/t16_library.mol2 # change existing keyword
FMCSC_MOL2SCOREONLY 0 # change existing keyword
FMCSC_MOL2FOCUS -12.0 # in Angstrom, change existing keyword
FMCSC_MOL2AUXINDEX automatic # special mode
FMCSC_CHIFREQ 0.5
FMCSC_CHISTEPSZ 5.0 # relatively small step size for local moves
In addition to turning on receptor flexibility, we restore the other small molecule-specific settings. Furthermore, we increase the number of steps (since there are more degrees of freedom to sample now), we make sure that there are several alternating segments of Monte Carlo and dynamics after the first Monte Carlo segment, and we enable side chain pivot moves specific to biopolymers. In conjunction with our choice of 4 for keyword FMCSC_ALIGN, we have to be careful, though. Any Monte Carlo moves on the receptor that could move the rest of the protein rather than the few atoms at the end of a side chain would destroy our frame of reference, which is essential for the position restraints. This restricts us to the use of the aforementioned side chain moves, which do not respond to FMCSC_ALIGN and to single dihedral angle-pivot moves used in conjunction with fixed or automatically determined constraints on receptor degrees of freedom. Constraints acting on the rigid-body degrees of freeom of the receptor have an indirect effect here since they prevent the longer end from swiveling around if the rigid-body motion of the parent molecule is meant to be constrained.
Of course, many combinations of molecules and binding sites will remain incompatible even if flexibility is expanded, in particular when the binding site is narrow. In these cases, it is worth trying to recognize this incompatibility as quickly as possible, which, outside of rigid molecules in rigid binding sites, is unfortunately difficult. CAMPARI offers a type of filtering to address this as follows. We turn it on here to illustrate the mechanism. For the specific case at hand, it is probably harmful because very few (if any) of the molecules in the set are strictly incompatible with the binding site:
FMCSC_TMD_RELAX 100.0 # change existing keyword
FMCSC_TMD_NRRELAX 150 # change existing keyword
FMCSC_MOL2PRUNEMODE 4 # permissive
The procedure works analogously to the initial structure relaxation and specifically detects and addresses degrees of freedom that are subject to a high force and are allowed to move. Notably, this excludes the rigid-body coordinates of all molecules at this stage. In this run, it will thus operate only on torsional degrees of freedom in the ligand and on side chain degrees of freedom in the receptor. The danger with any such approach is obvious: to discard good candidates for bad reasons, here an incomplete attempt at relaxation.
In order to collect information, especially if receptor degrees of freedom are kept mobile, it is not a particularly meaningful protocol to just take the last snapshot as the rigid complex if the run employed finite temperature sampling (this would make more sense in minimization). During the run, the minimum energy structure can be collected by changing the setting for keyword FMCSC_MOL2OUTMODE, but this might actually be misleading. As discussed already in step 4, the reason is that this energy is not the score we calculated above but simply the total energy of the complex. In addition, the distribution of the minimum is sampled unreliably even for few degrees of freedom since, by definition, it corresponds to an unlikely fluctuation. The more appropriate logic with sampling at finite temperature is to identify states that are visited frequently and to extract representative snapshots for these "basins". Fortunately, this is relatively easy to do for a system where very little moves and the coordinate frame is external and absolutely fixed. In CAMPARI, we can add these keywords:
FMCSC_EQUIL 10000 # change existing keyword
FMCSC_MOL2OUTMODE 1 # change existing keyword
FMCSC_MOL2CLUMODE 2 # work with a reduced set of atoms
FMCSC_CFILE mock.cfi
FMCSC_CCOLLECT 10
FMCSC_CMODE 5 # tree-based algorithm
FMCSC_CALIGN 0 # no alignment necessary
FMCSC_CDISTANCE 5 # RMSD
FMCSC_CRADIUS 1.5 # in A
FMCSC_CMAXRAD 6.0 # in A
FMCSC_MOL2THRESH 0.1 # at least 10%
You might recognize the clustering-related keywords from other tutorials like Tutorial 11 or Tutorial 14. Here, we request tree-based clustering, collecting data every 10 steps, with thresholds ranging from 1.5Å to 6.0Å (the former being most relevant). We should not perform pose alignment because the frozen parts of the receptor provide a fixed reference frame. We need to supply a mock input file to FMCSC_CFILE. The two specific keywords are the one selecting how to automatically pick the features for different molecules. Here, option 2 with RMSD implies that only such atoms are used that define the global base of motion or a dihedral angle. This set should be sufficient, but it should not give substantially different results to just use all heavy atoms (option 1). Finally, keyword FMCSC_MOL2THRESH instructs CAMPARI to save the poses corresponding to the cluster centroids of any cluster that encompasses at least 10% of all snapshots. This is risky since there might be cases for which no cluster fulfills this criterion, which is why we also save the minimum (complex total) energy pose via option 1 for FMCSC_MOL2OUTMODE.
Now we can create the mock input file and run again:
echo "1998" > mock.cfi
campari_threads -k tutorial16.key > t16_tetheredflexiblerun_threads.log
mv yourcalc.mol2 tetheredflexible_run.mol2
mv MOL2FOCUS.frz tetheredflexible_run.frz
mv MOL2EXTRAATOMS.idx tetheredflexible_run.aux
chimera t16_complex_relaxed.pdb tetheredflexible_run.mol2
Or:
echo "1998" > mock.cfi
mpirun -np XXX campari_mpi -k tutorial16.key > t16_tetheredflexiblerun_mpi.log
cat N_*_yourcalc.mol2 > tetheredflexible_run.mol2
mv N_001_MOL2FOCUS.frz tetheredflexible_run.frz
mv N_001_MOL2EXTRAATOMS.idx tetheredflexible_run.aux
chimera t16_complex_relaxed.pdb tetheredflexible_run.mol2 # use pymol if it is too cluttered
To produce interpretable results, receptor flexibility brings some additional factors to consider. First and foremost, the protein conformations for every molecule are now more induced fit-like and they are different for every small molecule. To get an idea of how much individual side chains actually move, the superposition in UCSF Chimera is useful (if you want to look at individual small molecules, use Pymol instead, which by default lets you step through the entries in the mol2-file). We have to store these auxiliary coordinates in order to be able to calculate scores for these poses later. The way CAMPARI handles this is by embedding the coordinates of mobile receptor atoms in the main output file of the run. Have a look at this file as it is being produced to understand this. The indices of receptor atoms to include can either be selected manually through an input file or automatically as we have chosen here (this is the role of keyword FMCSC_MOL2AUXINDEX). Later on, we will have to use FMCSC_MOL2AUXINPUT with the same index file to tell CAMPARI that the mol2-file it reads contains more than just the information about the small molecules. Second, receptor flexibility incurs the possibility of inducing strain within the receptor. This strain is an energetic contribution and needs to be accounted for when calculating scores since it differs between different small molecules. Whenever a source of strain arises (here, an incompatibility between tether, binding site geometry, and small molecule geometry), this strain will be distributed across the set of coupled degrees of freedom, which is a direct result of the equipartition principle. Third, we have determined a set of constraints from the reference ligand's coordinates. It is important to realize that our reference ligand was partially reconstructed and only crudely relaxed. Re-running the ligand preparation might give a slightly different set of constraints. Thus, in general, it is important to store these constraints (so that they can be reused later, see file "tetheredflexible_run.frz") and to rely on more definite information such as a crystal structure with a complete ligand.
To get the scores for this run, we proceed as before with two additions as follows:
FMCSC_MOL2SCOREONLY 1 # change existing keyword
FMCSC_MOL2FILE tetheredflexible_run.mol2 # change existing keyword
FMCSC_MOL2AUXINPUT tetheredflexible_run.aux # what receptor atoms are in the mol2-file
FMCSC_MOL2_PDB_RELAXED t16_complex_relaxed.pdb
When CAMPARI is reading through the generated poses, it will now use the information from the file with the indices to specifically adjust the coordinates of mobile receptor atoms before calculating the complex energy. Conversely, for the receptor-only energy, it will use the structure supplied via FMCSC_MOL2_PDB_RELAXED. Unlike before, this now important also for the ranking rather than for just the absolute values as explained above. Run:
campari -k tutorial16.key > t16_tetheredflexiblerun_scores.log
/bin/bash <full path to folder>/campari/examples/tutorial16/get_scores.sh t16_tetheredflexiblerun_scores.log "Energies-Difference:" > t16_tetheredflexiblerun.sco
You will note that there are now multiple stored poses with scores per molecule. This means that we would have to make a decision on how to construct a score from this information (the by far most common version would be to just pick the minimum). Looking at the scores, you should find that in almost all cases the pose tagged "_MINIMUM" also has the lowest predicted binding free energy. If you plot a histogram of the values, you will likely encounter a broader distribution reaching more negative values. To understand this better, take a molecule where the lowest recorded score has dropped considerably relative to the data in "t16_tetheredrigidrun.sco", then look at them both structurally and in terms of score details. Are the poses significantly different? Is the protein conformation significantly different? Which terms give rise to the change?
Epilogue:
There are many parameters, settings, and even conceptual choices to consider in this tutorial. Some questions or tasks that might be of interest for you to pursue are the following:- Repeat the workflow with a different reference ligand. You could ask CAMPARI to build a short oligopeptide instead (this might require modifying the mol2-file of the ligand by hand).
- Modify the tether settings, FMCSC_MOL2ASSIMILAR and FMCSC_MOL2RMSDTHRESH, to try placing also different moieties in the binding site. To try out the tethering itself relatively quickly, you should set the number of simulation steps to 1 and only proceed to a full calculation once you have tuned the parameters to your liking.
- Collect average energies (→ FMCSC_MOL2ENMODE) and find a way to use them in a score calculation. You can no longer use the same route as in Step 6, which is specific to single-point energies.
- Account for ligand strain by sampling the ligands in the absence of the receptor and collecting their energies. You will probably find that the logistics of safely associating scores from different calculations via names or orders of molecules quickly become tedious, in particular when there are exceptions (like aborted runs). This is where good automation is hugely beneficial.
- Get an idea of how statistically robust our scores can be for the limited amount of sampling. Most docking approaches do not rely on finite-temperature sampling or other noisy estimators, and some might even have zero statistical errors.
- Try to manually define a set of flexible binding site residues that is better at accommodating molecules than the one defined by the (relatively small) reference ligand. If you did the visualization in Step 7, you should have noted that the flexibility of the receptor is very modest.