Tutorial 9: Using Umbrella Sampling to Assess the Ability of Polyglutamine (Q30) to Access Conformations with High β-Content
Preface:
Polyglutamine (polyQ) expansions have been implicated in several neurodegenerative diseases. These expansions are long, continuous repeats of glutamine (Q) residues present in several different disease related proteins. In healthy individuals, several proteins have polyQ expansions, but the expansion is not long enough to cause disease. For example, the wild-type protein "huntingtin" implicated in Huntington's disease (HD) contains a small polyQ repeat which does not cause disease. If this expansion becomes longer than a threshold of around 35 glutamine residues, then Huntington's disease (HD) develops within the human life span. The longer the polyQ expansion, the earlier affected individuals develop symptoms and the more severe the disease.
Individuals with polyQ expansion diseases contain polyQ-rich protein aggregates in the brain. These aggregates are thought to cause neuronal cell death. These Q-rich aggregates are rich in β-sheet. However, simulation and experimental evidence suggests that individual polyQ peptides (monomers) are disordered, meaning that they do not adopt any repeating structure (implying that they do not contain any discernible amount of β-sheet). In earlier tutorials, you have simulated polyQ monomers. A good exploratory exercise would be to go back to these simulations and query the amount of β-sheet you observed (aside from trajectory data, potential output files helpful with this are DSSP.dat or ZSEC_HIST.dat). It should be fairly low.
If the β-sheet content of polyQ monomers is low, then why are large polyQ-rich aggregates rich in β-sheet? Simplistically, one can hypothesize that the high β-sheet content arises from two limiting cases, viz., a) monomers spontaneously forming β-sheet rich structures, then aggregating, or b) polyQ forming disordered aggregates before the intermolecular interactions and flanking protein sequences cause a conversion to β-sheet rich aggregates. The focus of this tutorial is to evaluate the free energy penalty for monomeric polyQ to form structures with high β-content. Note that this does not allow us to pick which hypothesis is correct, but is the first step in collecting evidence to build a case for which appears more plausible.
To understand the context of these two hypotheses and the bigger picture of this research we direct you to work from the Wetzel group (evidence for hypothesis a) and to recent computational and experimental work that provides support for hypothesis b. There is additional evidence that flanking sequences play an important role in altering the mechanism of polyQ aggregation. The main body of work from which this tutorial is derived is found in the following publication: Thermodynamics of β-Sheet Formation in Polyglutamine. Given that the focus of this tutorial is to learn how to use umbreall sampling in CAMPARI with a nontrivial restraint potential, most of the interesting and salient points of the research on which this tutorial is based are either glossed over or not covered at all.
Use of umbrella sampling and structural restraints:
Technical clarificationsUp to this point, we have been referring to β-sheet content, which can be, for example, defined by the fraction of residues that make a specific pattern of intra-backbone hydrogen bonds. A precise definition is offered by Kabsch and Sander's DSSP algorithm and corresponds to the fraction of residues assigned to the "E", or "extended" structural classifier. β-Content, which is discussed below, is a superset (more generic) of β-sheet content and is the fraction of residues whose backbone φ/ψ-angles are contained in the β-basin of the Ramachandran map. The exact definition is given in the documentation on FMCSC_SC_ZSEC.
Existing experimental evidence suggests that polyQ forms disordered, collapsed structures with almost no secondary structure content. The goal of this tutorial is to ascertain the free energy penalty associated with a polyQ monomer adopting high β-content conformations. Thus, we need to know the relative probabilities of sampling high β-content conformations vis-à-vis those of visiting disordered conformations. These structural transitions are exceedingly rare and need to be well sampled to obtain accurate statistics. Using brute-force, equilibrium sampling approaches to study these rare events is highly inefficient and generally infeasible. To solve this problem, we use an artificial potential which biases sampling toward these rare conformational states, which enables us to obtain reliable statistics on these states and reconstruct the potential of mean force (PMF) along a reaction coordinate measuring β-content.
A harmonic restraint potential on global secondary structure contentWe use a harmonic structural restraint potential Ufβ to bias sampling of conformations with high β content. Our total potential Utot then becomes:
Utot = Ufβ + Wabs
where Wabs is the unbiased ABSINTH potential energy function. The documentation for keyword FMCSC_SC_ZSEC provides the definition for Ufβ (and the keyword itself is the "outside" scale factor for this term).
fβ (the global β-content) is a progress variable or reaction coordinate that corresponds to the fraction of residues that sample backbone dihedral angles from the β-basin in φ,ψ-space. fβ ranges from 0 to 1. Proteins which are almost entirely β-sheet have fβ values between 0.7 and 0.8 (the necessary presence of turns makes the values significantly smaller than 1.0). For a more detailed description of fβ, we refer you to the following work: Thermodynamics of β-Sheet Formation in Polyglutamine. There exists a more recent application using the fβ biasing potential to describe the monomeric Alzheimer's amyloid-β peptide. Ufβ is used to restrain a simulation to a particular target value of fβ, referred to as fβ0. Thus, given a large enough force constant, the fβ distribution of a restrained simulation should be approximately peaked at its value of fβ0.
To ensure adequate sampling in fβ space, we will run multiple simulations each at a different target fβ0 value. The set of fβ0 values we will use are: {0.0, 0.1, 0.25, 0.3, 0.4, 0.5, 0.6, 0.75, 0.8, 0.9, 1.0}. Replica exchange is used to further enhance sampling. If you have completed the tutorial on the FS-peptide, you already have some experience with using replica exchange. However, with this exercise, we exchange coordinates across simulations with different fβ0 values as opposed to different temperatures. For this tutorial, all of the data will be collected at a constant temperature of 298K.
We will post-process the simulation results using the WHAM method to obtain an unbiased probability distribution in fβ space. This allows us to compute the PMF as a function of fβ and answers the original question of what is the free energy penalty associated with accessing conformations with high β-content (high values of fβ).
Step-by-Step:
Step 1 - Build the appropriate key-fileDownload the following key-file: fb.key and save it as fb.key in an empty folder for this tutorial. Open the key-file and let us look at the sections most relevant for this simulation:
FMCSC_SHAPE 2
FMCSC_SIZE 150
FMCSC_BOUNDARY 4
We will use a spherical droplet (FMCSC_SHAPE is 2) with a radius of 150Å and the setting for FMCSC_BOUNDARY of 4, which corresponds to the atom-based soft-wall boundary conditions (a good choice for most systems using continuum models).
FMCSC_NRSTEPS 40000000
FMCSC_EQUIL 10000000
This simulation will be run a total of 4x107 Monte Carlo (MC) steps. The first 107 of these steps will be considered equilibration, and will not be included in analysis output files and other system averages. Because we are using a restraint potential, discarding initial data points is particularly important (the system needs time to "find" values close to the target).
FMCSC_SC_ZSEC 1.0
FMCSC_ZS_FR_KB 150
As mentioned, FMCSC_SC_ZSEC is the outside scaling factor for the Ufβ harmonic restraint potential and a value of zero would disable it. FMCSC_ZS_FR_KB is the kβ value (force constant = restraint strength) in the restraint potential. If we have restraints only on β-content, the use of FMCSC_SC_ZSEC and FMCSC_ZS_FR_KB for scaling the potential is partially redundant (but both need to be specified as nonzero values since the product is taken). This potential is applied over all residues of the system (here, Q30) simultaneously, meaning that the average restraint strength per residue is FMCSC_ZS_FR_KB·N-1 where N is the number of residues. When comparing systems with different chain lengths, one should probably have an approximately constant value for FMCSC_ZS_FR_KB·N-1, thus FMCSC_ZS_FR_KB should be adjusted accordingly.
FMCSC_SEGCALC 50
The fβ histogram is updated, i.e., data are accumulated, every 50 MC steps. It is beneficial for this to happen at very high frequency since this calculation is not cost-intensive, and since we are interested in obtaining the best statistics possible. These distributions will serve as raw data in the WHAM post-processing step.
FMCSC_MPIAVG 0
FMCSC_REMC 1
FMCSC_REPLICAS 11
FMCSC_REDIM 1
FMCSC_RENBMODE 2
FMCSC_RESWAPS 10
FMCSC_REFREQ 10000
The FMCSC_REMC keyword enables replica exchange. By setting FMCSC_RENBMODE to 2, only neighbor swaps are attempted. This is common protocol for replica exchange simulations. Allowing non-neighbor swaps may sometimes provide further sampling boosts but can also complicate subsequent analyses (the proper ensemble re-weighting techniques become mandatory, see documentation on FMCSC_FRAMESFILE). For the system here, non-neighbor swaps would likely have very low acceptance rates anyway, which is due to the restraint potentials being harmonic but with clear differences in minimum position. The frequency of exchanges (swap cycles) is controlled by the FMCSC_REFREQ keyword. Lastly, FMCSC_RESWAPS refers to the number of swaps that will be attempted for each swap cycle and is chosen to be equal to the number of possible neighbors (FMCSC_REPLICAS - 1) here.
FMCSC_REFILE remc.in
This file defines the quantity that is to be swapped across replicas which is defined on the first line of this input file. Note that the first line is NOT the total number of replicas (set by FMCSC_REPLICAS). The remaining lines define the target fβ0 values for each replica. This automatically sets the value normally set through keyword FMCSC_ZS_FR_B for each replica. Download remc.in and place it in the same directory as the key-file.
FMCSC_SEQFILE q30.in
Download the sequence file, q30.in, and place it in the same directory as the key-file. This defines the system that CAMPARI will simulate, which here is a polypeptide with 30 glutamine residues. The ends are capped with acetyl (ACE) and N-methylamide (NME) groups, respectively, neither of which carries a net charge. The starting configuration is drawn randomly from an excluded volume-like ensemble trying to avoid steric overlaps. The completely random nature of the protocol used to generate these structures ensures that there is no conformational bias engendered by the starting configuration. The last thing to do is to adjust the explicit paths for PARAMETERS and FMCSC_BBSEGFILE in the key-file.
Step 2 - Launch the simulation
Use of replica exchange calculations requires a CAMPARI executable compiled with MPI. See the installation instructions to set this up correctly. Here, we require 11 replicas. Let us assume we have compute nodes available with 2 8-core CPUs each. If each replica runs on an entire CPU, we would need 5.5 nodes. If nodes can only be used as a whole, it may be useful to reconsider the fβ schedule and add one or more replicas. In this case, keyword FMCSC_REPLICAS has to be adjusted as well and keyword FMCSC_RESWAPS should be update accordingly. As an MC calculation, the scaling with the number of threads will be limited here, and it may be useful to first test with non-MPI runs the efficiency of the threads parallelizaiton for this system. Note that keyword NRTHREADS only works with non-MPI OpenMP code. For the hybrid executable, we need to use the environment variable defined by the OpenMP standard (this is explained in more detail in Tutorial 7. A direct invocation of "mpirun" as part of a job submission script may look like this if we assume OpenMPI syntax:
export CAMPDIR=<full path to CAMPARI bin-directory>
export OMP_NUM_THREADS=8
mpirun -np 11 -npernode 2 ${CAMPDIR}/campari_mpi_threads -k fb.key >& masterlog
These are only the minimal settings that are required. The command should work if the job scheduler (like SGE) granted the job 5.5 nodes and handles the distribution of resources to individual MPI processes intelligently by default. This may not always work as desired, and thus the mapping may have to be done explicitly. An example with the srun command of the SLURM scheduler (which abstracts the mpirun command) is as follows:
export CAMPDIR=<full path to CAMPARI bin-directory>
export SLURM_CPU_BIND=verbose
export OMP_NUM_THREADS=8
srun --unbuffered --ntasks-per-node 2 --cpus-per-task 8 --ntasks-per-core 1 -n 11 --cpu_bind=sockets ${CAMPDIR}/campari_mpi_threads -k fb.key >& masterlog
Please refer to the documentation of the scheduler and/or MPI library you use for the necessary information. More explanations on the above example are provided in Tutorial 7.
Step 3 - Sanity checking the simulation (before you waste too much time)
First of all, make sure that you have simulated what you expect. The fβ0 value is assigned one-to-one to the node number: replica 1 (N_000_* output) corresponds to fβ0=0, while replica 11 (N_010_* output) corresponds to fβ0=1.0. Of course the details will differ if you changed the schedule as mentioned above. A quick sanity check is to plot the restraint energy potential found in instantaneous energy output files. Its average should be higher for unlikely values (here, values of fβ0 close to 1.0). Note that the simulations start from excluded volume-informed coil-like structures, which means that there is no major barrier toward reaching a plateau level for the (purely torsional) potential from the initial configurations. In fact, the slow process is to find states that simultaneously allow chain compaction (ABSINTH water is a poor solvent for polyglutamine under the chosen conditions) and satisfy the restraints. Once the equilibration period has passed, it is also possible to look at the simulation trajectory, which will allow a visual confirmation that, for example, there is minimal β-content for low values of fβ0 and high β-content (polypeptide backbone forms linear segments, or there are β-sheets present) for large fβ0. Visualization is most conveniently done using the CAMPARI-generated VMD scripts. From the directory containing the simulation output, issue, for example, the following command, which of course assumes the presence of VMD:
vmd -e N_010___VIS.vmd
Note that there are in fact 3 underscores between "010" and "VIS". This is due to the selection of "_" as the simulation's basic name in the supplied key-file. From within the VMD console, the command "align_it" is available, which will perform an RMSD alignment of the polypeptide across all frames to ease visualization. Note that if you do this when the simulation is still running, there may be several drawing artifacts (long, nonsensical bonds) that occur because the pdb-file of the initial structure is used as a template; this issue resolves itself once the final pdb-file can be used, i.e., upon completion of the simulation. If you copy trajectory files, make sure to also copy the available initial and/or final pdb files as they are required by VMD as a template to interpret the binary trajectory data. It is also perfectly possible to analyze (from a separate directory) trajectories of running simulations using CAMPARI's trajectory analysis mode, for example to obtain current copies of output files like ZSEC_HIST.dat. In case any of these checks yielded something unexpected, there may be issues with your key-file. Open the N_0xx.log files to review what may have gone wrong.
Step 4 - Analyzing the data
If you have not already done so, copy the supplied R-script to the same directory as the N_*_ZSEC_HIST.dat output files. This is the only required information from the simulation. Umbrella sampling runs are typically analyzed using the weighted histogram analysis method (WHAM). WHAM iteratively determines the relative free energies of neighboring windows exploiting the fact that, with knowledge of the exact nature of the restraint potential, the overall unbiased probability must be computable from the relative free energies and vice versa. Discretization errors on the restraint potential are typically neglected. In comparison to other relative free energy estimates such as those covered in Tutorial 7, WHAM has the advantage of only requiring reaction coordinate histograms. In case you made any modifications to the tutorial regarding the choice of window centers in remc.in, regarding simulation temperature, regarding the force constant, or regarding the scale factor for ZSEC, those will have to be adjusted in the R-script. To obtain the WHAM reconstruction the potential of mean force (PMF) from the raw (biased) histograms in fβ, start R in your working directory and do:
source("tutorial9.R")
From the script, you will (hopefully) obtain two plots. The first is a simple plot of the biased distributions underlying the analysis. The second is a plot of the PMF as a function of fβ along with the free energies of the windows (relative to a common reference) as a function of fβ0 (note that these two horizontal axes are not the same). This will allow you to assess the free energy difference between conformations with high and low values of β-content. Be advised that the results you are analyzing come from only one independent replica exchange simulation and may not be sampled enough, i.e., the noise level may be high. Better statistics could of course be obtained by running multiple independent replica exchange simulations and averaging the results. In fact, if you inspect that trajectories visually you will most likely find that, after the initial phase of the simulation, there is not too much sampling outside of exchange moves. This is a direct consequence of a rugged free energy landscape and a sampler that is not able to follow this landscape very well (because that would require correlations between degrees of freedom). The ruggedness comes from two partially antagonistic restraints: to collapse and to have a prescribed β-content.
Even though the result is noisy, you should find that the PMF along fβ as reaction coordinate indicates a large free energy barrier/penalty to accessing configurations with a high amount of β-content. This suggests that for monomeric Q30 (and probably for similar chain lengths as well), the adoption of structures high in β-content is rare. This provides some limited evidence that the high β-sheet content of polyQ-rich aggregates may be a consequence of intermolecular interactions and not from assembly of well-ordered monomers. For more details, see our publication from 2009 on this topic.
Step 5 - Epilogue: Improving sampling
As an exercise, you should use trajectory analysis mode (which can be run in parallel as well, see for example FMCSC_REMC) to calculate a block averaging error estimate for the PMF. This entails analyzing trajectories in separate blocks, reading block-specific fβ histograms into R, and calculating (and plotting) a standard deviation across the selected number of blocks. While this type of re-sampling strategy is respecting time correlation in the data, it is unaware of the averaging of information provided by the replica-exchange methodology and it does not possess a theoretical distribution to be able to quantify likelihoods of outcomes. This is an extremely common conundrum in molecular simulations: a system's stochastic evolution featuring an overlay of different time scales leads to a data set that, at the end of the day, frequently requires brute-force error estimation by simply rerunning the same calculation multiple times with maximally independent initial conditions. Here, the conditions can be made independent relatively easily because we start from randomized conformations but the same luxury is not available in studies of folded proteins for instance.
To demonstrate to yourself more clearly why analyzing replica-exchange data is not necessarily straightforward, as a second step you should "unmix" the trajectories. This means that we want to generate trajectories which are not continuous in condition (here, parameters of the secondary structure biasing potential) but instead are continuous in configuration (and thus contain swaps in condition). For this, you will need the trace file of the run and the individual trajectories. You must then specify the suffix of the trajectory filenames (here "__traj.xtc" with the default settings from the tutorial) as the main input file and the replica exchange trace as auxiliary input file for a parallel analysis run. Aside from box settings, parameters, sequence file, and replica exchange input file, which should be the same as before, the basic required keywords should read like this:
FMCSC_PDBANALYZE 1 # activate trajectory analysis mode
FMCSC_PDB_FORMAT 3 # .xtc as input
FMCSC_XTCFILE __traj.xtc # the rest of the name will be constructed
FMCSC_TRACEFILE N_000_REXTRACE.dat # back this file up in any case
FMCSC_MPIAVG 0
FMCSC_REMC 1 # need for REX unmixing
FMCSC_REPLICAS 11 # or what you chose
FMCSC_REDIM 1
FMCSC_RE_TRAJOUT 5000 # XYZOUT from production run
FMCSC_RE_TRAJSKIP 10000000 # EQUIL from production run
FMCSC_DISABLE_ANALYSIS 1
FMCSC_BASENAME UNMIXED
FMCSC_NRSTEPS 6000 # from default setting of 40M total - 10M equil.
FMCSC_EQUIL 0
FMCSC_XYZOUT 1
FMCSC_XYZPDB 4 # xtc again
FMCSC_SC_IPP 0.0 # no need to calculate nonbonded energies
FMCSC_SC_ZSEC 1.0 # to avoid errors with old RE input file
To get an idea of how efficiently the Monte Carlo sampler propagates the system, we can use the RMSD values between subsequent trajectory snapshots after this unmixing. This analysis can be directly enabled in the same run and will conveniently align the trajectories as well:
FMCSC_ALIGNCALC -1 # the minus sign means we request files like N_000_RMSD.dat to be written
FMCSC_ALIGNFILE your.idx # you could use the CAMPARI atom indices of all CA, all heavy atoms, etc.
Now execute this calculation in the same vein as the production run (it should only take a few seconds). Have a look at a few of the unmixed trajectories and plot the RMSD traces. You should find that rapid excursions are rare indicating that sampling is problematic. As the final exercise for this tutorial, propose and run an attempt to improve sampling. There would be three main routes to pursue:
- Try to optimize the replica exchange settings and the choices for the bias potential (meaning the artificial components of the calculation). The final PMF should be independent of these settings but the convergence rate is not.
- Try to optimize the propagator itself either by optimizing the Monte Carlo move set or by switching to a hybrid or pure gradient-based propagator.
- Exchange the (Hamiltonian) replica-exchange protocol entirely, for example with alternative methods that promise to calculate exactly the same potential of mean force (flat histogram methods) or with the methods introduced in Tutorial 13.