Tutorial 9: Using Umbrella Sampling to Assess the Ability of Polyglutamine (Q30) to Access Conformations with High β-Content


Polyglutamine (polyQ) expansions have been implicated in several neurodegenerative diseases. These expansions are long, continuous repeats of glutamine (Q) residues present in several different disease related proteins. In healthy individuals, several proteins have polyQ expansions, but the expansion is not long enough to cause disease. For example, the wild-type protein "huntingtin" implicated in Huntington's disease (HD) contains a small polyQ repeat which does not cause disease. If this expansion becomes longer than a threshold of around 35 glutamine residues, then Huntington's disease (HD) develops within the human life span. The longer the polyQ expansion, the earlier affected individuals develop symptoms and the more severe the disease.

Individuals with polyQ expansion diseases contain polyQ-rich protein aggregates in the brain. These aggregates are thought to cause neuronal cell death. These Q-rich aggregates are rich in β-sheet. However, simulation and experimental evidence suggests that individual polyQ peptides (monomers) are disordered, meaning that they do not adopt any repeating structure (implying that they do not contain any discernible amount of β-sheet). In earlier tutorials, you have simulated polyQ monomers. A good exploratory exercise would be to go back to these simulations and query the amount of β-sheet you observed (aside from trajectory data, potential output files helpful with this are DSSP.dat or ZSEC_HIST.dat). It should be fairly low.

If the β-sheet content of polyQ monomers is low, then why are large polyQ-rich aggregates rich in β-sheet? Simplistically, one can hypothesize that the high β-sheet content arises from two limiting cases, viz., a) monomers spontaneously forming β-sheet rich structures, then aggregating, or b) polyQ forming disordered aggregates before the intermolecular interactions and flanking protein sequences cause a conversion to β-sheet rich aggregates. The focus of this tutorial is to evaluate the free energy penalty for monomeric polyQ to form structures with high β-content. Note that this does not allow us to pick which hypothesis is correct, but is the first step in collecting evidence to build a case for which appears more plausible.

To understand the context of these two hypotheses and the bigger picture of this research we direct you to work from the Wetzel group (evidence for hypothesis a) and to recent computational and experimental work that provides support for hypothesis b. There is additional evidence that flanking sequences play an important role in altering the mechanism of polyQ aggregation. The main body of work from which this tutorial is derived is found in the following publication: Thermodynamics of β-Sheet Formation in Polyglutamine. Given that the focus of this tutorial is to learn how to use umbreall sampling in CAMPARI with a nontrivial restraint potential, most of the interesting and salient points of the research on which this tutorial is based are either glossed over or not covered at all.

Use of umbrella sampling and structural restraints:

Technical clarifications

Up to this point, we have been referring to β-sheet content, which can be, for example, defined by the fraction of residues that make a specific pattern of intra-backbone hydrogen bonds. A precise definition is offered by Kabsch and Sander's DSSP algorithm and corresponds to the fraction of residues assigned to the "E", or "extended" structural classifier. β-Content, which is discussed below, is a superset (more generic) of β-sheet content and is the fraction of residues whose backbone φ/ψ-angles are contained in the β-basin of the Ramachandran map. The exact definition is given in the documentation on FMCSC_SC_ZSEC.

Existing experimental evidence suggests that polyQ forms disordered, collapsed structures with almost no secondary structure content. The goal of this tutorial is to ascertain the free energy penalty associated with a polyQ monomer adopting high β-content conformations. Thus, we need to know the relative probabilities of sampling high β-content conformations vis-à-vis those of visiting disordered conformations. These structural transitions are exceedingly rare and need to be well sampled to obtain accurate statistics. Using brute-force, equilibrium sampling approaches to study these rare events is highly inefficient and generally infeasible. To solve this problem, we use an artificial potential which biases sampling toward these rare conformational states, which enables us to obtain reliable statistics on these states and reconstruct the potential of mean force (PMF) along a reaction coordinate measuring β-content.

A harmonic restraint potential on global secondary structure content

We use a harmonic structural restraint potential U to bias sampling of conformations with high β content. Our total potential Utot then becomes:

Utot = U + Wabs

where Wabs is the unbiased ABSINTH potential energy function. The documentation for keyword FMCSC_SC_ZSEC provides the definition for U (and the keyword itself is the "outside" scale factor for this term).

fβ (the global β-content) is a progress variable or reaction coordinate that corresponds to the fraction of residues that sample backbone dihedral angles from the β-basin in φ,ψ-space. fβ ranges from 0 to 1. Proteins which are almost entirely β-sheet have fβ values between 0.7 and 0.8 (the necessary presence of turns makes the values significantly smaller than 1.0). For a more detailed description of fβ, we refer you to the following work: Thermodynamics of β-Sheet Formation in Polyglutamine. There exists a more recent application using the fβ biasing potential to describe the monomeric Alzheimer's amyloid-β peptide. U is used to restrain a simulation to a particular target value of fβ, referred to as fβ0. Thus, given a large enough force constant, the fβ distribution of a restrained simulation should be approximately peaked at its value of fβ0.

To ensure adequate sampling in fβ space, we will run multiple simulations each at a different target fβ0 value. The set of fβ0 values we will use are: {0.0, 0.1, 0.25, 0.3, 0.4, 0.5, 0.6, 0.75, 0.8, 0.9, 1.0}. Replica exchange is used to further enhance sampling. If you have completed the tutorial on the FS-peptide, you already have some experience with using replica exchange. However, with this exercise, we exchange coordinates across simulations with different fβ0 values as opposed to different temperatures. For this tutorial, all of the data will be collected at a constant temperature of 298K.

We will post-process the simulation results using the WHAM method to obtain an unbiased probability distribution in fβ space. This allows us to compute the PMF as a function of fβ and answers the original question of what is the free energy penalty associated with accessing conformations with high β-content (high values of fβ).


Step 1 - Build the appropriate key-file

Download the following key-file: fb.key and save it as fb.key in an empty folder for this tutorial. Open the key-file and let us look at the sections most relevant for this simulation:


We will use a spherical droplet (FMCSC_SHAPE is 2) with a radius of 150Å and the setting for FMCSC_BOUNDARY of 4, which corresponds to the atom-based soft-wall boundary conditions (a good choice for most systems using continuum models).

FMCSC_EQUIL 10000000

This simulation will be run a total of 4x107 Monte Carlo (MC) steps. The first 107 of these steps will be considered equilibration, and will not be included in analysis output files and other system averages. Because we are using a restraint potential, discarding initial data points is particularly important (the system needs time to "find" values close to the target).


As mentioned, FMCSC_SC_ZSEC is the outside scaling factor for the U harmonic restraint potential and a value of zero would disable it. FMCSC_ZS_FR_KB is the kβ value (force constant = restraint strength) in the restraint potential. If we have restraints only on β-content, the use of FMCSC_SC_ZSEC and FMCSC_ZS_FR_KB for scaling the potential is partially redundant (but both need to be specified as nonzero values since the product is taken). This potential is applied over all residues of the system (here, Q30) simultaneously, meaning that the average restraint strength per residue is FMCSC_ZS_FR_KB·N-1 where N is the number of residues. When comparing systems with different chain lengths, one should probably have an approximately constant value for FMCSC_ZS_FR_KB·N-1, thus FMCSC_ZS_FR_KB should be adjusted accordingly.


The fβ histogram is updated, i.e., data are accumulated, every 100 MC steps. It is beneficial for this to happen at very high frequency since this calculation is not cost-intensive, and since we are interested in obtaining the best statistics possible. These distributions will serve as raw data in the WHAM post-processing step.


The FMCSC_REMC keyword enables replica exchange. By setting FMCSC_RENBMODE to 2, only neighbor swaps are attempted. This is common protocol for replica exchange simulations. Allowing non-neighbor swaps may sometimes provide further sampling boosts but can also complicate subsequent analyses (the proper ensemble re-weighting techniques become mandatory, see documentation on FMCSC_FRAMESFILE). For the system here, non-neighbor swaps would likely have very low acceptance rates anyway, which is due to the restraint potentials being harmonic but with clear differences in minimum position. The frequency of exchanges (swap cycles) is controlled by the FMCSC_REFREQ keyword. Lastly, FMCSC_RESWAPS refers to the number of swaps that will be attempted for each swap cycle and is chosen to be equal to the number of possible neighbors (FMCSC_REPLICAS - 1) here.


This file defines the quantity that is to be swapped across replicas which is defined on the first line of this input file. Note that the first line is NOT the total number of replicas (set by FMCSC_REPLICAS). The remaining lines define the target fβ0 values for each replica. This automatically sets the value normally set through keyword FMCSC_ZS_FR_B for each replica. Download remc.in and place it in the same directory as the key-file.


Download the sequence file, q30.in, and place it in the same directory as the key-file. This defines the system that CAMPARI will simulate, which here is a polypeptide with 30 glutamine residues. The ends are capped with acetyl (ACE) and N-methylamide (NME) groups, respectively, neither of which carries a net charge. The starting configuration is drawn randomly from an excluded volume-like ensemble trying to avoid steric overlaps. The completely random nature of the protocol used to generate these structures ensures that there is no conformational bias engendered by the starting configuration. The last thing to do is to adjust the explicit paths for PARAMETERS and FMCSC_BBSEGFILE in the key-file.

Step 2 - Launch the simulation

Use of replica exchange calculations requires a CAMPARI executable compiled with MPI. See the installation instructions to set this up correctly. Here, we require 11 replicas. Let us assume we have compute nodes available with 2 8-core CPUs each. If each replica runs on an entire CPU, we would need 5.5 nodes. If nodes can only be used as a whole, it may be useful to reconsider the fβ schedule and add a replica. As an MC calculation, the scaling with the number of threads will be limited here, and it may be useful to first test with serial runs the parallel efficiency. Note that keyword NRTHREADS only works with non-MPI OpenMP code. For the hybrid executable, we need to use the environment variable defined by the OpenMP standard (this is explained in more detail in Tutorial 7. A direct invocation of "mpirun" as part of a job submission script may look like this if we assume OpenMPI syntax:

export CAMPDIR=<full path to CAMPARI bin-directory>
mpirun -np 11 -npernode 2 ${CAMPDIR}/campari_mpi_threads -k fb.key >& masterlog

These are only the minimal settings that are required. The command should work if the job scheduler (like SGE) granted the job 5.5 nodes and handles the distribution of resources to individual MPI processes intelligently by default. This may not always work as desired, and thus the mapping may have to be done explicitly. An example with the srun command of the SLURM scheduler (which abstracts the mpirun command) is as follows:

export CAMPDIR=<full path to CAMPARI bin-directory>
export SLURM_CPU_BIND=verbose
srun --unbuffered --ntasks-per-node 2 --cpus-per-task 8 --ntasks-per-core 1 -n 11 --cpu_bind=sockets ${CAMPDIR}/campari_mpi_threads -k fb.key >& masterlog

Please refer to the documentation of the scheduler and/or MPI library you use for the necessary information. More explanations on the above example are provided in Tutorial 7.

Step 3 - Sanity checking the simulation (before you waste too much time)

First of all, make sure that you have simulated what you expect. The fβ0 value is assigned one-to-one to the node number: replica 1 (N_000_* output) corresponds to fβ0=0, while replica 11 (N_010_* output) corresponds to fβ0=1.0. A quick sanity check is to plot the restraint energy potential found in instantaneous energy output files. Its average should be higher for unlikely values (here, values of fβ0 close to 1.0). Note that the simulations start from excluded volume-informed coil-like structures, which means that there is no major barrier toward reaching a plateau level for the (purely torsional) potential from the initial configurations. In fact, the slow process is to find states that simultaneously allow chain compaction (ABSINTH water is a poor solvent under the chosen conditions) and satisfy the restraints. Once the equilibration period has passed, it is also possible to look at the simulation trajectory, which will allow a visual confirmation that, for example, there is minimal β-content for low values of fβ0 and high β-content (polypeptide backbone forms linear segments, or there are β-sheets present) for large fβ0. Visualization is most conveniently done using the CAMPARI-generated VMD scripts. From the directory containing the simulation output, issue, for example, the following command, which of course assumes presence of VMD:

vmd -e N_010___VIS.vmd

Note that there are in fact 3 underscores between "010" and "VIS". This is due to the selection of "_" as the simulation's basic name in the supplied key-file. From within the VMD console, the command "align_it" is available, which will perform an RMSD alignment of the polypeptide across all frames to ease visualization. Note that if you do this when the simulation is still running, there may be several drawing artifacts (long, nonsensical bonds) that occur because the pdb-file of the initial structure is used as a template; this issue resolves itself once the final pdb-file can be used, i.e., upon completion of the simulation. If you copy trajectory files, make sure to also copy the available initial and/or final pdb files as they are required by VMD as a template to interpret the binary trajectory data. It is also perfectly possible to analyze (from a separate directory) trajectories of running simulations using CAMPARI's trajectory analysis mode, for example to obtain current copies of output files like ZSEC_HIST.dat. In case any of these checks yielded something unexpected, there may be issues with your key-file. Open the N_0xx.log files to review what may have gone wrong.

Step 4 - Output and Analysis

If you have not already done so, copy the supplied R-script to the same directory as the N_*_ZSEC_HIST.dat output files. This is the only required information from the simulation. Umbrella sampling runs are typically analyzed using the weighted histogram analysis method (WHAM). WHAM iteratively determines the relative free energies of neighboring windows exploiting the fact that, with knowledge of the exact nature of the restraint potential, the overall unbiased probability must be computable from the relative free energies and vice versa. Discretization errors on the restraint potential are typically neglected. In comparison to other relative free energy estimates such as those covered in Tutorial 7, WHAM has the advantage of only requiring reaction coordinate histograms. In case you made any modifications to the tutorial regarding the choice of window centers in remc.in, regarding simulation temperature, regarding the force constant, or regarding the scale factor for ZSEC, those will have to be adjusted in the R-script. To obtain the WHAM reconstruction the potential of mean force (PMF) from the raw (biased) histograms in fβ, start R in your working directory and do:


From the script, you will (hopefully) obtain two plots. The first is a simple plot of the biased distributions underlying the analysis. The second is a plot of the PMF as a function of fβ along with the free energies of the windows (relative to a common reference) as a function of fβ0 (note that these two horizontal axes are not the same). This will allow you to assess the free energy difference between conformations with high and low values of β-content. Be advised that the results you are analyzing come from only one independent replica exchange simulation and may not be sampled enough, i.e., the noise level may be high. Better statistics could of course be obtained by running multiple independent replica exchange simulations and averaging the results. You should find that the PMF along fβ as reaction coordinate indicates a large free energy barrier to accessing configurations with a high amount of β-content. This suggests that for monomeric Q30 (and probably for similar chain lengths as well), the adoption of structures high in β-content is rare. This provides some limited evidence that the high β-sheet content of polyQ-rich aggregates may be a consequence of intermolecular interactions and not from assembly of well-ordered monomers. For more details, see our publication from 2009 on this topic. It is left as an exercise to use trajectory analysis mode (which can be run in parallel as well, see for example FMCSC_REMC) to calculate a block averaging error estimate for the PMF. This entails analyzing trajectories in separate blocks, reading block-specific fβ histograms into R, and calculating (and plotting) a standard deviation across the selected number of blocks.
Design downloaded from free website templates.