## Tutorial 2: Doing Science with CAMPARI - Determining the Scaling Exponent for Polypeptides in the Excluded Volume Limit

### Preface:

The inverse power potential (IPP) is the steep repulsive (12^{th}power) part of the Lennard-Jones potential. It describes steric or excluded volume interactions only. Consequently, when all terms of the Hamiltonian are excluded except for the IPP term, the system explores the so-called "excluded volume limit." Essentially, all favorable interaction potentials are removed and sampling is governed exclusively by the requirement for each atom to occupy a certain volume.

The excluded volume limit corresponds to a well-defined reference state in polymer physics. Polymer chains in this limit act like fractals meaning that their self-avoiding local behavior is renormalizable to all length scales. If a solvent is present in addition to the polymer, such a solvent would be called "good" or "perfect" implying that chain-solvent interactions remain favored over chain-chain interactions for all conditions (theoretical limit). In this tutorial, we explore a universal feature of polymer chains in the excluded volume limit,

*viz.*their polynomial and universal scaling law of radius of gyration (R

_{g}) with chain length (measured for instance by N, the number of monomers). We have:

<R

_{g}> = R

_{0}·N

^{γ}

Here, R

_{0}is a factor related to monomer size and local, topology-dependent stiffness, and γ is the scaling exponent that is theoretically predicted to assume a value of ~0.6 in the excluded volume limit. By performing simulations for multiple lengths of a homopolymer, we can perform an

*in silico*experiment testing this prediction. It should be mentioned at the beginning that the exact value of the scaling exponent is a theoretical limit for infinitely long chains. In numerical, particle-based simulations we thus will encounter finite size effects, and these depend primarily on the range of the repulsive potential (if is not a truly athermal limit) and the degree of local flexibility (which is a function of molecular topology). This tutorial aims to demonstrate both of these elements (the theoretical limit and finite size effects) in CAMPARI.

For biomacromolecules, the self-avoiding limit takes its fundamental relevance from the fact that in strong denaturing conditions (such as high concentrations of urea or GdnCl) polypeptide chains have been shown to obey the same scaling law cited above with the same universal exponent (reference). Theoretical work by Tran and Pappu (reference) has explored this similarity computationally. It is - in this context - important to distinguish the Gaussian random coil from the excluded volume limit. A model of the former could be obtained by disabling steric repulsions as well, and it would be described by a universal scaling exponent of γ = 0.5. The Gaussian random coil is most relevant as being somewhat congruent with the θ-point of the canonical polymeric coil-to-globule transition. Feel free to explore this limit as well by introducing the appropriate modification to the key-file constructed below.

### Step-by-Step:

Step 1 - Modify the sample key-file developed in Tutorial 1Place a copy of the final key-file used in Tutorial 1, which was originally derived from this sample key-file, and save it as tutorial2.key in a new folder for this tutorial. Below, only those keywords are listed, which require changes if you followed Tutorial 1 exactly. Open the key-file and make sure that all scale factors for individual energy terms are set to zero with the exception of the inverse power potential and the torsional potentials (the latter are required here for the ω-angles to observe electronic barriers):

`FMCSC_SC_ATTLJ 0.0`

FMCSC_SC_IMPSOLV 0.0

FMCSC_SC_POLAR 0.0

FMCSC_SC_BONDED_A 0.0

FMCSC_SC_BONDED_I 0.0

FMCSC_SC_IMPSOLV 0.0

FMCSC_SC_POLAR 0.0

FMCSC_SC_BONDED_A 0.0

FMCSC_SC_BONDED_I 0.0

It is a general feature of CAMPARI keywords that there can be multiple specifications for the same keyword, and only the last one is relevant (unless an immediate error occurs,

*e.g.*, an input type mismatch). Thus, in theory, you could just append the above to the end of IPPchain.key to achieve the same effect (but accept a much larger potential for confusion; errors involving a duplicate keyword the user is not aware of can be really frustrating).

For this simulation, we will use periodic boundary conditions in conjunction with a very large cubic box of side lengths 1000Å. This is done to ensure that the longer chains are not artificially confined (through explicit boundary terms), which would lead to artifactual results (compaction).

`FMCSC_BOUNDARY 1`

FMCSC_SHAPE 1

FMCSC_SIZE 1000.0 1000.0 1000.0

FMCSC_SHAPE 1

FMCSC_SIZE 1000.0 1000.0 1000.0

We also disable most of the report flags to reduce output size in the log-file:

`FMCSC_DIPREPORT 0`

FMCSC_VDWREPORT 0

FMCSC_FOSREPORT 0

FMCSC_VDWREPORT 0

FMCSC_FOSREPORT 0

Finally, set the frequencies (by lowering the interval) at which average polymeric values are gathered in the output and disable all other outputs. The primary analysis of interest is the one that will accumulate statistics for R

_{g},

*i.e.*, our primary readout. In addition, change the original settings for two related keywords for a few complementary analyses, and disable trajectory output:

FMCSC_POLCALC 100

FMCSC_RHCALC 5000 # scales poorly with chain length, thus less frequent

FMCSC_SCATTERCALC 100000000 # scales very poorly with chain length, thus disabled

FMCSC_XYZOUT 10000000000

FMCSC_POLCALC 100

FMCSC_RHCALC 5000 # scales poorly with chain length, thus less frequent

FMCSC_SCATTERCALC 100000000 # scales very poorly with chain length, thus disabled

FMCSC_XYZOUT 10000000000

For this tutorial to work from a statistical point of view, the simulations have to be run for a large enough number of steps, in particular for the longer chains. Otherwise, the size estimates will be noisy, and the estimation of the scaling exponent may consequently be so far away from the predicted value that the validity of the limit is difficult to test and that finite size effects may be masked. A statistical viewpoint is critical for most simulations performed with whatever software package because, at typical conditions of interest, molecules do not occupy just one "ground state". Given this, namely that molecular systems sample their conformational space from distributions, statistical mechanics also demands that the convergence of ensemble averages needs to be assessed

*with respect to the quantities of interest*(outside of toy systems, there is no such thing as general convergence in molecular simulations, which is a direct result of the very high dimensionality of the phase space the simulated systems exist in). By purely empirical considerations, the following simulation length is sufficient for doing the analysis:

`FMCSC_NRSTEPS 2000000`

FMCSC_EQUIL 200000

FMCSC_EQUIL 200000

Depending on the available resources, feel free to increase FMCSC_NRSTEPS further. Note that for reasons of simplicity we do not actually try to estimate the statistical error on the estimates here. Next, specify the sequence file (which we will continuously regenerate by a script), change the basic name, and make sure to correct any paths to parameter files or input files, which have not been corrected yet:

`FMCSC_SEQFILE tutorial2.in`

FMCSC_BASENAME POLY

FMCSC_BASENAME POLY

Step 2 - Use an automated setup (script) to run multiple simulations efficiently

If working in a bash environment, you can save the provided script, modify the variable specifications it relies on, and run it to start the full range of length-specific simulations. In particular, you will have to point the script to the CAMPARI executable you want to use (change the value for the variable "CAMPARIEXE"; note that the default path uses environment variables "CAMPARI_HOME" and "ARCH", which are

*not*globally exported by the installation,

*i.e.*, they will be unset). You can also change the value for the variable "AMINO", which specifies the type of amino acid we create long homopolymers of. The residue type influences both run time and the magnitude of finite size effects.

Relevant results for each chain length will be stored in separate folders. This script will also setup input files and gather the necessary output automatically. Run it by:

`./IPPChain.sh >& tutorial2.masterlog &`

Note that when executing the script repeatedly, you should keep in mind that the individual simulation logs are always concatenated (in file "tutorial2.log"). In case, the script is useless or incomprehensible, the steps it prescribes can of course be performed manually as well. Simply create one folder for each homopolymer length, create sequence input files that span a range of chain lengths, and run CAMPARI in each folder. It is important that each chain length is run in a separate folder as CAMPARI will by default overwrite most output files. It should be straightforward to circumvent the usage of the script after completing the introductory tutorial.

Step 3 - Analyze the data to compute the scaling exponent γ

The script will automatically gather the ensemble-averaged radius of gyration (<R

_{g}>) for each chain length from the corresponding output file, and add them to the file "IPPscaling.dat". This file contains two columns; chain length and <R

_{g}>. To obtain the scaling exponent γ, simply create a double-logarithmic plot of N vs. <R

_{g}>. γ is then simply obtained as the slope of a linear fit with R

_{0}being the intercept. Alternatively (and this is generally the better approach), one could do the nonlinear fit directly (for example, using a non-linear least squares method). It is important to create a comparison plot (fit

*versus*data) to see why the result may deviate from the theoretical prediction (for this visual inference, the linearization is beneficial). Outside of ever-present statistical noise, it is the aforementioned finite size artifacts that give rise to systematic deviations here. As mentioned above, the limiting value of ~0.6 is really obtained for infinitely long chains. We (with the default script) simulated chain lengths from 20 to 900 residues, and the shortest chains can introduce errors. Based on the visual inference from the linearization, you can try to leave out the first few values from the fit, and the result should generally approach the theoretical limit more accurately. It is left as an exercise to study 1) the influence of using (<R

_{g}

^{2}>)

^{0.5}instead of <R

_{g}>; 2) see if a similar scaling law holds for the end-to-end distance; and 3) repeat the tutorial for a few chain lengths but split each run into a number of separate shorter runs so that the statistical error on <R

_{g}> can be estimated.

To finish, we touch upon a complementary analysis. In the folder for each chain length, you will find among the output files results regarding the scaling of internal residue-to-residue distances as a function of sequence separation. Due to the fractal nature of chains in the excluded volume limit, this plot mirrors the scaling law itself:

<R

_{ij}> ~ |i-j|

^{γ}

You should plot all the obtained internal scaling curves in the same graph. This illustrates the universality of the scaling relationship and finite size effects at the same time. Regarding the latter, you should see in particular that for a given chain the values for large sequence separation are systematically below values for the same sequence separation in a longer chain, which means that the limiting scaling exponent cannot be gathered safely from single chain data without discarding some of the sequence separation values. Note that these data were accumulated only every 5000 steps (→ FMCSC_RHCALC), which is a statistical limitation.