IRC - Project
Welcome to the GitHub page for the IRC Project. This repo is mostly intended for versioning and documentation and while the "IRCpkg" can and should be installed, it is rather meant as a collection of files rather than an actual python package. The code listed here should be sufficient to generate most of the data used in the IRC project.
Requirements & Installation
A working python environment (for allegro) called openMM with recommended packages can be created from the allegro_conf.yml file by running the command:
conda env create -f allegro_conf.yml
If this is not desirable, the necessary packages should (in principle) be:
conda install -c conda-forge cudatoolkit=XXX jupyter matplotlib mdanalysis mdtraj numba=0.58 numpy openmm=8.1.0 pandas python scipy seaborn pymbar tqdm
Where the cudatoolkit has to be selected according to the installed driver.
Further multipletau has to be installed.
pip install multipletau
In order to use the files here as a package, it needs to be installed by opening a terminal under this directory and executing
pip install -e .
Usage
The Scripts used for getting results can be found under the folders in the parent directory. Otherwise some example scripts can also be found in the IRCpkg/examples folder.
Tips & possible Improvements
While the report is sufficient to get the idea behind this code and shows applications, it does not explain every crucial detail. Here are some general Tips for how to use this method.
Generally speaking, the runtime parameters are very important and not 100% trivial to identify in order to use this stuff. There are some possible improvements in order to improve all of this.
Two rather common problems are that its hard to guess sensible equilibration times and integration timesteps for all possible parameter sets the algorithm might try. For example, the algorithm might try values like 2000 kbT for eps_2. This may simply "break the simulation" with particles flying around very quickly, but perhaps the pair numbers generated form the broken simulation seem good to the algorithm, so this is accepted (and the optimization cannot recover from it). One possible remedy for this might be to use the "VariableLangevinIntegrator" from openMM in the future. This Integrator is supposed to adjust the step size to the FF (as outlined in the openMM documentation). Apart from this, there is the rule of thumb of setting dt to 1/10th of the fastest vibration in the system. Perhaps sensible choices can be made based on the approximation eps_2 \approx k for the harmonic oscillator (or something like this). Another issue is equilibration. Again, how can we know beforehand, that a given equilibration period is sufficient for all possible eps_2 values proposed? Once again, the algorithm may make "wrong" moves based on accepting "out of equilibrium samples". Here, solutions are probably a bit more involved to come by i think. Probably, it would be best to put some "sanity check" into the simulation script, however, that would probably also be a bit much code to write. So for the time being, it
s best to just pick some conservative estimate.
The simulation during equilibration runs much quicker anyway, since nothing has to be travel fromg gpu to cpu to disk. Also, no derivatives have to be calculated making the FF quicker.
So its not that bad to do that anyway.
Further, longer simulation times -> more accurate gradients, etc. This can be very important during the later steps of optimization. Eventhough it may seem tempting to make each simulation run as quick as possible,
in practice it seems like running simulations for rather long times really pays off here. Another consideration in this vein is system scale. Broadly speaking, we are intrested in the number of pairs. Because of this,
the size of the observables scales the same way as the force/distance calculation does. So if you consider that we are intrested in the total pairs sampled (for derivatives and observables),
scaling the system up is as efficient as increasing production time. However, the main overheadin this implementation are the export of data to disk (and the number of different types of pairs), so making the system larger
basically increases the ratio of "useful calculation time" to "time exporting the data". You want to have the GPU calculate as many pairs as possible before it has to be saved to disk. The only way to do so is to scale the system up.
And this scaling up to some scale really does not incur a large time increment.
So idk, scaling up helps a lot at very little cost, whereas sampling to little and with the wrong eq-T and dt can really break the simulation. Longer simulations are good.
It feels much better to just forget about the optimization running for a week rather than finding out every two days that ohh, the equilibration time should have been longer, the timestep isnt great etc.
ForceField explanation
The way the forcefield works is possibly one of the strangest ways to implement a FF in openMM, so it needs a little explaining. In terms of design, we want the forcefield to run as quickly as possible, but also need to specify different global parameters for each protein pair to get derivatives. The most elegant solution is to specify the eps_2 paras via a 2DTabulatedFunction and feed them as precomputed vals for each pair. However, the 2DTabulated Funciton does not support derivatives, so this cannot be used here. Another option is to create seperate CustomNonBondedForce objcects for each pair type and restrict that force to teh pairs in question. This works, but somehow it is quite slow Idk, why? So instead, what was implemented was to create a single FF object that calculates the itneraction for each possible protein pair. (So there are terms for each possible pair type in the force expression). And then there are delta funcitons that basically disable all of the terms except the one necessary for the given type. (Those are implemented as precomputed vals via a 2DTabulatedFunction.) Idk why, or how through the JIT compilation in openMM this works, but it was like 4-5x quicker than the scheme outlined beforehand, so this strange complication was deemed worth it for the project.)