blog - intro to gromacs for protein design

What is GROMACS?

GROMACS runs molecular dynamics (MD): you simulate your protein (usually in explicit water + ions) using a molecular mechanics force field, and watch how it behaves over time.

For protein design, MD is most useful for answering:

Does the fold stay compact, or does it unravel?
Does a binder interface stay engaged, or drift apart?
Are there obvious weak points (fraying helices, unstable loops, poor core packing)?

Units note: GROMACS distances are typically in nm (0.10 nm = 1 Å). Time is usually in ps or ns.

Why use MD for protein design?

Structure predictors and design models give you a plausible snapshot. MD asks whether that snapshot is dynamically stable when atoms are allowed to move.

Tool	What it tells you	Limitation
RFdiffusion / BoltzGen	Generates plausible backbone geometries	Static structure only
AlphaFold2 / ESMFold	Predicts likely folded structure	Static structure only
GROMACS MD	Whether the modeled structure stays stable when atoms move	Slower; setup details matter

MD is best used as a relative filter on a shortlist (e.g., 3–20 candidates) under one consistent protocol. It is not a guarantee of experimental success.

What MD is great for

Screening a shortlist under identical conditions
Catching "fails fast" designs before spending lab resources
Identifying problem regions to redesign (high flexibility, loss of contacts)
Comparing stability at different temperatures (e.g., 300 K vs 350 K)

What MD is NOT

A guarantee of experimental stability, activity, or affinity
A replacement for correct protonation states, disulfides, cofactors, metals, or glycans
A substitute for experimental validation

Key file types

Understanding what each file contains helps you debug problems and interpret results.

Extension	What it is	When you use it
`.pdb`	Input structure (Protein Data Bank format)	Starting point for `pdb2gmx`
`.gro`	GROMACS coordinate file (positions + box)	Passed between steps; final snapshots
`.top`	Topology (atoms, bonds, force field params)	Describes the system for the simulation
`.mdp`	Run parameters (timestep, temperature, etc.)	Controls what kind of simulation you run
`.tpr`	Portable run input (binary, everything needed)	Created by `grompp`, used by `mdrun`
`.xtc`	Compressed trajectory (positions over time)	Main analysis input
`.edr`	Energy file (temperature, pressure, etc.)	Used by `gmx energy`
`.cpt`	Checkpoint (restart file)	Continue interrupted runs
`.xvg`	Time series output (plottable data)	RMSD, Rg, energy plots, etc.
`.xpm`	Matrix/image output (e.g., contact maps)	DSSP, contact analysis

Typical workflow (prep → solvated MD)

This is the standard "minimal but sane" workflow for proteins. Each step builds on the previous one.

Step 1: Build topology and add hydrogens (`pdb2gmx`)

gmx pdb2gmx -f protein.pdb -o protein_processed.gro -p topol.top -ff amber99sb-ildn -water tip3p

What it does / key args:

pdb2gmx: generates topology + coordinates with hydrogens added.
-f: input PDB structure.
-o: output coordinates (GRO format).
-p: output topology file.
-ff: force field choice — keep consistent across candidates. Common choices:
- amber99sb-ildn: well-validated, good for folded proteins
- charmm36: another solid choice with good lipid support
- amber14sb: newer Amber variant
-water: water model (must match force field; TIP3P for Amber, SPC/E for some OPLS).

Tip: If pdb2gmx complains about missing atoms or unknown residues, your input PDB may need cleaning (remove ligands, fix non-standard residues, check for missing loops).

Step 2: Define a solvent box and solvate (`editconf`, `solvate`)

gmx editconf -f protein_processed.gro -o box.gro -bt dodecahedron -d 1.0 -c
gmx solvate -cp box.gro -cs spc216.gro -o solv.gro -p topol.top

What it does / key args:

editconf: defines the periodic simulation box.
-bt dodecahedron: box shape — dodecahedron is more efficient than cubic (fewer waters for same buffer).
-d 1.0: minimum distance from protein to box edge (~1.0 nm is typical; larger for big conformational changes).
-c: centers the protein in the box.
solvate: fills the box with water molecules.
-cp: input (protein + box).
-cs spc216.gro: water coordinate template.
-p: updates topology with water molecule counts.

Step 3: Add ions (`grompp`, `genion`)

gmx grompp -f ions.mdp -c solv.gro -p topol.top -o ions.tpr
gmx genion -s ions.tpr -o solv_ions.gro -p topol.top -pname NA -nname CL -neutral -conc 0.15

What it does / key args:

grompp: "preprocess" — combines .mdp settings + coordinates + topology into a runnable .tpr.
genion: replaces water molecules with ions.
-neutral: adds enough ions to neutralize the system's net charge.
-conc 0.15: sets ~0.15 M salt concentration (physiological-ish).
-pname NA -nname CL: ion names (check your force field for correct names).

You'll be prompted to choose a group to replace with ions — usually pick "SOL" (solvent).

Step 4: Minimization + equilibration (`mdrun`)

# Energy minimization (remove bad contacts)
gmx grompp -f em.mdp -c solv_ions.gro -p topol.top -o em.tpr
gmx mdrun -v -deffnm em

# NVT equilibration (heat up, constant volume)
gmx grompp -f nvt.mdp -c em.gro -r em.gro -p topol.top -o nvt.tpr
gmx mdrun -v -deffnm nvt

# NPT equilibration (constant pressure, let density settle)
gmx grompp -f npt.mdp -c nvt.gro -t nvt.cpt -r nvt.gro -p topol.top -o npt.tpr
gmx mdrun -v -deffnm npt

What it does / key args:

em.mdp: energy minimization — removes steric clashes and bad geometry.
nvt.mdp: NVT equilibration — brings system to target temperature with position restraints on protein.
npt.mdp: NPT equilibration — adds pressure coupling, lets box size and density equilibrate.
mdrun -deffnm X: writes outputs as X.xtc (trajectory), X.edr (energies), X.gro (final coords).
grompp -r: reference structure for position restraints (if enabled in .mdp).
grompp -t: continue from checkpoint (.cpt) — preserves velocities and thermostat state.

Position restraints during equilibration let the water and ions relax around a fixed protein, preventing early instability artifacts.

Step 5: Production MD

gmx grompp -f md.mdp -c npt.gro -t npt.cpt -p topol.top -o md.tpr
gmx mdrun -v -deffnm md

What it does: This is the actual simulation you'll analyze. No more restraints — the protein moves freely.

Run length guidelines (very approximate):

10–20 ns: "does it fail fast?" — catches major instabilities quickly.
50–200 ns: shortlist ranking — enough to see meaningful trends.
Multiple short replicas: often more informative than one long run (different starting velocities can reveal sampling issues).

Critical: Your results are only comparable if you keep the protocol identical across candidates — same force field, water model, salt concentration, temperature, .mdp settings, and run length.

Key outputs explained (what they mean)

Most analysis commands use these common arguments:

-s: run input (usually md.tpr)
-f: trajectory (usually md.xtc)
-o: output plot/file (often .xvg)

0) First: PBC-clean the trajectory (important!)

Many "binder dissociates" plots are actually periodic boundary artifacts. The protein didn't fall apart — it just wrapped around the simulation box edge. Always make molecules whole and center before analysis.

gmx trjconv -s md.tpr -f md.xtc -o md_centered.xtc -pbc mol -center

What it does / key args:

trjconv: converts and post-processes trajectories.
-pbc mol: makes molecules whole across the periodic box.
-center: centers the selected group (you'll be prompted to choose).

You'll be prompted twice: once for centering group (usually "Protein") and once for output group (usually "System").

1) RMSD (Root Mean Square Deviation)

What it measures: How far the structure drifts from the reference after alignment. This is your primary "is it staying folded?" signal.

gmx rms -s md.tpr -f md_centered.xtc -o rmsd.xvg -tu ns

What it does / key args:

rms: calculates RMSD vs time (prompts you to choose fit group and RMSD group).
-tu ns: output time in nanoseconds (easier to read than ps).

Interpretation (heuristics; depends on protein size/fold and your atom selection):

✅ RMSD rises then plateaus (common ballpark backbone RMSD: ~0.10–0.30 nm) → stable-ish fold under this protocol.
⚠️ RMSD drifts then plateaus high → relaxation to a different basin (can be fine, especially from a strained starting geometry).
❌ RMSD keeps rising without plateau → unfolding or major rearrangement in progress.

Tip: RMSD depends heavily on which atoms you select (backbone vs all-atom) and whether you handled PBC correctly. Backbone C-alpha is a common choice.

2) RMSF (Root Mean Square Fluctuation)

What it measures: Per-residue flexibility — great for finding weak regions in your design.

gmx rmsf -s md.tpr -f md_centered.xtc -o rmsf.xvg -res

What it does / key args:

rmsf: calculates fluctuation per atom/residue.
-res: report per-residue values (averaged over atoms in each residue).

Interpretation:

✅ Core residues relatively rigid; loops/termini more flexible → expected behavior.
❌ High RMSF across helices/sheets or "everything is floppy" → poor packing, instability, or design problems.

RMSF is like crystallographic B-factors: it tells you which parts of the structure are mobile. A well-packed core should be rigid.

3) Radius of gyration (Rg)

What it measures: Overall compactness — how "spread out" the protein is.

gmx gyrate -s md.tpr -f md_centered.xtc -o gyrate.xvg

What it does:

gyrate: calculates Rg vs time (choose the group you want, typically Protein).

Interpretation:

✅ Stable Rg → fold remains compact.
❌ Steadily increasing Rg → often accompanies unfolding (but some folds "breathe" without actually failing).

4) Hydrogen bonds

What it measures: Hydrogen bond counts over time (global or between specific groups).

gmx hbond -s md.tpr -f md_centered.xtc -num hbnum.xvg

What it does / key args:

hbond: H-bond analysis (choose donor/acceptor groups interactively).
-num: outputs H-bond count vs time.

Interpretation:

✅ Stable secondary-structure H-bond patterns → helices/sheets intact.
⚠️ Interface H-bonds can be meaningful if you define groups correctly (e.g., chain A donors vs chain B acceptors).

5) Secondary structure over time (DSSP)

What it measures: Whether helices/sheets persist or fray/collapse over time.

gmx do_dssp -s md.tpr -f md_centered.xtc -o ss.xpm -sc scount.xvg

What it does / key args:

do_dssp: calls DSSP algorithm to assign secondary structure per frame.
-o ss.xpm: per-residue, per-time "image" output (can convert to PNG).
-sc scount.xvg: secondary structure counts (helix, sheet, coil) vs time.

Note: do_dssp requires a DSSP executable (mkdssp or dssp) in the environment.

6) Binder/interface stability (for complexes)

What it measures: Whether two proteins stay together or drift apart. Track more than one signal:

gmx mindist -s md.tpr -f md_centered.xtc -od mindist.xvg -tu ns
gmx mdmat   -s md.tpr -f md_centered.xtc -mean mean_contacts.xpm

What they do / key args:

mindist: minimum distance between two groups over time (choose groups interactively, e.g., chain A vs chain B).
mdmat -mean: contact matrix summary averaged over time.

Interpretation:

✅ Distance stays low and contacts persist → interface likely stable under this protocol.
❌ Distance jumps and contacts drop → likely dissociation (but verify PBC-cleaning first!).

7) Potential energy and "sanity dashboard"

What it measures: Simulation stability signals — not protein stability directly, but whether the run is behaving sanely.

gmx energy -f md.edr -o energy.xvg

What it does / key args:

energy: extracts time series from the energy file (interactive selection of terms: temperature, pressure, potential energy, etc.).

Interpretation:

✅ Temperature stable around setpoint; density settles during NPT; no huge energy spikes → simulation is running correctly.
❌ Giant energy spikes → bad contacts, unstable integration, or something went wrong.

Practical examples

Example 1: Monomer stability screen

Goal: Check if a designed monomer stays folded and compact.

Run a production trajectory, then analyze RMSD and Rg:

gmx rms    -s md.tpr -f md_centered.xtc -o rmsd.xvg -tu ns
gmx gyrate -s md.tpr -f md_centered.xtc -o gyrate.xvg

What to look for:

RMSD rises then plateaus (no steady drift).
Rg stays roughly constant.
Compare multiple candidates under identical protocols — rank by stability.

Example 2: Binder/complex stability

Goal: Check whether a designed binder stays bound to its target.

gmx mindist -s md.tpr -f md_centered.xtc -od mindist.xvg -tu ns
gmx mdmat   -s md.tpr -f md_centered.xtc -mean mean_contacts.xpm

What to look for:

Minimum distance remains low (sub-nm).
Contact summary stays consistent (no sustained collapse of interface).
Remember to define index groups properly (chain A vs chain B).

Example 3: Thermostability stress test

Goal: Compare stability at different temperatures (screening, not proof).

Run comparable protocols at 300 K vs 350 K and compare RMSD trends:

gmx rms -s md_300K.tpr -f md_300K_centered.xtc -o rmsd_300K.xvg -tu ns
gmx rms -s md_350K.tpr -f md_350K_centered.xtc -o rmsd_350K.xvg -tu ns

What to look for:

Does one design stay stable at 350 K while another unfolds?
Useful for comparing thermostable variants, but don't over-interpret absolute values.

Example 4: Finding problem regions

Goal: Identify which part of your design is unstable for redesign.

gmx rmsf -s md.tpr -f md_centered.xtc -o rmsf.xvg -res
gmx do_dssp -s md.tpr -f md_centered.xtc -o ss.xpm -sc scount.xvg

What to look for:

High RMSF spikes in regions that should be structured → redesign targets.
DSSP showing helix/sheet loss in specific residue ranges → weak spots.

Validation checklist

Use this as a screening checklist. These are heuristics, not hard rules — absolute cutoffs depend on protein size, fold, and your analysis choices.

✅ Structural stability

RMSD reaches a plateau (commonly ~0.10–0.30 nm backbone)
Rg not trending upward
Secondary structure doesn't collapse (DSSP)
Visual inspection doesn't show obvious unfolding

✅ Local stability

Core RMSF is low relative to loops/termini
No obvious water penetration into hydrophobic core (visual check)
Key interactions (salt bridges, H-bonds) persist

✅ Interface stability (if binder/complex)

Distance/contact metrics don't trend toward dissociation
Interface H-bonds persist (if designed to have them)
Binder orientation doesn't rotate away from target

✅ Simulation sanity

Temperature stable at setpoint
No huge energy spikes
Density/pressure stabilized after equilibration

Common issues & troubleshooting

Problem: "It explodes immediately"

Common causes:

Steric clashes / overlapping atoms in starting structure
Wrong protonation states or missing disulfides
Too-large time step for your constraint setup
Missing parameters (unknown residues/ligands)

Try:

Longer/stronger energy minimization (more steps, steeper descent)
More restrained equilibration (stronger position restraints)
Smaller time step early (e.g., 1 fs) before switching to 2 fs
Check your input PDB for problems (missing atoms, clashes, non-standard residues)

Problem: RMSD keeps rising

Possibilities:

Real unfolding (bad packing, poor design)
Relaxation away from a strained starting model (not necessarily bad)
Analysis artifact (fit group choice, PBC issues)

Try:

Re-run analysis after PBC-cleaning (trjconv -pbc mol -center)
Compare RMSD with different fitting/selection choices (backbone only, C-alpha only, specific domains)
Use RMSF to find which region is driving the motion
Visually inspect the trajectory in VMD/PyMOL/ChimeraX

Problem: Binder "dissociates" in plots

Most common cause: Periodic boundary artifacts — the complex looks separated because one chain wrapped to the other side of the box.

Try:

Always analyze a centered/whole trajectory (trjconv -pbc mol -center)
Visually inspect before concluding dissociation is real
For stubborn cases, try -pbc nojump first, then -pbc mol

Problem: pdb2gmx fails with unknown residues

Common causes:

Non-standard residues (modified amino acids, ligands)
Missing atoms or incomplete residues
Waters/ions in unusual formats

Try:

Remove ligands/waters before pdb2gmx, add them back later
Use a tool like pdb4amber or pdbfixer to clean the input
Check if you need to add custom force field parameters

Scope & complementary tools

MD is one tool in the design validation toolkit. Understanding its scope helps you use it appropriately.

What MD is good at

Catching obviously unstable designs — things that fall apart in nanoseconds aren't going to work
Relative ranking — comparing candidates under identical conditions
Finding flexible regions — identifying redesign targets
Checking binder engagement — does the interface stay together?

What MD can't tell you

Absolute stability — "stable in MD" ≠ "experimentally stable"
Binding affinity — you need free energy calculations for that
Folding from scratch — typical MD timescales can't observe folding
Expression/solubility — totally different properties

Complementary tools

Tool	What it adds
AlphaFold2 pLDDT	Quick confidence filter before MD
Rosetta ddG	Mutation effects, stability scoring
FoldX	Fast stability estimates
Free energy perturbation	Binding affinity predictions (more expensive)
Coarse-grained MD	Longer timescales, bigger systems
Enhanced sampling	Metadynamics, replica exchange for rare events

Typical validation pipeline

Generate candidates: RFdiffusion, BoltzGen, etc.
Quick filter: pLDDT, Rosetta scoring, visual inspection
MD screening: shortlist → solvated MD → rank by stability
Deeper analysis: longer runs, replicas, free energy calculations
Experimental validation: expression, purification, biophysical characterization

SubSeq-specific notes

On subseq.bio, GROMACS runs as "one command per line":

Inputs go under /inputs
Outputs go under /outputs
No pipes, &&, output redirects (>), or subshells
Interactive prompts can be handled via stdin redirection from a file: ... < /inputs/... or ... < /aux/...

See /docs/gromacs for the exact runner rules, GPU flags used by the image, and a minimal relaxation example.

Intro to GROMACS for protein design

Table of contents

What is GROMACS?

Why use MD for protein design?

What MD is great for

What MD is NOT

Key file types

Typical workflow (prep → solvated MD)

Step 1: Build topology and add hydrogens (pdb2gmx)

Step 2: Define a solvent box and solvate (editconf, solvate)

Step 3: Add ions (grompp, genion)

Step 4: Minimization + equilibration (mdrun)

Step 5: Production MD

Key outputs explained (what they mean)

0) First: PBC-clean the trajectory (important!)

1) RMSD (Root Mean Square Deviation)

2) RMSF (Root Mean Square Fluctuation)

3) Radius of gyration (Rg)

4) Hydrogen bonds

5) Secondary structure over time (DSSP)

6) Binder/interface stability (for complexes)

7) Potential energy and "sanity dashboard"

Practical examples

Example 1: Monomer stability screen

Example 2: Binder/complex stability

Example 3: Thermostability stress test

Example 4: Finding problem regions

Validation checklist

✅ Structural stability

✅ Local stability

✅ Interface stability (if binder/complex)

✅ Simulation sanity

Common issues & troubleshooting

Problem: "It explodes immediately"

Problem: RMSD keeps rising

Problem: Binder "dissociates" in plots

Problem: pdb2gmx fails with unknown residues

Scope & complementary tools

What MD is good at

What MD can't tell you

Complementary tools

Typical validation pipeline

SubSeq-specific notes

Step 1: Build topology and add hydrogens (`pdb2gmx`)

Step 2: Define a solvent box and solvate (`editconf`, `solvate`)

Step 3: Add ions (`grompp`, `genion`)

Step 4: Minimization + equilibration (`mdrun`)