BioEmu (Microsoft)

Equilibrium-ensemble sampling and side-chain reconstruction for protein monomers.

How SubSeq Runs BioEmu

Guided mode is one sequence-to-ensemble workflow: precompute embeddings from A3M, sample from the warmed cache, optionally apply physical steering, and optionally reconstruct all-heavy-atom side chains in the same job.
Advanced args use one BioEmu command per line, similar to GROMACS.
Supported commands are python -m bioemu.subseq_precompute_embeds ..., python -m bioemu.sample ..., and python -m bioemu.sidechain_relax ....
Arbitrary Python scripts, python -c, and unsupported bioemu.* modules are rejected.
Shell operators, comments, quotes, and command chaining are rejected in advanced BioEmu command lines.
Runtime dependencies, model assets, and caches are managed by SubSeq.
Sampling paths are enforced: --output_dir=/outputs, --cache_embeds_dir=/outputs/.bioemu_embeds_cache, --cache_so3_dir=/ref/.bioemu_so3_cache.
This deployment rejects direct sequence strings and FASTA input at submit-time; --sequence must point to an .a3m file.

Sampling

Generate backbone-frame equilibrium ensemble outputs from an A3M alignment:

python -m bioemu.subseq_precompute_embeds --sequence=/inputs/sequence.a3m --cache_embeds_dir=/outputs/.bioemu_embeds_cache

python -m bioemu.sample --sequence=/inputs/sequence.a3m --num_samples=100 --batch_size_100=10 --filter_samples=True --base_seed=42

--sequence: path to an .a3m file.
--num_samples: number of sampled structures to generate.
Guided physical steering is exposed as a strength parameter, not as a separate mode.
Guided atom detail chooses between native backbone/topology trajectory outputs and all-heavy-atom side-chain outputs.
--denoiser_config and --steering_config must point to mounted config files under /inputs, /aux, or /ref.
User-supplied --ckpt_path, --model_config_path, and --model_name are ignored by SubSeq.

Side-chain Reconstruction

Reconstruct all-heavy-atom side chains from BioEmu topology.pdb and samples.xtc outputs. Guided full-atom jobs run this after sampling using the files just written under /outputs:

python -m bioemu.sidechain_relax --pdb-path /outputs/topology.pdb --xtc-path /outputs/samples.xtc --outpath /outputs --prefix samples --no-md-equil

--outpath is normalized to /outputs.
Without --no-md-equil, BioEmu can run OpenMM local minimization/equilibration after side-chain reconstruction.
Outputs include files such as samples_sidechain_rec.pdb, samples_sidechain_rec.xtc, samples_md_equil.pdb, and samples_md_equil.xtc.

This BioEmu module does not select representative centroid frames from the trajectory; use a trajectory postprocessor for clustering/extracting representative PDBs.

Outputs and Caches

Sampling writes topology.pdb, samples.xtc, sequence.fasta, and batch_*.npz under /outputs.
Embedding cache is written to /outputs/.bioemu_embeds_cache.
SO(3) lookup cache is written to /ref/.bioemu_so3_cache.

Offline/Reproducible Use

Prefer A3M input when you want to avoid remote MSA services.
Side-chain reconstruction uses image-bundled OpenMM and HPacker so jobs do not install dependencies at runtime.