ESMC

Overview

Embedding protein sequences for clustering, retrieval, annotation, or ML features.
Per-residue embeddings for site-level analysis.
Comparing related variants or designs by masked-token pseudo-likelihood.

Modes

Mode	Input shape	When to use it
`embed` Embeddings - Typed Sequence Default	No uploaded input is required by the mode itself.	Extract mean or per-token embeddings from one typed protein sequence.
`embed_inputs` Embeddings - Input FASTA	Consumes a folder or output set; useful for batches and pipeline handoffs.	Extract mean or per-token embeddings from FASTA inputs in the selected source.
`score` Masked-token Scoring - Typed Sequence	No uploaded input is required by the mode itself.	Score one typed protein sequence by masked-token pseudo-likelihood.
`score_inputs` Masked-token Scoring - Input FASTA	Consumes a folder or output set; useful for batches and pipeline handoffs.	Score FASTA inputs in the selected source by masked-token pseudo-likelihood.

Canonical Job Configuration

These are the fields exposed by the default job configuration for esmc. They are also returned by GET /api/v1/program/params?program=esmc and submitted as the params JSON object to POST /api/v1/job/submit.

Parameter	Type	Modes	What it does
`sequence` Protein Sequence	Sequence	Embeddings - Typed Sequence, Masked-token Scoring - Typed Sequence	Protein sequence using standard or common ambiguous amino-acid codes. Whitespace is ignored. Required
`sequence_name` Sequence Name	Text	Embeddings - Typed Sequence, Masked-token Scoring - Typed Sequence	Output label for a typed sequence. Default: sequence
`pool` Embedding Output	Text	Embeddings - Typed Sequence, Embeddings - Input FASTA	Mean writes one vector per record; tokens writes per-token embeddings. Default: mean; Options: mean, tokens, both

Advanced configuration fields

Parameter	Type	Modes	What it does
`batch_size` Batch Size	Integer	All modes	Sequence or masked-position batch size. The default is safest for long inputs. Default: 1; Range: 1-4
`max_score_length` Max Score Length	Integer	Masked-token Scoring - Typed Sequence, Masked-token Scoring - Input FASTA	Reject sequences longer than this token count during masked-token scoring. Default: 1024; Range: 1-2048

Outputs And Metrics

Embedding manifest and NPY arrays for mean and/or token embeddings.
Score TSV/JSON files plus per-position masked-token detail tables.
mean_log_probability and perplexity are most comparable across different sequence lengths.
Lower perplexity and less-negative mean log probability indicate a sequence more expected under the model, not necessarily more stable or active.

Common Examples

Family embedding screen: FASTA input, Embedding Output mean.
Residue features: token embeddings for one sequence or FASTA set.
Variant triage: score a related variant panel and inspect local score drops.

Example API params

{
  "mode": "embed",
  "sequence": "MKTAYIAKQRQISFVKSHFSRQDILDL",
  "pool": "mean",
  "batch_size": 1
}

Caveats

ESMC uses sequence context only; it does not directly model ligands, cofactors, PTMs, assay conditions, or oligomeric state.
Interpret scores within controlled related sequence sets.
This job returns embeddings and scores, not folded structures or generated sequences.

Advanced Submit

Advanced submit is still available for direct program arguments through POST /api/v1/job/submit-advanced. Prefer canonical configuration unless you need exact low-level arguments or are reproducing a known command line.

Advanced submit can run the ESMC wrapper directly for batch scoring or embedding layouts not covered by canonical configuration.
Keep FASTA records unaligned and avoid gaps or stop symbols.

curl -X POST https://subseq.bio/api/v1/job/submit \
  -H "Authorization: Bearer <api_key>" \
  -F program=esmc \
  -F 'params={"mode":"embeddings_fasta","input_fasta":"sequences.fasta","embedding_output":"mean","batch_size":1}'