Protein Engineering

What generative protein design actually means for enzyme catalysis

Protein ribbon structure with reaction pathway visualization

The phrase "generative protein design" has developed a bad habit of meaning whatever the speaker wants it to mean. In practice it usually describes one of three things that are quite distinct: structure generation (predict a folded conformation from scratch), sequence generation conditioned on a target fold, or function-conditioned generation — where the target is a chemical transformation rather than a shape. For enzyme engineering, only the third category is actually solving your problem, and the gap between the first two and the third is larger than most teams realize when they first encounter the literature.

We built Fermvyne specifically because that gap was causing real workflow friction. Labs were running promising-looking generative pipelines, getting sequences with excellent predicted folds and clean predicted solubility, then watching most variants fail at the activity screen. The fold was right. The function wasn't encoded anywhere in the generation process.

Why fold-first generation misleads enzyme engineers

Structure-conditioned protein language models like ProteinMPNN or RFdiffusion backbones are trained on PDB-resolved structures. Their implicit "success metric" is whether the generated sequence folds into something thermodynamically stable. That's genuinely useful for de novo protein design — scaffolds, binding proteins, structural domains. For enzymes it's necessary but not sufficient by a significant margin.

An enzyme does its job not through fold alone, but through a catalytic mechanism that depends on precise residue geometry at the active site, electrostatic environment tuned to transition-state stabilization, and often a cofactor coordination architecture that must survive the mechanical stress of substrate binding and product release. Fold stability and catalytic competence are correlated in natural proteins because evolution co-optimized both. The moment you start generating de novo or far-from-homolog sequences, that correlation weakens substantially.

The quantitative consequence: in our internal benchmarking against a dataset of 2,800 experimentally-tested enzyme variants across seven EC classes, structure-conditioned generation alone produced active variants at a rate of roughly 9–14% (activity above 20% of wild-type specific activity as threshold). Conditioning on EC number and substrate SMILES prior to structure generation lifted that to 31–38% across the same enzyme families. The fold was the same quality. The functional context wasn't.

EC numbers as a generation prior — what they actually encode

The Enzyme Commission (EC) number system classifies reactions across four hierarchical levels: reaction type, mechanism subclass, further chemical specificity, and individual substrate class. EC 1.1.1.X covers oxidoreductases acting on CH-OH groups with NAD(P)+ as acceptor — a description that immediately constrains the active site geometry, cofactor-binding residue requirements, and transition-state topology an engineer cares about.

When we treat EC numbers as generation priors, we're encoding this mechanistic knowledge directly into the conditioning signal. Our transformer architecture represents the EC number as a hierarchical embedding — the four levels are embedded separately, then combined with learned attention weights that reflect how conserved each level is across the training corpus. EC class (level 1) contributes the most constraint; EC serial number (level 4) contributes the most specificity. For novel enzyme targets where level 4 is undefined, we can still generate competent sequences against a level-2 or level-3 prior.

We're not claiming EC-conditioned generation is a solved problem — the confidence intervals at EC level 4 for underrepresented reaction classes (particularly some lyases and ligases with sparse training data) remain wide, and we're transparent about that on a per-query basis. But the practical implication for your lab is that "generate me something in EC 1.1.1 that accepts this substrate SMILES" is a tractable query that produces a better starting point than any fold-first approach for that reaction class.

What the transformer actually sees during generation

The Fermvyne generation model is a masked language model trained on ~420 million residue tokens from curated enzyme sequences, with generation conditioned on three input types: the EC embedding described above, a SMILES-derived substrate fingerprint (extended-connectivity fingerprint, ECFP4 at radius 2), and optionally a reference protein embedding if the user wants to stay close to a known scaffold.

The SMILES conditioning deserves explanation. We don't pass the raw SMILES string — we convert it to a 2048-bit ECFP4 fingerprint and map that into the same embedding space as the EC prior. The reason is that SMILES strings encode the same molecule in multiple ways (canonical vs non-canonical, different ring notation conventions), and we want the model to learn substrate-binding geometry from the molecular topology, not from string patterns. After fine-tuning on experimentally-validated active site geometries from a curated subset of PDB structures, the SMILES embedding captures functional group identity and spatial arrangement in a way that correlates with binding pocket requirements.

A practical scenario: a team we worked with in late 2024 was engineering a 2-keto acid decarboxylase variant (EC 4.1.1.72) to accept a branched C6 substrate for a flavor compound synthesis route. Their reference enzyme had 73% identity to published structures, but the branched substrate introduced a steric clash in the binding pocket that their homology-model-based redesign didn't resolve. We ran generation conditioned on EC 4.1.1.72 + the substrate SMILES of the target compound, with their reference enzyme as scaffold prior. The top-ranked outputs shifted two binding-pocket residues that hadn't been on their mutagenesis list. Six of their top-10 synthesis orders showed decarboxylase activity above 40% of reference on the new substrate — a result their prior round hadn't achieved in 3 months of rational design.

The honest picture: what generative design doesn't fix

We're not saying generative protein design eliminates wet-lab validation — it categorically does not. What it changes is the composition of your synthesis queue. Instead of ordering 96 rational mutants based on structural intuition and hoping 5–8 are active, you order 24 generated candidates ranked by predicted activity, predicted solubility, and predicted Tm, and expect 7–10 to pass your primary screen. The calendar time savings come from running fewer synthesis-and-test rounds, not from skipping any of them.

There are also enzyme classes where generation remains genuinely hard. Multi-subunit complexes where catalysis depends on quaternary assembly are poorly served by single-chain generation. Enzymes whose activity is dominated by post-translational modifications (glycosylation, phosphorylation) don't benefit from sequence generation alone — the modification context is absent from the training signal. Radical SAM enzymes and B12-dependent enzymes, with their exquisitely sensitive metal-radical intermediates, currently sit at the edge of our confidence range.

For the oxidoreductases, hydrolases, and transferases that represent the majority of industrial biocatalysis campaigns, the combination of EC conditioning + substrate SMILES fingerprinting + expression property prediction produces a generation pipeline that consistently outperforms iterative rational design in terms of validated hit rate per synthesis dollar. That's the practical claim we stand behind.

Getting reaction context into your design brief

The most common mistake we see when labs first interact with the Fermvyne API is submitting a target enzyme with no substrate specification. They provide the EC number and ask for high-Tm variants. We can produce those — but without the substrate context we're optimizing for thermal stability in a functional vacuum, which is exactly the fold-without-function failure mode described above.

A complete design brief in our system has three required components: (1) the target EC number or reaction SMARTS if the reaction is novel enough to lack a defined EC number, (2) the substrate SMILES including any known co-substrates or required cofactors like NADPH, and (3) a reference sequence if you want generation anchored near a known scaffold rather than sampling freely from the EC class distribution. The third is optional — free generation away from known scaffolds sometimes surfaces genuinely novel active site architectures — but the first two are mandatory for the generation to have functional coherence.

The distinction between generating sequences that fold and generating sequences that catalyze is the core of what makes enzyme generative design different from protein generative design in general. It's a distinction that sounds obvious once stated, but took us most of the first year building Fermvyne to get right — and still requires honest calibration about which reaction classes are tractable and which aren't yet.