AI/ML

Directed evolution versus generative AI: the right question is not which one

June 5, 2025

Abstract visualization of computational versus experimental protein design approaches

A question we get from protein engineers evaluating Fermvyne, usually in the first conversation: "Are you replacing directed evolution?" The framing is understandable. There's been considerable noise in the protein engineering space about computational methods displacing classical evolution-based approaches, and anyone running a directed evolution campaign needs to know whether they're betting on a deprecated methodology.

The short answer is no. The longer answer is more useful: directed evolution and generative sequence design solve different parts of the same engineering problem, and the teams getting the most out of generative design are typically using it to improve the starting point for evolution, not to skip evolution entirely. The question worth asking isn't "which one" — it's "where in the workflow does each approach have the higher information yield per dollar spent."

What directed evolution actually does exceptionally well

Directed evolution is agnostic to mechanism. You don't need to understand why a mutation improves the property you're selecting for — you need a reliable selection or screening assay, a source of sequence diversity (error-prone PCR, DNA shuffling, saturation mutagenesis at target sites), and enough throughput to cover the diversity you generated. The mechanism reveals itself after the fact, in the sequences that survive selection.

This mechanistic agnosticism is not a limitation — it's the entire point. For properties where the structure-function relationship is poorly understood (enantioselectivity on a novel substrate class, activity in a complex organic solvent mixture, functional expression in a specific non-standard host), directed evolution can navigate toward solutions that no rational or computational approach would have predicted because it doesn't require a predictive model of the target property. It requires only a selection pressure and sequence diversity.

The classic directed evolution workflows — epPCR for random mutagenesis, gene shuffling for recombining diversity from multiple homologs, focused saturation mutagenesis at hotspot positions — have been optimized over decades of practice. The throughput limits are well understood. For high-throughput screening assays (fluorescence, absorbance, growth coupling), you can practically screen 10^5–10^6 variants per round in a well-equipped lab. For low-throughput assays (HPLC product quantification, GC-MS), you're limited to 10^2–10^3 variants per round, and round count becomes the constraint.

Where generative design adds value in the evolution workflow

The limitation of unguided directed evolution is that most of the diversity you generate is non-functional. A random single-nucleotide mutagenesis library of an enzyme typically has 1–5% variants that improve the target property meaningfully. For properties where the functional sequence space is sparse — high thermostability in combination with maintained activity, or broad substrate scope while retaining enantioselectivity — that rate drops to 0.1–0.5%. You're burning screening capacity exploring regions of sequence space that don't contain solutions.

Generative design's contribution is to bias the starting point and the library composition toward higher-density functional regions of sequence space. Before you build a saturation mutagenesis library at 10 positions in the binding pocket, using generative predictions to identify which 3–4 of those positions are most likely to tolerate diversity without destroying activity means your epPCR targets the positions with the highest expected information return. You're still doing directed evolution — the library, the screening, the selection — but you're seeding it with a more informed starting scaffold and focusing diversity where the model predicts it's tolerated.

The practical result: labs that use generation to inform the directed evolution starting point typically run 1–2 rounds fewer than their baseline, because round 1 starts from a scaffold that's already closer to the target property profile. We've seen this pattern most clearly in thermostability campaigns where the target is maintaining activity at 60°C: starting from a computationally-generated scaffold with predicted Tm of 68–72°C (rather than the wild-type 52°C) means the first directed evolution round is navigating a smaller improvement gap. Rather than needing 3–4 rounds to reach the target, labs often get there in 1–2.

The honest failure mode of generation without evolution

We need to be direct about where purely computational generation fails, because overstating the capability is bad for the field and bad for the labs that build their experimental plans around unrealistic predictions.

Generative models trained on existing protein sequence space cannot reliably design enzymes for reaction classes with very sparse training data. For reactions without close analogs in UniProt/BRENDA/characterized enzyme databases, the model's generation is essentially interpolating from distant relatives — and the interpolation quality degrades rapidly with evolutionary distance. We flag low-coverage reaction classes explicitly, but the fundamental limitation is that the model can only extrapolate as far as its training data supports.

There is also a specific failure mode for multi-property optimization that doesn't get discussed enough: when you simultaneously optimize for activity, thermostability, solubility, and substrate specificity in a single generation pass, the predicted Pareto front can be biased by correlations in the training data that don't reflect true mechanistic tradeoffs. An enzyme family where thermostable variants in the training set happen to be from thermophiles that also have high surface charge density will produce generated variants with high surface charge even if the thermostability mechanism in the target family is actually hydrophobic core packing. The model learned the correlation; it didn't learn the mechanism.

This is precisely where directed evolution is irreplaceable: it's running in the actual physicochemical and biological context of your target reaction and host. No prediction model produces a higher signal-to-noise ratio than direct experimental selection under your actual conditions.

Library design: where the two methods converge

The most productive framing we've arrived at, after working with labs at various stages of enzyme engineering projects, is treating generative design as a library design tool rather than a replacement for experimental screening. Instead of asking "generate me an enzyme that works at 65°C with this substrate," ask "generate the 24 sequences most likely to be diverse in functional space and most likely to survive my first expression and activity screen — then I'll evolve the best of those."

This framing changes how you interact with the generation output. You're not hoping the top-ranked sequence is your final production enzyme. You're using the ranked, filtered candidate list as the seeding set for an evolution campaign where you know each sequence in the seed set is predicted to (a) express solubly, (b) fold stably at your target temperature, and (c) have the right active site geometry for your substrate. The evolution then navigates from that seeding set rather than from an arbitrary wild-type.

Error-prone PCR on a generated seed sequence with a predicted Tm of 67°C and confirmed activity produces a library where the starting point is already beyond what three conventional directed evolution rounds on wild-type typically achieve. You're not compressing the biology out of the process — you're front-loading the computational reasoning so the experimental rounds start from a better position.

What this means for how you should plan a campaign

If your enzyme engineering campaign has a high-throughput screening assay (absorbance-based, fluorescence-based, or growth coupling), directed evolution alone is probably adequate if you have time and budget for 3–4 rounds. If your assay is low throughput, or if you need to hit tight multi-property specifications (specific activity AND thermostability AND expression yield), starting with generated candidates and evolving from there compresses the total rounds required.

If your target reaction class is well-represented in the enzyme databases and you have a clear substrate SMILES, generation produces high-confidence candidates that are worth testing as starting scaffolds. If your reaction class is novel or under-characterized, use generation conservatively — as one input to scaffold selection alongside literature and homology — and weight your experimental selection accordingly.

The labs that waste time in this space are usually the ones that tried to skip evolution entirely because a generated sequence looked promising in prediction, or alternatively the ones that refused to use any computational pre-filtering because "directed evolution already works." Both extremes are suboptimal. The real efficiency is in sequencing the two approaches correctly.