AI/ML

How we benchmark enzyme generation models without cherry-picking

September 4, 2025

Abstract data visualization representing model evaluation and benchmarking

Every generative protein design company has benchmark numbers. The question is whether those numbers mean anything. We have spent a lot of time thinking about what honest benchmarking looks like in this domain — not because we are unusually virtuous, but because the alternative (inflated or cherry-picked benchmarks) will eventually burn any trust we have built with the protein engineers who use our platform. This post explains the choices we made and where we think the field has systemic problems.

The data leakage problem is worse than people admit

Protein sequence databases are not static. UniProt alone adds tens of thousands of reviewed entries per year. If your training data was fixed at a cutoff date and your holdout set was constructed from the same database at a later date, sequences that were "novel" at training time may have experimental characterization data that your model effectively memorized — because the sequence was already in the training set even if the functional annotation was added later.

More concretely: a sequence-based holdout split at 30% sequence identity to any training example sounds rigorous. In practice, if your training set is large enough and covers enzyme families well, the 30% threshold still allows structural homologs to act as near-neighbors. A generative model conditioned on EC number 1.1.1.x with a structurally similar enzyme in training will perform much better on that holdout family than on a genuinely novel enzyme class it has only seen sparse examples of. Reporting a single aggregate hit rate obscures this completely.

Our position: aggregate benchmark numbers reported without family-level breakdown are close to useless for a protein engineer trying to decide whether to use a tool for their specific application. We always report per-family confidence and performance separately, and we call out enzyme families where our training data is thin and our confidence intervals are consequently wide.

How we constructed our holdout set

We use a three-tier holdout strategy. The first tier is a temporal holdout: sequences with experimental functional annotations added to public databases after our training cutoff. These are genuinely novel to the model and test generalization under realistic conditions. The second tier is a family-exclusion holdout: entire enzyme families (at the EC 3rd-digit level) excluded from training and held out for evaluation. This tests the model's ability to generalize to enzyme classes it has seen few or no training examples of. The third tier is a substrate-exclusion holdout: enzymes from families well-represented in training, but evaluated specifically on substrate classes not present in training data.

Each tier answers a different question. Tier one answers: how does the model handle sequences from known families where the specific sequence was not seen during training? Tier two answers: can the model generalize to new enzyme architectures? Tier three answers: does substrate-conditioned generation actually work for novel chemistry, or only for chemistry similar to what was in training?

We perform best on tier one. We are honest that tier two is harder and our performance there is more variable. Tier three is where the research is most active for us right now — generalization to genuinely novel substrate classes is a hard problem and the field does not have a clean solution yet.

Expression validation as the ground truth

Computational benchmarks using sequence-based or structure-based metrics are necessary but insufficient. The only ground truth we care about is whether a generated sequence expresses as a soluble, active protein in the host of interest. Everything else is a proxy.

We have an ongoing expression validation workflow where generated enzyme sequences are synthesized and tested in E. coli BL21(DE3) under standard IPTG induction conditions. We track expression yield by SDS-PAGE and soluble fraction by centrifugation, activity by substrate-appropriate assay. These results feed back into the model as additional training signal but, critically, we freeze the model before evaluating the next validation batch. No contamination of benchmark results by the data used to compute them.

Our current running expression rate — across all generated sequences where we have wet-lab validation — sits in a range broadly consistent with what you would expect from a high-quality computational pre-screen. We do not report a single number because it varies meaningfully by enzyme family and substrate complexity, and a single aggregate number would be misleading. What we do report is the per-family breakdown when users ask for it.

Why we do not report "hit rate" as a primary metric

Hit rate — the fraction of generated sequences that meet a functional threshold in wet-lab validation — is seductive because it sounds like the bottom line. It is not, for two reasons. First, hit rate is acutely sensitive to threshold definition. "Active" can mean kcat greater than 0.1 s-1 or kcat greater than 10 s-1 depending on the application. A model that generates lots of weak-activity variants looks good by a permissive threshold and terrible by a stringent one. Second, hit rate conflates two different things: the fraction of variants that are active, and the quality distribution of the active variants. A campaign that yields 40% of variants with marginal activity is not better than one that yields 20% of variants with excellent activity and a clear structure-activity relationship.

We report hit rate because users ask for it and it is a reasonable sanity check. But the metrics we think matter more are: synthesis rounds to a validated variant meeting specifications, total number of sequences synthesized per successful campaign, and confidence interval width on our predictions — because narrow confidence intervals mean the model is giving you useful signal rather than noise with a lucky mean.

What benchmark transparency actually requires

We think benchmark transparency in protein design requires at minimum: explicit description of train/test split methodology; family-level performance breakdown rather than aggregate only; reporting of failures and enzyme classes with poor performance alongside successes; and a clear statement of what the benchmark was designed to measure and what it cannot measure.

We are not saying other groups are being deliberately misleading. We are saying the field has a collective incentive problem where groups that report less nuanced benchmarks look better in the short term, and groups that report more honest numbers look worse even when their models are actually more trustworthy for real-world use. The downstream cost of this is that protein engineers struggle to make calibrated decisions about which tools to trust for their specific problem, and wasted synthesis spend follows.

Transparency does not mean only reporting when the numbers are bad. It means reporting in a way that lets a user with a specific enzyme design problem make a rational decision about whether our model is likely to be useful for their problem. That requires honest communication about where our confidence is high, where it is lower, and why. We think that approach builds more durable trust than leading with the best-looking aggregate number and hoping users do not probe deeper.

The longer-term picture

As more wet-lab validation data accumulates — from our own workflow and from the broader field — benchmarking methodology will mature. Right now the field is still in a period where model architecture papers tend to dominate over honest applied evaluation. We expect that to shift as practitioners who care about synthesis cost and campaign efficiency gain more influence over what the field considers important to measure. We are building toward a world where the benchmark you report is the thing that is actually predictive of campaign success — not the thing that looks best in a paper abstract.