Protein Engineering

Predicting substrate scope: what our model gets right and where it struggles

October 20, 2025

Abstract molecular diversity visualization representing substrate scope

Substrate scope prediction is the capability that comes up most often in conversations with synthetic biology teams who are evaluating computational enzyme design tools. The question is reasonable: before you commit to engineering an enzyme for a specific reaction, you want to know how confident the model is that the enzyme will actually accept your substrate — and, equally important, what substrates it might also accept that could interfere with your intended pathway.

We have been running substrate scope prediction across a growing library of enzyme families for long enough now that we have clear visibility into where our model performs well and where the uncertainty is genuinely high. This post is a direct account of what we see across the 11 enzyme families we benchmark most frequently. We are going to be specific about which families are difficult and why, because that is more useful than a headline accuracy number.

What substrate scope prediction actually involves

Substrate scope in our model is not a binary accept/reject prediction. We output a predicted kcat/Km ratio relative to a reference substrate, with a confidence interval. For a given enzyme variant and target substrate, we ask: is this substrate likely to be converted at a rate sufficient for your application, and how certain are we?

The prediction draws on three sources: structural predictions of active site geometry (we use the AlphaFold2-derived structure as a starting point, then run a ligand docking estimate for the specific substrate-enzyme pair), learned sequence-to-activity relationships from our training data across that enzyme family, and substrate chemical similarity to substrates with known kinetics for close sequence homologs. These three signals are combined through an ensemble that outputs both a point estimate and an uncertainty estimate.

The uncertainty estimate is not decorative. When we report that our confidence interval spans an order of magnitude for a given substrate-enzyme pair, that is real information: it means you should synthesize more candidates and test them rather than depending on the rank order we provide.

Enzyme families where predictions are reliable

Short-chain dehydrogenases/reductases (SDRs) on aliphatic substrates. This is our strongest area. SDRs are structurally highly conserved, our training data is dense, and aliphatic substrate scope correlates well with active site cavity volume and electrostatic character. For C4-C10 aliphatic alcohols and ketones, our kcat/Km rank ordering is accurate enough that the top-ranked substrate typically agrees with wet-lab results and the confidence intervals are narrow.

Glycoside hydrolases (GHs) on well-annotated glycan substrates. The CAZy database provides deep coverage of experimentally characterized GH families. For GH families with more than a few hundred characterized sequences in our training data, substrate scope across the canonical glycan substrates is predicted well. The difficulty comes at the edges of known substrate space — novel glycan linkages or unusual substituents that fall outside the training distribution.

Cytochrome P450s on substrates closely related to known substrates. P450s are notoriously broad in scope, which makes them useful and also makes comprehensive prediction hard. Where we perform well: substrates within a carbon skeleton class that is well-represented in our training data. Where the confidence intervals widen significantly: substrates that require predicting regioselectivity of hydroxylation on an unfamiliar carbon framework.

Enzyme families where our confidence intervals are honestly wide

Halogenases for non-natural halogenation. Flavin-dependent halogenases (FDHs) are an exciting target for biocatalysis because regio-selective enzymatic halogenation is difficult to do chemically with high selectivity. The problem is that training data for FDH substrate scope is sparse. Published characterization has focused heavily on tryptophan halogenation; the diversity of non-tryptophan aromatic substrates in our training data is limited. When a team asks us to design an FDH variant for halogenation of a substrate that is chemically similar to tryptophan, we can be useful. For aromatic substrates with different substitution patterns or different heteroatom positions, we report wide confidence intervals and recommend synthesizing a larger candidate panel.

Non-ribosomal peptide synthetase (NRPS) adenylation domains for non-canonical amino acids. NRPS A-domain substrate scope prediction is one of the harder problems in the field. The substrate binding pocket (the 10-residue "nonribosomal code") has been studied extensively, but predicting adenylation activity for amino acids that differ substantially from canonical substrates is still at the limit of what sequence-structure models can do reliably. We can tell you that a given A-domain variant is likely to accept substrates that closely resemble its native amino acid. We are not yet able to confidently predict activity on chemically distant amino acid analogs with the same confidence level we achieve for SDR substrate scope.

Lytic polysaccharide monooxygenases (LPMOs) on recalcitrant substrates. LPMOs are increasingly important for biomass deconstruction applications. Their substrate scope correlates with surface binding geometry rather than classical active site shape complementarity, and predicting binding to complex polysaccharide surfaces from sequence alone is genuinely hard. Our predictions here are useful for ranking closely related substrate pairs but should not be treated as quantitative kinetics estimates.

A specific case: predicting transaminase scope for pharmaceutical intermediates

A team at a growing synthetic biology contract organization asked us to evaluate substrate scope for a panel of transaminase variants they were considering for producing a chiral amine intermediate. The target substrate was a cyclopropyl-containing amino ketone — structurally unusual for the omega-transaminase families we had most training data on.

We returned predictions for 40 variants, ranked by predicted kcat/Km for the target substrate, with explicit confidence intervals. For variants with close homologs in our training data that had been characterized on cyclopropyl substrates, confidence was reasonable. For the variants that were more distant from our characterized training examples, we flagged them explicitly with wide confidence intervals and recommended they deprioritize those variants in synthesis planning unless they had bandwidth for a larger screen.

When the lab tested the top 12 variants, the rank ordering within the high-confidence group was predictive: 8 of the top 10 high-confidence predictions correctly identified the most active variants. The two misses were in the medium-confidence group where we had been explicit about the uncertainty. The low-confidence variants we had recommended deprioritizing were largely inactive, consistent with our warnings.

What this means for how you should use substrate scope predictions

The confidence interval is the most important number in our substrate scope output — more important than the point estimate. A narrow confidence interval on a predicted kcat/Km means: synthesize fewer variants, this prediction is reliable enough to guide your synthesis plan. A wide confidence interval means: the model is telling you it does not have strong evidence for this pair, and you need to sample more broadly rather than following the rank order strictly.

We are not saying that wide confidence intervals mean the tool is not useful. They mean the tool is calibrated. A model that always outputs narrow confidence intervals regardless of evidence quality is dangerous because you cannot distinguish reliable predictions from noise. A model that acknowledges uncertainty enables you to allocate synthesis budget rationally — concentrate synthesis in regions of prediction space where the signal is strong, and hedge against uncertainty in regions where it is not. That is the decision support we aim to provide, and it requires honest uncertainty quantification rather than confidently wrong predictions with artificially tight bounds.