Protein Engineering

Why solubility prediction matters more than sequence identity

April 17, 2025

Abstract solubility and protein expression concept visualization

There's an assumption embedded in most early-stage enzyme engineering campaigns that sequence identity to a well-expressed reference enzyme is a reliable proxy for expressibility. If your reference enzyme expresses solubly in E. coli BL21(DE3) at 37°C with IPTG induction, and your candidate variant is 78% identical to that reference, the reasonable working assumption is that the variant will also express solubly — maybe not at the same yield, but probably not as inclusion bodies.

This assumption fails more often than most labs want to admit. We have expression data across ~8 million recombinant protein expression records (aggregated from published datasets and internal validation partnerships), and the correlation between percent sequence identity to a solubly-expressed reference and soluble expression in E. coli BL21 breaks down significantly below 85% identity for most enzyme families. At 70% identity — which is still well within what most labs would call "a close homolog" — soluble expression rates in our dataset are only modestly better than random prediction from amino acid composition alone. Sequence identity is tracking structural conservation, not the specific residues that determine aggregation propensity in a heterologous host.

What actually drives insoluble expression

Inclusion body formation in E. coli is a kinetic problem, not a thermodynamic one in most cases. It's not that your target enzyme is thermodynamically unstable — it might have excellent predicted fold stability. It's that the kinetics of productive folding compete unfavorably with the kinetics of off-pathway aggregation when the protein is being synthesized faster than the cellular chaperone system (DnaK/DnaJ/GrpE, GroEL/GroES) can handle the folding intermediates.

The specific sequence features that drive this competition are: (a) long hydrophobic stretches in the N-terminal 100 residues that are exposed before the C-terminal domain has folded and can shield them, (b) strong secondary structure disruptions at domain-linker boundaries that slow co-translational folding, (c) high local hydrophobicity patches in loop regions that are exposed at the surface of the folded protein and that facilitate inter-molecular interactions at high local concentrations in the cytoplasm, and (d) rare codon clusters that cause ribosome pausing at precisely the wrong moments during co-translational folding.

Points (a), (b), and (c) are sequence-encoded and predictable from the amino acid sequence without structure information. Point (d) requires matching the sequence against the codon usage table of your expression host, which is a straightforward calculation. Neither of these is captured by percent sequence identity to a reference protein.

What the solubility scoring model actually does

The Fermvyne solubility model is a supervised classifier trained on ~8M expression records mapping protein sequences to binary outcomes (soluble vs. inclusion body) under standard BL21 conditions (37°C, 0.5 mM IPTG, LB media, 4h induction). We also have a smaller dataset (~220,000 records) for lower-temperature expression conditions (16–20°C overnight induction), which allows a secondary classifier for cold-induction scenarios.

The feature set includes 94 sequence-derived predictors: the features mentioned above (N-terminal hydrophobicity profile over a sliding 20-residue window, predicted secondary structure transitions, surface patch hydrophobicity estimated from a lightweight fold prediction), plus aggregation-propensity scores from the PASTA algorithm, charge distribution along the sequence, and codon adaptation index calculated against E. coli K12 codon frequencies. The model is a two-layer architecture: a convolutional layer over sequence-position-specific features, followed by a gradient-boosted classifier over the global feature summary.

AUROC on held-out test set (stratified by homology cluster): 0.83 for the standard BL21 conditions classifier. Precision at 90% recall: 0.71. This means if you want to retain 90% of variants that would actually express solubly, you're accepting some false positives — you'll still order a few insoluble variants. But you'll suppress the majority of known inclusion body-formers before synthesis.

The more useful practical metric: the fraction of the synthesis order that you expect to get purifiable soluble protein from. In a typical generated library without solubility filtering, roughly 25–40% of variants end up in the inclusion body fraction under standard expression conditions. With solubility scoring applied as a pre-synthesis filter (discarding variants scoring below 0.45 on a 0–1 scale), that rate drops to roughly 12–18%. That's not perfect, but it changes the expected yield from a 24-variant synthesis order considerably.

Where sequence identity remains useful — and where it doesn't

We're not saying sequence identity is irrelevant to expression outcomes — it clearly carries signal about evolutionary fitness in some expression context. But that context is the organism the enzyme evolved in, not E. coli BL21. A thermophile oxidoreductase from Thermus thermophilus with 78% identity to a well-expressed mesophile ortholog will very often form inclusion bodies in BL21 at 37°C not because it's unstable, but because its hydrophobic core packing is denser than what mesophile chaperones handle efficiently, and its surface electrostatics are tuned for the thermophile's cytoplasmic environment. Sequence identity tracks where it came from, not whether it will cooperate with a BL21 production system.

Where sequence identity remains a reliable first filter: when you're making single-point mutations or small combinatorial libraries in a scaffold that already expresses well. The background expression quality dominates, and local mutations rarely disrupt that unless they happen to introduce a large hydrophobic surface patch. For those campaigns, full solubility scoring adds marginal value over the identity heuristic. For campaigns that involve generating sequences with <85% identity to any known reference — which describes the interesting part of enzyme engineering space — sequence identity is a weak predictor and solubility scoring becomes the primary pre-synthesis filter.

Practical implications for synthesis budget allocation

The downstream effect of poor solubility prediction isn't just the cost of synthesizing variants that end up in inclusion bodies — it's the cost of the expression experiment itself. A standard miniprep expression screen in 24-deepwell plates with lysis and SDS-PAGE or quick western to distinguish soluble from insoluble takes approximately 3–4 days of bench time and consumable costs of $180–$320 per plate depending on reagent costs. If you're running this screen on 96 variants and 35% end up insoluble, you've spent roughly $700 of that budget characterizing proteins you already couldn't use — before considering the time cost.

The business case for solubility pre-filtering is straightforward: reduce the proportion of synthesis orders that produce insoluble protein, and you reduce the number of expression screening experiments needed per campaign, which shortens the calendar distance between gene order and first activity data. The tradeoff is that solubility scoring isn't perfect — you'll miss some variants that would have expressed well. But at 0.83 AUROC, you're accepting an information-theoretically favorable tradeoff.

The case for cold-induction as a first-line response

A note on the secondary cold-induction classifier: when the solubility score for a variant is in the 0.35–0.55 range (ambiguous), rather than discarding it, we flag it as a "cold induction candidate." Expressing at 16–18°C with lower IPTG (0.1 mM) for 16–20 hours dramatically improves soluble yield for a substantial fraction of sequences that score ambiguously under standard conditions. The cold-induction dataset shows that ~41% of sequences that form inclusion bodies at 37°C/0.5 mM IPTG express solubly under cold conditions. This isn't always practical for large-scale production (cold expression reduces volumetric yield), but for the initial expression and activity screen it can rescue variants worth keeping in the campaign.

We flag cold-induction candidates explicitly in the generation output so labs can make a conscious triage decision: discard, or run with modified expression protocol. That's more information than a binary solubility prediction, and it changes how you structure the expression screening experiment. Getting more of your synthesis order into the soluble fraction, with clearly calibrated confidence levels rather than a false sense of certainty from sequence identity, is the practical goal of the entire prediction stack.