HMM-Gated PLM Embeddings

Abstract

The majority of microbial diversity remains uncultivated, in large part because we do not know what cultivation conditions a given lineage requires. Existing genome-based phenotype predictors operate on a single feature family — protein family inventories, codon usage statistics, or single-trait marker panels — and most binarize continuous phenotypes such as optimum temperature, pH, and salinity into coarse labels. Here we introduce HMM-gated protein language model embeddings (PTPE, phenotype-targeted PLM embeddings): for each genome we run pyhmmer against eight curated phenotype-relevant HMM marker families (oxygen handling, thermotolerance, pH homeostasis, osmotic response, vitamin biosynthesis, nitrogen cycling, carbon utilization, and a "special" category), embed only the matched proteins with a frozen protein language model (ESM-2), and mean-pool within each category to produce a compact per-genome functional fingerprint.

We integrate PTPE with five additional feature paths — amino acid composition, MediaDive recipe metadata, Pfam HMM marker counts, KEGG module fractional completeness (570 modules), and parsed BacDive isolation metadata — for a total of 6,313 features per genome, and train a multi-task XGBoost over 46,029 BacDive strains (22,300 unique genomes) for four cultivation targets (optimum temperature, pH, oxygen requirement, salt tolerance) with five-fold family-grouped cross-validation.

The cultivation bottleneck

Cultivation of microbes from environmental samples is bottlenecked not by sequencing or isolation hardware but by the prior question of which conditions to use: optimum temperature, pH, oxygen tolerance, and salinity must be chosen, typically from broad ranges, before a strain can be enriched in pure culture. For the >99% of bacterial and archaeal lineages that have never been cultivated, 16S placement provides only weak constraints on these conditions — phylogenetically close strains routinely differ by 10 °C in optimum temperature or several pH units. Direct prediction of cultivation conditions from genome sequence would lower this barrier substantially.

Why protein inventories are too coarse

A common gap across the prior literature is that protein family inventories are coarse. A Pfam hit is binary: "does this genome contain a cytochrome c oxidase?" It collapses the substantial sequence-level variation between different cytochromes — variation that determines whether the organism is microaerophilic, strictly aerobic, or facultative. Protein language models such as ESM-2 are designed to expose precisely this continuous variation, but naive whole-proteome PLM pooling averages ESM-2 across every protein in a genome, drowning the few phenotype-relevant proteins (12 cytochromes) in the ~4,000 housekeeping ones and producing a feature vector that is biologically dilute.

A middle path: HMM-gated PLM embeddings

We propose a middle path: use the HMM as a gate for which proteins to embed. For each genome we run pyhmmer against a curated panel of 48 phenotype-relevant Pfam HMMs grouped into 8 categories, embed only the matched proteins with a frozen ESM-2, and mean-pool within each category. The result is a phenotype-targeted protein language model embedding (PTPE): a compact per-genome functional fingerprint that keeps the specificity of HMM-based feature selection and the continuous functional resolution of PLMs. We integrate PTPE with five additional feature paths and train a multi-task XGBoost on 46,029 BacDive strains with family-grouped cross-validation — the largest BacDive-anchored phenotype prediction corpus published to date (~2× the size of prior work).

Contributions

PTPE construction. A novel feature type for genome-level phenotype prediction. To our knowledge no prior work in the eight surveyed BacDive-era papers uses HMM-gated PLM mean pooling.
Multi-source feature fusion at BacDive scale. 46,029 strains × 6,313 features integrating composition, MediaDive, Pfam markers, KEGG modules, isolation metadata, and PTPE, with 5-fold family-grouped CV and the strongest pre-PTPE baseline released as a reproducible comparator.
An honest empirical evaluation. PTPE adds modest, target-dependent lift on regression targets (1–2.4%) but slightly regresses oxygen F1. We do not overclaim; we characterize where frozen PTPE helps and where it does not.
A fold-0 LoRA result and deployed hybrid predictor. End-to-end trained ESM-2 adapters improve oxygen classification substantially but do not replace tabular heads for continuous targets, so we deploy a phenotype-specific hybrid and apply it to 5,000 uncultured catalog genomes.
A GenomeSPOT comparator. We run GenomeSPOT on 5,000 held-out BacDive-derived genomes from the same family-grouped manifest, producing an external condition-trait comparator with no failed genomes.

What the evaluation shows

PTPE adds modest, target-dependent lift over the five-path baseline: optimum temperature MAE improves from 2.74 to 2.67 °C (2.4%), salt MAE from 1.94 to 1.92% (1.1%), and pH MAE from 0.473 to 0.469 (1.0%); oxygen macro F1 regresses from 0.412 to 0.402. Against GenomeSPOT on a deterministic 5,000-genome subset, this work is 39% more accurate at temperature, 23% at pH, and 3% at salt. Fine-tuning the protein language model with LoRA adapters on the HMM-gated marker sequences sharply improves oxygen classification (macro F1 0.945 vs. 0.402 for the tabular head). The retained production system is therefore a hybrid: tabular XGBoost for temperature, pH, and salt; all-task LoRA for oxygen; and a tabular MediaDive recommender for medium ranking. Applied to 5,000 GTDB-derived uncultured catalog genomes, it predicts 3,026 aerobes and 1,974 anaerobes.

Keywords — microbial cultivation · phenotype prediction · protein language models · HMM profiles · KEGG modules · BacDive · uncultivated microorganisms

HMM-Gated Protein Language Model Embeddings for Bacterial Cultivation Condition Prediction