ProtEnrich - Ankh 3 XL

This model corresponds to ProtEnrich based on Ankh 3 XL.

[Github Repo] | [Dataset on HuggingFace] | [Model Collection] | [Cite]

Abstract

Protein language models effectively capture evolutionary and functional signals from sequence data but lack explicit representation of the biophysical properties that govern protein structure and dynamics. Existing multimodal approaches attempt to integrate such physical information through direct fusion, often requiring multimodal inputs at inference time and distorting the sequence embedding space. Consequently, a fundamental challenge of how to incorporate structural and dynamical knowledge into sequence representations without disrupting their established semantic organization remains a field-of-research. We introduce ProtEnrich, a representation learning framework based on a residual multimodal enrichment paradigm. ProtEnrich decomposes sequence embeddings into two complementary latent subspaces, an anchor subspace that preserves sequence semantics, and an alignment subspace that encodes biophysical relationships. By converting multimodal information derived from ProstT5 and RocketSHP to a low-energy residual component, our approach injects physical representation while maintaining the original sequence embedding. Across eight diverse protein foundational models trained on 550,120 SwissProt proteins with AlphaFold structures, enriched embeddings improved zero-shot remote homology retrieval, increasing Precision@10 and MRR by up to 0.13 and 0.11, respectively. Downstream performance also improved on structure-dependent tasks, reducing fluorescence prediction error by up to 16% and increasing metal ion binding AUCROC by up to 2.4 points, while requiring only sequence input at inference.

Model Details

ProtEnrich is a family of protein language model enrichment methods designed to inject low-energy structural and dynamical representations using only the sequence at inference time. Available models:

ProtEnrich Ankh 3 XL: Enriched model based on Ankh 3 XL.
ProtEnrich CARP 640M: Enriched model based on CARP 640M.
ProtEnrich ESM1b: Enriched model based on ESM1b.
ProtEnrich ESM2 T36: Enriched model based on ESM2 T36.
ProtEnrich ESM Cambrian 600M: Enriched model based on ESM Cambrian 600M.
ProtEnrich ProGen2: Enriched model based on ProGen2.
ProtEnrich ProtBERT: Enriched model based on ProtBERT.
ProtEnrich ProtT5: Enriched model based on ProtT5.

Model Usage

You can use this model for feature extraction (embeddings) or fine-tune it for downstream prediction tasks. The embeddings may be used for similarity measurements, visualization, or training predictor models.

Embedding Extraction

Use the code below to get started with the model for embedding extraction:

from transformers import AutoTokenizer, AutoModel, T5EncoderModel, T5Tokenizer
import torch

tokenizer = T5Tokenizer.from_pretrained("ElnaggarLab/ankh3-xl")
encoder = T5EncoderModel.from_pretrained("ElnaggarLab/ankh3-xl")
protenrich = AutoModel.from_pretrained("SaeedLab/ProtEnrich-Ankh3", trust_remote_code=True)

seqs = "MKTFFVLLL"
seqs = "[NLU]" + seqs
inputs = tokenizer([seqs], return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = encoder(**inputs)
  pooled = outputs.last_hidden_state[0, 1:-1].mean(axis=0)
  enriched = protenrich(pooled)

print('H enrich:', enriched.h_enrich)
print('H anchor:', enriched.h_anchor)
print('H algn:', enriched.h_algn)
print('Structure:', enriched.struct)
print('Dynamics:', enriched.dyn)

Downstream Prediction Task

Use the code below to get started with the model for downstream prediction tasks:

from transformers import AutoTokenizer, AutoModel, T5EncoderModel, T5Tokenizer, AutoConfig
import torch

tokenizer = T5Tokenizer.from_pretrained("ElnaggarLab/ankh3-xl")
encoder = T5EncoderModel.from_pretrained("ElnaggarLab/ankh3-xl")
config = AutoConfig.from_pretrained("SaeedLab/ProtEnrich-Ankh3", trust_remote_code=True)
config.num_labels = 10
protenrich = AutoModelForSequenceClassification.from_config(config, trust_remote_code=True)

seqs = "MKTFFVLLL"
seqs = "[NLU]" + seqs
inputs = tokenizer([seqs], return_tensors="pt", padding=True)

with torch.no_grad():
  outputs = encoder(**inputs)
  pooled = outputs.last_hidden_state[0, 1:-1].mean(axis=0)
  enriched = protenrich(pooled)

print('Logits:', enriched.logits)

Citation

The paper is under review. As soon as it is accepted, we will update this section.

License

This model and associated code are released under the CC-BY-NC-ND 4.0 license and may only be used for non-commercial, academic research purposes with proper attribution. Any commercial use, sale, or other monetization of this model and its derivatives, which include models trained on outputs from the model or datasets created from the model, is prohibited and requires prior approval. Downloading the model requires prior registration on Hugging Face and agreeing to the terms of use. By downloading this model, you agree not to distribute, publish or reproduce a copy of the model. If another user within your organization wishes to use the model, they must register as an individual user and agree to comply with the terms of use. Users may not attempt to re-identify the deidentified data used to develop the underlying model. If you are a commercial entity, please contact the corresponding author.

Contact

For any additional questions or comments, contact Fahad Saeed (fsaeed@fiu.edu).

Downloads last month: 25

Safetensors

Model size

18.2M params

Tensor type

F32

Dataset used to train SaeedLab/ProtEnrich-Ankh3

Collection including SaeedLab/ProtEnrich-Ankh3

ProtEnrich

Collection

ProtEnrich models and dataset • 9 items • Updated about 20 hours ago