microsoft-dayhoff-170m-ur50
microsoft-dayhoff-170m-ur50
Version: 4
HuggingFaceLast updated August 2025

Model Card for microsoft/Dayhoff-170m-UR50

Dayhoff is an Atlas of both protein sequence data and generative language models — a centralized resource that brings together 3.34 billion protein sequences across 1.7 billion clusters of metagenomic and natural protein sequences (GigaRef), 46 million structure-derived synthetic sequences (BackboneRef), and 16 million multiple sequence alignments (OpenProteinSet). These models can natively predict zero-shot mutation effects on fitness, scaffold structural motifs by conditioning on evolutionary or structural context, and perform guided generation of novel proteins within specified families. Learning from metagenomic and structure-based synthetic data from the Dayhoff Atlas increased the cellular expression rates of generated proteins, highlighting the real-world value of expanding the scale, diversity, and novelty of protein sequence data. The Dayhoff architecture is a hybrid of state-space Mamba layers and Transformer self-attention, interleaved with Mixture-of-Experts modules to maximize capacity while preserving efficiency. It natively handles long contexts, allowing both single sequences and unrolled MSAs to be modeled. Trained with an autoregressive objective in both N→C and C→N directions, Dayhoff supports order-agnostic infilling and scales to billions of parameters.

Model Details

Model Description

  • Developed by: Kevin K. Yang, Sarah Alamdari, Alex J. Lee, Kaeli Kaymak-Loveless, Samir Char, Garyk Brixi, Carles Domingo-Enrich, Chentong Wang, Suyue Lyu, Nicolo Fusi, Neil Tenenholtz, Ava P. Amini
  • Model type: Hybrid state-space-model transformer architecture with mixture-of-experts
  • License: MIT

Model Sources

Uses

Downstream Use

Dayhoff is intended for broad research use on protein language modeling. The model has been used and assessed on the following capabilities:
  1. Unconditional design of protein sequences
  2. Zero-shot mutation effect prediction on ProteinGym
  3. Designing scaffolds for structural motifs in sequence space on RFDiffusion and MotifBench
  4. Homolog conditioning with Dayhoff-3b-GR-HM and Dayhoff-3b-GR-HM-c

Bias, Risks, and Limitations

This model should not be used to generate anything that is not a protein sequence or a set of homologuous protein sequences. It is not meant for natural language or other biological sequences, such as DNA sequences. Not all sequences are guaranteed to be realistic. It remains difficult to generate high-quality sequences with no sequence homology to any natural sequence.

How to Get Started with the Model

You can use cURL or any REST Client to send a request to the AzureML endpoint with your AzureML token.
curl <AZUREML_ENDPOINT_URL> \
    -X POST \
    -d '{"inputs":"@MADQLTEEQIAEFKEAF","parameters":{"max_new_tokens":64,"do_sample":false}}' \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json"

Example

To unconditinally generate a new protein sequence:
{
  "inputs": "@",
  "parameters": {
    "max_new_tokens": 64,
    "do_sample": false
  }
}
To complete a protein sequence:
{
  "inputs": "@MADQLTEEQIAEFKEAF",
  "parameters": {
    "max_new_tokens": 64,
    "do_sample": false
  }
}
For a complete list of generation parameters see here . For detailed instructions on package usage, please refer to the README in model repo.

Evaluation

Results

See the preprint for the latest benchmark results and evaluations. Model perplexity on held-out test sequences for Dayhoff models.
ModelUniRef50GigaRefAligned homologsUnaligned homologs
170m-UR5011.6211.88
170m-UR9011.5211.85
170m-GR13.679.36
170m-UR50-BRn11.7812.03
170m-UR50-BRq11.6711.91
170m-UR50-BRu11.6611.87
3b-UR908.959.64
3b-GR-HM11.956.684.344.60
3b-GR-HM-c10.119.213.573.56
Quality of generated sequences as measured by ESMFold pLDDT and scPerplexity. Dataset statistics are for 1024 randomly-sampled sequences. Model statistics are for 1024 generations at T=1 in the N-to-C direction.
Model or datasetpLDDT (mean ± s.d.)scPerplexity (mean ± s.d.)
Natural sequences
UniRef500.653 ± 0.1969.45 ± 2.89
GigaRef-clusters0.619 ± 0.1999.69 ± 2.83
GigaRef-singletons0.561 ± 0.20110.07 ± 2.88
Generated sequences
170m-UR500.421 ± 0.13211.97 ± 2.14
170m-UR900.407 ± 0.12512.12 ± 2.14
170m-GR0.422 ± 0.12911.83 ± 2.12
170m-UR50-BRu0.441 ± 0.15711.71 ± 2.18
170m-UR50-BRq0.434 ± 0.15211.72 ± 2.24
170m-UR50-BRn0.432 ± 0.13111.77 ± 2.24
3b-UR900.454 ± 0.15011.79 ± 2.38
3b-GR-HM0.406 ± 0.12611.50 ± 2.16
3b-GR-HM-c0.423 ± 0.13211.91 ± 2.18
ProteinGym zero-shot performance Spearman’s correlation coefficient on ProteinGym substitutions and indels.
InputModelParametersSubstitutionsIndels
Single sequence170m-UR50170M0.3530.479
170m-UR90170M0.3540.483
170m-GR170M0.1990.292
170m-UR50-BRu170M0.3410.476
170m-UR50-BRq170M0.3560.477
170m-UR50-BRn170M0.3410.478
3b-UR903B0.3940.497
3b-GR-HM3B0.3280.423
3b-GR-HM-c3B0.4170.466
Aligned homologs3b-GR-HM-c3B0.368NA
Unaligned homologs3b-GR-HM-c3B0.3720.401
RFDiffusion Benchmark Performance Motif scaffolding performance, problems solved, successes out of 100, and MotifBench score.
Problem170m-UR50170m-UR90170m-GR170m-UR50-BRn170m-UR50-BRq170m-UR50-BRu3b-UR903b-GR-HM3b-GR-HM-cEvoDiff-Seq
1PRW62728195919094817982
1BCF00500010807
5TPN0000000000
5IUS0000000000
3IXT12171214181218111420
5YUI0000000000
1QJG0000000000
1YCR2506762342
2KL80101011111
7MRX_6010000242090
7MRX_8500000019110
7MRX_1280000000000
4JHW0000000000
4ZYP0000010000
5WN90000000000
6VW11110010000
5TRV_short0000000000
5TRV_med0000000000
5TRV_long0000000000
6E6R_short22133214786
6E6R_med0120024020
6E6R_long0100013010
6EXZ_short0000000000
6EXZ_med0000000000
6EXZ_long0000000000
Problems solved686541010796
Successes80100102119119118207112119118
Score9.6512.256.107.2610.6214.3616.3211.9014.147.67
MotifBench Benchmark Performance Motif scaffolding performance, problems solved, successes out of 100, and MotifBench score.
Problem170m-UR50170m-UR90170m-GR170m-UR50-BRn170m-UR50-BRq170m-UR50-BRu3b-UR903b-GR-HM3b-GR-HM-cEvoDiff-Seq
01_1LDB113001202120
02_1ITU43341143757480
03_2CGA0000000000
04_5WN90000000000
05_5ZE90121000164090
06_6E6R1111216312
07_6E6R0002002000
08_7AD50000000000
09_7CG50000000000
10_7WRK0000000000
11_3TQB4113437408260
12_4JHW0000000000
13_4JHW0000000000
14_5IUS0000000000
15_7A8S0000000000
16_7BNY0000000000
17_7DGW0000000000
18_7MQQ0000000000
19_7MQQ0000000000
20_7UWL0000000000
21_1B730000000000
22_1BCF003000209019
23_1MPY0000000000
24_1QY30000000000
35_2RKX0000000000
36_3B5V0000000000
37_4XOJ0000000000
28_5YUI0000000000
29_6CPA0000000000
30_7UWL0000000000
Problems4564347652
Successes10473586131411199621
Score2.332.924.332.752.172.758.364.964.481.58

Technical Specifications

Compute Infrastructure

  • 170M-parameter models: trained on 8 NVIDIA A100 or 8 NVIDIA H100 GPUs using Distributed Data Parallel.
  • 3B-parameter models: trained on 176 NVIDIA H100 GPUs using Fully Sharded Data Parallel in hybrid-shard mode.

Responsible AI Considerations

The intended use of this model is to generate high-quality, realistic, protein sequences or sets of homologous protein sequences. Generations can be designed from scratch or conditioned on partial sequences in both N→C and C→N directions. The code and datasets released in this repository are provided for research and development use only. They are not intended for use in clinical decision-making or for any other clinical use, and the performance of these models for clinical use has not been established. You bear sole responsibility for any use of these models, data and software, including incorporation into any product intended for clinical use.

Citation

If you use the code, data, models, or results. please cite our preprint .
Model Specifications
LicenseMit
Last UpdatedAugust 2025
PublisherHuggingFace