EvoDiff
Version: 1
Microsoft Research's EvoDiff is a diffusion modeling framework capable of generating high-fidelity, diverse, and novel proteins with the option of conditioning according to sequence constraints. Because it operates in the universal protein design space, EvoDiff can unconditionally sample diverse structurally-plausible proteins, generate intrinsically disordered regions, and scaffold structural motifs using only sequence information, challenging a paradigm in structure-based protein design.
We are thrilled to release EvoDiff-OADM 640M on Azure AI Foundry. For all other models in the EvoDiff suite, please see our github repository . If you use the code our repository, the results, or the model available on Azure AI Foundry, please cite our preprint .
Overview
We investigated two types of forward processes for diffusion over discrete data modalities to determine which would be most effective. In order-agnostic autoregressive diffusion OADM , one amino acid is converted to a special mask token at each step in the forward process. After $T=L$ steps, where $L$ is the length of the sequence, the entire sequence is masked. We additionally designed discrete denoising diffusion probabilistic models D3PM for protein sequences. In EvoDiff-D3PM, the forward process corrupts sequences by sampling mutations according to a transition matrix, such that after $T$ steps the sequence is indistinguishable from a uniform sample over the amino acids. In the reverse process for both, a neural network model is trained to undo the previous corruption. The trained model can then generate new sequences starting from sequences of masked tokens or of uniformly-sampled amino acids for EvoDiff-OADM or EvoDiff-D3PM, respectively. We trained all EvoDiff sequence models on 42M sequences from UniRef50 using a dilated convolutional neural network architecture introduced in the CARP protein masked language model. We trained 38M-parameter and 640M-parameter versions for each forward corruption scheme and for left-to-right autoregressive (LRAR) decoding. To explicitly leverage evolutionary information, we designed and trained EvoDiff MSA models using the MSA Transformer architecture on the OpenFold dataset. To do so, we subsampled MSAs to a maximum length of 512 residues per sequence and a maximum depth of 64 sequences, either by randomly sampling the sequences ("Random") or by greedily maximizing for sequence diversity ("Max"). Within each subsampling strategy, we then trained EvoDiff MSA models with the OADM and D3PM corruption schemes.Generation on Azure AI Foundry
Start using EvoDiff on Azure AI Foundry with this Jupyter Notebook .Intended Use
Primary Use Cases
Below are several use cases for EvoDiff. Currently, Azure AI Foundry supports unconditional or conditional design with EvoDiff-Seq. To use EvoDiff-MSA, we point you to our github repository for more information.- Unconditional generation with EvoDiff-Seq or EvoDiff-MSA(https://github.com/microsoft/evodiff/blob/main/README.md#unconditional-generation-with-evodiff-msa )
- Conditional sequence generation
- Evolution-guided protein generation with EvoDiff-MSA
- Generating intrinsically disordered regions with EvoDiff-Seq and EvoDiff-MSA
- Scaffolding functional motifs with EvoDiff-Seq and EvoDiff-MSA
Out-of-Scope Use Cases
This model is intended for use on protein sequences. It is not meant for natural language or other biological sequences, such as DNA sequences.Bias, Risks, and Limitations
This model will not generate sequences that are not proteins. This includes cases such as trying to generate other biological sequences, such as DNA sequences, or natural language. In other words, the model will perform best on data within the data distribution, which includes protein sequences and multiple sequence alignments (MSAs). Based on review of currently available information, EvoDiff is not be expected to provide any notable uplift in expertise to users. It is also very unlikely to create any new or add to any known CBRN or advanced autonomy risks.Training Data
We obtain sequences from the Uniref50 dataset , which containsapproximately 42 million protein sequences. The Multiple Sequence Alignments (MSAs) are from the OpenFold dataset ,
which contains 401,381 MSAs for 140,000 unique Protein Data Bank (PDB) chains and 16,000,000 UniClust30 clusters. The intrinsically disordered regions (IDR) data was obtained from the Reverse Homology GitHub . For the scaffolding structural motif benchmark, we provide pdb and fasta files used for conditionally generating sequences in the examples/scaffolding-pdbs folder. We also provide pdb files used for conditionally generating MSAs in the examples/scaffolding-msas folder.
Environmental Impact
- Hardware Type:
32GB NVIDIA V100
GPUs - Hours used: 4,128 (14 days per sequence model, 10 days per MSA model)
- Cloud Provider: Azure
- Compute Region: East US
- Carbon Emitted: 485.21 kg
For full details, please refer to our preprint .
Notes:
Note:
Testing Data
We provide all generated sequences on the EvoDiff Zenodo . To download our unconditional generated sequences fromunconditional_generations.csv
file:
curl -O https://zenodo.org/record/8329165/files/unconditional_generations.csv?download=1
To extract all unconditionally generated sequences created using the EvoDiff-seq oa_dm_640M
model, run the following code:
import pandas as pd
df = pd.read_csv('unconditional_generations.csv', index_col = 0)
subset = df.loc[df['model'] == 'evodiff_oa_dm_640M']
Please view our README.md for more information about the CSV files containing generated data.
Metrics
To analyze the quality of the generations, we look at:- amino acid KL divergence (aa_reconstruction_parity_plot )
- secondary structure KL divergence (evodiff/analysis/calc_kl_ss.py )
- model perplexity for sequences (evodiff/analysis/sequence_perp.py )
- model perplexity for MSAs (evodiff/analysis/msa_perp.py )
- Fréchet inception distance (evodiff/analysis/calc_fid.py )
- Hamming distance (evodiff/analysis/calc_nearestseq_hamming.py )
- sc-RMSD score (analysis/rmsd_analysis.py )
- TM score
- Omegafold
- ProteinMPNN
- ESM-IF1 ; see this Jupyter notebook for setup details.
- PGP
- DISOPRED3
- DR-BERT
EvoDiff-Seq Performance
The reconstruction KL (Recon KL) was calculated between the distribution of amino acids in the test set and in generated samples (n=1000). The perplexity was computed on 25k samples from the test set. The minimum Hamming distance to any train sequence of the same length (Hamming) is reported for each model as the mean ± standard deviation over the generated samples. Frechet ProtT5 distance (FPD) was calculated between the test set and generated samples. The secondary structure KL (SS KL) was calculated between the means of the predicted secondary structures of the test and generated samples.Model | parameters | Recon KL | perplexity | Hamming | FPD | SS KL |
---|---|---|---|---|---|---|
Test | - | 9.92e-41 | - | 0.00392 | 0.101 | 1.37e-51 |
EvoDiff-Seq (D3PM BLOSUM) | 38M | 1.77e-2 | 17.16 | 0.83 ± 0.05 | 1.42 | 3.30e-5 |
EvoDiff-Seq (D3PM Uniform) | 38M | 1.48e-3 | 18.82 | 0.83 ± 0.05 | 1.31 | 3.73e-5 |
EvoDiff-Seq (OADM) | 38M | 1.11e-3 | 14.61 | 0.83 ± 0.07 | 0.92 | 1.61e-4 |
EvoDiff-Seq (D3PM BLOSUM) | 640M | 3.73e-2 | 15.74 | 0.83 ± 0.05 | 1.53 | 4.96e-4 |
EvoDiff-Seq (D3PM Uniform) | 640M | 2.90e-3 | 18.47 | 0.83 ± 0.05 | 1.35 | 2.13e-4 |
EvoDiff-Seq (OADM) | 640M | 1.26e-3 | 13.05 | 0.83 ± 0.08 | 0.88 | 1.48e-4 |
LRAR | 38M | 7.90e-4 | 12.38 | 0.82 ± 0.06 | 0.86 | 1.61e-4 |
CARP | 38M | 5.71e-1 | 25.13 | 0.74 ± 0.07 | 6.30 | 2.72e-3 |
LRAR | 640M | 7.01e-4 | 10.41 | 0.83 ± 0.06 | 0.63 | 1.76e-5 |
CARP | 640M | 3.56e-1 | 31.77 | 0.84 ± 0.05 | 1.78 | 5.03e-3 |
ESM-1b3 | 650M | 4.91e-1 | 53.49 | 0.83 ± 0.06 | 6.67 | 5.48e-4 |
ESM-23 | 650M | 5.00e-1 | 68.39 | 0.84 ± 0.06 | 6.79 | 3.05e-3 |
FoldingDiff4 | 14M | 5.49e-2 | - | - | 1.64 | 1.76e-3 |
RFdiffusion5 | 60M | 7.19e-2 | - | - | 1.96 | 5.98e-3 |
Random | - | 1.65e-1 | 20 | 0.85 ± 0.04 | 3.16 | 1.90e-4 |
- Calculated between the test set and validation set.
- Reported value is the minimum Hamming distance between any two natural sequences of the same length in UniRef50.
- Due to model constraints, the maximum sequence length sampled was 1022.
- For the FoldingDiff baseline, 1000 structures generated by FoldingDiff were randomly selected, and the corresponding 1000 inferred sequences were inverse-folded using ESM IF. These sequences are between lengths of 50 and 128 residues.
- For the RFdiffusion baseline,1000 structures were generated corresponding to the UniRef train distribution length, and 1000 corresponding sequences were inverse-folded using ESM-IF.
EvoDiff-MSA performance
The perplexity is calculated based on the ability of each model to reconstruct a subsampled MSA from the validation set. "Max" and "Rand. Perplexity" indicate MaxHamming and Random subsampling, respectively, for construction of the validation MSA.Corruption | Subsampling | Params | MaxPerplexity | Rand.Perplexity |
---|---|---|---|---|
EvoDiff-MSA (D3PM BLOSUM) | Random | 100M | 11.35 | 8.31 |
EvoDiff-MSA (D3PM BLOSUM) | Max | 100M | 10.98 | 7.61 |
EvoDiff-MSA (D3PM Uniform) | Random | 100M | 10.14 | 6.77 |
EvoDiff-MSA (D3PM Uniform) | Max | 100M | 10.06 | 6.66 |
EvoDiff-MSA (OADM) | Random | 100M | 6.05 | 3.64 |
EvoDiff-MSA (OADM) | Max | 100M | 6.14 | 3.60 |
ESM-MSA-1b | Max | 100M | 11.20 | 5.89 |
EvoDiff-Seq structural plausibility metrics
Metrics are reported as the mean ± standard deviation for 1000 generated samples for each model.Model | Params | ESM-IF scPerplexity | ProteinMPNN scPerplexity | OmegaFold pLDDT |
---|---|---|---|---|
Test | - | 8.04 ± 4.04 | 3.09 ± 0.63 | 68.25 ± 17.85 |
EvoDiff-Seq (D3PM BLOSUM) | 38M | 12.38 ± 2.06 | 3.80 ± 0.49 | 42.76 ± 14.55 |
EvoDiff-Seq (D3PM Uniform) | 38M | 12.03 ± 2.04 | 3.77 ± 0.50 | 42.37 ± 14.39 |
EvoDiff-Seq (OADM) | 38M | 11.61 ± 2.38 | 3.72 ± 0.50 | 43.78 ± 14.18 |
EvoDiff-Seq (D3PM BLOSUM) | 640M | 11.86 ± 2.21 | 3.73 ± 0.48 | 44.14 ± 13.80 |
EvoDiff-Seq (D3PM Uniform) | 640M | 12.29 ± 2.05 | 3.78 ± 0.49 | 41.65 ± 14.32 |
EvoDiff-Seq (OADM) | 640M | 11.53 ± 2.50 | 3.71 ± 0.52 | 44.46 ± 14.62 |
LRAR | 38M | 11.61 ± 2.38 | 3.64 ± 0.56 | 48.26 ± 14.87 |
CARP | 38M | 9.68 ± 2.56 | 3.66 ± 0.62 | 50.79 ± 12.06 |
LRAR | 640M | 10.99 ± 2.63 | 3.59 ± 0.54 | 48.71 ± 15.47 |
CARP | 640M | 14.13 ± 2.42 | 4.05 ± 0.52 | 41.56 ± 14.35 |
ESM-1b | 650M | 13.90 ± 2.44 | 3.47 ± 0.68 | 58.07 ± 15.64 |
ESM-2 | 650M | 14.02 ± 2.87 | 3.58 ± 0.69 | 50.70 ± 15.67 |
Random | - | 14.68 ± 1.97 | 3.96 ± 0.50 | 39.97 ± 14.05 |
EvoDiff-MSA homolog conditioned generation
Metrics are reported as the mean ± standard deviation over 250 generated samples for each model. The first subsampling method listed describes the sampling procedure to train the model, and the second describes the subsampling procedure used for generation.Model | scPerplexity | pLDDT | Seq. similarity | TM score |
---|---|---|---|---|
Valid | 5.93 ± 3.19 | 73.99 ± 17.80 | 14.58 ± 21.641 | - |
EvoDiff-MSA (OADM (Rand) - Rand MSA) | 9.41 ± 2.61 | 55.99 ± 14.75 | 6.13 ± 9.88 | 0.49 ± 0.23 |
EvoDiff-MSA (OADM (Max) - Max MSA) | 9.38 ± 2.57 | 57.08 ± 16.01 | 6.74 ± 11.00 | 0.50 ± 0.23 |
EvoDiff-MSA (OADM (Max) - Rand MSA) | 9.59 ± 2.69 | 54.95 ± 16.83 | 6.55 ± 10.49 | 0.46 ± 0.23 |
ESM-MSA-1b | 10.05 ± 2.92 | 51.64 ± 16.54 | 7.13 ± 11.60 | 0.40 ± 0.23 |
Potts | 10.34 ± 2.26 | 55.46 ± 13.82 | 12.01 ± 17.19 | 0.17 ± 0.10 |
- Sequence similarity is calculated between the original query sequence and all the sequences in the MSA.
Scaffolding performance of EvoDiff-Seq
Number of scaffolding successes out of 100 generations for RFdiffusion, EvoDiff-Seq, the LRAR baseline, the CARP baseline, and randomly sampled scaffolds (Random), for each of 17 scaffolding problems. The bottom row contains the total number of successful scaffolds generated per model.PDB | RFdiffusion | EvoDiff-Seq | LRAR | CARP | Random |
---|---|---|---|---|---|
1BCF | 100 | 24 | 0 | 4 | 0 |
6E6R | 71 | 16 | 7 | 3 | 1 |
2KL8 | 88 | 0 | 1 | 1 | 0 |
6EXZ | 42 | 0 | 0 | 0 | 0 |
1YCR | 74 | 13 | 12 | 10 | 7 |
6VW1 | 69 | 1 | 0 | 0 | 0 |
4JHW | 0 | 0 | 0 | 0 | 0 |
5TPN | 61 | 0 | 0 | 0 | 0 |
4ZYP | 40 | 0 | 0 | 0 | 0 |
3IXT | 25 | 23 | 22 | 13 | 7 |
7MRX | 7 | 0 | 0 | 0 | 0 |
1PRW | 8 | 68 | 70 | 54 | 5 |
5IUS | 2 | 0 | 0 | 0 | 0 |
5YUI | 0 | 4 | 0 | 0 | 0 |
5WN9 | 0 | 0 | 0 | 0 | 2 |
1QJG | 0 | 0 | 0 | 0 | 0 |
5TRV | 22 | 0 | 0 | 0 | 0 |
Total | 610 | 149 | 112 | 85 | 22 |
Scaffolding performance of EvoDiff-MSA
Number of scaffolding successes out of 100 generations for RFdiffusion, EvoDiff-MSA (Max), EvoDiff-MSA (Random), and the ESM-MSA baseline, for each of 17 scaffolding problems. The bottom row contains the total number of successful scaffolds generated per model.PDB | RFdiffusion | EvoDiff-MSA (Max) | EvoDiff-MSA (Random) | ESM-MSA |
---|---|---|---|---|
1BCF | 100 | 100 | 98 | 99 |
6E6R | 71 | 87 | 63 | 96 |
2KL8 | 88 | 11 | 31 | 42 |
6EXZ | 42 | 86 | 87 | 73 |
1YCR | 74 | 3 | 0 | 0 |
6VW1 | 69 | 4 | 3 | 4 |
4JHW | 0 | 0 | 0 | 0 |
5TPN | 61 | 0 | 0 | 0 |
4ZYP | 40 | 0 | 0 | 0 |
3IXT | 25 | 1 | 0 | 5 |
7MRX | 7 | 72 | 68 | 66 |
1PRW | 8 | 48 | 46 | 92 |
5IUS | 2 | 3 | 1 | 7 |
5YUI | 0 | 58 | 44 | 70 |
5WN9 | 0 | 0 | 0 | 0 |
1QJG | 0 | 34 | 22 | 38 |
5TRV | 22 | 15 | 12 | 12 |
Total | 610 | 522 | 475 | 604 |
Model Specifications
LicenseMit
Last UpdatedMay 2025
Input TypeText
Output TypeText
PublisherMicrosoft
Languages1 Language