AI Model Catalog | Microsoft Foundry Models

microsoft-minilm-l12-h384-uncased

Version: 7

Hugging Face•Last updated July 2025

MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation

MiniLM is a distilled model from the paper "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers ". Please find the information about preprocessing, training and full details of the MiniLM in the original MiniLM repository . Please note: This checkpoint can be an inplace substitution for BERT and it needs to be fine-tuned before use!

English Pre-trained Models

We release the uncased 12-layer model with 384 hidden size distilled from an in-house pre-trained UniLM v2 model in BERT-Base size.

MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base

Fine-tuning on NLU tasks

We present the dev results on SQuAD 2.0 and several GLUE benchmark tasks.

Model	#Param	SQuAD 2.0	MNLI-m	SST-2	QNLI	CoLA	RTE	MRPC	QQP
BERT-Base	109M	76.8	84.5	93.2	91.7	58.9	68.6	87.3	91.3
MiniLM-L12xH384	33M	81.7	85.7	93.0	91.5	58.5	73.3	89.5	91.3

Citation

If you find MiniLM useful in your research, please cite the following paper:

@misc{wang2020minilm,
    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
    year={2020},
    eprint={2002.10957},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

microsoft/MiniLM-L12-H384-uncased powered by Hugging Face Inference Toolkit

Send Request

You can use cURL or any REST Client to send a request to the AzureML endpoint with your AzureML token.

curl <AZUREML_ENDPOINT_URL> \
    -X POST \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json" \
    -d '{"inputs":"I like you. I love you"}'

Supported Parameters

inputs (string): The text to classify
parameters (object):
- function_to_apply (enum): Possible values: sigmoid, softmax, none.
- top_k (integer): When specified, limits the output to the top K most probable classes.

Check the full API Specification at the Hugging Face Inference documentation .

Model Specifications

LicenseMit

Last UpdatedJuly 2025

ProviderHugging Face

Quick Start