microsoft-multilingual-minilm-l12-h384
Version: 6
HuggingFaceLast updated July 2025

MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation

MiniLM is a distilled model from the paper "MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers ". Please find the information about preprocessing, training and full details of the MiniLM in the original MiniLM repository . Please note: This checkpoint uses BertModel with XLMRobertaTokenizer so AutoTokenizer won't work with this checkpoint!

Multilingual Pretrained Model

  • Multilingual-MiniLMv1-L12-H384: 12-layer, 384-hidden, 12-heads, 21M Transformer parameters, 96M embedding parameters
Multilingual MiniLM uses the same tokenizer as XLM-R. But the Transformer architecture of our model is the same as BERT. We provide the fine-tuning code on XNLI based on huggingface/transformers . Please replace run_xnli.py in transformers with ours to fine-tune multilingual MiniLM. We evaluate the multilingual MiniLM on cross-lingual natural language inference benchmark (XNLI) and cross-lingual question answering benchmark (MLQA).

Cross-Lingual Natural Language Inference - XNLI

We evaluate our model on cross-lingual transfer from English to other languages. Following Conneau et al. (2019) , we select the best single model on the joint dev set of all the languages.
Model#Layers#Hidden#Transformer ParametersAverageenfresdeelbgrutrarvithzhhiswur
mBERT 1276885M66.382.173.874.371.166.468.969.061.664.969.555.869.360.050.458.0
XLM-100 161280315M70.783.276.777.774.072.774.172.768.768.672.968.972.565.658.262.4
XLM-R Base 1276885M74.584.678.478.976.875.977.375.473.271.575.472.574.971.165.266.5
mMiniLM-L12xH3841238421M71.181.574.875.772.973.074.571.369.768.872.167.870.066.263.364.2
This example code fine-tunes 12-layer multilingual MiniLM on XNLI.
# run fine-tuning on XNLI
DATA_DIR=/{path_of_data}/
OUTPUT_DIR=/{path_of_fine-tuned_model}/
MODEL_PATH=/{path_of_pre-trained_model}/

python ./examples/run_xnli.py --model_type minilm \
 --output_dir ${OUTPUT_DIR} --data_dir ${DATA_DIR} \
 --model_name_or_path microsoft/Multilingual-MiniLM-L12-H384 \
 --tokenizer_name xlm-roberta-base \
 --config_name ${MODEL_PATH}/multilingual-minilm-l12-h384-config.json \
 --do_train \
 --do_eval \
 --max_seq_length 128 \
 --per_gpu_train_batch_size 128 \
 --learning_rate 5e-5 \
 --num_train_epochs 5 \
 --per_gpu_eval_batch_size 32 \
 --weight_decay 0.001 \
 --warmup_steps 500 \
 --save_steps 1500 \
 --logging_steps 1500 \
 --eval_all_checkpoints \
 --language en \
 --fp16 \
 --fp16_opt_level O2

Cross-Lingual Question Answering - MLQA

Following Lewis et al. (2019b) , we adopt SQuAD 1.1 as training data and use MLQA English development data for early stopping.
Model F1 Score#Layers#Hidden#Transformer ParametersAverageenesdearhivizh
mBERT 1276885M57.777.764.357.945.743.857.157.5
XLM-15 121024151M61.674.968.062.254.848.861.461.1
XLM-R Base (Reported)1276885M62.977.867.260.853.057.963.160.2
XLM-R Base (Our fine-tuned)1276885M64.980.367.062.755.060.466.562.3
mMiniLM-L12xH3841238421M63.279.466.161.254.958.563.159.0

Citation

If you find MiniLM useful in your research, please cite the following paper:
@misc{wang2020minilm,
    title={MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
    author={Wenhui Wang and Furu Wei and Li Dong and Hangbo Bao and Nan Yang and Ming Zhou},
    year={2020},
    eprint={2002.10957},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

microsoft/Multilingual-MiniLM-L12-H384 powered by Hugging Face Inference Toolkit

Send Request

You can use cURL or any REST Client to send a request to the AzureML endpoint with your AzureML token.
curl <AZUREML_ENDPOINT_URL> \
    -X POST \
    -H "Authorization: Bearer <AZUREML_TOKEN>" \
    -H "Content-Type: application/json" \
    -d '{"inputs":"I like you. I love you"}'

Supported Parameters

  • inputs (string): The text to classify
  • parameters (object):
    • function_to_apply (enum): Possible values: sigmoid, softmax, none.
    • top_k (integer): When specified, limits the output to the top K most probable classes.
Check the full API Specification at the Hugging Face Inference documentation .
Model Specifications
LicenseMit
Last UpdatedJuly 2025
ProviderHuggingFace