microsoft-rho-math-1b-v0.1
Version: 1
Rho-1: Not All Tokens Are What You Need
[π Arxiv] β’ [π¬ HF Paper] β’ [π€ Models] β’ [π± GitHub]
Figure 1: Rho-1 is pre-trained with Selective Language Modeling (SLM). SLM improves average few-shot accuracy on GSM8k and MATH by over 16%, achieving the baseline performance 5-10x faster.
π₯ News
- [2024/04/12] π₯π₯π₯ Rho-Math-v0.1 models released at π€ HuggingFace!
- Rho-Math-1B and Rho-Math-7B achieve 15.6% and 31.0% few-shot accuracy on MATH dataset, respectively β matching DeepSeekMath with only 3% of the pretraining tokens.
- Rho-Math-1B-Interpreter is the first 1B LLM that achieves over 40% accuracy on MATH.
- Rho-Math-7B-Interpreter achieves 52% on MATH dataset, using only 69k samples for fine-tuning.
- [2024/04/11] Rho-1 paper and repo released.
π‘ Introduction
Rho-1 base models employ Selective Language Modeling (SLM) for pretraining, which selectively trains on clean and useful tokens that aligned with the desired distribution.Selective Lanugage Modeling (SLM)
Figure 2: Upper: Even an extensively filtered pretraining corpus contains token-level noise. Left: Previous Causal Language Modeling (CLM) trains on all tokens. Right: Our proposed Selective Language Modeling (SLM) selectively applies loss on those useful and clean tokens.
Figure 3: The pipeline of Selective Language Modeling. SLM optimizes language model performance by concentrating on valuable, clean tokens during pre-training. It involves three steps: (Step 1) Initially, train a reference model on high-quality data. (Step 2) Then, score each token's loss in a corpus using the reference model. (Step 3) Finally, train the language model selectively on tokens that show higher excess loss compared to the reference loss.
Evaluation Results
Base models (Few-shot CoT):| Model | Size | Data | Uniq. Token | Train Token | GSM8K | MATH | MMLU STEM | SAT |
|---|---|---|---|---|---|---|---|---|
| 1-2B Base Models | ||||||||
| Qwen1.5 | 1.8B | - | - | - | 36.1 | 6.8 | 31.3 | 40.6 |
| Gemma | 2.0B | - | - | - | 18.8 | 11.4 | 34.4 | 50.0 |
| DeepSeekMath | 1.3B | - | 120B | 150B | 23.8 | 13.6 | 33.1 | 56.3 |
| Rho-Math-1B-v0.1 | 1.1B | OWM | 14B | 30B | 36.2 | 15.6 | 23.3 | 28.1 |
| >= 7B Base Models | ||||||||
| Mistral | 7B | - | - | 41.2 | 11.6 | 49.5 | 59.4 | |
| Minerva | 540B | - | 39B | 26B | 58.8 | 33.6 | 63.9 | - |
| LLemma | 34B | PPile | 55B | 50B | 54.2 | 23.0 | 54.7 | 68.8 |
| InternLM2-Math | 20B | - | 31B | 125B | 65.4 | 30.0 | 53.1 | 71.9 |
| DeepSeekMath | 7B | - | 120B | 500B | 64.1 | 34.2 | 56.4 | 84.4 |
| Rho-Math-7B-v0.1 | 7B | OWM | 14B | 10.5B | 66.9 | 31.0 | 54.6 | 84.4 |
| Model | Size | SFT Data | GSM8k | MATH | SVAMP | ASDiv | MAWPS | TabMWP | GSM-Hard | AVG |
|---|---|---|---|---|---|---|---|---|---|---|
| gpt4-early (pal) | - | - | 94.2 | 51.8 | 94.8 | 92.6 | 97.7 | 95.9 | 77.6 | 86.4 |
| gpt-4-turbo-2024-04-09 (cot) | - | - | - | 73.4 | - | - | - | - | - | |
| Open-Source Small Models | ||||||||||
| MAmmoTH | 70B | MI-260k | 76.9 | 41.8 | 82.4 | - | - | - | - | - |
| ToRA | 7B | ToRA-69k | 68.8 | 40.1 | 68.2 | 73.9 | 88.8 | 42.4 | 54.6 | 62.4 |
| ToRA | 70B | ToRA-69k | 84.3 | 49.7 | 82.7 | 86.8 | 93.8 | 74.0 | 67.2 | 76.9 |
| DeepSeekMath | 7B | ToRA-69k | 79.8 | 52.0 | 80.1 | 87.1 | 93.8 | 85.8 | 63.1 | 77.4 |
| Rho-Math-1B-Interpreter-v0.1 | 1B | ToRA-69k | 59.4 | 40.6 | 60.7 | 74.2 | 88.6 | 26.7 | 48.1 | 56.9 |
| Rho-Math-7B-Interpreter-v0.1 | 7B | ToRA-69k | 81.3 | 51.8 | 80.8 | 85.5 | 94.5 | 70.1 | 63.1 | 75.3 |
π Quick Start
Evaluation
git clone git@github.com:microsoft/rho.git
cd rho-1/math-evaluation-harness
bash scripts/run_eval.sh cot microsoft/rho-math-7b-v0.1
bash scripts/run_eval.sh tora microsoft/rho-math-7b-interpreter-v0.1
rho-1/outputs.zip.
βοΈ Citation
If you find this repository helpful, please consider citing our paper:@misc{lin2024rho1,
title={Rho-1: Not All Tokens Are What You Need},
author={Zhenghao Lin and Zhibin Gou and Yeyun Gong and Xiao Liu and Yelong Shen and Ruochen Xu and Chen Lin and Yujiu Yang and Jian Jiao and Nan Duan and Weizhu Chen},
year={2024},
eprint={2404.07965},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
microsoft/rho-math-1b-v0.1 powered by Text Generation Inference (TGI)
Send Request
You can use cURL or any REST Client to send a request to the AzureML endpoint with your AzureML token.curl <AZUREML_ENDPOINT_URL> \
-X POST \
-d '{"inputs":"What is Deep Learning?"}' \
-H "Authorization: Bearer <AZUREML_TOKEN>" \
-H "Content-Type: application/json"
Supported Parameters
- inputs (string): Input prompt.
- parameters (object):
- best_of (integer): Generate best_of sequences and return the one if the highest token logprobs.
- decoder_input_details (boolean): Whether to return decoder input token logprobs and ids.
- details (boolean): Whether to return generation details.
- do_sample (boolean): Activate logits sampling.
- frequency_penalty (float): The parameter for frequency penalty. 1.0 means no penalty Penalize new tokens based on their existing frequency in the text so far, decreasing the modelβs likelihood to repeat the same line verbatim.
- grammar (object): One of the following
- #1 (object):
- type (enum): Possible values: json.
- value (string): A string that represents a JSON Schema. JSON Schema is a declarative language that allows to annotate JSON documents with types and descriptions.
- #2 (object):
- type (enum): Possible values: regex.
- value (string): The regular expression.
- #3 (object):
- type (enum): Possible values: json_schema.
- value (object):
- name (string): Optional name identifier for the schema
- schema (object): The actual JSON schema definition
- #1 (object):
- max_new_tokens (integer): Maximum number of tokens to generate.
- repetition_penalty (float): The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
- return_full_text (boolean): Whether to prepend the prompt to the generated text
- seed (integer): Random sampling seed.
- stop (string[]): Stop generating tokens if a member of stop is generated.
- temperature (float): The value used to module the logits distribution.
- top_k (integer): The number of highest probability vocabulary tokens to keep for top-k-filtering.
- top_n_tokens (integer): The number of highest probability vocabulary tokens to keep for top-n-filtering.
- top_p (float): Top-p value for nucleus sampling.
- truncate (integer): Truncate inputs tokens to the given size.
- typical_p (float): Typical Decoding mass See Typical Decoding for Natural Language Generation for more information.
- watermark (boolean): Watermarking with A Watermark for Large Language Models.
- stream (boolean): Whether to stream the output tokens or not. Defaults to false.
Example payload
{
"inputs": "What is Deep Learning?",
"parameters": {
"do_sample": true,
"top_p": 0.95,
"temperature": 0.2,
"top_k": 50,
"max_new_tokens": 256,
"repetition_penalty": 1.03,
"stop": ["\nUser:", "<|endoftext|>", "</s>"]
}
}
OpenAI Chat Completion API compatibility
Additionally, Text Generation Inference (TGI) offers an OpenAI Chat Completion API compatible layer under the endpoint/v1/chat/completions,check the full specification in the OpenAI Chat Completion Create Documentation .
Model Specifications
LicenseMit
Last UpdatedAugust 2025
ProviderHuggingFace