AI Model Catalog | Azure AI Foundry Models

Llama-3.2-NV-rerankqa-1b-v2-NIM-microservice

Version: 2

Nvidia•Last updated March 2025

NVIDIA NeMo™ Retriever Llama3.2 reranking model is optimized for providing a logit score that represents how relevant a document(s) is to a given query. The model was fine-tuned for multilingual, cross-lingual text question-answering retrieval, with support for long documents (up to 8192 tokens). This model was evaluated on 26 languages: English, Arabic, Bengali, Chinese, Czech, Danish, Dutch, Finnish, French, German, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, and Turkish. The reranking model is a component in a text retrieval system to improve the overall accuracy. A text retrieval system often uses an embedding model (dense) or lexical search (sparse) index to return relevant text passages given the input. A reranking model can be used to rerank the potential candidate into a final order. The reranking model has the question-passage pairs as an input and therefore, can process cross attention between the words. It’s not feasible to apply a Ranking model on all documents in the knowledge base, therefore, ranking models are often deployed in combination with embedding models. This model is ready for commercial use. The Llama 3.2 1B reranking model is a part of the NeMo Retriever collection of NVIDIA NeMo Retriever collection of NIM included in NVIDIA AI Enterprise , which provide state-of-the-art, commercially-ready models and microservices, optimized for the lowest latency and highest throughput. It features a production-ready information retrieval pipeline with enterprise support. The models that form the core of this solution have been trained using responsibly selected, auditable data sources. With multiple pre-trained models available as starting points, developers can also readily customize them for their domain-specific use cases, such as information technology, human resource help assistants, and research & development research assistants. NVIDIA AI Enterprise
NVIDIA AI Enterprise is an end-to-end, cloud-native software platform that accelerates data science pipelines and streamlines development and deployment of production-grade co-pilots and other generative AI applications. Easy-to-use microservices provide optimized model performance with enterprise-grade security, support, and stability to ensure a smooth transition from prototype to production for enterprises that run their businesses on AI.

Intended Use

Primary Use Cases

The NeMo Retriever Llama 3.2 reranking model is most suitable for users who want to improve their multilingual retrieval tasks by reranking a set of candidates for a given question. The model was trained on question and answering over text documents from multiple languages. It was evaluated to work successfully with up to a sequence length of 8192 tokens. Longer texts are recommended to be either chunked or truncated. Each the probability score (or raw logits). Users can decide to implement a Sigmoid activation function applied to the logits in their usage of the model.

Responsible AI Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. For more detailed information on ethical considerations for this model, please see the Model Card++ tab for the Explainability, Bias, Safety & Security, and Privacy subcards.

Training Data

The development of large-scale public open-QA datasets has enabled tremendous progress in powerful embedding models. However, one popular dataset named MSMARCO restricts ‌commercial licensing, limiting the use of these models in commercial settings. To address this, NVIDIA created its own training dataset blend based on public QA datasets, which each have a license for commercial applications.

We evaluate the pipelines on a set of evaluation benchmarks. We applied the ranking model to the candidates retrieved from a retrieval embedding model. Overall, the pipeline llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2 provides high BEIR+TechQA accuracy with multilingual and crosslingual support. The llama-3.2-nv-rerankqa-1B-v2 ranking model is 3.5x smaller than the nv-rerankqa-mistral-4b-v3 model. We evaluated the NVIDIA Retrieval QA Embedding Model in comparison to literature open & commercial retriever models on academic benchmarks for question-answering - NQ , HotpotQA and FiQA Finance Q&A from BeIR benchmark and TechQA dataset. In this benchmark, the metric used was Recall@5. As described, we need to apply the ranking model on the output of an embedding model.

Model	Average Recall@5
llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2	73.64%
llama-3.2-nv-embedqa-1b-v2	68.60%
nv-embedqa-e5-v5 + nv-rerankQA-mistral-4b-v3	75.45%
nv-embedqa-e5-v5	62.07%
nv-embedqa-e5-v4	57.65%
e5-large_unsupervised	48.03%
BM25	44.67%

We evaluated the model’s multilingual capabilities on the MIRACL academic benchmark - a multilingual retrieval dataset, across 15 languages, and on an additional 11 languages that were translated from the English and Spanish versions of MIRACL. The reported scores are based on a custom subsampled version by selecting hard negatives for each query to reduce the corpus size.

Model	Average Recall@5
llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2	65.80%
llama-3.2-nv-embedqa-1b-v2	60.75%
nv-embedqa-mistral-7b-v2	50.42%
BM25	26.51%

We evaluated the cross-lingual capabilities on the academic benchmark MLQA based on 7 languages (Arabic, Chinese, English, German, Hindi, Spanish, Vietnamese). We consider only evaluation datasets when the query and documents are in different languages. We calculate the average Recall@5 across the 42 different language pairs.

Model	Average Recall@5
llama-3.2-nv-embedqa-1b-v2 + llama-3.2-nv-rerankqa-1b-v2	86.83%
llama-3.2-nv-embedqa-1b-v2	79.86%
nv-embedqa-mistral-7b-v2	68.38%
BM25	13.01%

Llama 3.2 NV-rerank 1B-v2 NIM is optimized to run best on the following compute:

GPU	Total GPU memory	Azure VM compute	#GPUs on VM	Link
A100	80	Standard_NC24ads_A100_v4	1	link
A100	160	Standard_NC48ads_A100_v4	2	link
A100	320	Standard_NC96ads_A100_v4	4	link
A100	640	STANDARD_ND96AMSR_A100_V4	8	link

Model Specifications

LicenseCustom

Last UpdatedMarch 2025

PublisherNvidia

Quick Start