Llama 3.1 Nemotron Nano VL 8B v1 NIM microservice
Llama 3.1 Nemotron Nano VL 8B v1 NIM microservice
Version: 2
NvidiaLast updated September 2025
Versatile vision-language model for querying and summarizing images and video, deployable from data center to edge (via AWQ 4-bit TinyChat), with key findings that interleaved image-text, LLM unfreezing, and re-blended text-only data are essential for stro
Multimodal
Vision
Summarization
Llama Nemotron Nano VL is a leading document intelligence vision language model (VLMs) that enables the ability to query and summarize images and video from the physical or virtual world. Llama Nemotron Nano VL is deployable in the data center, cloud and at the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance. This model was trained on commercial images and videos for all three stages of training and supports single image and video inference. Llama-3.1-Nemotron-Nano-VL-8B-V1 is available as an NVIDIA NIM™ microservice, part of NVIDIA AI Enterprise . NVIDIA NIM offers prebuilt containers for large language models (LLMs) that can be used to develop chatbots, content analyzers—or any application that needs to understand and generate human language. Each NIM consists of a container and a model and uses a CUDA-accelerated runtime for all NVIDIA GPUs, with special optimizations available for many configurations. NVIDIA AI Enterprise
NVIDIA AI Enterprise is an end-to-end, cloud-native software platform that accelerates data science pipelines and streamlines development and deployment of production-grade co-pilots and other generative AI applications. Easy-to-use microservices provide optimized model performance with enterprise-grade security, support, and stability to ensure a smooth transition from prototype to production for enterprises that run their businesses on AI.

Intended Use

Primary Use Cases

Llama-3.1-Nemotron-Nano-VL-8B-V1 Use Cases: Image summarization. Text-image analysis, Optical Character Recognition, Interactive Q&A on images, Comparison and contrast of multiple images, Text Chain-of-Thought reasoning

Responsible AI Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Training Data

NV-Pretraining and NV-CosmosNemotron-SFT were used for training and evaluation It includes:
  • Internal datasets built with public commercial images and internal labels, supporting tasks like conversation modeling and document analysis.
  • Public datasets sourced from publicly available images and annotations, adapted for tasks such as image captioning and visual question answering.
  • Synthetic datasets generated programmatically for specific tasks like tabular data understanding.
  • Specialized datasets for safety alignment, function calling, and domain-specific tasks (e.g., science diagrams, financial question answering).
BenchmarkScore
MMMU Val with chatGPT as a judge48.2%
AI2D84.8%
ChartQA86.3%
InfoVQA Val76.2%
OCRBench839
OCRBenchV2 English60.1%
OCRBenchV2 Chinese37.9%
DocVQA val91.2%
VideoMME49.2%
Source: Llama-3.1-Nemotron-Nano-VL-8B-V1 Llama-3.1-Nemotron-Nano-VL-8B-V1 NIM is optimized to run best on the following compute:
GPUTotal GPU memoryAzure VM compute#GPUs on VMLink
A10080Standard_NC24ads_A100_v41link
H10094STANDARD_NC40ADS_H100_V51link
Model Specifications
LicenseCustom
Last UpdatedSeptember 2025
Input TypeText,Image
Output TypeText
PublisherNvidia
Languages1 Language