NVIDIA Nemotron Parse NIM microservice
Version: 1
NVIDIA-Nemotron-Parse-NIM-microservice is a general purpose text-extraction model, specifically designed to handle documents. Given an image, nemotron-parse is able to extract formatted-text, with bounding-boxes and the corresponding semantic class. This has downstream benefits for several tasks such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of retriever systems, and enhancing document understanding pipelines. This model is ready for commercial use.
NVIDIA AI Enterprise is an end-to-end, cloud-native software platform that accelerates data science pipelines and streamlines development and deployment of production-grade co-pilots and other generative AI applications. Easy-to-use microservices provide optimized model performance with enterprise-grade security, support, and stability to ensure a smooth transition from prototype to production for enterprises that run their businesses on AI.
Input
Input Types:- Image: RGB (Red, Green, Blue)
- Text: Prompt (String)
- Image Dimensions
- Maximum Resolution: 1648 x 2048 (Width x Height)
- Minimum Resolution: 1024 x 1280 (Width x Height)
- Channel Count: 3
Output
Output Types:- Text: The output is provided as a string.
- String (1D)
- The output string encodes the extracted text content (formatted or unformatted), along with bounding box coordinates and associated class attributes (e.g., title, section, caption, etc.).
NVIDIA AI Enterprise is an end-to-end, cloud-native software platform that accelerates data science pipelines and streamlines development and deployment of production-grade co-pilots and other generative AI applications. Easy-to-use microservices provide optimized model performance with enterprise-grade security, support, and stability to ensure a smooth transition from prototype to production for enterprises that run their businesses on AI.
Intended Use Case
Nemotron-Parse will be capable of comprehensive text understanding and document structure understanding. It will be used in retriever and curator solutions. Its text extraction datasets and capabilities will help with LLM and VLM training, as well as improve run-time inference accuracy of VLMs. The nemotron-parse model will perform text extraction from PDF and PPT documents. The nemotron-parse can classify the objects (title, section, caption, index, footnote, lists, tables, bibliography, image) in a given document, and provide bounding boxes with coordinates.Ethical Considerations
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment. Please report security vulnerabilities or NVIDIA AI Concerns here .Example Curl Request
#!/bin/bash
curl -X 'POST' \
'<ENDPOINT_URL>/v1/chat/completions' \
-H 'Accept: application/json' \
-H 'Content-Type: application/json' \
-H "Authorization: Bearer <API_KEY>" \
-d '{
"messages": [
{
"role": "user",
"content": "<img src=\"data:image/png;base64,BASE_64_ENCODED_IMAGE"
}
],
"max_tokens": 8192
}'
Model Architecture
Architecture Type:Transformer-based vision-encoder-decoder model Network Components:
- Vision Encoder: ViT-H model
- Adapter Layer: 1D convolutions & normalization layers used to compress the dimensionality and sequence length of the latent space (from 1280 tokens to 320 tokens)
- Decoder: mBart, 10 blocks
- Tokenizer: The tokenizer included with this model is governed by the CC-BY-4.0 license
- Number of Parameters: Less than 1 Billion (<1B)
Training, Testing, and Evaluation Datasets
Training Datasets
Nemotron-Parse is first pre-trained on our internal datasets: human, synthetic and automated. Data Modality: *Text *ImageData Collection Method by Dataset: Hybrid: Human, Synthetic, Automated Labeling Method by Dataset: Hybrid: Human, Synthetic, Automated
Testing and Evaluation Dataset
Nemotron-Parse is evaluated on multiple datasets for robustness, including public and internal dataset. Data Collection Method by Dataset: Hybrid: Human, Synthetic, Automated Labeling Method by Dataset: Hybrid: Human, Synthetic, AutomatedThe following external benchmarks are used for evaluating the model:
| Benchmark | Score |
|---|---|
| MMMU* | 68 |
| MathVista* | 76.9 |
| AI2D | 87.11 |
| OCRBenchv2 | 62.0 |
| OCRBench | 85.6 |
| OCR-Reasoning | 36.4 |
| ChartQA | 89.72 |
| DocVQA | 94.39 |
| Video-MME w/o sub | 65.9 |
| Vision Average | 74.0 |
Model Specifications
LicenseCustom
Last UpdatedDecember 2025
Input TypeText,Image
Output TypeText
ProviderNvidia
Languages1 Language