AI Model Catalog | Microsoft Foundry Models

NVIDIA Nemotron Parse NIM microservice

Version: 1

Nvidia•Last updated December 2025

Document Analysis

Text Extraction

Image Analysis

NVIDIA-Nemotron-Parse-NIM-microservice is a general purpose text-extraction model, specifically designed to handle documents. Given an image, nemotron-parse is able to extract formatted-text, with bounding-boxes and the corresponding semantic class. This has downstream benefits for several tasks such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of retriever systems, and enhancing document understanding pipelines. This model is ready for commercial use.

Input

Input Types:

Image: RGB (Red, Green, Blue)
Text: Prompt (String)

Input Parameters:

Image Dimensions
- Maximum Resolution: 1648 x 2048 (Width x Height)
- Minimum Resolution: 1024 x 1280 (Width x Height)
Channel Count: 3

Output

Output Types:

Text: The output is provided as a string.

Output Format:

String (1D)

Output Details:

The output string encodes the extracted text content (formatted or unformatted), along with bounding box coordinates and associated class attributes (e.g., title, section, caption, etc.).

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions. NVIDIA AI Enterprise
NVIDIA AI Enterprise is an end-to-end, cloud-native software platform that accelerates data science pipelines and streamlines development and deployment of production-grade co-pilots and other generative AI applications. Easy-to-use microservices provide optimized model performance with enterprise-grade security, support, and stability to ensure a smooth transition from prototype to production for enterprises that run their businesses on AI.

Intended Use Case

Nemotron-Parse will be capable of comprehensive text understanding and document structure understanding. It will be used in retriever and curator solutions. Its text extraction datasets and capabilities will help with LLM and VLM training, as well as improve run-time inference accuracy of VLMs. The nemotron-parse model will perform text extraction from PDF and PPT documents. The nemotron-parse can classify the objects (title, section, caption, index, footnote, lists, tables, bibliography, image) in a given document, and provide bounding boxes with coordinates.

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal developer team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse. Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment. Please report security vulnerabilities or NVIDIA AI Concerns here .

Example Curl Request

#!/bin/bash

curl -X 'POST' \
'<ENDPOINT_URL>/v1/chat/completions' \
    -H 'Accept: application/json' \
    -H 'Content-Type: application/json' \
    -H "Authorization: Bearer <API_KEY>" \
    -d '{
    "messages": [
        {
            "role": "user",
            "content": "<img src=\"data:image/png;base64,BASE_64_ENCODED_IMAGE"
        }
    ],
    "max_tokens": 8192
    }'

Model Architecture

Architecture Type:
Transformer-based vision-encoder-decoder model Network Components:

Vision Encoder: ViT-H model
Adapter Layer: 1D convolutions & normalization layers used to compress the dimensionality and sequence length of the latent space (from 1280 tokens to 320 tokens)
Decoder: mBart, 10 blocks
Tokenizer: The tokenizer included with this model is governed by the CC-BY-4.0 license
Number of Parameters: Less than 1 Billion (<1B)

Training, Testing, and Evaluation Datasets

Training Datasets

Nemotron-Parse is first pre-trained on our internal datasets: human, synthetic and automated. Data Modality: *Text *Image
Data Collection Method by Dataset: Hybrid: Human, Synthetic, Automated Labeling Method by Dataset: Hybrid: Human, Synthetic, Automated

Testing and Evaluation Dataset

Nemotron-Parse is evaluated on multiple datasets for robustness, including public and internal dataset. Data Collection Method by Dataset: Hybrid: Human, Synthetic, Automated Labeling Method by Dataset: Hybrid: Human, Synthetic, Automated

The following external benchmarks are used for evaluating the model:

Benchmark	Score
MMMU*	68
MathVista*	76.9
AI2D	87.11
OCRBenchv2	62.0
OCRBench	85.6
OCR-Reasoning	36.4
ChartQA	89.72
DocVQA	94.39
Video-MME w/o sub	65.9
Vision Average	74.0

Model Specifications

LicenseCustom

Last UpdatedDecember 2025

Input TypeText,Image

Output TypeText

ProviderNvidia

Languages1 Language

Quick Start