Llama-3.2-11B-Vision-Instruct

Excels in image reasoning capabilities on high-res images for visual understanding apps.

About this model

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image.

Key model capabilities

Visual Question Answering (VQA) and Visual Reasoning: Imagine a machine that looks at a picture and understands your questions about it.
Document Visual Question Answering (DocVQA): Imagine a computer understanding both the text and layout of a document, like a map or contract, and then answering questions about it directly from the image.
Image Captioning: Image captioning bridges the gap between vision and language, extracting details, understanding the scene, and then crafting a sentence or two that tells the story.
Image-Text Retrieval: Image-text retrieval is like a matchmaker for images and their descriptions. Similar to a search engine but one that understands both pictures and words.
Visual Grounding: Visual grounding is like connecting the dots between what we see and say. It's about understanding how language references specific parts of an image, allowing AI models to pinpoint objects or regions based on natural language descriptions.

Use cases

Pricing

Technical specs

Training disclosure

Distribution

More information

Quick facts

Model providerMeta

TypeChat completion

LifecycleGenerally available (GA)

Input typetext, image, audio

Output typetext

Context window128k

Token limits4096 output

PricingView pricing

Llama-3.2-11B-Vision-Instruct

About this model

Key model capabilities

Quick facts

Quick start