Llama-3.3-70B-Instruct-NIM-microservice

Nvidia

Version: 2

Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction-tuned text-only model is optimized for multilingual dialogue use cases and outperforms many of the available open source and closed chat models on common industry benchmarks. It has context length of 128k and token count of 15T+. Token counts refer to pretraining data only. All model versions use Grouped-Query Attention (GQA) for improved inference scalability. This model is developed by Meta and is ready for commercial use.

Llama 3.3 70B-Instruct is available as an NVIDIA NIM™ microservice, part of NVIDIA AI Enterprise . NVIDIA NIM offers prebuilt containers for large language models (LLMs) that can be used to develop chatbots, content analyzers—or any application that needs to understand and generate human language. Each NIM consists of a container and a model and uses a CUDA-accelerated runtime for all NVIDIA GPUs, with special optimizations available for many configurations. NVIDIA NIM is the fastest way to achieve accelerated generative AI inference at scale and has been benchmarked to have up to 2.6x improved throughput latency.

NVIDIA AI Enterprise
NVIDIA AI Enterprise is an end-to-end, cloud-native software platform that accelerates data science pipelines and streamlines development and deployment of production-grade co-pilots and other generative AI applications. Easy-to-use microservices provide optimized model performance with enterprise-grade security, support, and stability to ensure a smooth transition from prototype to production for enterprises that run their businesses on AI.

Quick facts

Model providerNvidia

TypeChat completion

LifecycleGenerally available (GA)

PricingView pricing

Llama-3.3-70B-Instruct-NIM-microservice

Quick facts

Quick start