databricks-dbrx-base
databricks-dbrx-base
Version: 3
DatabricksLast updated April 2024

Model Overview

DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units(GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments. DBRX was pretrained on 12T tokens of carefully curated data and a maximum context length of 32K tokens. We estimate that this data is at least 2x better token-for-token than the data we used to pretrain the MPT family of models. This new dataset was developed using the full suite of Databricks tools, including Apache Spark™ and Databricks notebooks for data processing, and Unity Catalog for data management and governance. We used curriculum learning for pretraining, changing the data mix during training in ways we found to substantially improve model quality.

Usage

These are several general ways to use the DBRX models:
  • DBRX Base and DBRX Instruct are available for download on HuggingFace.
  • The DBRX model repository can be found on GitHub here .
  • DBRX Base and DBRX Instruct are available with Databricks Foundation Model API via both Pay-per-token and Provisioned throughput endpoints. These are enterprise-ready deployments.

Limitations

Training Dataset Limitations

The DBRX models were trained on 12T tokens of text, with a knowledge cutoff date of January 2024. The training mix used for DBRX contains both natural-language and code examples. The vast majority of our training data is in the English language. We did not test DBRX for non-English proficiency. Therefore, DBRX should be considered a generalist model for text-based use in the English language. DBRX does not have multimodal capabilities.

Training Stack

MoE models are complicated to train, and the training of DBRX Base and DBRX Instruct was heavily supported by Databricks’ infrastructure for data processing and large-scale LLM training (e.g., Composer , Streaming , Megablocks , and LLM Foundry ). Composer is our core library for large-scale training. It provides an optimized training loop, easy checkpointing and logging , FSDP -based model sharding , convenient abstractions , extreme customizability via callbacks , and more. Streaming enables fast, low cost, and scalable training on large datasets from cloud storage. It handles a variety of challenges around deterministic resumption as node counts change, avoiding redundant downloads across devices, high-quality shuffling at scale, sample-level random access, and speed. Megablocks is a lightweight library for MoE training. Crucially, it supports “dropless MoE,” which avoids inefficient padding and is intended to provide deterministic outputs for a given sequence no matter what other sequences are in the batch. LLM Foundry ties all of these libraries together to create a simple LLM pretraining, fine-tuning, and inference experience. DBRX was trained using proprietary optimized versions of the above open source libraries, along with our LLM training platform .

Evaluation

We find that DBRX Instruct outperforms established open-source and open-weight base models on the Databricks Model Gauntlet , the Hugging Face Open LLM Leaderboard , and HumanEval. Full evaluation details can be found in our technical blog post .

Acknowledgements

The DBRX models were made possible thanks in large part to the open-source community, especially:
  • The MegaBlocks library, which established a foundation for our MoE implementation
  • PyTorch FSDP , which we built on for distributed training

Inference samples

Inference typePython sample (Notebook)CLI with YAML
Real timetext-generation-online-endpoint.ipynb text-generation-online-endpoint.sh

Sample Inputs and Outputs (for real-time inference)

Sample Input

{
  "input_data": 
  {
    "input_string": ["Write me a poem about Databricks."],
     "parameters": {
          "temperature": 0.1,
           "top_p": 0.9,
          "do_sample": true,
           "max_new_tokens": 100
       }  
   }
}

Sample Output

[
  {
    "0": "Write me a poem about Databricks. I want it to be a sonnet, 14 lines, iambic pentameter, 
    and I want it to be about the company's mission to accelerate innovation for its customers.
    I want it to mention how Databricks unifies data science, engineering, and business, and how 
    it provides a collaborative workspace for data teams to work on big data and AI projects. 
    I want it to mention how Databricks is built on Apache Spark and how it provides a managed 
    platform for data engineering"
  }
]
Model Specifications
LicenseOther
Last UpdatedApril 2024
PublisherDatabricks