Description

Falcon-40B is a large language model (LLM) developed by the Technology Innovation Institute (TII) with 40 billion parameters. It is a causal decoder-only model trained on 1 trillion tokens from the RefinedWeb dataset, enhanced with curated corpora. Falcon-40B supports English, German, Spanish, and French languages, with limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish. It is available under the Apache 2.0 license. Falcon-40B is considered the best open-source model currently available, optimized for inference with features such as FlashAttention and multiquery. However, it is recommended to fine-tune the model for specific use cases. The training of Falcon-40B involved using 384 A100 40GB GPUs and took two months. The model carries biases and stereotypes encountered online and requires appropriate precautions for production use. It is suggested to finetune the model for specific tasks and consider guardrails. The technical specifications, training details, and evaluation results are provided in the summary.

The above summary was generated using ChatGPT. Review the original model card to understand the data used to train the model, evaluation metrics, license, intended uses, limitations and bias before using the model.

Training Details

Training Data

Falcon-40B was trained on 1,000B tokens of RefinedWeb , a high-quality filtered and deduplicated web dataset which we enhanced with curated corpora. Significant components from our curated copora were inspired by The Pile (Gao et al., 2020 ).

Data source	Fraction	Tokens	Sources
RefinedWeb-English	75%	750B	massive web crawl
RefinedWeb-Europe	7%	70B	European massive web crawl
Books	6%	60B
Conversations	5%	50B	Reddit, StackOverflow, HackerNews
Code	5%	50B
Technical	2%	20B	arXiv, PubMed, USPTO, etc.

RefinedWeb-Europe is made of the following languages:

Language	Fraction of multilingual data	Tokens
German	26%	18B
Spanish	24%	17B
French	23%	16B
Italian	7%	5B
Portuguese	4%	3B
Polish	4%	3B
Dutch	4%	3B
Romanian	3%	2B
Czech	3%	2B
Swedish	2%	1B

The data was tokenized with the Falcon-7B /40B tokenizer.

Training Procedure

Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO.

Training Hyperparameters

Hyperparameter	Value	Comment
Precision	`bfloat16`
Optimizer	AdamW
Learning rate	1.85e-4	4B tokens warm-up, cosine decay to 1.85e-5
Weight decay	1e-1
Z-loss	1e-4
Batch size	1152	100B tokens ramp-up

Speeds, Sizes, Times

Training started in December 2022 and took two months.

Evaluation

Paper coming soon. See the OpenLLM Leaderboard for early results.

Technical Specifications

Model Architecture and Objective

Falcon-40B is a causal decoder-only model trained on a causal language modeling task (i.e., predict the next token). The architecture is broadly adapted from the GPT-3 paper (Brown et al., 2020 ), with the following differences:

Positionnal embeddings: rotary (Su et al., 2021 );
Attention: multiquery (Shazeer et al., 2019 ) and FlashAttention (Dao et al., 2022 );
Decoder-block: parallel attention/MLP with a two layer norms.

For multiquery, we are using an internal variant which uses independent key and values per tensor parallel degree.

Hyperparameter	Value	Comment
Layers	60
`d_model`	8192
`head_dim`	64	Reduced to optimise for FlashAttention
Vocabulary	65024
Sequence length	2048

Compute Infrastructure

Hardware

Falcon-40B was trained on AWS SageMaker, on 384 A100 40GB GPUs in P4d instances.

Software

Falcon-40B was trained a custom distributed training codebase, Gigatron. It uses a 3D parallelism approach combined with ZeRO and high-performance Triton kernels (FlashAttention, etc.)

License

Falcon-40B is made available under the Apache 2.0 license.

Finetuning samples

Task	Use case	Dataset	Python sample (Notebook)	CLI with YAML
Text Classification	Emotion Detection	Emotion	emotion-detection.ipynb	emotion-detection.sh

Model Evaluation Sample

Task	Use case	Dataset	Python sample (Notebook)	CLI with YAML
Text generation	Text generation	cnn_dailymail	evaluate-model-text-generation.ipynb	evaluate-model-text-generation.yml

Inference samples

Inference type	Python sample (Notebook)	CLI with YAML
Real time	text-generation-online-endpoint.ipynb	text-generation-online-endpoint.sh
Batch	text-generation-batch-endpoint.ipynb	coming soon

Sample input (for real-time inference)

{
  "input_data": {
      "input_string":["The meaning of the life is"]
  }
}

Sample output

[
  {
    "0": "The meaning of the life is to find your gift. The purpose of life is to give it away"
  }
]

Model Specifications

LicenseApache-2.0

Last UpdatedAugust 2024

Publishertiiuae

Languages4 Languages

Quick Start