camembert-base
CamemBERT is a state-of-the-art language model for French based on the RoBERTa model.
It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.
Training Data
OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.Training Procedure
| Model | #params | Arch. | Training data |
|---|---|---|---|
camembert-base | 110M | Base | OSCAR (138 GB of text) |
camembert/camembert-large | 335M | Large | CCNet (135 GB of text) |
camembert/camembert-base-ccnet | 110M | Base | CCNet (135 GB of text) |
camembert/camembert-base-wikipedia-4gb | 110M | Base | Wikipedia (4 GB of text) |
camembert/camembert-base-oscar-4gb | 110M | Base | Subsample of OSCAR (4 GB of text) |
camembert/camembert-base-ccnet-4gb | 110M | Base | Subsample of CCNet (4 GB of text) |
The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).