camembert-base

Version: 14
CamemBERT is a state-of-the-art language model for French based on the RoBERTa model. It is now available on Hugging Face in 6 different versions with varying number of parameters, amount of pretraining data and pretraining data source domains.

Training Data

OSCAR or Open Super-large Crawled Aggregated coRpus is a multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the Ungoliant architecture.

Training Procedure

Model#paramsArch.Training data
camembert-base110MBaseOSCAR (138 GB of text)
camembert/camembert-large335MLargeCCNet (135 GB of text)
camembert/camembert-base-ccnet110MBaseCCNet (135 GB of text)
camembert/camembert-base-wikipedia-4gb110MBaseWikipedia (4 GB of text)
camembert/camembert-base-oscar-4gb110MBaseSubsample of OSCAR (4 GB of text)
camembert/camembert-base-ccnet-4gb110MBaseSubsample of CCNet (4 GB of text)
The model developers evaluated CamemBERT using four different downstream tasks for French: part-of-speech (POS) tagging, dependency parsing, named entity recognition (NER) and natural language inference (NLI).

Quick facts

Model provider
TypeFill mask
LifecycleGenerally available (GA)