facebook-deit-base-patch16-224
Version: 20
DeiT (Data-efficient image Transformers) is an image transformer that do not require very large amounts of data for training. This is achieved through a novel distillation procedure using teacher-student strategy, which results in high throughput and accuracy. DeiT is pre-trained and fine-tuned on ImageNet-1k (1 million images, 1,000 classes) at resolution 224x224. The model was first released in this repository , but the weights were converted to PyTorch from the timm repository by Ross Wightman.
An image is treated as a sequence of patches and it is processed by a standard Transformer encoder as used in NLP. These patches are linearly embedded, and a [CLS] token is added at the beginning of the sequence for classification tasks. The model also requires absolute position embeddings before feeding the sequence Transformer encoder. So the pre-training creates an inner representation of images that can be used to extract features that are useful for downstream tasks. For instance, if a dataset of labeled images is available, a linear layer can be placed on top of the pre-trained encoder, to train a standard classifier.
/plot_facebook-deit-base-patch16-224_laptop_MC.png)
For more details on DeiT, Review the original-paper .
Training Details
Training Data
The DeiT model is pre-trained and fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes on a resolution of 224x224.Training Procedure
In the preprocessing step, images are resized to the same resolution 224x224. Different augmentations like Rand-Augment, and random erasing are used. For more details on transformations during training/validation refer this-link . At inference time, images are rescaled to the same resolution 256x256, center-cropped at 224x224 and then normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on a single 8-GPU node for 3 days. Training resolution is 224. For more details on hyperparameters refer to table 9 of the original-paper . For more details on pre-training (ImageNet-1k) followed by supervised fine-tuning (ImageNet-1k) refer to the section 2 to 5 of the original-paper .Evaluation Results
DeiT base model achieved top-1 accuracy of 81.8% and top-5 accuracy of 95.6% on ImageNet with 86M parameters with image size 224x224. For DeiT image classification benchmark results, refer to the table 5 of the original-paper . It's important to note that during the fine-tuning process, superior performance is attained with a higher resolution, and enhancing the model size leads to improved performance.License
apache-2.0Inference Samples
Inference type | Python sample (Notebook) | CLI with YAML |
---|---|---|
Real time | image-classification-online-endpoint.ipynb | image-classification-online-endpoint.sh |
Batch | image-classification-batch-endpoint.ipynb | image-classification-batch-endpoint.sh |
Finetuning Samples
Task | Use case | Dataset | Python sample (Notebook) | CLI with YAML |
---|---|---|---|---|
Image Multi-class classification | Image Multi-class classification | fridgeObjects | fridgeobjects-multiclass-classification.ipynb | fridgeobjects-multiclass-classification.sh |
Image Multi-label classification | Image Multi-label classification | multilabel fridgeObjects | fridgeobjects-multilabel-classification.ipynb | fridgeobjects-multilabel-classification.sh |
Evaluation Samples
Task | Use case | Dataset | Python sample (Notebook) |
---|---|---|---|
Image Multi-class classification | Image Multi-class classification | fridgeObjects | image-multiclass-classification.ipynb |
Image Multi-label classification | Image Multi-label classification | multilabel fridgeObjects | image-multilabel-classification.ipynb |
Sample input and output
Sample input
{
"input_data": ["image1", "image2"]
}
Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.
Sample output
[
[
{
"label" : "can",
"score" : 0.91
},
{
"label" : "carton",
"score" : 0.09
},
],
[
{
"label" : "carton",
"score" : 0.9
},
{
"label" : "can",
"score" : 0.1
},
]
]
Visualization of inference result for a sample image
/plot_facebook-deit-base-patch16-224_laptop_MC.png)
Model Specifications
LicenseApache-2.0
Last UpdatedApril 2025
PublisherMeta