facebook-deit-base-patch16-224
facebook-deit-base-patch16-224
Version: 20
MetaLast updated April 2025
DeiT (Data-efficient image Transformers) is an image transformer that do not require very large amounts of data for training. This is achieved through a novel distillation procedure using teacher-student strategy, which results in high throughput and accuracy. DeiT is pre-trained and fine-tuned on ImageNet-1k (1 million images, 1,000 classes) at resolution 224x224. The model was first released in this repository , but the weights were converted to PyTorch from the timm repository by Ross Wightman. An image is treated as a sequence of patches and it is processed by a standard Transformer encoder as used in NLP. These patches are linearly embedded, and a [CLS] token is added at the beginning of the sequence for classification tasks. The model also requires absolute position embeddings before feeding the sequence Transformer encoder. So the pre-training creates an inner representation of images that can be used to extract features that are useful for downstream tasks. For instance, if a dataset of labeled images is available, a linear layer can be placed on top of the pre-trained encoder, to train a standard classifier.
For more details on DeiT, Review the original-paper .

Training Details

Training Data

The DeiT model is pre-trained and fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes on a resolution of 224x224.

Training Procedure

In the preprocessing step, images are resized to the same resolution 224x224. Different augmentations like Rand-Augment, and random erasing are used. For more details on transformations during training/validation refer this-link . At inference time, images are rescaled to the same resolution 256x256, center-cropped at 224x224 and then normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on a single 8-GPU node for 3 days. Training resolution is 224. For more details on hyperparameters refer to table 9 of the original-paper . For more details on pre-training (ImageNet-1k) followed by supervised fine-tuning (ImageNet-1k) refer to the section 2 to 5 of the original-paper .

Evaluation Results

DeiT base model achieved top-1 accuracy of 81.8% and top-5 accuracy of 95.6% on ImageNet with 86M parameters with image size 224x224. For DeiT image classification benchmark results, refer to the table 5 of the original-paper . It's important to note that during the fine-tuning process, superior performance is attained with a higher resolution, and enhancing the model size leads to improved performance.

License

apache-2.0

Inference Samples

Inference typePython sample (Notebook)CLI with YAML
Real timeimage-classification-online-endpoint.ipynb image-classification-online-endpoint.sh
Batchimage-classification-batch-endpoint.ipynb image-classification-batch-endpoint.sh

Finetuning Samples

TaskUse caseDatasetPython sample (Notebook)CLI with YAML
Image Multi-class classificationImage Multi-class classificationfridgeObjects fridgeobjects-multiclass-classification.ipynb fridgeobjects-multiclass-classification.sh
Image Multi-label classificationImage Multi-label classificationmultilabel fridgeObjects fridgeobjects-multilabel-classification.ipynb fridgeobjects-multilabel-classification.sh

Evaluation Samples

TaskUse caseDatasetPython sample (Notebook)
Image Multi-class classificationImage Multi-class classificationfridgeObjects image-multiclass-classification.ipynb
Image Multi-label classificationImage Multi-label classificationmultilabel fridgeObjects image-multilabel-classification.ipynb

Sample input and output

Sample input

{
  "input_data": ["image1", "image2"]
}
Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.

Sample output

[
  [
    {
      "label" : "can",
      "score" : 0.91
    },
    {
      "label" : "carton",
      "score" : 0.09
    },
  ],
  [
    {
      "label" : "carton",
      "score" : 0.9
    },
    {
      "label" : "can",
      "score" : 0.1
    },
  ]
]

Visualization of inference result for a sample image

mc visualization
Model Specifications
LicenseApache-2.0
Last UpdatedApril 2025
PublisherMeta