facebook-deit-base-patch16-224

Version: 20

Meta•Last updated April 2025

DeiT (Data-efficient image Transformers) is an image transformer that do not require very large amounts of data for training. This is achieved through a novel distillation procedure using teacher-student strategy, which results in high throughput and accuracy. DeiT is pre-trained and fine-tuned on ImageNet-1k (1 million images, 1,000 classes) at resolution 224x224. The model was first released in this repository , but the weights were converted to PyTorch from the timm repository by Ross Wightman. An image is treated as a sequence of patches and it is processed by a standard Transformer encoder as used in NLP. These patches are linearly embedded, and a [CLS] token is added at the beginning of the sequence for classification tasks. The model also requires absolute position embeddings before feeding the sequence Transformer encoder. So the pre-training creates an inner representation of images that can be used to extract features that are useful for downstream tasks. For instance, if a dataset of labeled images is available, a linear layer can be placed on top of the pre-trained encoder, to train a standard classifier.

For more details on DeiT, Review the original-paper .

Training Details

Training Data

The DeiT model is pre-trained and fine-tuned on ImageNet 2012, consisting of 1 million images and 1,000 classes on a resolution of 224x224.

Training Procedure

In the preprocessing step, images are resized to the same resolution 224x224. Different augmentations like Rand-Augment, and random erasing are used. For more details on transformations during training/validation refer this-link . At inference time, images are rescaled to the same resolution 256x256, center-cropped at 224x224 and then normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). The model was trained on a single 8-GPU node for 3 days. Training resolution is 224. For more details on hyperparameters refer to table 9 of the original-paper . For more details on pre-training (ImageNet-1k) followed by supervised fine-tuning (ImageNet-1k) refer to the section 2 to 5 of the original-paper .

Evaluation Results

DeiT base model achieved top-1 accuracy of 81.8% and top-5 accuracy of 95.6% on ImageNet with 86M parameters with image size 224x224. For DeiT image classification benchmark results, refer to the table 5 of the original-paper . It's important to note that during the fine-tuning process, superior performance is attained with a higher resolution, and enhancing the model size leads to improved performance.

License

apache-2.0

Inference Samples

Inference type	Python sample (Notebook)	CLI with YAML
Real time	image-classification-online-endpoint.ipynb	image-classification-online-endpoint.sh
Batch	image-classification-batch-endpoint.ipynb	image-classification-batch-endpoint.sh

Finetuning Samples

Task	Use case	Dataset	Python sample (Notebook)	CLI with YAML
Image Multi-class classification	Image Multi-class classification	fridgeObjects	fridgeobjects-multiclass-classification.ipynb	fridgeobjects-multiclass-classification.sh
Image Multi-label classification	Image Multi-label classification	multilabel fridgeObjects	fridgeobjects-multilabel-classification.ipynb	fridgeobjects-multilabel-classification.sh

Evaluation Samples

Task	Use case	Dataset	Python sample (Notebook)
Image Multi-class classification	Image Multi-class classification	fridgeObjects	image-multiclass-classification.ipynb
Image Multi-label classification	Image Multi-label classification	multilabel fridgeObjects	image-multilabel-classification.ipynb

Sample input and output

Sample input

{
  "input_data": ["image1", "image2"]
}

Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.

Sample output

[
  [
    {
      "label" : "can",
      "score" : 0.91
    },
    {
      "label" : "carton",
      "score" : 0.09
    },
  ],
  [
    {
      "label" : "carton",
      "score" : 0.9
    },
    {
      "label" : "can",
      "score" : 0.1
    },
  ]
]

Visualization of inference result for a sample image

Model Specifications

LicenseApache-2.0

Last UpdatedApril 2025

PublisherMeta

Quick Start