microsoft-beit-base-patch16-224-pt22k-ft22k
BEiT (Bidirectional Encoder representation from Image Transformers) is a vision transformer(ViT) pre-trained with Masked Image Modeling(MIM), which is a self-supervised pre-training inspired by BERT from NLP, followed by Intermediate fine-tuning using ImageNet-22k dataset. It is then fine-tuned for Image Classification. Images have two views of representation in BEiT, image patches and visual tokens which serve as input and output during pre-training, respectively. During self-supervised pre-training stage, some percentage of image patches are masked randomly, and then the visual tokens corresponding to the masked patches are predicted.
Through pre-training, the model acquires an internal representation of images, enabling the extraction of features useful for subsequent tasks. After pre-training, a simple linear classifier layer is employed as a task layer on top of pre-trained encoder for image classification, which includes average pooling to aggregate the representations and feed the global to a softmax classifier.
For more details, refer BEiT-paper .
Training Data
The BEiT model is pre-trained on ImageNet-22k, encompassing 14 million images and 21,000 classes and fine-tuned on the same dataset.Training Procedure
In the preprocessing step, images are resized to the same resolution 224x224. Images are scaled with augmentations like random resized cropping, horizontal flipping, color jittering. Then normalized across the RGB channels with mean (0.5, 0.5, 0.5) and standard deviation (0.5, 0.5, 0.5). For more details on self-supervised pre-training (ImageNet-22k) followed by supervised fine-tuning (ImageNet-1k) refer to the section 2 and 3 of the original paper.For BEiT image classification benchmark results, Refer to the table 1 of the original-paper .