microsoft-swinv2-base-patch4-window12-192-22k
microsoft-swinv2-base-patch4-window12-192-22k
Version: 21
MicrosoftLast updated April 2025
The Swin Transformer V2 model is a type of Vision Transformer, pre-trained on ImageNet-21k with a resolution of 192x192, is introduced in the research-paper titled "Swin Transformer V2: Scaling Up Capacity and Resolution" authored by Liu et al. This model tries to resolve training instability, resolution gaps between pre-training and fine-tuning, and large labelled data issues in training and application of large vision models. This model generates hierarchical feature maps by merging image patches and computes self attention within a local window resulting in a linear computational complexity relative to input image size which is a significant improvement over vision transformers that take quadratic computational complexity. Swin Transformer V2 introduces three improvements:
  • a residual-post-norm method with cosine attention to improve training stability
  • a log-spaced continuous position bias method, aiding the transfer of pre-trained models from low-resolution images to tasks with high-resolution inputs
  • the application of a self-supervised pre-training method called SimMIM, designed to reduce the need for extensive labeled images

License

apache-2.0

Inference Samples

Inference typePython sample (Notebook)CLI with YAML
Real timeimage-classification-online-endpoint.ipynb image-classification-online-endpoint.sh
Batchimage-classification-batch-endpoint.ipynb image-classification-batch-endpoint.sh

Finetuning Samples

TaskUse caseDatasetPython sample (Notebook)CLI with YAML
Image Multi-class classificationImage Multi-class classificationfridgeObjects fridgeobjects-multiclass-classification.ipynb fridgeobjects-multiclass-classification.sh
Image Multi-label classificationImage Multi-label classificationmultilabel fridgeObjects fridgeobjects-multilabel-classification.ipynb fridgeobjects-multilabel-classification.sh

Evaluation Samples

TaskUse caseDatasetPython sample (Notebook)
Image Multi-class classificationImage Multi-class classificationfridgeObjects image-multiclass-classification.ipynb
Image Multi-label classificationImage Multi-label classificationmultilabel fridgeObjects image-multilabel-classification.ipynb

Sample input and output

Sample input

{
  "input_data": ["image1", "image2"]
}
Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.

Sample output

[
  [
    {
      "label" : "can",
      "score" : 0.91
    },
    {
      "label" : "carton",
      "score" : 0.09
    },
  ],
  [
    {
      "label" : "carton",
      "score" : 0.9
    },
    {
      "label" : "can",
      "score" : 0.1
    },
  ]
]

Visualization of inference result for a sample image

mc visualization Note: The labels provided by swinv2 model are class indices appended to "LABEL_"(starting from "LABEL_0" to "LABEL_21841"). For e.g. "LABEL_3500" for "Giraffe". For visualization purpose, we explictly mapped these labels to imagenet-21k class names which are shown above in the sample image.
Model Specifications
LicenseApache-2.0
Last UpdatedApril 2025
ProviderMicrosoft