microsoft-swinv2-base-patch4-window12-192-22k
Version: 21
The Swin Transformer V2 model is a type of Vision Transformer, pre-trained on ImageNet-21k with a resolution of 192x192, is introduced in the research-paper titled "Swin Transformer V2: Scaling Up Capacity and Resolution" authored by Liu et al. This model tries to resolve training instability, resolution gaps between pre-training and fine-tuning, and large labelled data issues in training and application of large vision models. This model generates hierarchical feature maps by merging image patches and computes self attention within a local window resulting in a linear computational complexity relative to input image size which is a significant improvement over vision transformers that take quadratic computational complexity.
Swin Transformer V2 introduces three improvements:
Note: "image1" and "image2" string should be in base64 format or publicly accessible urls.
Note: The labels provided by swinv2 model are class indices appended to "LABEL_"(starting from "LABEL_0" to "LABEL_21841"). For e.g. "LABEL_3500" for "Giraffe". For visualization purpose, we explictly mapped these labels to imagenet-21k class names which are shown above in the sample image.
- a residual-post-norm method with cosine attention to improve training stability
- a log-spaced continuous position bias method, aiding the transfer of pre-trained models from low-resolution images to tasks with high-resolution inputs
- the application of a self-supervised pre-training method called SimMIM, designed to reduce the need for extensive labeled images
License
apache-2.0Inference Samples
| Inference type | Python sample (Notebook) | CLI with YAML |
|---|---|---|
| Real time | image-classification-online-endpoint.ipynb | image-classification-online-endpoint.sh |
| Batch | image-classification-batch-endpoint.ipynb | image-classification-batch-endpoint.sh |
Finetuning Samples
| Task | Use case | Dataset | Python sample (Notebook) | CLI with YAML |
|---|---|---|---|---|
| Image Multi-class classification | Image Multi-class classification | fridgeObjects | fridgeobjects-multiclass-classification.ipynb | fridgeobjects-multiclass-classification.sh |
| Image Multi-label classification | Image Multi-label classification | multilabel fridgeObjects | fridgeobjects-multilabel-classification.ipynb | fridgeobjects-multilabel-classification.sh |
Evaluation Samples
| Task | Use case | Dataset | Python sample (Notebook) |
|---|---|---|---|
| Image Multi-class classification | Image Multi-class classification | fridgeObjects | image-multiclass-classification.ipynb |
| Image Multi-label classification | Image Multi-label classification | multilabel fridgeObjects | image-multilabel-classification.ipynb |
Sample input and output
Sample input
{
"input_data": ["image1", "image2"]
}
Sample output
[
[
{
"label" : "can",
"score" : 0.91
},
{
"label" : "carton",
"score" : 0.09
},
],
[
{
"label" : "carton",
"score" : 0.9
},
{
"label" : "can",
"score" : 0.1
},
]
]
Visualization of inference result for a sample image
Note: The labels provided by swinv2 model are class indices appended to "LABEL_"(starting from "LABEL_0" to "LABEL_21841"). For e.g. "LABEL_3500" for "Giraffe". For visualization purpose, we explictly mapped these labels to imagenet-21k class names which are shown above in the sample image. Model Specifications
LicenseApache-2.0
Last UpdatedApril 2025
ProviderMicrosoft