microsoft-swinv2-base-patch4-window12-192-22k

microsoft-swinv2-base-patch4-window12-192-22k

Microsoft
Version: 21
The Swin Transformer V2 model is a type of Vision Transformer, pre-trained on ImageNet-21k with a resolution of 192x192, is introduced in the research-paper titled "Swin Transformer V2: Scaling Up Capacity and Resolution" authored by Liu et al. This model tries to resolve training instability, resolution gaps between pre-training and fine-tuning, and large labelled data issues in training and application of large vision models. This model generates hierarchical feature maps by merging image patches and computes self attention within a local window resulting in a linear computational complexity relative to input image size which is a significant improvement over vision transformers that take quadratic computational complexity. Swin Transformer V2 introduces three improvements:
  • a residual-post-norm method with cosine attention to improve training stability
  • a log-spaced continuous position bias method, aiding the transfer of pre-trained models from low-resolution images to tasks with high-resolution inputs
  • the application of a self-supervised pre-training method called SimMIM, designed to reduce the need for extensive labeled images
apache-2.0

Quick facts

Model providerMicrosoft
TypeImage classification
LifecycleGenerally available (GA)