-
MViTv2: Improved Multi-scale Vision Transformersfor Classification and Detection
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection Facebook AI Research, UC Berkeley Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. Introduction Motivation Designing a single, simple, yet effective architecture for diverse visual recognition tasks (image, video, detection). While Vision Transformers (ViT) are powerful, their standard architecture struggles with…
-
MViT: Multiscale Vision Transformer
Multiscale Vision Transformers Facebook AI Research, UC Berkeley Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. Introduction Motivation Convolutional Neural Networks (CNNs) have long benefited from multiscale feature hierarchies (pyramids), where spatial resolution decreases while channel complexity increases through the network. Vision Transformers (ViT) maintain a constant resolution and channel capacity throughout,…
-
[SAM] Segment Anything
Paper: https://arxiv.org/abs/2304.02643 Code: Web: https://segment-anything.com/ Motivation Objective Segment Anything (SAM) = Interactive Segmentation + Automatic Segmentation Global Framework Loss Function 1) Lmask: Supervise mask prediction with a linear combination of focal loss [65] and dice loss [73] in a 20:1 ratio of focal loss to dice loss. 2) LIoU: The IoU prediction head is trained…
-
[ViT] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale