#242
summarized by : Anonymous
MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

どんな論文か?

a general backbone for classification and detection task by improving Multiscale Vision Transformer (MViT-v1)
placeholder

新規性

replace absolute position embedding to decomposed relative position embedding, residual pooling connection to compensate the effect of pooling strides in attention computation

結果

better than other vision transformers and SOTA results on classification, detection, video recognition tasks

その他(なぜ通ったか?等)