MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

#242

summarized by : Anonymous

Yanghao Li; Chao-Yuan Wu; Haoqi Fan; Karttikeya Mangalam; Bo Xiong; Jitendra Malik; Christoph Feichtenhofer

どんな論文か？

a general backbone for classification and detection task by improving Multiscale Vision Transformer (MViT-v1)

新規性

replace absolute position embedding to decomposed relative position embedding, residual pooling connection to compensate the effect of pooling strides in attention computation

結果

better than other vision transformers and SOTA results on classification, detection, video recognition tasks

その他（なぜ通ったか？等）

このページで利用されている画像は論文から引用しています．