- …
- …
#242
summarized by : Anonymous
どんな論文か?
a general backbone for classification and detection task by improving Multiscale Vision Transformer (MViT-v1)
新規性
replace absolute position embedding to decomposed relative position embedding, residual pooling connection to compensate the effect of pooling strides in attention computation
結果
better than other vision transformers and SOTA results on classification, detection, video recognition tasks
その他(なぜ通ったか?等)
- …
- …