MixFormer: Mixing Features Across Windows and Dimensions

#370

summarized by : Anonymous

Qiang Chen; Qiman Wu; Jian Wang; Qinghao Hu; Tao Hu; Errui Ding; Jian Cheng; Jingdong Wang

どんな論文か？

a general backbone design; tackling the issues of limited receptive field and weak modeling capability in local-window self-attention;

新規性

1) combine local-window self-attention with depth-wise conv in parallel (two branches) to increase receptive field; 2) bi-directional interaction (channel, spatial attention) between two branches

結果

outperforms its alternatives by significant margins with less computational costs in 5 dense prediction tasks on MS COCO, ADE20k, and LVIS

その他（なぜ通ったか？等）

the proposed method is limited to window-based vision transformers; experiments with global attention show slightly worse results on Imagenet-1k; the interactions between branches are SE attention

このページで利用されている画像は論文から引用しています．