#370
summarized by : Anonymous
MixFormer: Mixing Features Across Windows and Dimensions

どんな論文か?

a general backbone design; tackling the issues of limited receptive field and weak modeling capability in local-window self-attention;
placeholder

新規性

1) combine local-window self-attention with depth-wise conv in parallel (two branches) to increase receptive field; 2) bi-directional interaction (channel, spatial attention) between two branches

結果

outperforms its alternatives by significant margins with less computational costs in 5 dense prediction tasks on MS COCO, ADE20k, and LVIS

その他(なぜ通ったか?等)

the proposed method is limited to window-based vision transformers; experiments with global attention show slightly worse results on Imagenet-1k; the interactions between branches are SE attention