#195
summarized by : Anonymous
Shunted Self-Attention via Multi-Scale Token Aggregation

どんな論文か?

enabling ViTs to model the attentions at hybrid scales per attention layer
placeholder

新規性

half attention heads learn Key and Values from downsampled features with downsampling ratio r1; another half with r2; learned features aggregated in FFN by depth-wise conv

結果

SOTA on imagenet-1k, COCO compared with other backbones

その他(なぜ通ったか?等)

downsampling implemented by conv with same kernel size but different strides