Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer

#80

summarized by : Anonymous

Weixiang Hong; Jiangwei Lao; Wang Ren; Jian Wang; Jingdong Chen; Wei Chu

どんな論文か？

Can we train Transformer backbone from scratch? with model architecture change + more epochs + gradient calibration, training from scratch can achieve similar performances with ImageNet pre-training

新規性

identified that model architecture change from T-T-T-T to C-C-T-T (T and C stand for transformer and convolution block) is important for training Transformer backbone from scratch

結果

training from scratch achieves competitive or better performance (COCO, Swin Transformer as backbone + Faster R-CNN and FCOS, etc )

その他（なぜ通ったか？等）

replacing early self-attention layers with cnn to introduce inductive bias and mitigate the dependence on large-scale pre-training (motivation of using C-C-T-T)

このページで利用されている画像は論文から引用しています．