#80
summarized by : Anonymous
Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer

どんな論文か?

Can we train Transformer backbone from scratch? with model architecture change + more epochs + gradient calibration, training from scratch can achieve similar performances with ImageNet pre-training

新規性

identified that model architecture change from T-T-T-T to C-C-T-T (T and C stand for transformer and convolution block) is important for training Transformer backbone from scratch

結果

training from scratch achieves competitive or better performance (COCO, Swin Transformer as backbone + Faster R-CNN and FCOS, etc )

その他(なぜ通ったか?等)

replacing early self-attention layers with cnn to introduce inductive bias and mitigate the dependence on large-scale pre-training (motivation of using C-C-T-T)