summarized by : Anonymous
Training Object Detectors From Scratch: An Empirical Study in the Era of Vision Transformer


Can we train Transformer backbone from scratch? with model architecture change + more epochs + gradient calibration, training from scratch can achieve similar performances with ImageNet pre-training


identified that model architecture change from T-T-T-T to C-C-T-T (T and C stand for transformer and convolution block) is important for training Transformer backbone from scratch


training from scratch achieves competitive or better performance (COCO, Swin Transformer as backbone + Faster R-CNN and FCOS, etc )


replacing early self-attention layers with cnn to introduce inductive bias and mitigate the dependence on large-scale pre-training (motivation of using C-C-T-T)