#333
summarized by : Norihito Ishida
TAP: Text-Aware Pre-Training for Text-VQA and Text-Caption

どんな論文か?

Text-VQA/Text-Captionのための画像/文章マルチモーダルpre-traning (TAP : Text-Aware Pre-Training) を提案
placeholder

新規性

"Text word embedding", "Visual object embedding", "Scene text embedding"を Multi-modal Transformer Layer に入力し、"MLM", "Relative position prediction", "Image-text matching"の pre-trainingを行う

結果

既存手法より性能向上 (+8.3% accuracy on TextVQA, +8.6% accuracy on ST-VQA, +10.2 CIDEr score on TextCaps)

その他(なぜ通ったか?等)