- …
- …
#333
summarized by : Norihito Ishida
新規性
"Text word embedding", "Visual object embedding", "Scene text embedding"を Multi-modal Transformer Layer に入力し、"MLM", "Relative position prediction", "Image-text matching"の pre-trainingを行う
結果
既存手法より性能向上 (+8.3% accuracy on TextVQA, +8.6% accuracy on ST-VQA, +10.2 CIDEr score on TextCaps)
その他(なぜ通ったか?等)
- …
- …