TY - GEN
T1 - MIRTT
T2 - 2021 Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
AU - Wang, Junjie
AU - Ji, Yatai
AU - Sun, Jiaqi
AU - Yang, Yujiu
AU - Sakai, Tetsuya
N1 - Funding Information:
This research was supported by the Key Program of the National Natural Science Foundation of China under Grant No. U1903213. We would like to thank members of The Real Sakai Laboratory4, Waseda University, for giving us suggestions. Junjie Wang is especially grateful to our friend Yuxiang Zhang for his support, advice, and encouragement.
Funding Information:
This research was supported by the Key Program of the National Natural Science Foundation of China under Grant No. U1903213. We would like to thank members of The Real Sakai Laboratory4, Waseda University, for giving us suggestions. Jun-jie Wang is especially grateful to our friend Yuxi-ang Zhang for his support, advice, and encouragement.
Publisher Copyright:
© 2021 Association for Computational Linguistics.
PY - 2021
Y1 - 2021
N2 - In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.
AB - In Visual Question Answering (VQA), existing bilinear methods focus on the interaction between images and questions. As a result, the answers are either spliced into the questions or utilized as labels only for classification. On the other hand, trilinear models such as the CTI model of Do et al. (2019) efficiently utilize the inter-modality information between answers, questions, and images, while ignoring intramodality information. Inspired by these observations, we propose a new trilinear interaction framework called MIRTT (Learning Multimodal Interaction Representations from Trilinear Transformers), incorporating the attention mechanisms for capturing inter-modality and intra-modality relationships. Moreover, we design a two-stage workflow where a bilinear model reduces the free-form, open-ended VQA problem into a multiple-choice VQA problem. Furthermore, to obtain accurate and generic multimodal representations, we pretrain MIRTT with masked language prediction. Our method achieves state-of-the-art performance on the Visual7W Telling task and VQA1.0 Multiple Choice task and outperforms bilinear baselines on the VQA-2.0, TDIUC and GQA datasets.
UR - http://www.scopus.com/inward/record.url?scp=85129181144&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85129181144&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85129181144
T3 - Findings of the Association for Computational Linguistics, Findings of ACL: EMNLP 2021
SP - 2280
EP - 2292
BT - Findings of the Association for Computational Linguistics, Findings of ACL
A2 - Moens, Marie-Francine
A2 - Huang, Xuanjing
A2 - Specia, Lucia
A2 - Yih, Scott Wen-Tau
PB - Association for Computational Linguistics (ACL)
Y2 - 7 November 2021 through 11 November 2021
ER -