Paper
4 March 2022 Bi-direction co-attention network on visual question answering for blind people
Tung Le, Thong Bui, Huy Tien Nguyen, Minh Le Nguyen
Author Affiliations +
Proceedings Volume 12084, Fourteenth International Conference on Machine Vision (ICMV 2021); 1208416 (2022) https://doi.org/10.1117/12.2623596
Event: Fourteenth International Conference on Machine Vision (ICMV 2021), 2021, Rome, Italy
Abstract
The visual impairment community especially blind people needs support from advanced technologies to help them with understanding and answering the image content. In the multi-modal area, Visual Question Answering (VQA) is the notable cutting-edge task requiring the combination of images and texts via a co-attention mechanism. Inspired by the Deep Co-attention Layer, we propose a Bi-direction Co-Attention VT-Transformer network to jointly learn visual and textual features simultaneously. Via our system, the relationship and interaction of the modality objects are digested and combined together into the meaningful space. Besides, the consistency of Transformer architecture in both feature extractor and multi-modal attention function is efficient enough to decrease the layer of attention as well as the computation cost. Through the experimental results and ablation studies, our model achieves the promising performance against the existing approaches and uni-direction mechanism in VizWiz-VQA 2020 dataset for blind people.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Tung Le, Thong Bui, Huy Tien Nguyen, and Minh Le Nguyen "Bi-direction co-attention network on visual question answering for blind people", Proc. SPIE 12084, Fourteenth International Conference on Machine Vision (ICMV 2021), 1208416 (4 March 2022); https://doi.org/10.1117/12.2623596
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
Back to Top