Paper
16 August 2024 JoAt: to dynamically aggregate visual queries in transformer for visual question answering
Mingben Wang, Juan Yang, Lixia Xue, Ronggui Wang
Author Affiliations +
Proceedings Volume 13230, Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024); 132300F (2024) https://doi.org/10.1117/12.3035665
Event: Third International Conference on Machine Vision, Automatic Identification and Detection, 2024, Kunming, China
Abstract
Attention mechanisms have shown impressive abilities in solving downstream multi-modal tasks. However, there exists a natural semantic gap between vision and language modalities that hinders conventional attention-based models in achieving effective cross-modal semantic alignment. In this paper, we present JoAt, a Joint Attention net, through which we investigate how to utilize the visual background information more directly in a query-adaptive manner to enrich querying semantics for each visual token, and how to more fully bridge the semantic gap to achieve cross-modal alignment between visual-grid and textual features. Specifically, our JoAt utilizes each query’s neighboring pixels, aggregates the visual query tokens from different receptive fields, and allows the model to dynamically select the most relevant neighboring tokens for each query, then obtains representations that are more semantically matched with the textual features to realize better interaction between visual and linguistic modalities. The experimental results show that our JoAt net can fully utilize different semantic-level signals from visual features at different receptive fields and effectively narrow the natural semantic differences between visual and language modalities. Our JoAt achieved an accuracy of 72.15% and 98.90% on the VQAv2.0 test-std and CLEVR benchmarks, respectively.
(2024) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Mingben Wang, Juan Yang, Lixia Xue, and Ronggui Wang "JoAt: to dynamically aggregate visual queries in transformer for visual question answering", Proc. SPIE 13230, Third International Conference on Machine Vision, Automatic Identification, and Detection (MVAID 2024), 132300F (16 August 2024); https://doi.org/10.1117/12.3035665
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Semantics

Visual process modeling

Transformers

Information visualization

Performance modeling

Education and training

Feature extraction

Back to Top