Paper
21 July 2023 Image captioning based on an improved attention mechanism
Leiyan Liang, Nan Xiang
Author Affiliations +
Proceedings Volume 12717, 3rd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2023); 127173B (2023) https://doi.org/10.1117/12.2685386
Event: 3rd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2023), 2023, Wuhan, China
Abstract
Attention mechanism in image captioning model can help model focus on relative regions while generating caption. However, existing attention mechanisms are unable to identify important regions and important visual features in images. This problem makes models sometimes pay excessive attention to non-important regions and non-important features in the process of generating image captions, which makes model generate coarse-grained and even wrong image captions. To address this problem, we propose an “Importance Discrimination Attention” (IDA) module, which could discriminate important feature and non-important features and reduce the possibility of misleading by non-important features in the process of generating image captions. We also propose a IDA-based image captioning model IDANet, which is completely based on transformer framework. The encoder of IDANet consists of two parts, one is pretrained Vision Transformer (VIT), which is used to extract visual features in a fast way. The other is refining module which is added into encoder to model position and semantic relationships of different grids. For the decoder, we propose IDA-Decoder which has similar framework with transformer decoder. IDA-Decoder is guided by IDA to focus on crucial regions and features instead of all regions and features while generating image caption. Compared with others attention mechanism, IDA could capture semantic relevance of important regions with other regions in a fine-grained and high-efficient way. The caption generated by IDANet could accurately capture the relevance of different objects and discriminate objects that have similar size and shape. The performance on MSCOCO “Karpathy” offline test split achieves 132.0 CIDEr-D score and 40.3 BLEU-4 score.
© (2023) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Leiyan Liang and Nan Xiang "Image captioning based on an improved attention mechanism", Proc. SPIE 12717, 3rd International Conference on Artificial Intelligence, Automation, and High-Performance Computing (AIAHPC 2023), 127173B (21 July 2023); https://doi.org/10.1117/12.2685386
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Visualization

Semantics

Transformers

Visual process modeling

Feature extraction

Computer vision technology

Pattern recognition

Back to Top