Image caption generation based on object detection and knowledge enhancement

Jiaguo Zhong; Dongsheng Wang

doi:10.1117/12.2680966

8 June 2023 Image caption generation based on object detection and knowledge enhancement

Jiaguo Zhong, Dongsheng Wang

Proceedings Volume 12707, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2023); 127070W (2023) https://doi.org/10.1117/12.2680966
Event: International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2023), 2023, Changsha, China

Abstract

Existing image caption generation methods and results mainly focus on identifying existing image contents and their relationships in images, but cannot generate descriptions with fine-grained background knowledge, to overcome the problem that traditional image caption models cannot describe deep-level semantics in pictures. On the one hand, this paper presents an image caption generation method based on target detection and knowledge enhancement. First, in the phase of target detection, this paper presents a Fusion target classification detector (FTCD) which fuses multidimensional information to get tags of human faces, goods, and objects in the graph. Secondly, the knowledge map is introduced, and the related knowledge is queried in the knowledge map using the target tags obtained by the target classification detector. Finally, the target's tag set and related knowledge are jointly fed into the model for coding. On the decoding side of the model, an attention mechanism is introduced to guide the model to select appropriate information and generate an image description. On the other hand, there is a lack of common knowledge about manual descriptions in MSCOCO datasets. This paper presents an evaluation index SPICE-K, which can be used to evaluate image descriptions with common sense knowledge. The experimental results show that the accuracy of the proposed method is 1.3% higher than that of the standard LBPF model. The experimental analysis shows that, compared with the benchmark model, the performance improvement of this method mainly comes from the introduction of the knowledge map and the target classification detector proposed in this paper.

Citation Download Citation

Jiaguo Zhong and Dongsheng Wang "Image caption generation based on object detection and knowledge enhancement", Proc. SPIE 12707, International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2023), 127070W (8 June 2023); https://doi.org/10.1117/12.2680966

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
7 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Target detection

Image enhancement

Object detection

Feature extraction

Facial recognition systems

Performance modeling

Semantics

Show All Keywords

Keywords/Phrases

Search In:

Publication Years