Paper
12 October 2022 Multi-modal transformer for video retrieval using improved sentence embeddings
Author Affiliations +
Proceedings Volume 12342, Fourteenth International Conference on Digital Image Processing (ICDIP 2022); 1234221 (2022) https://doi.org/10.1117/12.2643741
Event: Fourteenth International Conference on Digital Image Processing (ICDIP 2022), 2022, Wuhan, China
Abstract
With the explosive growth of the number of online videos, video retrieval becomes increasingly difficult. Multi-modal visual and language understanding based video-text retrieval is one of the mainstream framework to solve this problem. Among them, MMT (Multi-modal Transformer) is a novel and mainstream model. On the language side, BERT (Bidirectional Encoder Representation for Transformers) is used to encode text, where the pretrained BERT will be fine tuned during training. However, there exists a mismatch in this stage. The pre-training tasks of BERT is based on NSP (Next Sentence Prediction) and MLM(masked language model) which have weak correlation with video retrieval. For text encoder will encode text into semantic embeddings. On the visual side, Transformer is used to aggregate multimodal experts of videos. We find that the output of visual transformer is not fully utilized. In this paper, a sentence- BERT model is introduced to substitute BERT model in MMT to improve sentence embeddings efficiency. In addition, a max-pooling layer is adopted after Transformer to improve the utilization efficiency of the output of the model. Experiment results show that the proposed model outperforms MMT.
© (2022) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Zhi Liu, Fangyuan Zhao, and Mengmeng Zhang "Multi-modal transformer for video retrieval using improved sentence embeddings", Proc. SPIE 12342, Fourteenth International Conference on Digital Image Processing (ICDIP 2022), 1234221 (12 October 2022); https://doi.org/10.1117/12.2643741
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

Computer programming

Transformers

Video acceleration

Feature extraction

Visualization

Fourier transforms

RELATED CONTENT


Back to Top