Open Access
12 March 2019 Spatiotemporal information deep fusion network with frame attention mechanism for video action recognition
Author Affiliations +
Abstract
In the deep learning-based video action recognition, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. We propose a network for extracting semantic information of video sequences based on the deep fusion feature of local spatial–temporal information. Convolutional neural networks (CNNs) are used in the network to extract local spatial information and local motion information, respectively. The spatial information is in three-dimensional convolution with the motion information of the corresponding time to obtain local spatial–temporal information at a certain moment. The local spatial–temporal information is then input into the long- and short-time memory (LSTM) to obtain the context relationship of the local spatial–temporal information in the long-time dimension. We add the ability of the regional attention mechanism of video frames in the neural network mechanism for obtaining context. That is, the last layer of convolutional layer spatial information and the first layer of the fully connected layer are, respectively, input into different LSTM networks, and the outputs of the two LSTMs at each time are merged again. This enables a fully connected layer that is rich in categorical information to provide a frame attention mechanism for the spatial information layer. Through the experiments on the three action recognition common experimental datasets UCF101, UCF11, and UCFSports, the spatial–temporal information deep fusion network proposed has a high correct recognition rate in the task of action recognition.
CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.
Hongshi Ou and Jifeng Sun "Spatiotemporal information deep fusion network with frame attention mechanism for video action recognition," Journal of Electronic Imaging 28(2), 023009 (12 March 2019). https://doi.org/10.1117/1.JEI.28.2.023009
Received: 25 September 2018; Accepted: 21 February 2019; Published: 12 March 2019
Lens.org Logo
CITATIONS
Cited by 6 scholarly publications and 1 patent.
Advertisement
Advertisement
KEYWORDS
Video

Information fusion

Video acceleration

Network architectures

Convolution

Feature extraction

Neural networks

RELATED CONTENT

Video quality assessment based on deep learning
Proceedings of SPIE (December 15 2023)

Back to Top