Spatiotemporal information deep fusion network with frame attention mechanism for video action recognition

Hongshi Ou; Jifeng Sun

doi:10.1117/1.JEI.28.2.023009

12 March 2019 Spatiotemporal information deep fusion network with frame attention mechanism for video action recognition

Hongshi Ou, Jifeng Sun

Author Affiliations +

Journal of Electronic Imaging, Vol. 28, Issue 2, 023009 (March 2019). https://doi.org/10.1117/1.JEI.28.2.023009

Abstract

In the deep learning-based video action recognition, the function of the neural network is to acquire spatial information, motion information, and the associated information of the above two kinds of information over an uneven time span. We propose a network for extracting semantic information of video sequences based on the deep fusion feature of local spatial–temporal information. Convolutional neural networks (CNNs) are used in the network to extract local spatial information and local motion information, respectively. The spatial information is in three-dimensional convolution with the motion information of the corresponding time to obtain local spatial–temporal information at a certain moment. The local spatial–temporal information is then input into the long- and short-time memory (LSTM) to obtain the context relationship of the local spatial–temporal information in the long-time dimension. We add the ability of the regional attention mechanism of video frames in the neural network mechanism for obtaining context. That is, the last layer of convolutional layer spatial information and the first layer of the fully connected layer are, respectively, input into different LSTM networks, and the outputs of the two LSTMs at each time are merged again. This enables a fully connected layer that is rich in categorical information to provide a frame attention mechanism for the spatial information layer. Through the experiments on the three action recognition common experimental datasets UCF101, UCF11, and UCFSports, the spatial–temporal information deep fusion network proposed has a high correct recognition rate in the task of action recognition.

CC BY: © The Authors. Published by SPIE under a Creative Commons Attribution 4.0 Unported License. Distribution or reproduction of this work in whole or in part requires full attribution of the original publication, including its DOI.

Citation Download Citation

Hongshi Ou and Jifeng Sun "Spatiotemporal information deep fusion network with frame attention mechanism for video action recognition," Journal of Electronic Imaging 28(2), 023009 (12 March 2019). https://doi.org/10.1117/1.JEI.28.2.023009

Received: 25 September 2018; Accepted: 21 February 2019; Published: 12 March 2019

ACCESS THE FULL ARTICLE

JOURNAL ARTICLE
11 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY