16 September 2022 Fusion sampling networks for skeleton-based human action recognition
Guannan Chen, Shimin Wei
Author Affiliations +
Abstract

After a few seconds of an action, the human eye only needs a few photos to judge, but the action recognition network needs hundreds of frames of input pictures for each action. This results in a large number of floating point operations (ranging from 16 to 100 G FLOPs) to process a single sample, which hampers the implementation of graph convolutional networks (GCN)-based action recognition methods when the computation capabilities are restricted. A common strategy is to retain only the portions of the frames, but this results in the loss of important information in the discarded frames. Furthermore, the selection progress of key frames is too independent and lacks connections with other frames. To solve these two problems, we propose a fusion sampling network to generate fused frames to extract key frames. Temporal aggregation is used to fuse adjacent similar frames, thereby reducing information loss and redundancy. The concept of self-attention is introduced to strengthen the long-term association of key frames. The experimental results on three benchmark datasets show that the proposed method achieves performance levels that are competitive with state-of-the-art methods while using only 16.7% of the number of frames (∼50 and 300 frames in total). On the NTU 60 dataset, the number of FLOPs and Params with a single-channel input are 3.776 G and 3.53 M, respectively. This would greatly reduce the excessive computational power cost in practical applications due to the large amount of data processed by action recognition.

© 2022 SPIE and IS&T
Guannan Chen and Shimin Wei "Fusion sampling networks for skeleton-based human action recognition," Journal of Electronic Imaging 31(5), 053015 (16 September 2022). https://doi.org/10.1117/1.JEI.31.5.053015
Received: 3 May 2022; Accepted: 29 August 2022; Published: 16 September 2022
Lens.org Logo
CITATIONS
Cited by 2 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Convolution

Data modeling

Video

Bone

RGB color model

Neural networks

Cameras

Back to Top