Video-based human action recognition is a challenging task in computer vision. In recent years, the convolution neural network (CNN) and its extended versions have shown promising results for video action recognition. However, most of the existing methods cannot deal with the global motion information effectively, especially for long-term motion which is crucial to represent complex none-periodic actions. To address this issue, a stacked trajectory energy image (STEI) is proposed by extracting trajectories from motion saliency regions and stacked them onto one grayscale image. This will result in an STEI with discriminative texture feature which can effectively characterize the global motion from multiple consecutive frames. Then, a three-stream CNN framework is proposed to simultaneously capture spatial, temporal, and global motion information of the action from RGB frames, optical flow, and STEI. Moreover, a trajectory-aware convolution strategy is introduced by incorporating local and long-term motion information so as to learn the motion features directly and effectively from three complementary action-related regions. Finally, the learned features are aggregated and categorized by a linear support vector machine. The experimental results on two challenging datasets (i.e., HMDB51 and UCF101) demonstrate that our approach statistically outperforms a number of state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.