High-performance Activity Recognition models from video data are difficult to train and deploy efficiently. We measure efficiency in performance, model size, and run-time; during training and inference. Researchers have demonstrated that 3D convolutions capture the space-time dynamics well [13]. The challenge is that 3D convolutions are computationally intensive. [8] Propose the Temporal Shift Module (TSM) for train-efficiency, and [5] proposes DeepCompression for inference-efficiency. TSM is a simple yet effective way to gain near 3D convolution performance at 2D convolution computation cost. We apply these efficiency techniques to a newly labeled activity recognition data set through transfer learning. Our labeling strategy is meant to create highly temporal activity. We benchmark against a 2D ResNet50 backbone trained on individual frames, and a multilayer 3DCNN on multi-frame short videos. Our contributions are: 1. A new highly temporal activity recognition dataset based on egoHands [1]. 2. results that show a 3D backbone on videos outperforms a 2D one on frames. 3. With TSM we achieve 5x train efficiency in run-time with negligible performance loss. 4. With Quantization alone we achieve 10x inference efficiency in model size with negligible performance loss.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.