Command Palette
Search for a command to run...
Boyuan Jiang; Mengmeng Wang; Weihao Gan; Wei Wu; Junjie Yan

Abstract
Spatiotemporal and motion features are two complementary and crucial information for video action recognition. Recent state-of-the-art methods adopt a 3D CNN stream to learn spatiotemporal features and another flow stream to learn motion features. In this work, we aim to efficiently encode these two features in a unified 2D framework. To this end, we first propose an STM block, which contains a Channel-wise SpatioTemporal Module (CSTM) to present the spatiotemporal features and a Channel-wise Motion Module (CMM) to efficiently encode motion features. We then replace original residual blocks in the ResNet architecture with STM blcoks to form a simple yet effective STM network by introducing very limited extra computation cost. Extensive experiments demonstrate that the proposed STM network outperforms the state-of-the-art methods on both temporal-related datasets (i.e., Something-Something v1 & v2 and Jester) and scene-related datasets (i.e., Kinetics-400, UCF-101, and HMDB-51) with the help of encoding spatiotemporal and motion features together.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | STM (ResNet-50) | Acc@1: 73.7 |
| action-recognition-in-videos-on-hmdb-51-1 | STM (ImageNet+Kinetics pretrain) | Average accuracy of 3 splits: 72.2 |
| action-recognition-in-videos-on-jester-1 | STM (Resnet-50, 16 frames) | Val: 96.7 |
| action-recognition-in-videos-on-something-2 | STM (16 frames, ImageNet pretraining) | Top 1 Accuracy: 50.7 |
| action-recognition-in-videos-on-something-3 | STM (16 frames, ImageNet pretraining) | Top-1 Accuracy: 64.2 Top-5 Accuracy: 89.8 |
| action-recognition-in-videos-on-ucf101-2 | STM (ImageNet+Kinetics pretrain) | 3-fold Accuracy: 96.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.