3 months ago

MAR: Masked Autoencoders for Efficient Action Recognition

Zhiwu Qing Shiwei Zhang Ziyuan Huang Xiang Wang Yuehuan Wang Yiliang Lv Changxin Gao Nong Sang

Abstract

Standard approaches for video recognition usually operate on the full input videos, which is inefficient due to the widely present spatio-temporal redundancy in videos. Recent progress in masked video modelling, i.e., VideoMAE, has shown the ability of vanilla Vision Transformers (ViT) to complement spatio-temporal contexts given only limited visual contents. Inspired by this, we propose propose Masked Action Recognition (MAR), which reduces the redundant computation by discarding a proportion of patches and operating only on a part of the videos. MAR contains the following two indispensable components: cell running masking and bridging classifier. Specifically, to enable the ViT to perceive the details beyond the visible patches easily, cell running masking is presented to preserve the spatio-temporal correlations in videos, which ensures the patches at the same spatial location can be observed in turn for easy reconstructions. Additionally, we notice that, although the partially observed features can reconstruct semantically explicit invisible patches, they fail to achieve accurate classification. To address this, a bridging classifier is proposed to bridge the semantic gap between the ViT encoded features for reconstruction and the features specialized for classification. Our proposed MAR reduces the computational cost of ViT by 53% and extensive experiments show that MAR consistently outperforms existing ViT models with a notable margin. Especially, we found a ViT-Large trained by MAR outperforms the ViT-Huge trained by a standard training scheme by convincing margins on both Kinetics-400 and Something-Something v2 datasets, while our computation overhead of ViT-Large is only 14.5% of ViT-Huge.

Code Repositories

alibaba-mmai-research/masked-action-recognition

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
action-classification-on-kinetics-400	MAR (50% mask, ViT-B, 16x4)	Acc@1: 81.0 Acc@5: 94.4
action-classification-on-kinetics-400	MAR (75% mask, ViT-L, 16x4)	Acc@1: 83.9 Acc@5: 96.0
action-classification-on-kinetics-400	MAR (75% mask, ViT-B, 16x4)	Acc@1: 79.4 Acc@5: 93.7
action-classification-on-kinetics-400	MAR (50% mask, ViT-L, 16x4)	Acc@1: 85.3 Acc@5: 96.3
action-recognition-in-videos-on-something	MAR (75% mask, ViT-B, 16x4)	GFLOPs: 41x6 Parameters: 94 Top-1 Accuracy: 69.5 Top-5 Accuracy: 91.9
action-recognition-in-videos-on-something	MAR (75% mask, ViT-L, 16x4)	GFLOPs: 131x6 Parameters: 311 Top-1 Accuracy: 73.8 Top-5 Accuracy: 94.4
action-recognition-in-videos-on-something	MAR (50% mask, ViT-L, 16x4)	GFLOPs: 276x6 Parameters: 311 Top-1 Accuracy: 74.7 Top-5 Accuracy: 94.9
action-recognition-in-videos-on-something	MAR (50% mask, ViT-B, 16x4)	GFLOPs: 86x6 Parameters: 94 Top-1 Accuracy: 71.0 Top-5 Accuracy: 92.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette