HyperAIHyperAI

Action Classification On Kinetics 400

Metrics

Acc@1

Results

Performance results of various models on this benchmark

Model Name
Acc@1
Paper TitleRepository
OmniVec91.1OmniVec: Learning robust representations with cross modal sharing-
X3D-L77.5X3D: Expanding Architectures for Efficient Video Recognition-
ViT-B-VTN+ ImageNet-21K (84.0 [10])79.8Video Transformer Network-
MViT-B, 32x380.2Multiscale Vision Transformers-
MTV-H (WTS 60M)89.9Multiview Transformers for Video Recognition-
AdaMAE81.7AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders-
ViC-MAE (ViT-L)85.1ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders-
MoViNet-A480.5MoViNets: Mobile Video Networks for Efficient Video Recognition-
SlowFast 16x8 (ResNet-101)78.9SlowFast Networks for Video Recognition-
R[2+1]D-RGB (Sports-1M pretrain)74.3A Closer Look at Spatiotemporal Convolutions for Action Recognition-
X-CLIP(ViT-L/14, CLIP)87.7Expanding Language-Image Pretrained Models for General Video Recognition-
ip-CSN-152 (IG-65M pretraining)82.5Video Classification with Channel-Separated Convolutional Networks-
MARS+RGB+Flow (64 frames)74.9MARS: Motion-Augmented RGB Stream for Action Recognition-
TokenLearner 16at18 (L/10)85.4TokenLearner: What Can 8 Learned Tokens Do for Images and Videos?-
VideoMamba-M80085.0VideoMamba: State Space Model for Efficient Video Understanding-
TAdaConvNeXt-T79.1TAda! Temporally-Adaptive Convolutions for Video Understanding-
MAR (50% mask, ViT-B, 16x4)81.0MAR: Masked Autoencoders for Efficient Action Recognition-
Swin-S (ImageNet-1k pretrain)80.6Video Swin Transformer-
OMNIVORE (Swin-B)84.0Omnivore: A Single Model for Many Visual Modalities-
S3D-G (Flow, ImageNet pretrained)68Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification-
0 of 204 row(s) selected.
Action Classification On Kinetics 400 | SOTA | HyperAI