Action Recognition In Videos On Something 1

Metrics

GFLOPs

Param.

Top 1 Accuracy

Top 5 Accuracy

Results

Performance results of various models on this benchmark

					Paper Title
InternVideo	-	-	70.0	-	InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VideoMAE V2-g	-	-	68.7	91.9	VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Side4Video (EVA ViT-E/14	-	-	67.3	88.8	Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
ATM	-	-	65.6	88.6	What Can Simple Arithmetic Operations Do for Temporal Modeling?
TAdaFormer-L/14	-	-	63.7	-	Temporally-Adaptive Models for Efficient Video Understanding
TDS-CLIP-ViT-L/14(8frames)	-	-	63.0	87.8	TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning
UniFormerV2-L	-	-	62.7	88.0	UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
StructVit-B-4-1	-	-	61.3	-	Learning Correlation Structures for Vision Transformers
UniFormer-B (IN-1K + Kinetics400)	259x3	50.1	60.9	87.3	UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning
TAdaConvNeXtV2-B	-	-	60.7	-	Temporally-Adaptive Models for Efficient Video Understanding
TPS	-	-	58.3	-	Spatiotemporal Self-attention Modeling with Temporal Patch Shift for Action Recognition
MSMA (8+16frames)	-	-	57.9	-	Multi-scale Motion-Aware Module for Video Action Recognition
UniFormer-B (IN-1K + Kinetics600)	41.8x3	21.4	57.6	84.9	UniFormer: Unified Transformer for Efficient Spatial-Temporal Representation Learning
SIFA	-	-	57.3	-	Stand-Alone Inter-Frame Attention in Video Models
TCM (Ensemble)	-	-	57.2	-	Motion-driven Visual Tempo Learning for Video-based Action Recognition
EAN ResNet50 (single clip, center crop,8+16 ensemble, with sparse Transformer)	-	-	57.2	83.9	EAN: Event Adaptive Network for Enhanced Action Recognition
BQNEn (ImageNet + K400 pretrained)	-	-	57.1	84.2	Busy-Quiet Video Disentangling for Video Classification
TDN ResNet101 (one clip, center crop, 8+16 ensemble, ImageNet pretrained, RGB only)	-	-	56.8	84.1	TDN: Temporal Difference Networks for Efficient Action Recognition
CT-Net Ensemble (R50, 8+12+16+24)	-	-	56.6	-	CT-Net: Channel Tensorization Network for Video Classification
MoDS (8+16frames)	-	-	56.6	-	Action Recognition With Motion Diversification and Dynamic Selection

0 of 74 row(s) selected.

Command Palette

Action Recognition In Videos On Something 1

Metrics

Results

Command Palette

Action Recognition In Videos On Something 1

Metrics

Results