HyperAIHyperAI

Action Recognition On Epic Kitchens 100

Metrics

Action@1
GFLOPs
Noun@1
Verb@1

Results

Performance results of various models on this benchmark

Model Name
Action@1
GFLOPs
Noun@1
Verb@1
Paper TitleRepository
MoViNet-A544.574.9x155.169.1MoViNets: Mobile Video Networks for Efficient Video Recognition-
Avion (ViT-L)54.4-65.473.0Training a Large Video Model on a Single Machine in a Day-
MeMViT-2448.4-60.371.4MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition-
SlowFast36.81---Rescaling Egocentric Vision-
MoViNet-A241.27.59x152.367.1MoViNets: Mobile Video Networks for Efficient Video Recognition-
TSN33.57---Rescaling Egocentric Vision-
GSF44.48-53.1869.06Gate-Shift-Fuse for Video Action Recognition-
ViViT-L/16x2 Fact. encoder44.0-56.866.4ViViT: A Video Vision Transformer-
TAdaConvNeXtV2-S48.9-60.271.0Temporally-Adaptive Models for Efficient Video Understanding-
ORViT Mformer-L (ORViT blocks)45.7-58.768.4Object-Region Video Transformers-
CAST-B/1649.3-60.972.5CAST: Cross-Attention in Space and Time for Video Action Recognition-
Mformer-HR44.5-58.567.0Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers-
TempAgg45.26-53.3566Technical Report: Temporal Aggregate Representations-
LaViLa (TimeSformer-L)51-62.972Learning Video Representations from Large Language Models-
MMT47.8-61.070.1Multiscale Multimodal Transformer for Multimodal Action Recognition-
MBT43.4-5864.8Attention Bottlenecks for Multimodal Fusion-
Mformer-L44.1-57.667.1Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers-
Mformer43.1-56.566.7Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers-
M&M (WTS 60M)53.6-66.372.0M&M Mix: A Multimodal Multiview Transformer Ensemble-
OMNIVORE (Swin-B, finetuned)49.9-61.769.5Omnivore: A Single Model for Many Visual Modalities-
0 of 30 row(s) selected.
Action Recognition On Epic Kitchens 100 | SOTA | HyperAI