HyperAIHyperAI

Command Palette

Search for a command to run...

a month ago

ActionVLAD: Learning spatio-temporal aggregation for action classification

Girdhar Rohit Ramanan Deva Gupta Abhinav Sivic Josef Russell Bryan

ActionVLAD: Learning spatio-temporal aggregation for action
  classification

Abstract

In this work, we introduce a new video representation for actionclassification that aggregates local convolutional features across the entirespatio-temporal extent of the video. We do so by integrating state-of-the-arttwo-stream networks with learnable spatio-temporal feature aggregation. Theresulting architecture is end-to-end trainable for whole-video classification.We investigate different strategies for pooling across space and time andcombining signals from the different streams. We find that: (i) it is importantto pool jointly across space and time, but (ii) appearance and motion streamsare best aggregated into their own separate representations. Finally, we showthat our representation outperforms the two-stream base architecture by a largemargin (13% relative) as well as out-performs other baselines with comparablebase architectures on HMDB51, UCF101, and Charades video classificationbenchmarks.

Benchmarks

BenchmarkMethodologyMetrics
long-video-activity-recognition-on-breakfastActionVlad (I3D-K400-Pretrain-feature)
mAP: 60.20

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
ActionVLAD: Learning spatio-temporal aggregation for action classification | Papers | HyperAI