HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Video Action Transformer Network

Rohit Girdhar; João Carreira; Carl Doersch; Andrew Zisserman

Video Action Transformer Network

Abstract

We introduce the Action Transformer model for recognizing and localizing human actions in video clips. We repurpose a Transformer-style architecture to aggregate features from the spatiotemporal context around the person whose actions we are trying to classify. We show that by using high-resolution, person-specific, class-agnostic queries, the model spontaneously learns to track individual people and to pick up on semantic context from the actions of others. Additionally its attention mechanism learns to emphasize hands and faces, which are often crucial to discriminate an action - all without explicit supervision other than boxes and class labels. We train and test our Action Transformer network on the Atomic Visual Actions (AVA) dataset, outperforming the state-of-the-art by a significant margin using only raw RGB frames as input.

Benchmarks

BenchmarkMethodologyMetrics
action-recognition-in-videos-on-ava-v21I3D Tx HighRes
GFlops: 39.6
Params (M): 19.3
mAP (Val): 27.6
action-recognition-in-videos-on-ava-v21I3D I3D
GFlops: 6.5
Params (M): 16.2
mAP (Val): 23.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Video Action Transformer Network | Papers | HyperAI