4 months ago

Action Recognition

Convolutional Neural Network

Video Processing

Method/Architecture

Computer Vision

Girdhar Rohit Ramanan Deva Gupta Abhinav Sivic Josef Russell Bryan

Abstract

In this work, we introduce a new video representation for actionclassification that aggregates local convolutional features across the entirespatio-temporal extent of the video. We do so by integrating state-of-the-arttwo-stream networks with learnable spatio-temporal feature aggregation. Theresulting architecture is end-to-end trainable for whole-video classification.We investigate different strategies for pooling across space and time andcombining signals from the different streams. We find that: (i) it is importantto pool jointly across space and time, but (ii) appearance and motion streamsare best aggregated into their own separate representations. Finally, we showthat our representation outperforms the two-stream base architecture by a largemargin (13% relative) as well as out-performs other baselines with comparablebase architectures on HMDB51, UCF101, and Charades video classificationbenchmarks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

4 months ago

Action Recognition

Convolutional Neural Network

Video Processing

Method/Architecture

Computer Vision

Girdhar Rohit Ramanan Deva Gupta Abhinav Sivic Josef Russell Bryan

Abstract

In this work, we introduce a new video representation for actionclassification that aggregates local convolutional features across the entirespatio-temporal extent of the video. We do so by integrating state-of-the-arttwo-stream networks with learnable spatio-temporal feature aggregation. Theresulting architecture is end-to-end trainable for whole-video classification.We investigate different strategies for pooling across space and time andcombining signals from the different streams. We find that: (i) it is importantto pool jointly across space and time, but (ii) appearance and motion streamsare best aggregated into their own separate representations. Finally, we showthat our representation outperforms the two-stream base architecture by a largemargin (13% relative) as well as out-performs other baselines with comparablebase architectures on HMDB51, UCF101, and Charades video classificationbenchmarks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp