3 months ago

Temporal-Relational CrossTransformers for Few-Shot Action Recognition

Toby Perrett Alessandro Masullo Tilo Burghardt Majid Mirmehdi Dima Damen

Abstract

We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.

Code Repositories

tobyperrett/trx

Official

pytorch

Mentioned in GitHub

tobyperrett/few-shot-action-recognition

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
few-shot-action-recognition-on-hmdb51	TRX	1:1 Accuracy: 75.6
few-shot-action-recognition-on-kinetics-100	TRX	Accuracy: 85.9
few-shot-action-recognition-on-something	TRX	1:1 Accuracy: 64.6
few-shot-action-recognition-on-ucf101	TRX	1:1 Accuracy: 96.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette