Command Palette
Search for a command to run...
Toby Perrett Alessandro Masullo Tilo Burghardt Majid Mirmehdi Dima Damen

Abstract
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| few-shot-action-recognition-on-hmdb51 | TRX | 1:1 Accuracy: 75.6 |
| few-shot-action-recognition-on-kinetics-100 | TRX | Accuracy: 85.9 |
| few-shot-action-recognition-on-something | TRX | 1:1 Accuracy: 64.6 |
| few-shot-action-recognition-on-ucf101 | TRX | 1:1 Accuracy: 96.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.