HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Martine Toering; Ioannis Gatopoulos; Maarten Stol; Vincent Tao Hu

Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting

Abstract

Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion information are learned, without the explicit need for optical flow computation during inference. We obtain state-of-the-art results on nearest-neighbour video retrieval and action recognition, outperforming previous best by +3.2% on UCF101 using the S3D backbone (90.5% Top-1 acc), and by +7.2% on UCF101 and +15.1% on HMDB51 using the R(2+1)D backbone.

Code Repositories

martinetoering/ViCC
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
self-supervised-action-recognition-on-hmdb51ViCC (R2+1D; RGB)
Frozen: false
Pre-Training Dataset: UCF101
Top-1 Accuracy: 52.4
self-supervised-action-recognition-on-hmdb51ViCC (S3D; R+F)
Frozen: false
Pre-Training Dataset: UCF101
Top-1 Accuracy: 62.2
self-supervised-action-recognition-on-hmdb51ViCC (R2+1D; R+F)
Frozen: false
Pre-Training Dataset: UCF101
Top-1 Accuracy: 61.5
self-supervised-action-recognition-on-hmdb51ViCC (S3D; RGB)
Frozen: true
Pre-Training Dataset: UCF101
Top-1 Accuracy: 38.5
self-supervised-action-recognition-on-hmdb51-1ViCC (R2+1D; RGB)
Pretraining Dataset: UCF101
Top-1 Accuracy: 52.4
self-supervised-action-recognition-on-hmdb51-1ViCC (S3D; RGB))
Pretraining Dataset: UCF101
Top-1 Accuracy: 47.9
self-supervised-action-recognition-on-hmdb51-1ViCC (S3D; R+F)
Pretraining Dataset: UCF101
Top-1 Accuracy: 62.2
self-supervised-action-recognition-on-ucf101ViCC (S3D; R+F)
3-fold Accuracy: 90.5
Frozen: false
Pre-Training Dataset: UCF101
self-supervised-action-recognition-on-ucf101ViCC (S3D; RGB)
3-fold Accuracy: 72.2
Frozen: true
Pre-Training Dataset: UCF101
self-supervised-action-recognition-on-ucf101ViCC (S3D; RGB)
3-fold Accuracy: 88.8
Frozen: false
Pre-Training Dataset: UCF101
self-supervised-action-recognition-on-ucf101ViCC (R2+1D; RGB)
3-fold Accuracy: 82.8
Frozen: false
Pre-Training Dataset: UCF101
self-supervised-action-recognition-on-ucf101ViCC (R2+1D; R+F)
3-fold Accuracy: 88.8
Frozen: false
Pre-Training Dataset: UCF101
self-supervised-action-recognition-on-ucf101-1ViCC (R2+1D; RGB)
3-fold Accuracy: 82.8
Pretrain: UCF101
self-supervised-action-recognition-on-ucf101-1ViCC (R2+1D; R+F)
3-fold Accuracy: 88.8
Pretrain: UCF101
self-supervised-action-recognition-on-ucf101-1ViCC (S3D; R+F)
3-fold Accuracy: 90.5
Pretrain: UCF101
self-supervised-action-recognition-on-ucf101-1ViCC (S3D; RGB)
3-fold Accuracy: 84.3
Pretrain: UCF101

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp