Command Palette
Search for a command to run...
COMEDIAN: Self-Supervised Learning and Knowledge Distillation for Action Spotting using Transformers
Denize Julien ; Liashuha Mykola ; Rabarisoa Jaonary ; Orcesi Astrid ; Hérault Romain

Abstract
We present COMEDIAN, a novel pipeline to initialize spatiotemporaltransformers for action spotting, which involves self-supervised learning andknowledge distillation. Action spotting is a timestamp-level temporal actiondetection task. Our pipeline consists of three steps, with two initializationstages. First, we perform self-supervised initialization of a spatialtransformer using short videos as input. Additionally, we initialize a temporaltransformer that enhances the spatial transformer's outputs with global contextthrough knowledge distillation from a pre-computed feature bank aligned witheach short video segment. In the final step, we fine-tune the transformers tothe action spotting task. The experiments, conducted on the SoccerNet-v2dataset, demonstrate state-of-the-art performance and validate theeffectiveness of COMEDIAN's pretraining paradigm. Our results highlight severaladvantages of our pretraining pipeline, including improved performance andfaster convergence compared to non-pretrained models.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-spotting-on-soccernet-v2 | COMEDIAN (ViSwin T ens.) | Average-mAP: 77.6 Tight Average-mAP: 73.1 |
| action-spotting-on-soccernet-v2 | COMEDIAN (ViViT T) | Average-mAP: 76.1 Tight Average-mAP: 70.7 |
| action-spotting-on-soccernet-v2 | COMEDIAN (ViViT T ens.) | Average-mAP: 77.1 Tight Average-mAP: 72.0 |
| action-spotting-on-soccernet-v2 | COMEDIAN (ViSwin T) | Average-mAP: 76.6 Tight Average-mAP: 71.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.