Command Palette
Search for a command to run...
Vanderplaetse Bastien ; Dupont Stéphane

Abstract
In this paper, we propose a study on multi-modal (audio and video) actionspotting and classification in soccer videos. Action spotting andclassification are the tasks that consist in finding the temporal anchors ofevents in a video and determine which event they are. This is an importantapplication of general activity understanding. Here, we propose an experimentalstudy on combining audio and video information at different stages of deepneural network architectures. We used the SoccerNet benchmark dataset, whichcontains annotated events for 500 soccer game videos from the Big Five Europeanleagues. Through this work, we evaluated several ways to integrate audio streaminto video-only-based architectures. We observed an average absoluteimprovement of the mean Average Precision (mAP) metric of $7.43\%$ for theaction classification task and of $4.19\%$ for the action spotting task.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-spotting-on-soccernet | AudioVid (Vanderplaetse et al.) | Average-mAP: 56.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.