Command Palette
Search for a command to run...
K R Prajwal; Liliane Momeni; Triantafyllos Afouras; Andrew Zisserman

Abstract
In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-keyword-spotting-on-lrs2 | Transpotter | Top-1 Accuracy: 65 Top-5 Accuracy: 87.1 mAP: 69.2 mAP IOU@0.5: 68.3 |
| visual-keyword-spotting-on-lrs3-ted | Transpotter | Top-1 Accuracy: 52 Top-5 Accuracy: 77.1 mAP: 55.4 mAP IOU@0.5: 53.6 |
| visual-keyword-spotting-on-lrw | Transpotter | Top-1 Accuracy: 85.8 Top-5 Accuracy: 99.6 mAP: 64.1 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.