Command Palette
Search for a command to run...
Hussein Noureldien ; Gavves Efstratios ; Smeulders Arnold W. M.

Abstract
This paper focuses on the temporal aspect for recognizing human activities invideos; an important visual cue that has long been undervalued. We revisit theconventional definition of activity and restrict it to Complex Action: a set ofone-actions with a weak temporal pattern that serves a specific purpose.Related works use spatiotemporal 3D convolutions with fixed kernel size, toorigid to capture the varieties in temporal extents of complex actions, and tooshort for long-range temporal modeling. In contrast, we use multi-scaletemporal convolutions, and we reduce the complexity of 3D convolutions. Theoutcome is Timeception convolution layers, which reasons about minute-longtemporal patterns, a factor of 8 longer than best related works. As a result,Timeception achieves impressive accuracy in recognizing the human activities ofCharades, Breakfast Actions, and MultiTHUMOS. Further, we demonstrate thatTimeception learns long-range temporal dependencies and tolerate temporalextents of complex actions.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-charades | Timeception (I3D) | MAP: 37.2 |
| action-classification-on-charades | Timeception (R2D) | MAP: 31.6 |
| action-classification-on-charades | Timeception (R3D) | MAP: 41.1 |
| long-video-activity-recognition-on-breakfast | Timeception (I3D-K400-Pretrain-feature) | mAP: 61.82 |
| video-classification-on-breakfast | Timeception | Accuracy (%): 71.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.