Command Palette
Search for a command to run...
Colin Lea; Michael D. Flynn; Rene Vidal; Austin Reiter; Gregory D. Hager

Abstract
The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We introduce a new class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-segmentation-on-gtea-1 | ED-TCN | Acc: 64.0 Edit: - F1@10%: 72.2 F1@25%: 69.3 F1@50%: 56.0 |
| skeleton-based-action-recognition-on-varying | TCN | Accuracy (AV I): 43% Accuracy (AV II): 64% Accuracy (CS): 56% Accuracy (CV I): 16% Accuracy (CV II): 43% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.