Command Palette
Search for a command to run...

摘要
我们提出了一种基于分离注意力机制的视频变换器(Video Transformer,简称 VidTr),用于视频分类任务。与常用的3D卷积网络相比,VidTr 通过堆叠的注意力机制有效聚合时空信息,在保持更高性能的同时展现出更强的计算效率。首先,我们介绍了原始的视频变换器模型,并验证了变换器模块能够直接从原始像素中实现时空建模,但其内存开销较大。随后,我们提出 VidTr 模型,在维持相同性能的前提下,将内存消耗降低了 3.3 倍。为进一步优化模型,我们引入了一种基于标准差的 topK 注意力池化方法($pool_{topK_std}$),通过在时间维度上剔除冗余信息特征,显著降低计算量。在五个常用视频数据集上,VidTr 均取得了当前最优的性能表现,且所需的计算资源更低,充分验证了所提设计在效率与有效性方面的优势。最后,通过误差分析与可视化结果表明,VidTr 在需要长期时序推理的动作预测任务中表现尤为出色。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-charades | VidTr-L | MAP: 43.5 |
| action-classification-on-charades | En-VidTr-L | MAP: 47.3 |
| action-classification-on-kinetics-400 | En-VidTr-M | Acc@1: 79.7 Acc@5: 94.2 |
| action-classification-on-kinetics-400 | En-VidTr-S | Acc@1: 79.4 Acc@5: 94 |
| action-classification-on-kinetics-400 | En-VidTr-L | Acc@1: 80.5 Acc@5: 94.6 |
| action-classification-on-kinetics-700 | VidTr-M | Top-1 Accuracy: 69.5 Top-5 Accuracy: 88.3 |
| action-classification-on-kinetics-700 | VidTr-L | Top-1 Accuracy: 70.2 Top-5 Accuracy: 89 |
| action-classification-on-kinetics-700 | VidTr-S | Top-1 Accuracy: 67.3 Top-5 Accuracy: 87.7 |
| action-classification-on-kinetics-700 | En-VidTr-L | Top-1 Accuracy: 70.8 Top-5 Accuracy: 89.4 |
| action-recognition-in-videos-on-hmdb-51 | VidTr-L | Average accuracy of 3 splits: 74.4 |
| action-recognition-in-videos-on-something | VidTr-L | Top-1 Accuracy: 60.2 |
| action-recognition-in-videos-on-ucf101 | VidTr-L | 3-fold Accuracy: 96.7 |