Action Classification On Kinetics 600
评估指标
Top-1 Accuracy
评测结果
各个模型在此基准测试上的表现结果
比较表格
模型名称 | Top-1 Accuracy |
---|---|
d3d-distilled-3d-networks-for-video-action | 79.1 |
space-time-mixing-attention-for-video | 84.5 |
movinets-mobile-video-networks-for-efficient | 82.7 |
videomae-v2-scaling-video-masked-autoencoders | 88.8 |
movinets-mobile-video-networks-for-efficient | 77.5 |
perf-net-pose-empowered-rgb-flow-net | 82.0 |
mplug-2-a-modularized-multi-modal-foundation | 89.8 |
florence-a-new-foundation-model-for-computer | 87.8 |
movinets-mobile-video-networks-for-efficient | 83.5 |
rethinking-spatiotemporal-feature-learning | 76.6 |
merlot-reserve-neural-script-knowledge | 89.7 |
uniformer-unified-transformer-for-efficient | 84.8 |
learning-spatio-temporal-representation-with-3 | 75 |
tokenlearner-what-can-8-learned-tokens-do-for | 86.3 |
slowfast-networks-for-video-recognition | 79.9 |
unmasked-teacher-towards-training-efficient | 90.5 |
slowfast-networks-for-video-recognition | 81.8 |
eva-exploring-the-limits-of-masked-visual | 89.8% |
a-short-note-about-kinetics-600 | 73.6 |
rethinking-video-vits-sparse-video-tubes-for | 91.5 |
2103-15691 | 83.0 |
internvideo-general-video-foundation-models | 91.3 |
rethinking-video-vits-sparse-video-tubes-for | 90.9 |
coca-contrastive-captioners-are-image-text | 89.4 |
movinets-mobile-video-networks-for-efficient | 76.0 |
learning-spatio-temporal-representation-with-3 | 81.5 |
revisiting-3d-resnets-for-video-recognition | 83.1 |
co-training-transformer-with-videos-and | 86.8 |
improved-multiscale-vision-transformers-for | 87.9 |
learning-spatio-temporal-representation-with-3 | 83.1 |
movinets-mobile-video-networks-for-efficient | 81.2 |
improved-multiscale-vision-transformers-for | - |
improved-multiscale-vision-transformers-for | 85.5 |
slowfast-networks-for-video-recognition | 81.1 |
movinets-mobile-video-networks-for-efficient | 84.3 |
expanding-language-image-pretrained-models | 88.3 |
d3d-distilled-3d-networks-for-video-action | 77.9 |
slowfast-networks-for-video-recognition | 80.4 |
movinets-mobile-video-networks-for-efficient | 71.5 |
multiscale-vision-transformers | 82.1 |
slowfast-networks-for-video-recognition | 78.8 |
internvideo2-scaling-video-foundation-models | 91.6 |
videomae-v2-scaling-video-masked-autoencoders | 89.9 |
coca-contrastive-captioners-are-image-text | 88.5 |
merlot-reserve-neural-script-knowledge | 91.1 |
2103-15691 | 85.8 |
hiera-a-hierarchical-vision-transformer | 88.8 |
uniformerv2-spatiotemporal-learning-by-arming | 90.1 |
2103-15691 | 84.3 |
multiview-transformers-for-video-recognition | 90.3 |
vatt-transformers-for-multimodal-self | 83.6 |
co-training-transformer-with-videos-and | 87.9 |
multiscale-vision-transformers | 83.8 |
movinets-mobile-video-networks-for-efficient | 80.8 |
merlot-reserve-neural-script-knowledge | 89.4 |
video-swin-transformer | 86.1 |
multiscale-vision-transformers | 83.4 |
rethinking-spatiotemporal-feature-learning | 78.6 |
internvideo2-scaling-video-foundation-models | 91.9 |
rethinking-spatiotemporal-feature-learning | 69.7 |
video-swin-transformer | 84.0 |
rethinking-video-vits-sparse-video-tubes-for | 91.8 |
masked-feature-prediction-for-self-supervised | 88.3 |
merlot-reserve-neural-script-knowledge | 88.1 |
improved-multiscale-vision-transformers-for | - |