Command Palette
Search for a command to run...
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
Yi Wang; Kunchang Li; Yizhuo Li; Yinan He; Bingkun Huang; Zhiyu Zhao; Hongjie Zhang; Jilan Xu; Yi Liu; Zun Wang; Sen Xing; Guo Chen; Junting Pan; Jiashuo Yu; Yali Wang; Limin Wang; Yu Qiao

Abstract
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| action-classification-on-kinetics-400 | InternVideo | Acc@1: 91.1 |
| action-classification-on-kinetics-600 | InternVideo-T | Top-1 Accuracy: 91.3 |
| action-classification-on-kinetics-700 | InternVideo-T | Top-1 Accuracy: 84.0 |
| action-recognition-in-videos-on-something | InternVideo | Top-1 Accuracy: 77.2 |
| action-recognition-in-videos-on-something-1 | InternVideo | Top 1 Accuracy: 70.0 |
| action-recognition-on-ava-v2-2 | InternVideo | mAP: 41.01 |
| open-set-action-recognition-on-ucf-hmdb | InternVideo | AUROC: 85.48 |
| open-set-action-recognition-on-ucf101-mitv2 | InternVideo | AUROC: 91.85 |
| spatio-temporal-action-localization-on-ava | InternVideo | val mAP: 41.01 |
| temporal-action-localization-on-activitynet | InternVideo | mAP: 39.00 |
| temporal-action-localization-on-fineaction | InternVideo | mAP: 17.57 |
| temporal-action-localization-on-hacs | InternVideo | Average-mAP: 41.55 |
| temporal-action-localization-on-thumos14 | ActionFormer (InternVideo features) | Avg mAP (0.3:0.7): 71.58 |
| video-question-answering-on-situated | InternVideo | Average Accuracy: 58.7 |
| video-retrieval-on-activitynet | InternVideo | text-to-video R@1: 62.2 video-to-text R@1: 62.8 |
| video-retrieval-on-didemo | InternVideo | text-to-video R@1: 57.9 video-to-text R@1: 59.1 |
| video-retrieval-on-lsmdc | InternVideo | text-to-video R@1: 34.0 video-to-text R@1: 34.9 |
| video-retrieval-on-msr-vtt | InternVideo | text-to-video R@1: 55.2 video-to-text R@1: 57.9 |
| video-retrieval-on-msvd | InternVideo | text-to-video R@1: 58.4 video-to-text R@1: 76.3 |
| video-retrieval-on-vatex | InternVideo | text-to-video R@1: 71.1 video-to-text R@1: 87.2 |
| visual-question-answering-on-msrvtt-qa-1 | InternVideo | Accuracy: 0.471 |
| visual-question-answering-on-msvd-qa-1 | InternVideo | Accuracy: 0.555 |
| visual-question-answering-on-tgif-qa | InternVideo | Accuracy: 0.722 |
| zero-shot-video-question-answer-on-egoschema-1 | InternVideo | Accuracy: 32.1 |
| zero-shot-video-question-answer-on-star | InternVideo | Accuracy: 41.6 |
| zero-shot-video-question-answer-on-tvqa | InternVideo (no speech) | Accuracy: 35.9 |
| zero-shot-video-retrieval-on-activitynet | InternVideo | text-to-video R@1: 30.7 video-to-text R@1: 31.4 |
| zero-shot-video-retrieval-on-didemo | InternVideo | text-to-video R@1: 31.5 text-to-video R@10: 68.2 text-to-video R@5: 57.6 video-to-text R@1: 33.5 video-to-text R@10: 71.1 video-to-text R@5: 60.3 |
| zero-shot-video-retrieval-on-lsmdc | InternVideo | text-to-video R@1: 17.6 text-to-video R@10: 40.2 text-to-video R@5: 32.4 video-to-text R@1: 13.2 video-to-text R@10: 34.9 video-to-text R@5: 27.8 |
| zero-shot-video-retrieval-on-msr-vtt | InternVideo | text-to-video R@1: 40.7 video-to-text R@1: 39.6 |
| zero-shot-video-retrieval-on-msvd | InternVideo | text-to-video R@1: 43.4 video-to-text R@1: 67.6 |
| zero-shot-video-retrieval-on-vatex | InternVideo | text-to-video R@1: 49.5 video-to-text R@1: 69.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.