Microsoft’s DVD AI: Efficiently Exploring Long Videos with LLMs
Microsoft has introduced a novel intelligent agent called Deep Video Discovery (DVD). This agent is designed to break down long videos into shorter segments, treating each segment as an environment. By leveraging the advanced reasoning capabilities of large language models (LLMs), DVD can independently analyze problems, plan strategies, and select appropriate tools with specific parameters to extract information from these environments step-by-step, ultimately providing answers to complex questions. With the help of the latest inference model, OpenAI o3, DVD achieved a notable accuracy rate of 74.2% on the challenging LVBench benchmark, significantly outperforming previous methods. To promote further research and development, Microsoft plans to release this work as an open-source project under the MCP Server framework. This innovation underscores the growing importance of efficient video analysis and the capability of LLMs to enhance autonomous problem-solving in complex environments. By breaking video content into manageable segments and using sophisticated AI models, DVD opens new possibilities for applications ranging from content moderation to advanced video search and summarization.