HyperAI
Back to Headlines

OpenAI's o3 Leads in Scientific Question-Answering, Outperforming Gemini and DeepSeek in New Benchmark

6 days ago

OpenAI's o3 has topped a new AI league table for answering scientific questions, outperforming other prominent models like Gemini and DeepSeek. According to SciArena, a benchmarking platform launched last week by the Allen Institute for Artificial Intelligence (Ai2) in Seattle, Washington, o3 excelled across various scientific domains. SciArena evaluated 23 large language models (LLMs) based on their responses to scientific queries, with answers rated by 102 researchers. Over 13,000 votes were cast, and o3 emerged as the best tool for natural sciences, health care, engineering, and humanities and social science. DeepSeek-R1, developed by DeepSeek in Hangzhou, China, secured the second spot in natural sciences and the fourth in engineering. Google’s Gemini-2.5-Pro ranked third in natural sciences and fifth in both engineering and health care. Arman Cohan, a research scientist at Ai2, suggests that o3's detailed citations and technically nuanced responses may contribute to its strong performance. However, pinpointing the exact reasons for varying model performance is challenging due to the proprietary nature of most AI systems. Factors such as differences in training data and optimization goals could play a role, he notes. SciArena stands out as one of the first platforms to rank AI models' performance on scientific tasks using crowdsourced feedback. "SciArena is a positive effort that motivates a careful evaluation of LLM-assisted literature tasks," says Rahul Shome, a robotics and AI researcher at the Australian National University in Canberra. To compile the rankings, SciArena invited researchers to submit scientific questions. Each user received answers from two randomly selected models, which drew references from Semantic Scholar, another Ai2 tool designed for AI research. Users then voted on the effectiveness of the responses, with only votes from verified users being included in the leaderboard. The platform is now publicly available, allowing anyone to ask research questions for free and participate in the evaluation process. Jonathan Kummerfeld, an AI researcher at the University of Sydney in Australia, highlights the potential impact of such a platform on the scientific community. "The ability to question LLMs on science topics and have confidence in the answers will help researchers stay current with the latest literature in their field," he says. This feature can assist scientists in discovering relevant work they might otherwise have overlooked, enhancing their research and productivity.

Related Links