AI Research

AI Science Agents Face Benchmark Challenges: DiscoveryWorld Scores at 20%, ScienceWorld at 80%

AI science agents struggle with practical tasks, scoring 20% in DiscoveryWorld and 80% in ScienceWorld, highlighting gaps in their scientific capabilities.

Staff

Published

13 April, 2026

As artificial intelligence continues to evolve, the development of AI science agents is gaining traction. However, questions linger about their effectiveness in actual scientific tasks. Recent benchmarks created by the Allen Institute for Artificial Intelligence (Ai2), namely ScienceWorld and DiscoveryWorld, are designed to address these concerns by evaluating agents’ abilities to conduct scientific experiments and investigations. Launched in 2022 and 2024, respectively, these benchmarks aim to quantify the progress of AI models in performing scientific tasks.

In 2022, leading AI models demonstrated impressive scores on multiple-choice grade-school science exams. Yet, when tasked with executing experiments in controlled virtual environments like ScienceWorld, their performance plummeted to below 10%. This stark contrast highlighted the gap between theoretical knowledge and practical application of scientific principles—a disparity that continues to challenge AI developers.

Fast forward to early 2025, and the landscape has shifted. Top models now achieve scores in the low 80s on ScienceWorld, indicating significant progress but still falling short of mastering a typical 4th-grade science curriculum. On the more complex DiscoveryWorld, which requires agents to design and execute their own scientific investigations, performance metrics reveal that elite models complete only about 20% of tasks at the higher difficulty level. In comparison, average human scientists with advanced degrees tackle these challenges successfully around 70% of the time.

“So many folks are jumping on the science agent bandwagon and releasing agents,” says Ai2 researcher Peter Jansen, who spearheaded the development of these benchmarks. “But if the best systems a year ago couldn’t even solve most of the easy problems in DiscoveryWorld, how likely is it that they’re much better today?”

The Challenge of End-to-End Scientific Discovery

DiscoveryWorld is unique in its design, simulating end-to-end scientific investigations in a fictional setting. Set on a hypothetical space colony, Planet X, the benchmark comprises 120 challenge tasks across eight scientific disciplines, including proteomics and rocket science. Each task is crafted to require forming hypotheses, designing experiments, and analyzing results, often involving hundreds of in-game actions.

What sets DiscoveryWorld apart is its evaluation criteria. It assesses not just whether an agent solves a task but also whether it adheres to scientific protocols and demonstrates a genuine understanding of its findings. This level of scrutiny distinguishes insightful conclusions from mere luck, adding a layer of complexity to the assessment process.

Jansen notes that while practicing scientists routinely solve problems in DiscoveryWorld, leading AI agents still struggle, completing roughly 80% of the tasks inaccurately. “Knowing what a concept is and being able to apply it are different things entirely,” he explains, emphasizing the distinction between theoretical knowledge and practical application.

The growing interest in DiscoveryWorld is evidenced by its citation in nearly 80 academic papers and coverage in outlets like New Scientist. Jansen anticipates that as AI models continue to improve, benchmarks like DiscoveryWorld will gain prominence. “With models at their current price-to-performance ratio, I’d argue there’s never been a better time to test whether your agent can solve long-horizon scientific discovery tasks,” he asserts.

In contrast, ScienceWorld serves as a foundational benchmark, focusing on the basics of scientific inquiry. It places agents in a simulated world with ten interconnected locations, simulating real laboratory conditions. Agents engage in 30 different task types, requiring them to conduct experiments that mirror classic discoveries found in today’s science textbooks.

Despite improvements since its inception, the gap between theoretical understanding and practical execution remains evident. When ScienceWorld debuted, models that excelled in traditional science exams faced failure rates exceeding 90% in executing actual experiments. Although a recent benchmark suite, TALES, reported scores in the low 80s for leading models, the challenges persist.

“We hope that in the near future, science agents will help treat diseases, create new materials, and generate other important discoveries,” Jansen remarks. He emphasizes that benchmarks like DiscoveryWorld and ScienceWorld are essential for gauging whether AI systems can navigate the complexities of scientific inquiry. “If an agent flunks basic science, what hope does it have of curing cancer?” he questions.

As AI continues to make strides, the significance of benchmarks like DiscoveryWorld and ScienceWorld cannot be overstated. They not only measure the capabilities of current AI models but also guide future developments in the field, enabling a clearer path toward effective scientific discovery through artificial intelligence.

AI Research

TCU Launches $10M AI² Initiative with Dell to Expand AI Research and Education

Texas Christian University launches $10M AI² initiative with Dell to enhance secure AI research and education, aiming for Carnegie R1 status and broader community...

Staff14 December, 2025

Dell, Amazon, Microsoft Invest $10M in TCU’s AI Initiative for Responsible Use

Texas Christian University secures $10 million in funding from Dell, Amazon, and Microsoft to launch the AI² initiative for responsible technology use in research...

Staff10 December, 2025

AIPRESSA.COM

AI Research

AI Science Agents Face Benchmark Challenges: DiscoveryWorld Scores at 20%, ScienceWorld at 80%

The Challenge of End-to-End Scientific Discovery

Trending

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

You May Also Like

AI Research

TCU Launches $10M AI² Initiative with Dell to Expand AI Research and Education

Top Stories

Dell, Amazon, Microsoft Invest $10M in TCU’s AI Initiative for Responsible Use