Connect with us

Hi, what are you looking for?

AI Research

AI Science Agents Face Benchmark Challenges: DiscoveryWorld Scores at 20%, ScienceWorld at 80%

AI science agents struggle with practical tasks, scoring 20% in DiscoveryWorld and 80% in ScienceWorld, highlighting gaps in their scientific capabilities.

As artificial intelligence continues to evolve, the development of AI science agents is gaining traction. However, questions linger about their effectiveness in actual scientific tasks. Recent benchmarks created by the Allen Institute for Artificial Intelligence (Ai2), namely ScienceWorld and DiscoveryWorld, are designed to address these concerns by evaluating agents’ abilities to conduct scientific experiments and investigations. Launched in 2022 and 2024, respectively, these benchmarks aim to quantify the progress of AI models in performing scientific tasks.

In 2022, leading AI models demonstrated impressive scores on multiple-choice grade-school science exams. Yet, when tasked with executing experiments in controlled virtual environments like ScienceWorld, their performance plummeted to below 10%. This stark contrast highlighted the gap between theoretical knowledge and practical application of scientific principles—a disparity that continues to challenge AI developers.

Fast forward to early 2025, and the landscape has shifted. Top models now achieve scores in the low 80s on ScienceWorld, indicating significant progress but still falling short of mastering a typical 4th-grade science curriculum. On the more complex DiscoveryWorld, which requires agents to design and execute their own scientific investigations, performance metrics reveal that elite models complete only about 20% of tasks at the higher difficulty level. In comparison, average human scientists with advanced degrees tackle these challenges successfully around 70% of the time.

“So many folks are jumping on the science agent bandwagon and releasing agents,” says Ai2 researcher Peter Jansen, who spearheaded the development of these benchmarks. “But if the best systems a year ago couldn’t even solve most of the easy problems in DiscoveryWorld, how likely is it that they’re much better today?”

The Challenge of End-to-End Scientific Discovery

DiscoveryWorld is unique in its design, simulating end-to-end scientific investigations in a fictional setting. Set on a hypothetical space colony, Planet X, the benchmark comprises 120 challenge tasks across eight scientific disciplines, including proteomics and rocket science. Each task is crafted to require forming hypotheses, designing experiments, and analyzing results, often involving hundreds of in-game actions.

What sets DiscoveryWorld apart is its evaluation criteria. It assesses not just whether an agent solves a task but also whether it adheres to scientific protocols and demonstrates a genuine understanding of its findings. This level of scrutiny distinguishes insightful conclusions from mere luck, adding a layer of complexity to the assessment process.

Jansen notes that while practicing scientists routinely solve problems in DiscoveryWorld, leading AI agents still struggle, completing roughly 80% of the tasks inaccurately. “Knowing what a concept is and being able to apply it are different things entirely,” he explains, emphasizing the distinction between theoretical knowledge and practical application.

The growing interest in DiscoveryWorld is evidenced by its citation in nearly 80 academic papers and coverage in outlets like New Scientist. Jansen anticipates that as AI models continue to improve, benchmarks like DiscoveryWorld will gain prominence. “With models at their current price-to-performance ratio, I’d argue there’s never been a better time to test whether your agent can solve long-horizon scientific discovery tasks,” he asserts.

In contrast, ScienceWorld serves as a foundational benchmark, focusing on the basics of scientific inquiry. It places agents in a simulated world with ten interconnected locations, simulating real laboratory conditions. Agents engage in 30 different task types, requiring them to conduct experiments that mirror classic discoveries found in today’s science textbooks.

Despite improvements since its inception, the gap between theoretical understanding and practical execution remains evident. When ScienceWorld debuted, models that excelled in traditional science exams faced failure rates exceeding 90% in executing actual experiments. Although a recent benchmark suite, TALES, reported scores in the low 80s for leading models, the challenges persist.

“We hope that in the near future, science agents will help treat diseases, create new materials, and generate other important discoveries,” Jansen remarks. He emphasizes that benchmarks like DiscoveryWorld and ScienceWorld are essential for gauging whether AI systems can navigate the complexities of scientific inquiry. “If an agent flunks basic science, what hope does it have of curing cancer?” he questions.

As AI continues to make strides, the significance of benchmarks like DiscoveryWorld and ScienceWorld cannot be overstated. They not only measure the capabilities of current AI models but also guide future developments in the field, enabling a clearer path toward effective scientific discovery through artificial intelligence.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Research

Texas Christian University launches $10M AI² initiative with Dell to enhance secure AI research and education, aiming for Carnegie R1 status and broader community...

Top Stories

Texas Christian University secures $10 million in funding from Dell, Amazon, and Microsoft to launch the AI² initiative for responsible technology use in research...

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.