As artificial intelligence continues to evolve, the development of AI science agents is gaining traction. However, questions linger about their effectiveness in actual scientific tasks. Recent benchmarks created by the Allen Institute for Artificial Intelligence (Ai2), namely ScienceWorld and DiscoveryWorld, are designed to address these concerns by evaluating agents’ abilities to conduct scientific experiments and investigations. Launched in 2022 and 2024, respectively, these benchmarks aim to quantify the progress of AI models in performing scientific tasks.
In 2022, leading AI models demonstrated impressive scores on multiple-choice grade-school science exams. Yet, when tasked with executing experiments in controlled virtual environments like ScienceWorld, their performance plummeted to below 10%. This stark contrast highlighted the gap between theoretical knowledge and practical application of scientific principles—a disparity that continues to challenge AI developers.
Fast forward to early 2025, and the landscape has shifted. Top models now achieve scores in the low 80s on ScienceWorld, indicating significant progress but still falling short of mastering a typical 4th-grade science curriculum. On the more complex DiscoveryWorld, which requires agents to design and execute their own scientific investigations, performance metrics reveal that elite models complete only about 20% of tasks at the higher difficulty level. In comparison, average human scientists with advanced degrees tackle these challenges successfully around 70% of the time.
“So many folks are jumping on the science agent bandwagon and releasing agents,” says Ai2 researcher Peter Jansen, who spearheaded the development of these benchmarks. “But if the best systems a year ago couldn’t even solve most of the easy problems in DiscoveryWorld, how likely is it that they’re much better today?”
The Challenge of End-to-End Scientific Discovery
DiscoveryWorld is unique in its design, simulating end-to-end scientific investigations in a fictional setting. Set on a hypothetical space colony, Planet X, the benchmark comprises 120 challenge tasks across eight scientific disciplines, including proteomics and rocket science. Each task is crafted to require forming hypotheses, designing experiments, and analyzing results, often involving hundreds of in-game actions.
What sets DiscoveryWorld apart is its evaluation criteria. It assesses not just whether an agent solves a task but also whether it adheres to scientific protocols and demonstrates a genuine understanding of its findings. This level of scrutiny distinguishes insightful conclusions from mere luck, adding a layer of complexity to the assessment process.
Jansen notes that while practicing scientists routinely solve problems in DiscoveryWorld, leading AI agents still struggle, completing roughly 80% of the tasks inaccurately. “Knowing what a concept is and being able to apply it are different things entirely,” he explains, emphasizing the distinction between theoretical knowledge and practical application.
The growing interest in DiscoveryWorld is evidenced by its citation in nearly 80 academic papers and coverage in outlets like New Scientist. Jansen anticipates that as AI models continue to improve, benchmarks like DiscoveryWorld will gain prominence. “With models at their current price-to-performance ratio, I’d argue there’s never been a better time to test whether your agent can solve long-horizon scientific discovery tasks,” he asserts.
In contrast, ScienceWorld serves as a foundational benchmark, focusing on the basics of scientific inquiry. It places agents in a simulated world with ten interconnected locations, simulating real laboratory conditions. Agents engage in 30 different task types, requiring them to conduct experiments that mirror classic discoveries found in today’s science textbooks.
Despite improvements since its inception, the gap between theoretical understanding and practical execution remains evident. When ScienceWorld debuted, models that excelled in traditional science exams faced failure rates exceeding 90% in executing actual experiments. Although a recent benchmark suite, TALES, reported scores in the low 80s for leading models, the challenges persist.
“We hope that in the near future, science agents will help treat diseases, create new materials, and generate other important discoveries,” Jansen remarks. He emphasizes that benchmarks like DiscoveryWorld and ScienceWorld are essential for gauging whether AI systems can navigate the complexities of scientific inquiry. “If an agent flunks basic science, what hope does it have of curing cancer?” he questions.
As AI continues to make strides, the significance of benchmarks like DiscoveryWorld and ScienceWorld cannot be overstated. They not only measure the capabilities of current AI models but also guide future developments in the field, enabling a clearer path toward effective scientific discovery through artificial intelligence.
See also
Machine Learning Advances Bone Imaging, Enhances Diagnosis and Treatment Monitoring
AI Study Reveals Generated Faces Indistinguishable from Real Photos, Erodes Trust in Visual Media
Gen AI Revolutionizes Market Research, Transforming $140B Industry Dynamics
Researchers Unlock Light-Based AI Operations for Significant Energy Efficiency Gains
Tempus AI Reports $334M Earnings Surge, Unveils Lymphoma Research Partnership



















































