Connect with us

Hi, what are you looking for?

Top Stories

LLMs Generate Self-Building Benchmarks, Achieving 66% Accuracy in 2025 Exams

LLMs now autonomously generate benchmarks, achieving up to 79% accuracy on task assessments, signaling a transformative approach for AI evaluation by 2025.

Benchmarks play a critical role in assessing the capabilities of Large Language Models (LLMs), yet their development and maintenance can be costly and resource-intensive. A recent study introduces a novel approach using Agentic AI principles, wherein LLMs themselves generate and evaluate practical examinations tailored to specific occupational tasks across sectors like Finance, Business Operations, Management, and various fields within Computer Science and Mathematics.

Exam Development Using LLMs

The research differentiates between the materials necessary for assessments, such as text, data, and images, and the tools required to solve these tasks, including function calling and web searches. By concentrating solely on text-based tasks that do not require additional tool usage, the study found that a mere 7% of the occupations examined yielded testable tasks, totaling 149 tasks across the analyzed fields.

To evaluate these synthetic examinations, the researchers deployed a variety of models, including notable variants like GPT, Claude, and Gemini. The findings revealed that even for basic tasks, current LLMs face considerable challenges. Leading models achieved median scores ranging from 65% to 79%, indicating significant room for improvement, particularly in areas such as data manipulation and financial calculations.

Rapid Model Improvement

Encouragingly, the research noted a rapid enhancement in model performance over time. Models introduced in 2024 averaged scores of 40.5%, while those released in 2025 showed a remarkable increase to 66%, marking a substantial rise of 26 percentage points in just one year. This trend suggests that while there is still considerable work to be done in validating these benchmarks and expanding their applicability to tool-based tasks, LLM-generated assessments could offer a more cost-effective, scalable, and continuously updateable method for measuring AI capabilities in workplace settings.

The study’s results advocate for the extension of the “LLM-as-a-judge” paradigm to occupational task assessments, signifying a shift in how LLMs can be utilized to enhance understanding of their strengths and weaknesses in practical scenarios.

As AI continues to evolve, the implications of these findings could be profound, potentially reshaping how we evaluate AI competency in various sectors, from finance to management. The approach not only addresses the challenges associated with traditional benchmarking methods but also sets a precedent for future research in the field.

For a detailed examination of the methodology and findings, the full working paper is available for download.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Finance

Finance team reports stagnant project accounts with no December spending amid ongoing boiler repairs at Administration Building, raising concerns over project timelines.

AI Finance

Generative AI is set to revolutionize finance by 2026, enhancing efficiency by 30% and creating new roles in oversight and data management.

AI Generative

OpenAI enhances agent capabilities with its fourth-gen Responses API as AI agents grapple with a 30% failure rate, highlighting reliability challenges ahead.

AI Generative

SoundHound AI reports a remarkable 68% revenue growth to $42 million, leveraging its innovative hybrid AI model to outperform traditional LLMs.

Top Stories

A study reveals that top AI models like GPT-5 misjudge exam difficulty, scoring only 0.34 in correlation with human perceptions, highlighting a crucial gap...

AI Technology

Tredence's Milky Way integrates proven reliability and decision traceability, enhancing AI-driven insights to deliver over $1 billion in client value.

AI Generative

Microsoft's new PrivacyChecker module slashes information leakage in LLMs by up to 75%, enhancing user privacy and trust in AI systems.

AI Generative

Researchers demonstrate that large language models achieve over 99% accuracy as world models, revolutionizing AI agent training with simulated environments.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.