LLMs Generate Self-Building Benchmarks, Achieving 66% Accuracy in 2025 Exams

LLMs now autonomously generate benchmarks, achieving up to 79% accuracy on task assessments, signaling a transformative approach for AI evaluation by 2025.

Staff

Published

21 November, 2025

Benchmarks play a critical role in assessing the capabilities of Large Language Models (LLMs), yet their development and maintenance can be costly and resource-intensive. A recent study introduces a novel approach using Agentic AI principles, wherein LLMs themselves generate and evaluate practical examinations tailored to specific occupational tasks across sectors like Finance, Business Operations, Management, and various fields within Computer Science and Mathematics.

Exam Development Using LLMs

The research differentiates between the materials necessary for assessments, such as text, data, and images, and the tools required to solve these tasks, including function calling and web searches. By concentrating solely on text-based tasks that do not require additional tool usage, the study found that a mere 7% of the occupations examined yielded testable tasks, totaling 149 tasks across the analyzed fields.

To evaluate these synthetic examinations, the researchers deployed a variety of models, including notable variants like GPT, Claude, and Gemini. The findings revealed that even for basic tasks, current LLMs face considerable challenges. Leading models achieved median scores ranging from 65% to 79%, indicating significant room for improvement, particularly in areas such as data manipulation and financial calculations.

Rapid Model Improvement

Encouragingly, the research noted a rapid enhancement in model performance over time. Models introduced in 2024 averaged scores of 40.5%, while those released in 2025 showed a remarkable increase to 66%, marking a substantial rise of 26 percentage points in just one year. This trend suggests that while there is still considerable work to be done in validating these benchmarks and expanding their applicability to tool-based tasks, LLM-generated assessments could offer a more cost-effective, scalable, and continuously updateable method for measuring AI capabilities in workplace settings.

The study’s results advocate for the extension of the “LLM-as-a-judge” paradigm to occupational task assessments, signifying a shift in how LLMs can be utilized to enhance understanding of their strengths and weaknesses in practical scenarios.

As AI continues to evolve, the implications of these findings could be profound, potentially reshaping how we evaluate AI competency in various sectors, from finance to management. The approach not only addresses the challenges associated with traditional benchmarking methods but also sets a precedent for future research in the field.

For a detailed examination of the methodology and findings, the full working paper is available for download.

AI Finance

Finance Updates Reveal No December Spending and Boiler Repairs at Administration Building

Finance team reports stagnant project accounts with no December spending amid ongoing boiler repairs at Administration Building, raising concerns over project timelines.

Marcus Chen12 hours ago

AI Finance

Generative AI Transforms Finance by 2026: Boosting Efficiency and Creating New Jobs

Generative AI is set to revolutionize finance by 2026, enhancing efficiency by 30% and creating new roles in oversight and data management.

Marcus Chen20 hours ago

AI Generative

Agents Surge in 2025: OpenAI and Anthropic Lead AI’s Evolution with New Protocols

OpenAI enhances agent capabilities with its fourth-gen Responses API as AI agents grapple with a 30% failure rate, highlighting reliability challenges ahead.

Staff2 days ago

AI Generative

SoundHound Achieves 68% Revenue Growth with Hybrid AI Model Outpacing LLMs

SoundHound AI reports a remarkable 68% revenue growth to $42 million, leveraging its innovative hybrid AI model to outperform traditional LLMs.

Staff3 days ago

AI Models Misjudge Exam Difficulty, Underestimate Human Struggles, Study Finds

A study reveals that top AI models like GPT-5 misjudge exam difficulty, scoring only 0.34 in correlation with human perceptions, highlighting a crucial gap...

Staff5 days ago

AI Technology

Tredence’s Milky Way Enhances Agentic AI with Proven Reliability and Decision Traceability

Tredence's Milky Way integrates proven reliability and decision traceability, enhancing AI-driven insights to deliver over $1 billion in client value.

Staff5 days ago

AI Generative

Microsoft Introduces PrivacyChecker, Reducing Info Leakage in LLMs by Up to 75%

Microsoft's new PrivacyChecker module slashes information leakage in LLMs by up to 75%, enhancing user privacy and trust in AI systems.

Staff6 days ago

AI Generative

LLMs Achieve Over 99% Accuracy as World Models for AI Agent Training, Study Reveals

Researchers demonstrate that large language models achieve over 99% accuracy as world models, revolutionizing AI agent training with simulated environments.

Staff1 January, 2026

AIPRESSA.COM

Top Stories

LLMs Generate Self-Building Benchmarks, Achieving 66% Accuracy in 2025 Exams

Exam Development Using LLMs

Rapid Model Improvement

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

You May Also Like

AI Finance

Finance Updates Reveal No December Spending and Boiler Repairs at Administration Building

AI Finance

Generative AI Transforms Finance by 2026: Boosting Efficiency and Creating New Jobs

AI Generative

Agents Surge in 2025: OpenAI and Anthropic Lead AI’s Evolution with New Protocols

AI Generative

SoundHound Achieves 68% Revenue Growth with Hybrid AI Model Outpacing LLMs

Top Stories

AI Models Misjudge Exam Difficulty, Underestimate Human Struggles, Study Finds

AI Technology

Tredence’s Milky Way Enhances Agentic AI with Proven Reliability and Decision Traceability

AI Generative

Microsoft Introduces PrivacyChecker, Reducing Info Leakage in LLMs by Up to 75%

AI Generative

LLMs Achieve Over 99% Accuracy as World Models for AI Agent Training, Study Reveals