LLMs Generate Self-Building Benchmarks, Achieving 66% Accuracy in 2025 Exams

LLMs now autonomously generate benchmarks, achieving up to 79% accuracy on task assessments, signaling a transformative approach for AI evaluation by 2025.

Staff

Published

21 November, 2025

Benchmarks play a critical role in assessing the capabilities of Large Language Models (LLMs), yet their development and maintenance can be costly and resource-intensive. A recent study introduces a novel approach using Agentic AI principles, wherein LLMs themselves generate and evaluate practical examinations tailored to specific occupational tasks across sectors like Finance, Business Operations, Management, and various fields within Computer Science and Mathematics.

Exam Development Using LLMs

The research differentiates between the materials necessary for assessments, such as text, data, and images, and the tools required to solve these tasks, including function calling and web searches. By concentrating solely on text-based tasks that do not require additional tool usage, the study found that a mere 7% of the occupations examined yielded testable tasks, totaling 149 tasks across the analyzed fields.

To evaluate these synthetic examinations, the researchers deployed a variety of models, including notable variants like GPT, Claude, and Gemini. The findings revealed that even for basic tasks, current LLMs face considerable challenges. Leading models achieved median scores ranging from 65% to 79%, indicating significant room for improvement, particularly in areas such as data manipulation and financial calculations.

Rapid Model Improvement

Encouragingly, the research noted a rapid enhancement in model performance over time. Models introduced in 2024 averaged scores of 40.5%, while those released in 2025 showed a remarkable increase to 66%, marking a substantial rise of 26 percentage points in just one year. This trend suggests that while there is still considerable work to be done in validating these benchmarks and expanding their applicability to tool-based tasks, LLM-generated assessments could offer a more cost-effective, scalable, and continuously updateable method for measuring AI capabilities in workplace settings.

The study’s results advocate for the extension of the “LLM-as-a-judge” paradigm to occupational task assessments, signifying a shift in how LLMs can be utilized to enhance understanding of their strengths and weaknesses in practical scenarios.

As AI continues to evolve, the implications of these findings could be profound, potentially reshaping how we evaluate AI competency in various sectors, from finance to management. The approach not only addresses the challenges associated with traditional benchmarking methods but also sets a precedent for future research in the field.

For a detailed examination of the methodology and findings, the full working paper is available for download.

AI Generative

Dottxt Launches Outlines Framework on AWS for Enhanced Structured Outputs

Dottxt unveils the Outlines framework on AWS, achieving 98% schema adherence and up to 5x faster structured output generation for critical sectors.

Staff9 hours ago

Anthropic Launches Enterprise Agents Program with Finance and HR Plugins

Anthropic launches its enterprise agents program with customizable finance and HR plugins to enhance workplace productivity, challenging existing SaaS solutions.

Staff10 hours ago

AI Business

AWS Launches Transform, Slashing Enterprise Modernization Time by 80% with Agentic AI

AWS Transform accelerates enterprise modernization by up to 80%, empowering firms like Experian to cut development time by 40% and reduce costs dramatically.

Marcus Chen22 hours ago

AI Technology

CPUs Drive 60% Revenue Growth for AMD as Agentic AI Workloads Surge

AMD's EPYC CPUs drive a record $5.4 billion in Q4 revenue, fueled by soaring demand from agentic AI workloads as CPUs take center stage...

Staff1 day ago

AI Regulation

Vitalik Buterin Proposes AI ‘Stewards’ to Revolutionize DAO Governance and Voting Privacy

Vitalik Buterin reveals plans to integrate AI agents in DAOs for enhanced voting privacy and participation, tackling centralization concerns through innovative governance tools.

Staff3 days ago

AI Research

Agentic AI Enhances Deep Learning Workflows, Automates Hyperparameter Tuning

A new lightweight agent automates ML workflows by streamlining experiment management, enabling deep learning researchers to reclaim valuable time and enhance productivity.

Staff4 days ago

AI Cybersecurity

Agentic AI Transforms Cybersecurity with Proactive Threat Management Strategies

Agentic AI revolutionizes cybersecurity by autonomously neutralizing threats in real-time, improving response times and operational efficiency for organizations.

Rachel Torres4 days ago

Cohere Co-Founder Nick Frosst Advocates for Stronger Canadian AI Amid Rapid Growth

Cohere, valued at $7B, aims to reshape AI in Canada by focusing on customized LLMs, achieving $240M in annual recurring revenue while dismissing AGI...

Staff4 days ago

AIPRESSA.COM

Top Stories

LLMs Generate Self-Building Benchmarks, Achieving 66% Accuracy in 2025 Exams

Exam Development Using LLMs

Rapid Model Improvement

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

You May Also Like

AI Generative

Dottxt Launches Outlines Framework on AWS for Enhanced Structured Outputs

Top Stories

Anthropic Launches Enterprise Agents Program with Finance and HR Plugins

AI Business

AWS Launches Transform, Slashing Enterprise Modernization Time by 80% with Agentic AI

AI Technology

CPUs Drive 60% Revenue Growth for AMD as Agentic AI Workloads Surge

AI Regulation

Vitalik Buterin Proposes AI ‘Stewards’ to Revolutionize DAO Governance and Voting Privacy

AI Research

Agentic AI Enhances Deep Learning Workflows, Automates Hyperparameter Tuning

AI Cybersecurity

Agentic AI Transforms Cybersecurity with Proactive Threat Management Strategies

Top Stories

Cohere Co-Founder Nick Frosst Advocates for Stronger Canadian AI Amid Rapid Growth