LLMs Generate Self-Building Benchmarks, Achieving 66% Accuracy in 2025 Exams

LLMs now autonomously generate benchmarks, achieving up to 79% accuracy on task assessments, signaling a transformative approach for AI evaluation by 2025.

Staff

Published

21 November, 2025

Benchmarks play a critical role in assessing the capabilities of Large Language Models (LLMs), yet their development and maintenance can be costly and resource-intensive. A recent study introduces a novel approach using Agentic AI principles, wherein LLMs themselves generate and evaluate practical examinations tailored to specific occupational tasks across sectors like Finance, Business Operations, Management, and various fields within Computer Science and Mathematics.

Exam Development Using LLMs

The research differentiates between the materials necessary for assessments, such as text, data, and images, and the tools required to solve these tasks, including function calling and web searches. By concentrating solely on text-based tasks that do not require additional tool usage, the study found that a mere 7% of the occupations examined yielded testable tasks, totaling 149 tasks across the analyzed fields.

To evaluate these synthetic examinations, the researchers deployed a variety of models, including notable variants like GPT, Claude, and Gemini. The findings revealed that even for basic tasks, current LLMs face considerable challenges. Leading models achieved median scores ranging from 65% to 79%, indicating significant room for improvement, particularly in areas such as data manipulation and financial calculations.

Rapid Model Improvement

Encouragingly, the research noted a rapid enhancement in model performance over time. Models introduced in 2024 averaged scores of 40.5%, while those released in 2025 showed a remarkable increase to 66%, marking a substantial rise of 26 percentage points in just one year. This trend suggests that while there is still considerable work to be done in validating these benchmarks and expanding their applicability to tool-based tasks, LLM-generated assessments could offer a more cost-effective, scalable, and continuously updateable method for measuring AI capabilities in workplace settings.

The study’s results advocate for the extension of the “LLM-as-a-judge” paradigm to occupational task assessments, signifying a shift in how LLMs can be utilized to enhance understanding of their strengths and weaknesses in practical scenarios.

As AI continues to evolve, the implications of these findings could be profound, potentially reshaping how we evaluate AI competency in various sectors, from finance to management. The approach not only addresses the challenges associated with traditional benchmarking methods but also sets a precedent for future research in the field.

For a detailed examination of the methodology and findings, the full working paper is available for download.

AI Business

Red Hat Reveals Small Language Models as Key to Scaling Enterprise AI Agents

Red Hat advances enterprise AI with Small Language Models that achieve over 98% validity in structured tasks, prioritizing reliability and data sovereignty.

Marcus Chen3 May, 2026

AI Business

IBM Reveals AI Solutions to Transform Retail Experience at Think 2026 Conference

IBM unveils agentic AI solutions at Think 2026, promising to enhance retail operations and customer experiences through intelligent, real-time insights and automation.

Marcus Chen1 May, 2026

AI Generative

Apple Researchers Reveal LaDiR Framework, Enhancing LLM Accuracy by 20% in Math and Code Generation

Apple's new LaDiR framework enhances large language model accuracy by 20% in math reasoning and code generation, revolutionizing AI problem-solving.

Staff1 May, 2026

AI Government

Agentic AI Forum 2026 Unveils Strategies for Ethical Government Data Governance

Agentic AI Forum 2026 set for July 29-30 in Canberra will equip leaders with actionable strategies for ethical AI governance amid rapid technological change.

Staff30 April, 2026

AI Research

Generative AI Increases Cyber Risks in Machine Learning, Warns Heriot-Watt Study

Heriot-Watt University warns that integrating generative AI into machine learning increases risks of cyber-attacks, data breaches, and algorithmic bias across sectors.

Staff30 April, 2026

Google DeepMind Reveals LLMs Can’t Achieve Consciousness, Challenging AGI Claims

Google DeepMind's Alexander Lerchner claims AI can't achieve consciousness, challenging AGI narratives and revealing it as mere advanced simulation.

Staff28 April, 2026

AI Technology

Lumai Launches Iris Server, World’s First Optical System for Real-Time AI Inference

Lumai unveils the Iris inference server, the world's first optical system enabling real-time execution of billion-parameter AI models with 90% lower energy consumption.

Staff28 April, 2026

AI Business

Stanford-Linked Human Intelligence Seeks $100 Million at $1 Billion Valuation

Stanford-affiliated startup Human Intelligence aims to raise $100 million for a $1 billion valuation to revolutionize AI with its new physiology foundation model.

Marcus Chen26 April, 2026

AIPRESSA.COM

Top Stories

LLMs Generate Self-Building Benchmarks, Achieving 66% Accuracy in 2025 Exams

Exam Development Using LLMs

Rapid Model Improvement

Trending

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

You May Also Like

AI Business

Red Hat Reveals Small Language Models as Key to Scaling Enterprise AI Agents

AI Business

IBM Reveals AI Solutions to Transform Retail Experience at Think 2026 Conference

AI Generative

Apple Researchers Reveal LaDiR Framework, Enhancing LLM Accuracy by 20% in Math and Code Generation

AI Government

Agentic AI Forum 2026 Unveils Strategies for Ethical Government Data Governance

AI Research

Generative AI Increases Cyber Risks in Machine Learning, Warns Heriot-Watt Study

Top Stories

Google DeepMind Reveals LLMs Can’t Achieve Consciousness, Challenging AGI Claims

AI Technology

Lumai Launches Iris Server, World’s First Optical System for Real-Time AI Inference

AI Business

Stanford-Linked Human Intelligence Seeks $100 Million at $1 Billion Valuation