AI Research

Researchers Launch ‘Humanity’s Last Exam’ Revealing AI Models’ Limitations with 50% Accuracy

Researchers unveil Humanity’s Last Exam, revealing top AI models like OpenAI’s GPT-4 and Claude scored just 2.7% to 3.5%, highlighting significant limitations.

Staff

Published

7 March, 2026

As the capabilities of artificial intelligence continue to evolve, traditional methods for measuring these advancements are becoming increasingly inadequate. An international team of researchers has introduced a new assessment named Humanity’s Last Exam (HLE), designed to evaluate the limits of modern AI systems. The HLE encompasses 2,500 expert-level questions across various fields, including mathematics, natural sciences, ancient languages, and the humanities. The findings, published in the journal Nature, reveal that even leading AI models struggled with the exam.

Initial results from the assessment indicate low accuracy rates among some of the most advanced AI systems. OpenAI’s GPT-4 scored just 2.7%, while Claude achieved 3.5%, and Sonnet reached 4.1%. More advanced models, such as Gemini 3.1 Pro and Claude Opus 4.6, performed better, with accuracy rates between 40% and 50%. These findings underscore a significant gap in AI’s ability to tackle specialized knowledge compared to its pattern recognition capabilities.

Historically, standardized tests like the Massive Multitask Language Understanding (MMLU) have served as benchmarks for AI performance. However, as many advanced AI systems excel on these assessments, researchers are questioning their relevance in providing a true sense of AI’s understanding and abilities. “When AI systems start performing extremely well on human benchmarks, it’s tempting to think they’re approaching human-level understanding,” noted Dr. Tung Nguyen, an instructional associate professor at Texas A&M University and a contributor to the HLE project. “But HLE reminds us that intelligence isn’t just about pattern recognition — it’s about depth, context, and specialized expertise.”

The development of Humanity’s Last Exam involved nearly 1,000 experts from various disciplines, each contributing questions that require advanced knowledge and offer a single, verifiable answer. The team crafted questions that encompass a vast range of human knowledge, from translating ancient Palmyrene inscriptions to identifying microscopic anatomical features in birds and analyzing phonological details in Biblical Hebrew pronunciation. Only questions that leading AI models could not answer were included in the final exam, creating a benchmark designed to be beyond the reach of current AI technology.

The HLE’s initial trials revealed that even the best AI models would frequently miss questions, particularly those demanding specialized knowledge or skills from diverse fields. This highlights the limitations of AI in areas requiring deep understanding rather than mere pattern recognition. “Without accurate assessment tools, policymakers, developers, and users risk misinterpreting what AI systems can actually do,” Dr. Nguyen explained. While the exam’s name may imply a catastrophic scenario for AI, its true purpose is to gain insights into both the strengths and weaknesses of current systems, rather than to suggest that AI could replace human expertise.

Dr. Nguyen emphasized that the initiative is not a competition against AI but rather a method to elucidate where these systems excel and where they fall short. The collaborative effort among specialists from diverse academic backgrounds has resulted in one of the most ambitious attempts to benchmark the capabilities of advanced AI systems to date. To maintain the integrity of the exam as AI continues to evolve, the majority of questions have been kept confidential.

“What made this project extraordinary was the scale,” Dr. Nguyen remarked, noting that contributions came from historians, physicists, linguists, and medical researchers, among others. This interdisciplinary approach serves to illuminate the gaps in today’s AI systems, showcasing the essential role of human collaboration in understanding the future of artificial intelligence.

AI Cybersecurity

Anthropic’s Mythos Reveals Thousands of Vulnerabilities, Banks Prepare for AI Cyberattacks

Anthropic's Mythos exposes thousands of critical vulnerabilities in major systems, prompting $100M in defensive action from tech giants and U.S. banks.

Rachel Torres3 May, 2026

AI Government

US Defense Partners with Anthropic, OpenAI, and Tech Giants for AI-First Military Initiative

US Department of Defense partners with tech giants including SpaceX and OpenAI to launch an "AI-first" initiative aimed at enhancing military decision-making efficiency.

Staff3 May, 2026

AI Research

OpenAI’s AI Model Achieves 81.6% Diagnostic Accuracy, Surpassing Human Doctors in ER Tests

OpenAI's o1 model achieves 81.6% diagnostic accuracy in emergency situations, surpassing human doctors and signaling a major shift in medical practice.

Staff3 May, 2026

AI Generative

OpenAI Launches GPT Image 2, Surpassing Google Nano Banana 2 in Key Categories

OpenAI unveils GPT Image 2, achieving a record 242-point lead over competitors, transforming the AI image generation landscape with native reasoning capabilities.

Staff2 May, 2026

AI Government

Anthropic Accuses Moonshot AI of 3.4M Unauthorized Claude Exchanges Amid US State Response

Anthropic accuses Moonshot AI of 3.4M unauthorized exchanges with its Claude chatbot, prompting a global U.S. State Department campaign against IP theft.

Staff2 May, 2026

AI Technology

Apple Faces Mac Mini and Studio Shortage as OpenClaw Drives AI Demand Surge

Apple CEO Tim Cook warns of several-month supply shortages for the Mac mini and Mac Studio as demand surges, pushing Mac revenue to $8.4...

Staff2 May, 2026

AI Regulation

AI Agent Powered by Claude Deletes PocketOS Database, Ignoring Safety Protocols

Malfunctioning AI agent Cursor, powered by Anthropic’s Claude Opus 4.6, deleted PocketOS's entire database in nine seconds, disrupting car rental operations nationwide.

Staff2 May, 2026

DeepSeek Launches V4 Open-Source Model, Underpricing GPT-5.5 and Claude Opus 4.7

DeepSeek's V4 open-source model undercuts GPT-5.5 and Claude Opus 4.7 with costs of $1.74 per million tokens, promising a disruptive shift in AI pricing...

Staff2 May, 2026

AIPRESSA.COM

AI Research

Researchers Launch ‘Humanity’s Last Exam’ Revealing AI Models’ Limitations with 50% Accuracy

Trending

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

You May Also Like

AI Cybersecurity

Anthropic’s Mythos Reveals Thousands of Vulnerabilities, Banks Prepare for AI Cyberattacks

AI Government

US Defense Partners with Anthropic, OpenAI, and Tech Giants for AI-First Military Initiative

AI Research

OpenAI’s AI Model Achieves 81.6% Diagnostic Accuracy, Surpassing Human Doctors in ER Tests

AI Generative

OpenAI Launches GPT Image 2, Surpassing Google Nano Banana 2 in Key Categories

AI Government

Anthropic Accuses Moonshot AI of 3.4M Unauthorized Claude Exchanges Amid US State Response

AI Technology

Apple Faces Mac Mini and Studio Shortage as OpenClaw Drives AI Demand Surge

AI Regulation

AI Agent Powered by Claude Deletes PocketOS Database, Ignoring Safety Protocols

Top Stories

DeepSeek Launches V4 Open-Source Model, Underpricing GPT-5.5 and Claude Opus 4.7