AI Models Misjudge Exam Difficulty, Underestimate Human Struggles, Study Finds

A study reveals that top AI models like GPT-5 misjudge exam difficulty, scoring only 0.34 in correlation with human perceptions, highlighting a crucial gap in educational AI.

Staff

Published

4 January, 2026

Large language models (LLMs) have demonstrated remarkable abilities in answering exam questions that often perplex human students, yet a recent study reveals a critical gap in how these AI systems perceive question difficulty. Conducted by a team of researchers from various US universities, the study sought to determine LLMs’ capability to evaluate exam question difficulty from a human perspective, using over 20 models, including GPT-5, GPT-4o, and various versions of Llama and Qwen.

The researchers tasked these models with estimating how challenging exam questions would be for human test-takers, comparing their outputs against actual difficulty ratings derived from student performance on the USMLE (medical), Cambridge (English), SAT Reading/Writing, and SAT Math. The findings were sobering: AI assessments fell short of aligning with human perceptions, with an average Spearman correlation score below 0.50. Notably, even the latest models like GPT-5 scored only 0.34, while the older GPT-4.1 fared slightly better at 0.44.

When the researchers aggregated the predictions from the 14 highest-performing models, they achieved a correlation of approximately 0.66 with human difficulty ratings. This still indicates only moderate agreement, highlighting a fundamental disconnect in how AI systems interpret question difficulty compared to actual human learners.

The study’s authors attribute this discrepancy to what they term the “curse of knowledge.” This phenomenon suggests that the advanced capabilities of LLMs prevent them from understanding the challenges faced by less skilled learners. For instance, while these models easily navigated tasks that frequently stumped medical students, they failed to recognize the specific hurdles that posed significant difficulties for human test-takers.

Attempts to mitigate this issue by prompting models to role-play as various types of learners yielded limited success. The models’ accuracy shifted by less than one percentage point regardless of whether they were instructed to simulate weak, average, or strong students. This lack of adaptability indicates that LLMs cannot scale down their own abilities to mimic the mistakes typical of less proficient learners.

Another critical finding revealed that when LLMs rated a question as difficult, they did not necessarily struggle with it. For example, GPT-5’s performance did not correlate with its difficulty estimates, demonstrating a lack of self-awareness regarding its limitations. The models’ assessments of difficulty and their actual performance appeared largely disconnected, leading the authors to suggest that these systems lack the self-reflection needed to recognize their own constraints.

Rather than approximating human perceptions, the LLMs developed a “machine consensus,” which diverged systematically from human data. This consensus often underestimated question difficulty, with model predictions clustering in a narrow range, whereas actual difficulty values exhibited a broader distribution. Previous research has similarly highlighted the tendency of AI models to form agreement, regardless of accuracy.

The implications of this study are significant for the future of AI in education. Accurately determining task difficulty is vital for educational testing, influencing curriculum design, automated test creation, and adaptive learning systems. Until now, the conventional approach has relied heavily on extensive field testing with actual students. Researchers had hoped that LLMs could assume this responsibility; however, the findings complicate that vision. Solving problems does not equate to understanding the reasons behind human struggles, suggesting that making AI effective in educational contexts will necessitate methods beyond simple prompting. One potential solution could involve training models using student error data to bridge the gap between machine capabilities and human learning.

OpenAI’s own usage data indicates the increasing role of AI in education, with “writing and editing” topping the list of popular use cases in Germany, closely followed by “tutoring and education.” Former OpenAI researcher Andrej Karpathy recently advocated for a radical transformation of the education system, suggesting that schools should assume any work completed outside the classroom involves AI assistance, given the unreliability of detection tools. Karpathy proposed a “flipped classroom” model, where exams occur at school and knowledge acquisition, supported by AI, takes place at home. His vision emphasizes the need for dual competence, enabling students to effectively collaborate with AI while also functioning independently of it.

AI Research

Krites Enhances Asynchronous Semantic Caching, Boosts Curated Response Rate by 3.9x

Krites boosts curated response rates by 3.9x for large language models while maintaining latency, revolutionizing AI caching efficiency.

Staff2 days ago

AI Generative

OpenAI Retires GPT-4o, Sparking Outcry Among AI Companion Community

OpenAI has retired the GPT-4o model, impacting 0.1% of users who formed deep emotional bonds with the AI as it transitions to newer models...

Staff3 days ago

Ginkgo Bioworks Soars 5.69% as AI Boosts Protein Synthesis Efficiency by 40%

Ginkgo Bioworks' shares surged 5.69% as AI-driven automation cut protein synthesis costs by 40% to $422 per gram, signaling a pivotal shift in biotech...

Staff5 days ago

Ginkgo Bioworks Soars 5.69% as AI Breakthrough Cuts Protein Synthesis Costs by 40%

Ginkgo Bioworks surges 5.69% as AI-driven automation slashes protein synthesis costs by 40% to $422 per gram, highlighting its innovative edge in biotech.

Staff5 days ago

AI Generative

OpenAI Retires GPT-4o, Shifts Focus to Stable Models Amid User Backlash

OpenAI will discontinue GPT-4o, affecting 800,000 users as it shifts focus to safer models amid rising concerns over the older model's reliability.

Staff5 days ago

AI Generative

OpenAI Discontinues GPT-5, GPT-4o Models to Focus on User-Favored Upgrades

OpenAI retires GPT-5 and other models due to 99.9% user adoption of GPT-5.2, shifting focus to enhance user-favored technologies like Codex-Spark.

Staff6 days ago

AI Generative

Zhipu AI Launches GLM-5, Surpassing OpenAI’s GPT-4o in Key Multimodal Benchmarks

Zhipu AI unveils GLM-5, a groundbreaking multimodal model that outperforms OpenAI's GPT-4o in key benchmarks, reshaping global AI competition.

Staff11 February, 2026

Cambridge Botanic Garden Launches AI-Enabled Plants for Interactive Visitor Experience

Cambridge University Botanic Garden unveils AI-driven interactive plants, enhancing visitor engagement and learning through dynamic conversations with unique digital characters.

Staff10 February, 2026

AIPRESSA.COM

Top Stories

AI Models Misjudge Exam Difficulty, Underestimate Human Struggles, Study Finds

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

You May Also Like

AI Research

Krites Enhances Asynchronous Semantic Caching, Boosts Curated Response Rate by 3.9x

AI Generative

OpenAI Retires GPT-4o, Sparking Outcry Among AI Companion Community

Top Stories

Ginkgo Bioworks Soars 5.69% as AI Boosts Protein Synthesis Efficiency by 40%

Top Stories

Ginkgo Bioworks Soars 5.69% as AI Breakthrough Cuts Protein Synthesis Costs by 40%

AI Generative

OpenAI Retires GPT-4o, Shifts Focus to Stable Models Amid User Backlash

AI Generative

OpenAI Discontinues GPT-5, GPT-4o Models to Focus on User-Favored Upgrades

AI Generative

Zhipu AI Launches GLM-5, Surpassing OpenAI’s GPT-4o in Key Multimodal Benchmarks

Top Stories

Cambridge Botanic Garden Launches AI-Enabled Plants for Interactive Visitor Experience