Connect with us

Hi, what are you looking for?

Top Stories

AI Models Misjudge Exam Difficulty, Underestimate Human Struggles, Study Finds

A study reveals that top AI models like GPT-5 misjudge exam difficulty, scoring only 0.34 in correlation with human perceptions, highlighting a crucial gap in educational AI.

Large language models (LLMs) have demonstrated remarkable abilities in answering exam questions that often perplex human students, yet a recent study reveals a critical gap in how these AI systems perceive question difficulty. Conducted by a team of researchers from various US universities, the study sought to determine LLMs’ capability to evaluate exam question difficulty from a human perspective, using over 20 models, including GPT-5, GPT-4o, and various versions of Llama and Qwen.

The researchers tasked these models with estimating how challenging exam questions would be for human test-takers, comparing their outputs against actual difficulty ratings derived from student performance on the USMLE (medical), Cambridge (English), SAT Reading/Writing, and SAT Math. The findings were sobering: AI assessments fell short of aligning with human perceptions, with an average Spearman correlation score below 0.50. Notably, even the latest models like GPT-5 scored only 0.34, while the older GPT-4.1 fared slightly better at 0.44.

When the researchers aggregated the predictions from the 14 highest-performing models, they achieved a correlation of approximately 0.66 with human difficulty ratings. This still indicates only moderate agreement, highlighting a fundamental disconnect in how AI systems interpret question difficulty compared to actual human learners.

The study’s authors attribute this discrepancy to what they term the “curse of knowledge.” This phenomenon suggests that the advanced capabilities of LLMs prevent them from understanding the challenges faced by less skilled learners. For instance, while these models easily navigated tasks that frequently stumped medical students, they failed to recognize the specific hurdles that posed significant difficulties for human test-takers.

Attempts to mitigate this issue by prompting models to role-play as various types of learners yielded limited success. The models’ accuracy shifted by less than one percentage point regardless of whether they were instructed to simulate weak, average, or strong students. This lack of adaptability indicates that LLMs cannot scale down their own abilities to mimic the mistakes typical of less proficient learners.

Another critical finding revealed that when LLMs rated a question as difficult, they did not necessarily struggle with it. For example, GPT-5’s performance did not correlate with its difficulty estimates, demonstrating a lack of self-awareness regarding its limitations. The models’ assessments of difficulty and their actual performance appeared largely disconnected, leading the authors to suggest that these systems lack the self-reflection needed to recognize their own constraints.

Rather than approximating human perceptions, the LLMs developed a “machine consensus,” which diverged systematically from human data. This consensus often underestimated question difficulty, with model predictions clustering in a narrow range, whereas actual difficulty values exhibited a broader distribution. Previous research has similarly highlighted the tendency of AI models to form agreement, regardless of accuracy.

The implications of this study are significant for the future of AI in education. Accurately determining task difficulty is vital for educational testing, influencing curriculum design, automated test creation, and adaptive learning systems. Until now, the conventional approach has relied heavily on extensive field testing with actual students. Researchers had hoped that LLMs could assume this responsibility; however, the findings complicate that vision. Solving problems does not equate to understanding the reasons behind human struggles, suggesting that making AI effective in educational contexts will necessitate methods beyond simple prompting. One potential solution could involve training models using student error data to bridge the gap between machine capabilities and human learning.

OpenAI’s own usage data indicates the increasing role of AI in education, with “writing and editing” topping the list of popular use cases in Germany, closely followed by “tutoring and education.” Former OpenAI researcher Andrej Karpathy recently advocated for a radical transformation of the education system, suggesting that schools should assume any work completed outside the classroom involves AI assistance, given the unreliability of detection tools. Karpathy proposed a “flipped classroom” model, where exams occur at school and knowledge acquisition, supported by AI, takes place at home. His vision emphasizes the need for dual competence, enabling students to effectively collaborate with AI while also functioning independently of it.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Research

Chonnam National University unveils an AI campus initiative providing free access to eight generative AI tools for 30,000 users, enhancing education and research capabilities.

Top Stories

Investors are flocking to AI firms with strong growth potential, as tech giants like Microsoft and IBM lead a $100 billion market shift in...

AI Research

Yann LeCun departs Meta amid rising tensions with Zuckerberg and the launch of Advanced Machine Intelligence Labs, aiming for a $3B valuation focused on...

AI Generative

OpenAI unveils GPT-4o, achieving a groundbreaking 320ms response time that redefines human-AI interaction with real-time multimodal capabilities.

AI Generative

Uppsala University’s study reveals that optimizing SRAM size and operating frequencies between 1200MHz and 1400MHz can significantly reduce LLM energy consumption by balancing static...

AI Generative

OpenAI unveils GPT-4o, achieving real-time multimodal AI with a groundbreaking 320ms response time, transforming user interaction and engagement.

AI Generative

Microsoft's new PrivacyChecker module slashes information leakage in LLMs by up to 75%, enhancing user privacy and trust in AI systems.

AI Generative

Researchers demonstrate that large language models achieve over 99% accuracy as world models, revolutionizing AI agent training with simulated environments.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.