Google Study Reveals AI Benchmarks Require Over 10 Raters for Reliable Evaluations

Google Research reveals that over 10 raters per AI test example are essential for reliable evaluations, challenging current benchmarking practices.

Staff

Published

5 April, 2026

New research from Google Research and the Rochester Institute of Technology has called into question the conventional wisdom surrounding human evaluations of AI models, suggesting that the typical practice of using three to five raters per test example may not suffice to capture the full spectrum of human opinion. As artificial intelligence continues to advance, human ratings often play a decisive role in determining which models outperform others, particularly in tasks such as assessing the toxicity of comments or ensuring the safety of chatbot responses. However, discrepancies among evaluators can lead to inconsistent judgments, undermining the reliability of model comparisons.

Standard AI benchmarks typically rely on a majority vote to establish a “correct” answer, effectively sidelining the diversity of human assessments. In their study, the researchers sought to optimize the allocation of limited annotation budgets, posing a critical question: Is it more beneficial to evaluate a larger number of test examples or to have fewer examples rated by more people?

To illustrate their dilemma, the researchers employed a restaurant analogy. If 1,000 guests each sample one dish, they obtain a broad yet shallow understanding of the menu. Conversely, asking 20 diners to rate 50 dishes yields a deeper insight into what truly stands out. Current AI benchmarks tend to adopt the former approach, gathering numerous ratings from a limited number of evaluators.

The team developed a simulator designed to replicate human rating patterns with real datasets. This tool generates synthetic evaluation data for two models, enabling the researchers to analyze which conditions facilitate accurate differentiation between them. The simulator was calibrated against five datasets related to toxicity detection, chatbot safety, and assessments of cross-cultural offensiveness, allowing for extensive testing of various combinations of total budgets and rater counts.

The findings challenge established practices, revealing that the commonly used range of one to five raters per test example often fails to yield reproducible model comparisons. For results that genuinely reflect the variety of human perspectives, the study indicates that more than ten raters per example are typically necessary. The research also determined that reliable outcomes could be attained with around 1,000 total annotations, provided the budget is effectively divided between test examples and raters.

The implications of these findings hinge on the metric used to gauge performance. If the goal is to assess accuracy—determining whether a model’s output aligns with the majority opinion—then a broader approach with numerous test examples and minimal raters is advantageous. This is because accuracy focuses solely on the most frequent response, where additional raters offer diminishing returns.

However, if the objective is to capture a more comprehensive range of human responses, employing metrics such as total variation necessitates a different strategy. This approach calls for fewer test examples but a significantly higher number of raters per example to map the nuances of agreement and disagreement among evaluators. The research underscored that even when different examples receive the same majority-vote label, the underlying distribution of responses can vary considerably.

As AI technologies continue to evolve, the study serves as a critical reminder that there is no one-size-fits-all method for evaluation. The approach to collecting and analyzing human judgments should be tailored to the specific metrics being employed, ensuring that AI benchmarks can adapt to the complexities of human opinion and yield more reliable results. The researchers’ work not only reshapes our understanding of AI evaluations but also sets the stage for future refinements in how these models are assessed.

AI Business

Red Hat Reveals Small Language Models as Key to Scaling Enterprise AI Agents

Red Hat advances enterprise AI with Small Language Models that achieve over 98% validity in structured tasks, prioritizing reliability and data sovereignty.

Marcus Chen3 May, 2026

AI Research

OpenAI’s AI Model Achieves 81.6% Diagnostic Accuracy, Surpassing Human Doctors in ER Tests

OpenAI's o1 model achieves 81.6% diagnostic accuracy in emergency situations, surpassing human doctors and signaling a major shift in medical practice.

Staff3 May, 2026

AI Regulation

Korea Ventures Launches AI Initiative to Enhance Fund Management and Policy Efficiency

Korea Venture Investment Corp. unveils AI-driven fund management systems by integrating Nvidia H200 GPUs to enhance efficiency and support unicorn growth.

Staff3 May, 2026

AI Technology

Apple Raises Mac Mini Price to $799 Amid AI-Driven Supply Shortages

Apple raises Mac mini starting price to $799 amid AI-driven inventory shortages, eliminating the $599 model in response to surging demand for advanced computing.

Staff3 May, 2026

AI Research

IBM Launches Chicago Quantum Hub, Creating 750 AI Jobs and Expanding MIT Research Lab

IBM launches a Chicago Quantum Hub to create 750 AI jobs and expands its MIT partnership to advance quantum computing and AI integration.

Staff3 May, 2026

AI Government

71% of Aussies Use Generative AI, Yet Only 36% Trust Its Implementation, Says Expert

71% of Australian employees use generative AI daily, but only 36% trust its implementation, highlighting urgent calls for better policy frameworks and safeguards.

Staff3 May, 2026

AI Regulation

Academy Confirms AI Performances Ineligible for Oscars Amid Growing Industry Tensions

The Academy of Motion Picture Arts and Sciences bars AI performances from Oscar eligibility, emphasizing human-authored content amid rising industry tensions over generative AI's...

Staff2 May, 2026

AI Tools

Workday Updates AI Products, Sees 49.8% Undervaluation Amid Earnings Optimism

Workday's stock jumps 3.73% to $126.96 amid AI product updates and earnings optimism, yet analysts cite a 49.8% undervaluation risk at $253.14.

Staff2 May, 2026

AIPRESSA.COM

Top Stories

Google Study Reveals AI Benchmarks Require Over 10 Raters for Reliable Evaluations

Trending

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

You May Also Like

AI Business

Red Hat Reveals Small Language Models as Key to Scaling Enterprise AI Agents

AI Research

OpenAI’s AI Model Achieves 81.6% Diagnostic Accuracy, Surpassing Human Doctors in ER Tests

AI Regulation

Korea Ventures Launches AI Initiative to Enhance Fund Management and Policy Efficiency

AI Technology

Apple Raises Mac Mini Price to $799 Amid AI-Driven Supply Shortages

AI Research

IBM Launches Chicago Quantum Hub, Creating 750 AI Jobs and Expanding MIT Research Lab

AI Government

71% of Aussies Use Generative AI, Yet Only 36% Trust Its Implementation, Says Expert

AI Regulation

Academy Confirms AI Performances Ineligible for Oscars Amid Growing Industry Tensions

AI Tools

Workday Updates AI Products, Sees 49.8% Undervaluation Amid Earnings Optimism