New research from Google Research and the Rochester Institute of Technology has called into question the conventional wisdom surrounding human evaluations of AI models, suggesting that the typical practice of using three to five raters per test example may not suffice to capture the full spectrum of human opinion. As artificial intelligence continues to advance, human ratings often play a decisive role in determining which models outperform others, particularly in tasks such as assessing the toxicity of comments or ensuring the safety of chatbot responses. However, discrepancies among evaluators can lead to inconsistent judgments, undermining the reliability of model comparisons.
Standard AI benchmarks typically rely on a majority vote to establish a “correct” answer, effectively sidelining the diversity of human assessments. In their study, the researchers sought to optimize the allocation of limited annotation budgets, posing a critical question: Is it more beneficial to evaluate a larger number of test examples or to have fewer examples rated by more people?
To illustrate their dilemma, the researchers employed a restaurant analogy. If 1,000 guests each sample one dish, they obtain a broad yet shallow understanding of the menu. Conversely, asking 20 diners to rate 50 dishes yields a deeper insight into what truly stands out. Current AI benchmarks tend to adopt the former approach, gathering numerous ratings from a limited number of evaluators.
The team developed a simulator designed to replicate human rating patterns with real datasets. This tool generates synthetic evaluation data for two models, enabling the researchers to analyze which conditions facilitate accurate differentiation between them. The simulator was calibrated against five datasets related to toxicity detection, chatbot safety, and assessments of cross-cultural offensiveness, allowing for extensive testing of various combinations of total budgets and rater counts.
The findings challenge established practices, revealing that the commonly used range of one to five raters per test example often fails to yield reproducible model comparisons. For results that genuinely reflect the variety of human perspectives, the study indicates that more than ten raters per example are typically necessary. The research also determined that reliable outcomes could be attained with around 1,000 total annotations, provided the budget is effectively divided between test examples and raters.
The implications of these findings hinge on the metric used to gauge performance. If the goal is to assess accuracy—determining whether a model’s output aligns with the majority opinion—then a broader approach with numerous test examples and minimal raters is advantageous. This is because accuracy focuses solely on the most frequent response, where additional raters offer diminishing returns.
However, if the objective is to capture a more comprehensive range of human responses, employing metrics such as total variation necessitates a different strategy. This approach calls for fewer test examples but a significantly higher number of raters per example to map the nuances of agreement and disagreement among evaluators. The research underscored that even when different examples receive the same majority-vote label, the underlying distribution of responses can vary considerably.
As AI technologies continue to evolve, the study serves as a critical reminder that there is no one-size-fits-all method for evaluation. The approach to collecting and analyzing human judgments should be tailored to the specific metrics being employed, ensuring that AI benchmarks can adapt to the complexities of human opinion and yield more reliable results. The researchers’ work not only reshapes our understanding of AI evaluations but also sets the stage for future refinements in how these models are assessed.
See also
DeepMind’s Founders Use Poker Tactics to Secure $500M Google Acquisition
Meta Launches New AI Feature for Ray-Ban Glasses to Track Food Intake, Raising Concerns
Germany”s National Team Prepares for World Cup Qualifiers with Disco Atmosphere
95% of AI Projects Fail in Companies According to MIT

















































