GPT-5.1 Tops LLM Council Rankings, Surpassing Gemini 3.0 and Claude in New Experiment

OpenAI’s GPT-5.1 outperforms competitors in the LLM-Council experiment, consistently ranking highest against Gemini 3.0 and Claude, reshaping AI evaluation standards.

Staff

Published

24 November, 2025

Andrej Karpathy, the prominent AI researcher and founder of Eureka Labs, has unveiled an innovative experiment named “LLM-Council,” which evaluates the performance of various language models in response to user queries. By having different models anonymously assess each other’s answers, the project aims to enhance the quality of generated responses. Initial results indicate that OpenAI’s latest model, GPT-5.1, consistently ranks as the most capable, even amidst competing benchmarks that had previously suggested Google’s Gemini 3.0 had surpassed OpenAI in overall performance.

Karpathy described the experimental framework as a structured three-step process. Initially, a user’s query is submitted to multiple language models, with their respective answers displayed side-by-side without disclosing which model produced which response. In the subsequent stage, each model reviews the anonymized responses and ranks them based on perceived accuracy and insight. Finally, a designated “chairman model” synthesizes these evaluations to generate a consensus response, essentially creating a collaborative answer derived from the competition among models.

Notably, Karpathy pointed out that the rankings are inherently subjective and do not necessarily reflect his personal assessments. He commented, “I’m not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively, I find GPT 5.1 a little too wordy and sprawled, while Gemini 3 is a bit more condensed and processed. Claude is too terse in this domain.” His comments underscore the complexity of evaluating AI performance, where personal biases can influence judgments.

The findings from the LLM-Council experiment resonate with observations made by other industry experts. Vasuman M, founder and CEO of Varick AI Agents, responded to Karpathy’s findings on social media platform X, asserting that he had developed a similar framework months prior. He noted that OpenAI’s models consistently emerged as the top performers in his evaluations as well, stating, “Even after plugging in Gemini 3.0, the winner was GPT 5.1, every single time.” Vasuman also highlighted a curious phenomenon where competing models appeared to adjust their outputs when informed that their responses came from GPT, revealing a layer of inter-model dynamics that could influence how AI interprets and generates language.

In light of this, the LLM-Council experiment raises critical questions about AI model evaluation and the potential for collaborative competition to refine output quality. The blending of anonymous feedback and ranking offers a fresh perspective on how different models perceive and assess each other, thereby potentially enhancing their responses through collective input. This could pave the way for advanced benchmarking techniques that might redefine performance standards in the fast-evolving AI landscape.

Karpathy, known for his deep involvement in AI development, crafted this project over the weekend using a ‘vibe coding’ tool, subsequently sharing the repository on GitHub. This initiative reflects a growing trend in the AI community toward open-source collaboration and experimentation, emphasizing that innovation can occur rapidly even outside traditional corporate structures.

The implications of these findings extend beyond just model performance; they could influence how users engage with AI technologies in various applications, from customer service to content generation. As models like GPT-5.1 continue to gain recognition for their performance, the competition among AI developers to refine and enhance these technologies is likely to intensify. With ongoing developments in AI model architecture and evaluation methodologies, the future promises an even more nuanced understanding of artificial intelligence capabilities.

AI Regulation

Law Firms Must Embrace Generative and Answer Engine Optimization to Thrive in 2026

Law firms must adopt Generative and Answer Engine Optimization strategies to remain competitive in 2026, prioritizing high-quality, citation-worthy content.

Staff3 hours ago

Tencent Valuation Surges Amid AI Strategy Boost from Ex-OpenAI Scientist

Tencent enlists former OpenAI scientist Yao Shunyu to spearhead AI initiatives as its stock trades at HK$611, a 31.54% discount from estimated fair value...

Staff4 hours ago

DeepSeek Launches Next-Gen V4 AI Model, Outperforming GPT Series in Coding Tasks

DeepSeek unveils its V4 AI model, designed to outperform GPT series in coding efficiency, potentially reshaping software development practices globally.

Staff5 hours ago

OpenAI, Google DeepMind Employees Demand Transparency and Safety in AI Oversight

OpenAI and Google DeepMind employees demand urgent transparency reforms amid growing fears of AI risks, citing potential human extinction and systemic inequities.

Staff7 hours ago

AI Technology

Cadence Design Systems Powers AI Hardware Boom with Advanced EDA Tools for 3nm Designs

Cadence Design Systems fuels the AI hardware revolution with its advanced EDA tools, enabling 3nm chip designs and driving double-digit revenue growth amidst rising...

Staff8 hours ago

AI Education

California Proposes AI Safety Measure for Youth, Backed by OpenAI and Common Sense Media

California proposes a ballot measure to enhance AI protections for minors, backed by OpenAI and Common Sense Media, mandating age assurance and data safeguards.

David Park9 hours ago

MiniMax, China’s AI Unicorn, Surges 109% in Record Hong Kong Market Debut

MiniMax, China's AI unicorn, skyrocketed 109% in its record-breaking Hong Kong market debut, marking a significant milestone for tech investments.

Staff13 hours ago

AI Research

AI Memorization Crisis: Stanford Reveals Major Copyright Risks in OpenAI, Claude, and Others

Stanford and Yale warn that OpenAI’s GPT, Anthropic's Claude, and others can reproduce extensive copyrighted texts, raising potential billion-dollar legal liabilities.

Staff16 hours ago

AIPRESSA.COM

Top Stories

GPT-5.1 Tops LLM Council Rankings, Surpassing Gemini 3.0 and Claude in New Experiment

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

You May Also Like

AI Regulation

Law Firms Must Embrace Generative and Answer Engine Optimization to Thrive in 2026

Top Stories

Tencent Valuation Surges Amid AI Strategy Boost from Ex-OpenAI Scientist

Top Stories

DeepSeek Launches Next-Gen V4 AI Model, Outperforming GPT Series in Coding Tasks

Top Stories

OpenAI, Google DeepMind Employees Demand Transparency and Safety in AI Oversight

AI Technology

Cadence Design Systems Powers AI Hardware Boom with Advanced EDA Tools for 3nm Designs

AI Education

California Proposes AI Safety Measure for Youth, Backed by OpenAI and Common Sense Media

Top Stories

MiniMax, China’s AI Unicorn, Surges 109% in Record Hong Kong Market Debut

AI Research

AI Memorization Crisis: Stanford Reveals Major Copyright Risks in OpenAI, Claude, and Others