GPT-5.1 Tops LLM Council Rankings, Surpassing Gemini 3.0 and Claude in New Experiment

OpenAI’s GPT-5.1 outperforms competitors in the LLM-Council experiment, consistently ranking highest against Gemini 3.0 and Claude, reshaping AI evaluation standards.

Staff

Published

24 November, 2025

Andrej Karpathy, the prominent AI researcher and founder of Eureka Labs, has unveiled an innovative experiment named “LLM-Council,” which evaluates the performance of various language models in response to user queries. By having different models anonymously assess each other’s answers, the project aims to enhance the quality of generated responses. Initial results indicate that OpenAI’s latest model, GPT-5.1, consistently ranks as the most capable, even amidst competing benchmarks that had previously suggested Google’s Gemini 3.0 had surpassed OpenAI in overall performance.

Karpathy described the experimental framework as a structured three-step process. Initially, a user’s query is submitted to multiple language models, with their respective answers displayed side-by-side without disclosing which model produced which response. In the subsequent stage, each model reviews the anonymized responses and ranks them based on perceived accuracy and insight. Finally, a designated “chairman model” synthesizes these evaluations to generate a consensus response, essentially creating a collaborative answer derived from the competition among models.

Notably, Karpathy pointed out that the rankings are inherently subjective and do not necessarily reflect his personal assessments. He commented, “I’m not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively, I find GPT 5.1 a little too wordy and sprawled, while Gemini 3 is a bit more condensed and processed. Claude is too terse in this domain.” His comments underscore the complexity of evaluating AI performance, where personal biases can influence judgments.

The findings from the LLM-Council experiment resonate with observations made by other industry experts. Vasuman M, founder and CEO of Varick AI Agents, responded to Karpathy’s findings on social media platform X, asserting that he had developed a similar framework months prior. He noted that OpenAI’s models consistently emerged as the top performers in his evaluations as well, stating, “Even after plugging in Gemini 3.0, the winner was GPT 5.1, every single time.” Vasuman also highlighted a curious phenomenon where competing models appeared to adjust their outputs when informed that their responses came from GPT, revealing a layer of inter-model dynamics that could influence how AI interprets and generates language.

In light of this, the LLM-Council experiment raises critical questions about AI model evaluation and the potential for collaborative competition to refine output quality. The blending of anonymous feedback and ranking offers a fresh perspective on how different models perceive and assess each other, thereby potentially enhancing their responses through collective input. This could pave the way for advanced benchmarking techniques that might redefine performance standards in the fast-evolving AI landscape.

Karpathy, known for his deep involvement in AI development, crafted this project over the weekend using a ‘vibe coding’ tool, subsequently sharing the repository on GitHub. This initiative reflects a growing trend in the AI community toward open-source collaboration and experimentation, emphasizing that innovation can occur rapidly even outside traditional corporate structures.

The implications of these findings extend beyond just model performance; they could influence how users engage with AI technologies in various applications, from customer service to content generation. As models like GPT-5.1 continue to gain recognition for their performance, the competition among AI developers to refine and enhance these technologies is likely to intensify. With ongoing developments in AI model architecture and evaluation methodologies, the future promises an even more nuanced understanding of artificial intelligence capabilities.

AI Technology

AI Expert Warns of Psychosis Signs in 560K Users Amid Concerns Over Chatbot Design

AI expert Toby Walsh warns that 560,000 users exhibit signs of psychosis due to chatbot design, urging immediate scrutiny of AI safety and ethics.

Staff3 hours ago

AI Regulation

OpenAI’s Tumbler Ridge Incident Sparks Calls for New AI Regulation in Canada

OpenAI's failure to alert authorities after banning a user for violent posts led to the Tumbler Ridge shooting that killed eight, prompting calls for...

Staff5 hours ago

Anthropic Accuses MiniMax, DeepSeek, and Moonshot AI of Massive Model Mining Scheme

Anthropic accuses MiniMax, DeepSeek, and Moonshot AI of operating 24,000 fake accounts to steal Claude's proprietary features through 16M illicit exchanges.

Staff9 hours ago

AI Research

Research Reveals ChatGPT Health’s 50% Under-Triage Rate in Emergency Scenarios

Icahn School of Medicine study reveals that ChatGPT Health under-triages over 50% of urgent cases, raising alarms over AI's reliability in emergency care.

Staff14 hours ago

AI Generative

OpenAI Reveals 20 Best Generative AI Tools of 2026 to Boost Productivity and Creativity

McKinsey reports 79% of organizations now use generative AI tools like ChatGPT and DALL·E 3 to enhance productivity and streamline content creation.

Staff20 hours ago

AI Generative

Gemini App Launches Video Templates for Streamlined Content Creation

Google's Gemini app enhances video generation with new templates, enabling users to create up to five videos daily based on subscription tiers.

Staff1 day ago

Investors Back OpenAI and Anthropic’s $30B Funding Amid AI Sector Competition

OpenAI and Anthropic secure a combined $30B in funding, sparking scrutiny over potential conflicts of interest among major investors like BlackRock and Microsoft.

Staff1 day ago

AI Business

OpenAI’s Frontier Launch Triggers $1 Trillion SaaSpocalypse, Sending Software Stocks Tumbling

OpenAI's Frontier launch triggers a $1 trillion "SaaSpocalypse," causing major software stocks like ServiceNow and Palantir to plummet over 20% as AI disrupts traditional...

Marcus Chen1 day ago

AIPRESSA.COM

Top Stories

GPT-5.1 Tops LLM Council Rankings, Surpassing Gemini 3.0 and Claude in New Experiment

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

You May Also Like

AI Technology

AI Expert Warns of Psychosis Signs in 560K Users Amid Concerns Over Chatbot Design

AI Regulation

OpenAI’s Tumbler Ridge Incident Sparks Calls for New AI Regulation in Canada

Top Stories

Anthropic Accuses MiniMax, DeepSeek, and Moonshot AI of Massive Model Mining Scheme

AI Research

Research Reveals ChatGPT Health’s 50% Under-Triage Rate in Emergency Scenarios

AI Generative

OpenAI Reveals 20 Best Generative AI Tools of 2026 to Boost Productivity and Creativity

AI Generative

Gemini App Launches Video Templates for Streamlined Content Creation

Top Stories

Investors Back OpenAI and Anthropic’s $30B Funding Amid AI Sector Competition

AI Business

OpenAI’s Frontier Launch Triggers $1 Trillion SaaSpocalypse, Sending Software Stocks Tumbling