Connect with us

Hi, what are you looking for?

Top Stories

GPT-5.1 Tops LLM Council Rankings, Surpassing Gemini 3.0 and Claude in New Experiment

OpenAI’s GPT-5.1 outperforms competitors in the LLM-Council experiment, consistently ranking highest against Gemini 3.0 and Claude, reshaping AI evaluation standards.

Andrej Karpathy, the prominent AI researcher and founder of Eureka Labs, has unveiled an innovative experiment named “LLM-Council,” which evaluates the performance of various language models in response to user queries. By having different models anonymously assess each other’s answers, the project aims to enhance the quality of generated responses. Initial results indicate that OpenAI’s latest model, GPT-5.1, consistently ranks as the most capable, even amidst competing benchmarks that had previously suggested Google’s Gemini 3.0 had surpassed OpenAI in overall performance.

Karpathy described the experimental framework as a structured three-step process. Initially, a user’s query is submitted to multiple language models, with their respective answers displayed side-by-side without disclosing which model produced which response. In the subsequent stage, each model reviews the anonymized responses and ranks them based on perceived accuracy and insight. Finally, a designated “chairman model” synthesizes these evaluations to generate a consensus response, essentially creating a collaborative answer derived from the competition among models.

Notably, Karpathy pointed out that the rankings are inherently subjective and do not necessarily reflect his personal assessments. He commented, “I’m not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively, I find GPT 5.1 a little too wordy and sprawled, while Gemini 3 is a bit more condensed and processed. Claude is too terse in this domain.” His comments underscore the complexity of evaluating AI performance, where personal biases can influence judgments.

The findings from the LLM-Council experiment resonate with observations made by other industry experts. Vasuman M, founder and CEO of Varick AI Agents, responded to Karpathy’s findings on social media platform X, asserting that he had developed a similar framework months prior. He noted that OpenAI’s models consistently emerged as the top performers in his evaluations as well, stating, “Even after plugging in Gemini 3.0, the winner was GPT 5.1, every single time.” Vasuman also highlighted a curious phenomenon where competing models appeared to adjust their outputs when informed that their responses came from GPT, revealing a layer of inter-model dynamics that could influence how AI interprets and generates language.

In light of this, the LLM-Council experiment raises critical questions about AI model evaluation and the potential for collaborative competition to refine output quality. The blending of anonymous feedback and ranking offers a fresh perspective on how different models perceive and assess each other, thereby potentially enhancing their responses through collective input. This could pave the way for advanced benchmarking techniques that might redefine performance standards in the fast-evolving AI landscape.

Karpathy, known for his deep involvement in AI development, crafted this project over the weekend using a ‘vibe coding’ tool, subsequently sharing the repository on GitHub. This initiative reflects a growing trend in the AI community toward open-source collaboration and experimentation, emphasizing that innovation can occur rapidly even outside traditional corporate structures.

The implications of these findings extend beyond just model performance; they could influence how users engage with AI technologies in various applications, from customer service to content generation. As models like GPT-5.1 continue to gain recognition for their performance, the competition among AI developers to refine and enhance these technologies is likely to intensify. With ongoing developments in AI model architecture and evaluation methodologies, the future promises an even more nuanced understanding of artificial intelligence capabilities.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

Top Stories

MiniMax, China's AI unicorn, skyrocketed 109% in its record-breaking Hong Kong market debut, marking a significant milestone for tech investments.

AI Research

Stanford and Yale warn that OpenAI’s GPT, Anthropic's Claude, and others can reproduce extensive copyrighted texts, raising potential billion-dollar legal liabilities.

Top Stories

Google enhances Gmail with AI Overviews and AI Inbox, leveraging Gemini 3 to streamline email management and boost productivity for users.

Top Stories

DeepSeek's V4 model, launching February 17, 2024, may surpass ChatGPT and Claude in long-context coding, aiming for over 80% accuracy in Software Engineering tasks.

AI Regulation

AI professionals must navigate new executive order changes while complying with state laws to avoid costly penalties and ensure ethical data practices.

Top Stories

DeepSeek's V4 model, launching by February 17, aims to outperform Claude and ChatGPT in coding, leveraging innovative training to boost accuracy beyond 80.9%.

AI Research

Thinking Machines Lab secures $2B funding at a $12B valuation and launches Tinker, a groundbreaking tool for efficient AI model customization.

Top Stories

Character.AI and Google settle lawsuits over teen safety, addressing claims of negligence in AI interactions linked to youth exploitation, with a $2.7B partnership under...

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.