Andrej Karpathy, the prominent AI researcher and founder of Eureka Labs, has unveiled an innovative experiment named “LLM-Council,” which evaluates the performance of various language models in response to user queries. By having different models anonymously assess each other’s answers, the project aims to enhance the quality of generated responses. Initial results indicate that OpenAI’s latest model, GPT-5.1, consistently ranks as the most capable, even amidst competing benchmarks that had previously suggested Google’s Gemini 3.0 had surpassed OpenAI in overall performance.
Karpathy described the experimental framework as a structured three-step process. Initially, a user’s query is submitted to multiple language models, with their respective answers displayed side-by-side without disclosing which model produced which response. In the subsequent stage, each model reviews the anonymized responses and ranks them based on perceived accuracy and insight. Finally, a designated “chairman model” synthesizes these evaluations to generate a consensus response, essentially creating a collaborative answer derived from the competition among models.
Notably, Karpathy pointed out that the rankings are inherently subjective and do not necessarily reflect his personal assessments. He commented, “I’m not 100% convinced this aligns with my own qualitative assessment. For example, qualitatively, I find GPT 5.1 a little too wordy and sprawled, while Gemini 3 is a bit more condensed and processed. Claude is too terse in this domain.” His comments underscore the complexity of evaluating AI performance, where personal biases can influence judgments.
The findings from the LLM-Council experiment resonate with observations made by other industry experts. Vasuman M, founder and CEO of Varick AI Agents, responded to Karpathy’s findings on social media platform X, asserting that he had developed a similar framework months prior. He noted that OpenAI’s models consistently emerged as the top performers in his evaluations as well, stating, “Even after plugging in Gemini 3.0, the winner was GPT 5.1, every single time.” Vasuman also highlighted a curious phenomenon where competing models appeared to adjust their outputs when informed that their responses came from GPT, revealing a layer of inter-model dynamics that could influence how AI interprets and generates language.
In light of this, the LLM-Council experiment raises critical questions about AI model evaluation and the potential for collaborative competition to refine output quality. The blending of anonymous feedback and ranking offers a fresh perspective on how different models perceive and assess each other, thereby potentially enhancing their responses through collective input. This could pave the way for advanced benchmarking techniques that might redefine performance standards in the fast-evolving AI landscape.
Karpathy, known for his deep involvement in AI development, crafted this project over the weekend using a ‘vibe coding’ tool, subsequently sharing the repository on GitHub. This initiative reflects a growing trend in the AI community toward open-source collaboration and experimentation, emphasizing that innovation can occur rapidly even outside traditional corporate structures.
The implications of these findings extend beyond just model performance; they could influence how users engage with AI technologies in various applications, from customer service to content generation. As models like GPT-5.1 continue to gain recognition for their performance, the competition among AI developers to refine and enhance these technologies is likely to intensify. With ongoing developments in AI model architecture and evaluation methodologies, the future promises an even more nuanced understanding of artificial intelligence capabilities.
Trump Approves 35,000 Nvidia Chips to Gulf States, Weighs H200 Exports to China
Google Accelerates AI Transformation with Gemini 3 Amidst Rising Competitive Pressure
AI Chips Drive Laptop Market Recovery Amid Rising Prices and Supply Chain Challenges
VAST Data and Microsoft Launch AI OS on Azure for High-Performance Cloud Infrastructure
Transform AI: Shift from Scale to Specialized Systems for Immediate Business Impact



















































