AI Generative

BeMyEyes Launches Multi-Agent Framework, Surpassing GPT-4o in Multimodal Reasoning Tasks

Researchers from USC and UC Davis launch BeMyEyes, surpassing GPT-4o’s performance by over 6% on multimodal reasoning tasks using a novel multi-agent framework.

Staff

Published

26 November, 2025

Researchers from the University of Southern California and the University of California, Davis, along with collaborators from Microsoft Research, have unveiled a novel framework called BeMyEyes aimed at enhancing the capabilities of large language models (LLMs) to process both text and images. The team, led by James Y. Huang, has developed a system that circumvents the significant resource demands typically associated with constructing integrated vision-language models. Instead of requiring extensive retraining, BeMyEyes orchestrates collaboration between smaller, agile vision-language models (VLMs) acting as “perceivers” and robust language models that serve as “reasoners.” This innovative multi-agent system facilitates multimodal reasoning, allowing a compact open-source language model to outperform larger proprietary vision-language models in complex, knowledge-intensive tasks, thereby paving the path for more adaptable artificial intelligence systems.

The research details a specific application of this framework, known as B. Prompt, designed to tackle multiple-choice questions related to images. Utilizing a three-agent approach, the system enhances performance in visual question-answering tasks by simplifying complex issues into smaller, more manageable segments. Each agent has a distinct role: the Perceiver Agent focuses on describing the image, the Reasoner Agent coordinates the overall process while interrogating the Perceiver for details, and the Expert—part of the Reasoner—synthesizes information to deliver the conclusive answer. The process begins with the Reasoner extracting answers from an initial prompt that includes both the question and image descriptions, ultimately formatting responses as “Answer: $LETTER.”

BeMyEyes aims to expand the capabilities of LLMs into the realm of multimodal reasoning through collaborative efforts between adaptable vision-language models and powerful LLMs. This decoupling of perception from reasoning enables text-only language models to interpret visual data without the extensive retraining often required by traditional approaches. The framework features a perceiver agent implemented with a small, computationally efficient VLM, working alongside a reasoner agent that utilizes a frozen LLM with substantial knowledge and reasoning prowess. This modular architecture allows for the flexible integration of new perceiver or reasoner models into the system.

The research team established an effective conversational flow between the perceiver and reasoner agents, with the perceiver concentrating on interpreting visual input and communicating pertinent details while the reasoner actively queries for specific information. The perceiver is prompted to recognize the reasoner’s lack of direct visual perception, which encourages more detailed descriptions. A dedicated data synthesis pipeline and a supervised fine-tuning strategy were developed to enhance the perceiver agent’s ability to accurately interpret visual information and respond effectively to the reasoner’s inquiries. Experiments demonstrated the efficacy of this approach across diverse tasks, models, and domains, establishing BeMyEyes as a scalable and flexible alternative to large-scale multimodal models.

When the researchers equipped the text-only DeepSeek-R1 LLM with the Qwen2 5-VL-7B perceiver agent, the combination outperformed larger proprietary vision-language models, including GPT-4o, on several knowledge-intensive multimodal tasks. On the MathVista benchmark, this configuration achieved a score of 72.7, surpassing GPT-4o’s score of 68.3. Similarly, on the MMMU-Pro benchmark, the pairing delivered a score of 57.2, outpacing GPT-4o’s score of 49.

Through the creation of the data synthesis pipeline, the researchers effectively distilled perceptual and instruction-following capabilities from larger VLMs to enhance the perceiver agent’s communication of visual information to the reasoner. The results confirm that BeMyEyes significantly elevates the performance of text-only LLMs in multimodal reasoning tasks, presenting a modular, scalable, and flexible alternative to traditional large-scale multimodal models. This reduces computational costs while maintaining generalization capabilities, marking a significant advancement in the development of future multimodal reasoning systems.

As the field of artificial intelligence continues to evolve, frameworks like BeMyEyes not only demonstrate the potential for improved multimodal reasoning but also signify a shift towards more efficient and adaptable AI systems. This innovative approach may lead to broader applications across various domains, fostering advancements in how AI interprets and interacts with the world.

👉 More information
🗞 Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
🧠 ArXiv: https://arxiv.org/abs/2511.19417

AI Research

UC San Francisco Researches Multiview DNNs to Enhance Echocardiogram Diagnostic Accuracy

UC San Francisco researchers reveal a multiview deep neural network that boosts echocardiogram diagnostic accuracy significantly, enhancing detection of major cardiac conditions.

Staff7 minutes ago

Computer Science Grad Faces Job Market Turmoil Amid AI Disruption and Layoffs

Computer science grad Kiran Maya Sheikh highlights the bleak outlook for entry-level tech jobs as AI disrupts hiring practices, urging companies to invest in...

Staff4 days ago

AI Research

Microsoft Appoints Peter Lee as President of Microsoft Science to Accelerate AI-Driven Research

Microsoft appoints Peter Lee as President of Microsoft Science to drive AI integration in research, aiming to transform biomedical sciences and enhance discovery.

Staff11 March, 2026

AI Research

Self-Proving AI Models Enhance Accuracy with Verifiable Outputs via Interactive Proofs

UC Berkeley's Self-Proving models revolutionize AI reliability by using Interactive Proofs to verify outputs, enhancing trust in critical applications like healthcare.

Staff21 February, 2026

AI Education

Computer Science Enrollment Drops 6% as Students Flock to AI Programs Amidst Job Market Shift

UC computer science enrollment drops 6% as students increasingly choose specialized AI programs, reflecting a significant shift in educational priorities.

David Park15 February, 2026

AI Education

California Universities See 6% Drop in CS Enrollment Amid AI Program Surge

California universities experience a 6% drop in computer science enrollment, reflecting a shift towards AI-focused programs amid rising student interest.

David Park15 February, 2026

AI Generative

UC Berkeley Team Reveals Generative Meta-Model Achieving LLM Interpretability with 1B Activations

UC Berkeley researchers unveil the Generative Latent Prior model, leveraging 1 billion activations to enhance interpretability and fluency in large language models.

Staff13 February, 2026

AI Research

Study Reveals 29% of US Course Syllabi Now Allow AI Use, Easing Restrictions Since 2023

A study shows that 29% of U.S. university syllabi now permit AI use, marking a significant shift from restrictive policies since 2023.

Staff9 February, 2026

AIPRESSA.COM

AI Generative

BeMyEyes Launches Multi-Agent Framework, Surpassing GPT-4o in Multimodal Reasoning Tasks

Trending

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Technology

AI Hardware Market Grows 30% in 2025, Driven by Generative AI and Edge Computing Demand

You May Also Like

AI Research

UC San Francisco Researches Multiview DNNs to Enhance Echocardiogram Diagnostic Accuracy

Top Stories

Computer Science Grad Faces Job Market Turmoil Amid AI Disruption and Layoffs

AI Research

Microsoft Appoints Peter Lee as President of Microsoft Science to Accelerate AI-Driven Research

AI Research

Self-Proving AI Models Enhance Accuracy with Verifiable Outputs via Interactive Proofs

AI Education

Computer Science Enrollment Drops 6% as Students Flock to AI Programs Amidst Job Market Shift

AI Education

California Universities See 6% Drop in CS Enrollment Amid AI Program Surge

AI Generative

UC Berkeley Team Reveals Generative Meta-Model Achieving LLM Interpretability with 1B Activations

AI Research

Study Reveals 29% of US Course Syllabi Now Allow AI Use, Easing Restrictions Since 2023