Connect with us

Hi, what are you looking for?

AI Generative

BeMyEyes Launches Multi-Agent Framework, Surpassing GPT-4o in Multimodal Reasoning Tasks

Researchers from USC and UC Davis launch BeMyEyes, surpassing GPT-4o’s performance by over 6% on multimodal reasoning tasks using a novel multi-agent framework.

Researchers from the University of Southern California and the University of California, Davis, along with collaborators from Microsoft Research, have unveiled a novel framework called BeMyEyes aimed at enhancing the capabilities of large language models (LLMs) to process both text and images. The team, led by James Y. Huang, has developed a system that circumvents the significant resource demands typically associated with constructing integrated vision-language models. Instead of requiring extensive retraining, BeMyEyes orchestrates collaboration between smaller, agile vision-language models (VLMs) acting as “perceivers” and robust language models that serve as “reasoners.” This innovative multi-agent system facilitates multimodal reasoning, allowing a compact open-source language model to outperform larger proprietary vision-language models in complex, knowledge-intensive tasks, thereby paving the path for more adaptable artificial intelligence systems.

The research details a specific application of this framework, known as B. Prompt, designed to tackle multiple-choice questions related to images. Utilizing a three-agent approach, the system enhances performance in visual question-answering tasks by simplifying complex issues into smaller, more manageable segments. Each agent has a distinct role: the Perceiver Agent focuses on describing the image, the Reasoner Agent coordinates the overall process while interrogating the Perceiver for details, and the Expert—part of the Reasoner—synthesizes information to deliver the conclusive answer. The process begins with the Reasoner extracting answers from an initial prompt that includes both the question and image descriptions, ultimately formatting responses as “Answer: $LETTER.”

BeMyEyes aims to expand the capabilities of LLMs into the realm of multimodal reasoning through collaborative efforts between adaptable vision-language models and powerful LLMs. This decoupling of perception from reasoning enables text-only language models to interpret visual data without the extensive retraining often required by traditional approaches. The framework features a perceiver agent implemented with a small, computationally efficient VLM, working alongside a reasoner agent that utilizes a frozen LLM with substantial knowledge and reasoning prowess. This modular architecture allows for the flexible integration of new perceiver or reasoner models into the system.

The research team established an effective conversational flow between the perceiver and reasoner agents, with the perceiver concentrating on interpreting visual input and communicating pertinent details while the reasoner actively queries for specific information. The perceiver is prompted to recognize the reasoner’s lack of direct visual perception, which encourages more detailed descriptions. A dedicated data synthesis pipeline and a supervised fine-tuning strategy were developed to enhance the perceiver agent’s ability to accurately interpret visual information and respond effectively to the reasoner’s inquiries. Experiments demonstrated the efficacy of this approach across diverse tasks, models, and domains, establishing BeMyEyes as a scalable and flexible alternative to large-scale multimodal models.

When the researchers equipped the text-only DeepSeek-R1 LLM with the Qwen2 5-VL-7B perceiver agent, the combination outperformed larger proprietary vision-language models, including GPT-4o, on several knowledge-intensive multimodal tasks. On the MathVista benchmark, this configuration achieved a score of 72.7, surpassing GPT-4o’s score of 68.3. Similarly, on the MMMU-Pro benchmark, the pairing delivered a score of 57.2, outpacing GPT-4o’s score of 49.

Through the creation of the data synthesis pipeline, the researchers effectively distilled perceptual and instruction-following capabilities from larger VLMs to enhance the perceiver agent’s communication of visual information to the reasoner. The results confirm that BeMyEyes significantly elevates the performance of text-only LLMs in multimodal reasoning tasks, presenting a modular, scalable, and flexible alternative to traditional large-scale multimodal models. This reduces computational costs while maintaining generalization capabilities, marking a significant advancement in the development of future multimodal reasoning systems.

As the field of artificial intelligence continues to evolve, frameworks like BeMyEyes not only demonstrate the potential for improved multimodal reasoning but also signify a shift towards more efficient and adaptable AI systems. This innovative approach may lead to broader applications across various domains, fostering advancements in how AI interprets and interacts with the world.

👉 More information
🗞 Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration
🧠 ArXiv: https://arxiv.org/abs/2511.19417

Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Business

UC Irvine's Anastasia Berg warns that AI tools may deskill the workforce, with 73% of adult interactions increasingly non-work-related, risking essential competencies.

AI Technology

Microsoft launches Project Gecko to develop cost-effective AI systems for low-resource languages, enhancing agriculture in India and Kenya for millions of smallholder farmers.

AI Technology

Anthropic announces a $50 billion data center expansion as the U.S. Education Department unveils a $167 million initiative to enhance AI readiness in higher...

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.