Connect with us

Hi, what are you looking for?

AI Generative

AI Models Struggle with Help-Seeking; ProactiveBench Shows 60% Drop in Accuracy

ProactiveBench reveals that 22 multimodal language models see a staggering 60% drop in accuracy, failing to seek user help when necessary.

ProactiveBench

In everyday interactions, when one might ask for help identifying an obscured object, they typically request the obstruction be removed. In contrast, multimodal language models often either generate incorrect answers or decline to respond entirely. The ProactiveBench framework systematically examines this issue by assessing AI models on their ability to identify when they require additional information.

ProactiveBench utilizes seven existing datasets, transforming them into scenarios that cannot be resolved without human intervention. The tasks involve identifying hidden objects, clarifying noisy images, interpreting rough sketches, or asking for alternative camera angles. Comprising over 108,000 images across 18,000 samples, the benchmark filters out tasks that models can complete correctly on their first attempt. To pass, models must proactively request further information.

Among the models assessed were LLaVA-OV, Qwen2.5-VL, InternVL3, GPT-4.1, GPT-5.2, and o4-mini. In a reference setting with clearly visible objects, these models achieved an average accuracy of 79.8 percent. However, when subjected to ProactiveBench, this figure plunged by more than 60 percent. The ROD dataset provides a stark illustration: as objects concealed behind blocks, accuracy plummeted from 98.3 percent in the reference setting to a mere 8.2 percent. While models excel in spotting objects in plain view, they rarely think to ask for assistance in uncovering them.

Interestingly, the size of the models did not correlate with improved inquiry capabilities. For example, InternVL3-1B outperformed the larger InternVL3-8B, achieving 27.1 percent accuracy compared to 12.7 percent. Similarly, the older LLaVA-1.5-7B surpassed the newer LLaVA-OV-72B, recording 24.8 percent versus 13 percent. Closed models like GPT-4.1 exhibited the highest accuracy, although researchers noted potential data contamination in their COCO scores.

Some models initially appeared to be more proactive. However, further investigation showed that substituting valid proactive suggestions with nonsensical ones—such as “Rewind the video” for a sketching task—resulted in these models selecting meaningless options just as readily. For instance, LLaVA-NeXT Vicuna increased its selection rate from 37 to 49 percent when given invalid choices. This indicates that what seems like proactivity is often little more than a lower threshold for guessing rather than genuine understanding.

Attempts to enhance model performance by inserting explicit hints into prompts and conversation histories were largely ineffective. While hints did marginally increase the rate of proactive suggestions—boosting accuracy to 25.8 percent—this still fell short of random chance outcomes. In some cases, models resorted to indiscriminate spamming of proactive suggestions up to the maximum allowed steps. The inclusion of conversation histories appeared to hinder performance, as models often merely replicated previous proactive actions rather than learning from them.

Reinforcement Learning as a Solution

Despite these challenges, researchers found a promising avenue for enhancing proactivity through reinforcement learning. By fine-tuning LLaVA-NeXT-Mistral-7B and Qwen2.5-VL-3B using Group-Relative Policy Optimization (GRPO) on around 27,000 examples, they created a reward function that prioritized accurate predictions over proactive suggestions. This approach led to significant improvements, with both models outperforming all 22 previously tested models, achieving accuracy rates of 37.4 and 38.6 percent compared to 34.0 percent. Notably, the learned proactivity extended to scenarios outside the initial training data, with Qwen2.5-VL-3B’s accuracy soaring from 12.4 to 55.6 percent on the ChangeIt dataset.

However, maintaining the correct balance in the reward structure is crucial. If proactive suggestions are rewarded equally to correct answers, the model may become overly eager, flooding requests for help and causing accuracy to drop to 5.4 percent. Ultimately, even with these improvements, there remains a notable gap between performance on ProactiveBench and the reference setting, with scores of 40.7 percent versus 75.1 percent, respectively.

Released as an open-source resource, ProactiveBench represents an initial step toward developing models capable of recognizing when they lack information and seeking assistance instead of fabricating responses. The benchmark aligns with a recurring theme in AI research: multimodal language models frequently struggle with uncertainty. Recent findings have demonstrated that even top-tier models can plateau at around 50 percent accuracy in visual object recognition, indicating a persistent issue of overconfidence.

As research continues, ProactiveBench may catalyze advancements in model designs that better navigate the complexities of uncertainty, potentially enhancing their utility in real-world applications where accurate interpretation is vital.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.