AI Generative

AI Models Struggle with Help-Seeking; ProactiveBench Shows 60% Drop in Accuracy

ProactiveBench reveals that 22 multimodal language models see a staggering 60% drop in accuracy, failing to seek user help when necessary.

Staff

Published

12 April, 2026

ProactiveBench

In everyday interactions, when one might ask for help identifying an obscured object, they typically request the obstruction be removed. In contrast, multimodal language models often either generate incorrect answers or decline to respond entirely. The ProactiveBench framework systematically examines this issue by assessing AI models on their ability to identify when they require additional information.

ProactiveBench utilizes seven existing datasets, transforming them into scenarios that cannot be resolved without human intervention. The tasks involve identifying hidden objects, clarifying noisy images, interpreting rough sketches, or asking for alternative camera angles. Comprising over 108,000 images across 18,000 samples, the benchmark filters out tasks that models can complete correctly on their first attempt. To pass, models must proactively request further information.

Among the models assessed were LLaVA-OV, Qwen2.5-VL, InternVL3, GPT-4.1, GPT-5.2, and o4-mini. In a reference setting with clearly visible objects, these models achieved an average accuracy of 79.8 percent. However, when subjected to ProactiveBench, this figure plunged by more than 60 percent. The ROD dataset provides a stark illustration: as objects concealed behind blocks, accuracy plummeted from 98.3 percent in the reference setting to a mere 8.2 percent. While models excel in spotting objects in plain view, they rarely think to ask for assistance in uncovering them.

Interestingly, the size of the models did not correlate with improved inquiry capabilities. For example, InternVL3-1B outperformed the larger InternVL3-8B, achieving 27.1 percent accuracy compared to 12.7 percent. Similarly, the older LLaVA-1.5-7B surpassed the newer LLaVA-OV-72B, recording 24.8 percent versus 13 percent. Closed models like GPT-4.1 exhibited the highest accuracy, although researchers noted potential data contamination in their COCO scores.

Some models initially appeared to be more proactive. However, further investigation showed that substituting valid proactive suggestions with nonsensical ones—such as “Rewind the video” for a sketching task—resulted in these models selecting meaningless options just as readily. For instance, LLaVA-NeXT Vicuna increased its selection rate from 37 to 49 percent when given invalid choices. This indicates that what seems like proactivity is often little more than a lower threshold for guessing rather than genuine understanding.

Attempts to enhance model performance by inserting explicit hints into prompts and conversation histories were largely ineffective. While hints did marginally increase the rate of proactive suggestions—boosting accuracy to 25.8 percent—this still fell short of random chance outcomes. In some cases, models resorted to indiscriminate spamming of proactive suggestions up to the maximum allowed steps. The inclusion of conversation histories appeared to hinder performance, as models often merely replicated previous proactive actions rather than learning from them.

Reinforcement Learning as a Solution

Despite these challenges, researchers found a promising avenue for enhancing proactivity through reinforcement learning. By fine-tuning LLaVA-NeXT-Mistral-7B and Qwen2.5-VL-3B using Group-Relative Policy Optimization (GRPO) on around 27,000 examples, they created a reward function that prioritized accurate predictions over proactive suggestions. This approach led to significant improvements, with both models outperforming all 22 previously tested models, achieving accuracy rates of 37.4 and 38.6 percent compared to 34.0 percent. Notably, the learned proactivity extended to scenarios outside the initial training data, with Qwen2.5-VL-3B’s accuracy soaring from 12.4 to 55.6 percent on the ChangeIt dataset.

However, maintaining the correct balance in the reward structure is crucial. If proactive suggestions are rewarded equally to correct answers, the model may become overly eager, flooding requests for help and causing accuracy to drop to 5.4 percent. Ultimately, even with these improvements, there remains a notable gap between performance on ProactiveBench and the reference setting, with scores of 40.7 percent versus 75.1 percent, respectively.

Released as an open-source resource, ProactiveBench represents an initial step toward developing models capable of recognizing when they lack information and seeking assistance instead of fabricating responses. The benchmark aligns with a recurring theme in AI research: multimodal language models frequently struggle with uncertainty. Recent findings have demonstrated that even top-tier models can plateau at around 50 percent accuracy in visual object recognition, indicating a persistent issue of overconfidence.

As research continues, ProactiveBench may catalyze advancements in model designs that better navigate the complexities of uncertainty, potentially enhancing their utility in real-world applications where accurate interpretation is vital.
See also
MegaTrain Achieves 120B Parameter LLM Training on Single GPU, Bypassing HBM Limits
Elorian AI Raises $55 Million to Advance Visual Understanding in Multimodal AI Research
Seedance 2.0 Launches with Multimodal Input and 4K Video Capabilities for AI Creators
TU Berlin Reveals Silent Data Corruption as Key Reliability Challenge in LLM Training
Microsoft Open Sources Phi-4-Reasoning-Vision-15B Model for Efficient Multimodal Tasks

In this article:ProactiveBench

Written By Staff

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

Advertisement

Trending

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

Flipboard

Reddit

Pinterest

Whatsapp

Whatsapp

Email

You May Also Like