Connect with us

Hi, what are you looking for?

AI Generative

Alibaba’s Qwen3-VL Scans 2-Hour Videos with 99.5% Accuracy in Frame Detection

Alibaba’s Qwen3-VL achieves 99.5% accuracy in detecting frames within two-hour videos, revolutionizing multimodal AI capabilities with 235 billion parameters.

Alibaba has released a comprehensive technical report on its new Qwen3-VL multimodal model, just months after its launch. This advanced system demonstrates exceptional performance in image-based mathematical tasks and the ability to analyze extensive video footage, making it a significant player in the evolving landscape of artificial intelligence.

The Qwen3-VL model can process substantial data loads, managing two-hour videos or hundreds of document pages within a 256,000-token context window. In rigorous “needle-in-a-haystack” tests, the flagship 235-billion-parameter model achieved a remarkable 100 percent accuracy in locating individual frames within 30-minute videos. Even in longer two-hour videos containing approximately one million tokens, its accuracy remained at an impressive 99.5 percent. This test involves inserting a semantically significant “needle” frame at random intervals, challenging the model to find and analyze the specific frame.

In published benchmarks, Qwen3-VL frequently outperforms competitors such as Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1, even when the rivals deploy advanced reasoning features or larger processing budgets. For instance, Qwen3-VL scored 85.8 percent on the MathVista benchmark, surpassing GPT-5’s 81.3 percent, and it led on MathVision with a score of 74.6 percent, ahead of Gemini 2.5 Pro at 73.3 percent and GPT-5 at 65.8 percent.

In addition to excelling in visual mathematics, the model showcases versatility across specialized benchmarks, achieving 96.5 percent on the DocVQA document comprehension test and 875 points on OCRBench, which supports 39 languages—nearly quadrupling the language capabilities of its predecessor. It also demonstrated substantial accuracy in graphical user interface tasks, scoring 61.8 percent on the ScreenSpot Pro test and 63.7 percent on AndroidWorld, where it must operate Android apps independently.

Complex, multi-page PDF documents are not beyond its capabilities either, as Qwen3-VL scored 56.2 percent on MMLongBench-Doc for long document analysis. On the CharXiv benchmark for scientific charts, it achieved 90.5 percent on description tasks and 66.2 percent on complex reasoning questions. However, the model does face challenges; in the complex MMMU-Pro test, it scored 69.3 percent, falling short of GPT-5’s 78.4 percent, and commercial competitors generally outperform it in video question-answering benchmarks. Thus, Qwen3-VL appears to be a specialist in visual mathematics and document comprehension, while still lagging in broader reasoning capabilities.

Technical Details

The technical report outlines three key architectural advancements. First, the introduction of “interleaved MRoPE” replaces the previous position embedding method, aiming to enhance performance on long videos by distributing mathematical representations evenly across all dimensions. Second, DeepStack technology allows the model to access intermediate results from the vision encoder, offering insights from various levels of detail rather than relying solely on final outputs. Third, a simplified text-based timestamp system replaces the T-RoPE method of the previous model, streamlining the process of marking video frames with time indicators. This adjustment enhances the model’s understanding of time-based video tasks.

Alibaba trained Qwen3-VL in four phases using up to 10,000 GPUs. Following initial training to link images and text, it underwent complete multimodal training on around one trillion tokens, utilizing diverse data sources including web scrapes, 3 million PDFs from Common Crawl, and over 60 million STEM tasks. The context window was expanded progressively from 8,000 tokens to 32,000 and eventually to 262,000 tokens. The model’s “Thinking” variants received specific training to explicitly map out reasoning steps, improving performance on complex problems.

All variants of Qwen3-VL released since September are available under the Apache 2.0 license with open weights on Hugging Face. This includes dense models ranging from 2B to 32B parameters, as well as mixture-of-experts models like the 30B-A3B and the massive 235B-A22B. While some features, such as extracting frames from lengthy videos, are not entirely new—Google’s Gemini 1.5 Pro demonstrated similar capabilities in early 2024—Qwen3-VL offers competitive performance in an open-source framework. Given the popularity of the previous Qwen2.5-VL model in research circles, the latest iteration is likely to accelerate further open-source innovations in the field.

Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Education

University of Texas professor Steven Mintz argues that AI exposes critical flaws in higher education's standardized teaching methods, prompting urgent calls for reform.

AI Finance

Chinese tech giants Alibaba and ByteDance train AI models in Southeast Asia to circumvent US chip restrictions, highlighting escalating challenges in tech access.

AI Technology

Amazon, Meta, and other tech giants are set to raise nearly $100 billion in debt to fuel AI and cloud infrastructure, reflecting a critical...

AI Tools

Allens achieves a record high in non-partner fee-earners, bolstering staffing as AI adoption reshapes legal workflows and demands human oversight.

Top Stories

Corning's Q3 earnings surged 6% as strong demand for AI and solar technologies boosts its revenue outlook to $92.75 billion by 2028.

AI Generative

AI's rapid evolution is reshaping journalism, with generative AI models generating content that risks eroding public trust in democratic institutions.

AI Regulation

Donnelley Financial Solutions unveils Active Intelligence™ to enhance compliance efficiency, aiming for $830.2M in revenue by 2028 despite market volatility risks

Top Stories

Judges at the South Zone Regional Judicial Conference warn against AI reliance in courts, citing risks of 'hallucinated' citations that mislead legal outcomes.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.