Alibaba has released a comprehensive technical report on its new Qwen3-VL multimodal model, just months after its launch. This advanced system demonstrates exceptional performance in image-based mathematical tasks and the ability to analyze extensive video footage, making it a significant player in the evolving landscape of artificial intelligence.
The Qwen3-VL model can process substantial data loads, managing two-hour videos or hundreds of document pages within a 256,000-token context window. In rigorous “needle-in-a-haystack” tests, the flagship 235-billion-parameter model achieved a remarkable 100 percent accuracy in locating individual frames within 30-minute videos. Even in longer two-hour videos containing approximately one million tokens, its accuracy remained at an impressive 99.5 percent. This test involves inserting a semantically significant “needle” frame at random intervals, challenging the model to find and analyze the specific frame.
In published benchmarks, Qwen3-VL frequently outperforms competitors such as Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1, even when the rivals deploy advanced reasoning features or larger processing budgets. For instance, Qwen3-VL scored 85.8 percent on the MathVista benchmark, surpassing GPT-5’s 81.3 percent, and it led on MathVision with a score of 74.6 percent, ahead of Gemini 2.5 Pro at 73.3 percent and GPT-5 at 65.8 percent.
In addition to excelling in visual mathematics, the model showcases versatility across specialized benchmarks, achieving 96.5 percent on the DocVQA document comprehension test and 875 points on OCRBench, which supports 39 languages—nearly quadrupling the language capabilities of its predecessor. It also demonstrated substantial accuracy in graphical user interface tasks, scoring 61.8 percent on the ScreenSpot Pro test and 63.7 percent on AndroidWorld, where it must operate Android apps independently.
Complex, multi-page PDF documents are not beyond its capabilities either, as Qwen3-VL scored 56.2 percent on MMLongBench-Doc for long document analysis. On the CharXiv benchmark for scientific charts, it achieved 90.5 percent on description tasks and 66.2 percent on complex reasoning questions. However, the model does face challenges; in the complex MMMU-Pro test, it scored 69.3 percent, falling short of GPT-5’s 78.4 percent, and commercial competitors generally outperform it in video question-answering benchmarks. Thus, Qwen3-VL appears to be a specialist in visual mathematics and document comprehension, while still lagging in broader reasoning capabilities.
Technical Details
The technical report outlines three key architectural advancements. First, the introduction of “interleaved MRoPE” replaces the previous position embedding method, aiming to enhance performance on long videos by distributing mathematical representations evenly across all dimensions. Second, DeepStack technology allows the model to access intermediate results from the vision encoder, offering insights from various levels of detail rather than relying solely on final outputs. Third, a simplified text-based timestamp system replaces the T-RoPE method of the previous model, streamlining the process of marking video frames with time indicators. This adjustment enhances the model’s understanding of time-based video tasks.
Alibaba trained Qwen3-VL in four phases using up to 10,000 GPUs. Following initial training to link images and text, it underwent complete multimodal training on around one trillion tokens, utilizing diverse data sources including web scrapes, 3 million PDFs from Common Crawl, and over 60 million STEM tasks. The context window was expanded progressively from 8,000 tokens to 32,000 and eventually to 262,000 tokens. The model’s “Thinking” variants received specific training to explicitly map out reasoning steps, improving performance on complex problems.
All variants of Qwen3-VL released since September are available under the Apache 2.0 license with open weights on Hugging Face. This includes dense models ranging from 2B to 32B parameters, as well as mixture-of-experts models like the 30B-A3B and the massive 235B-A22B. While some features, such as extracting frames from lengthy videos, are not entirely new—Google’s Gemini 1.5 Pro demonstrated similar capabilities in early 2024—Qwen3-VL offers competitive performance in an open-source framework. Given the popularity of the previous Qwen2.5-VL model in research circles, the latest iteration is likely to accelerate further open-source innovations in the field.
Google Limits Nano Banana Pro to 2 Daily Images for Free Users Amid High Demand
Distributed Speculative Decoding Achieves 1.1x Speedup and 9.7% Throughput Gain for LLMs
Synthetic Media Market Surges to $67.4B by 2034, Driven by Generative AI Innovations
OpenAI Launches GPT-5 with 10x Speed Improvement, Surpassing 1 Million Business Clients
Gemini 3 Launches Nano Banana Pro, Revolutionizing AI Image Text Generation





















































