Nvidia has introduced the Nemotron 3 Nano Omni, an open multimodal model capable of processing text, images, video, and audio within a single architecture. This release not only emphasizes the model’s performance but also highlights its innovative training data, which incorporates insights gained from previous models such as Qwen, GPT-OSS, Kimi, and DeepSeek-OCR.
The Nemotron 3 Nano Omni is categorized as an open-source multimodal model featuring a substantial 30-billion-parameter design. Utilizing a hybrid architecture that combines Mamba-Transformer with Mixture-of-Experts, this model activates approximately three billion parameters per query. It operates on Nvidia’s proprietary C-RADIOv4-H vision encoder and the Parakeet-TDT audio encoder, boasting a robust context window of up to 256,000 tokens, although it officially supports only the English language.
According to Nvidia’s technical report, the model is specifically designed for agentic applications including document processing, computer-use agents, video and audio analysis, and voice interaction. Performance benchmarks reveal that Nemotron 3 Nano Omni surpasses its predecessor, the Nemotron Nano V2 VL, and competes closely with Alibaba’s Qwen3-Omni. Notably, in tests such as OCRBenchV2, MMLongBench-Doc, WorldSense, and VoiceBench, the new model demonstrates significant improvements, particularly in the OSWorld benchmark for GUI agents, where accuracy rose from 11.1 to 47.4 points compared to the earlier version. Nvidia claims that the throughput at equivalent interactivity levels can be up to nine times greater than that of the Qwen3-Omni.
While the performance metrics are notable, the method of training data collection offers further insight into the model’s development. Nvidia has processed approximately 717 billion tokens through seven distinct training stages, progressively expanding the context window. A substantial portion of this synthetic training data has been derived from competing models, generating image captions, question-answer pairs, and reasoning traces through the use of models such as Qwen3-VL-30B-A3B-Instruct, Qwen3.5-122B-A10B, gpt-oss-120b, and others. Nvidia also incorporated filtering from GPT-4o and Gemini 3 Flash Preview.
Utilizing other models for training is a common practice within the AI industry, although few companies are as transparent about these methods. Prominent firms like OpenAI, Anthropic, and Google have previously accused Chinese AI laboratories of conducting large-scale distillation practices. The audio data employed in Nemotron 3 Nano Omni includes Nvidia’s own Granary and SIFT-50M datasets, supplemented by captions from Qwen’s Omni-Captioner. During the reinforcement learning phase, Nvidia employed a five-stage pipeline across 25 distinct environments, covering tasks such as visual grounding, chart and document understanding, GUI interaction, and automatic speech recognition.
Nvidia’s release includes not only the model weights in BF16, FP8, and NVFP4 but also portions of the training data, the training pipelines on Megatron-Bridge, and reinforcement learning recipes available on NeMo-RL. This comprehensive approach distinguishes this release from others that typically only offer model weights. Furthermore, the model’s reasoning mode is enabled by default, requiring users to disable it manually for tasks that do not necessitate a chain-of-thought process. It is distributed under the NVIDIA Open Model Agreement, permitting commercial usage.
The unveiling of the Nemotron 3 Nano Omni highlights Nvidia’s commitment to advancing multimodal AI technologies, setting a new benchmark in performance and transparency for the industry. As competition intensifies among AI developers, developments such as this will continue to shape the landscape of artificial intelligence applications across diverse sectors.
See also
Sam Altman Praises ChatGPT for Improved Em Dash Handling
AI Country Song Fails to Top Billboard Chart Amid Viral Buzz
GPT-5.1 and Claude 4.5 Sonnet Personality Showdown: A Comprehensive Test
Rethink Your Presentations with OnlyOffice: A Free PowerPoint Alternative
OpenAI Enhances ChatGPT with Em-Dash Personalization Feature




















































