On December 5, 2025, SenseTime, in collaboration with Nanyang Technological University and various research teams, unveiled NEO, touted as the world’s first scalable, open-source native multimodal architecture (Native VLM). This development signifies a breakthrough from traditional modular “assembly-style” models, heralding a new era of genuine multimodal fusion in artificial intelligence.
NEO diverges from conventional architectures like GPT-4V and Claude 3.5, which typically rely on a pipeline that includes a vision encoder, a projection layer, and a language model. Instead, it features a unified multimodal ‘brain’ designed to integrate different modes of input seamlessly. This innovation stems from three native technologies: Native Patch Embedding, which constructs high-fidelity visual representations directly from pixel data; Native 3D Rotary Position Encoding, which allocates specific frequencies for spatiotemporal information; and Native Multi-Head Attention, enabling collaborative attention patterns across both text and vision to address the semantic gap at the architectural level.
Initial evaluations indicate that NEO is competitive with leading models, such as Qwen2-VL and InternVL3, particularly on visual tasks including AI2D and DocVQA. Remarkably, NEO achieves this using merely 390 million image-text pairs—just one-tenth of the data utilized by comparable multimodal models. When assessed on extensive benchmarks like MMMU and MMBench, NEO not only meets but surpasses the performance of existing native VLMs, showcasing its overall capability.
Further enhancing its appeal, NEO’s models, ranging from 2 billion to 8 billion parameters, offer exceptional cost-efficiency in inference, making them well-suited for deployment across mobile devices, robotics, and various edge scenarios. SenseTime has already open-sourced the 2 billion and 9 billion versions of NEO and has ambitious plans to extend its architecture to encompass video understanding and 3D interactions. This expansion aims to transform the capabilities of multimodal AI while facilitating a broader movement of advanced artificial intelligence technologies from centralized cloud systems to more accessible edge devices.
As a groundbreaking framework, NEO not only contributes to the evolving landscape of artificial intelligence but also underscores the significant strides made by Chinese researchers in global AI architecture innovation. The release of NEO may catalyze further advancements in multimodal technologies, reshaping how machines understand and interact with the world. With the ongoing development in this field, the potential applications of such technology are vast, promising to enhance various sectors well into the future.
See also
AI Advances May Create Major Labor Shortage as Skills Gap Worsens Among Graduates
Google Launches Gemini 3 Pro, Revolutionizing Multimodal AI with Advanced Document and Video Processing
Halfaccess.org Reveals $5.063B AI Impact on Digital Content Creation Trends
OpenAI Accelerates GPT-5.2 Release to Compete with Google’s Gemini 3 Next Week




















































