Connect with us

Hi, what are you looking for?

AI Generative

SenseTime Launches NEO, First Native Multimodal Architecture, Outperforming Top Models

SenseTime unveils NEO, the world’s first open-source native multimodal architecture, achieving top performance with just 390 million image-text pairs, outpacing leading models.

On December 5, 2025, SenseTime, in collaboration with Nanyang Technological University and various research teams, unveiled NEO, touted as the world’s first scalable, open-source native multimodal architecture (Native VLM). This development signifies a breakthrough from traditional modular “assembly-style” models, heralding a new era of genuine multimodal fusion in artificial intelligence.

NEO diverges from conventional architectures like GPT-4V and Claude 3.5, which typically rely on a pipeline that includes a vision encoder, a projection layer, and a language model. Instead, it features a unified multimodal ‘brain’ designed to integrate different modes of input seamlessly. This innovation stems from three native technologies: Native Patch Embedding, which constructs high-fidelity visual representations directly from pixel data; Native 3D Rotary Position Encoding, which allocates specific frequencies for spatiotemporal information; and Native Multi-Head Attention, enabling collaborative attention patterns across both text and vision to address the semantic gap at the architectural level.

Initial evaluations indicate that NEO is competitive with leading models, such as Qwen2-VL and InternVL3, particularly on visual tasks including AI2D and DocVQA. Remarkably, NEO achieves this using merely 390 million image-text pairs—just one-tenth of the data utilized by comparable multimodal models. When assessed on extensive benchmarks like MMMU and MMBench, NEO not only meets but surpasses the performance of existing native VLMs, showcasing its overall capability.

Further enhancing its appeal, NEO’s models, ranging from 2 billion to 8 billion parameters, offer exceptional cost-efficiency in inference, making them well-suited for deployment across mobile devices, robotics, and various edge scenarios. SenseTime has already open-sourced the 2 billion and 9 billion versions of NEO and has ambitious plans to extend its architecture to encompass video understanding and 3D interactions. This expansion aims to transform the capabilities of multimodal AI while facilitating a broader movement of advanced artificial intelligence technologies from centralized cloud systems to more accessible edge devices.

As a groundbreaking framework, NEO not only contributes to the evolving landscape of artificial intelligence but also underscores the significant strides made by Chinese researchers in global AI architecture innovation. The release of NEO may catalyze further advancements in multimodal technologies, reshaping how machines understand and interact with the world. With the ongoing development in this field, the potential applications of such technology are vast, promising to enhance various sectors well into the future.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

Top Stories

SenseTime Group Inc. sees a 0.53% dip to HK$1.88 as it grapples with a negative EPS of -0.10 amid increasing competition in the AI...

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.