Connect with us

Hi, what are you looking for?

AI Generative

Beijing Academy of Artificial Intelligence Publishes Emu3 Model in Nature, Advancing Multimodal Learning

Beijing Academy of Artificial Intelligence’s Emu3 model, published in Nature, achieves state-of-the-art multimodal learning, surpassing competitors in performance benchmarks.

In a significant breakthrough for artificial intelligence, the Beijing Academy of Artificial Intelligence (BAAI) has launched the multimodal large model “Wujie·Emu“, which has been published in the main issue of Nature on January 29, 2024. This makes BAAI the second Chinese research team, following DeepSeek, to have a large model published in this prestigious journal, marking a milestone as China’s first Nature paper in the multimodal large model domain.

The editors of Nature highlighted Emu’s capabilities, stating it “achieves unified learning of large-scale text, images, and videos based solely on ‘predicting the next token.'” The model demonstrates performance in generation and perception tasks that rivals specialized systems. This advancement is poised to significantly influence the development of native multimodal assistants and embodied intelligence.

Launched in October 2024, the Emu model has shown impressive versatility, excelling in text-to-image and text-to-video generation, future prediction, and visual-language understanding. Emu’s autoregressive approach is notable for its simplicity, fostering a unified pathway for generative AI. Its performance metrics reveal it surpasses existing models on various benchmarks; for instance, it outperformed diffusion models such as SDXL in image generation and achieved a score of 81 on the VBench for video generation, edging out models like Open-Sora 1.2.

Jack Clark, co-founder of Anthropic and former head of policy at OpenAI, remarked on Emu’s architectural route, emphasizing its simplicity and scalability potential. BAAI’s president, Wang Zhongyuan, echoed this sentiment, stating, “The simpler the architecture, the greater the potential productivity and the greater the value to the industry.” The streamlined approach not only reduces complexity in the research and development process but also enhances efficiency in model construction and maintenance.

By October 2025, the Emu series evolved into a multimodal world model, Emu3.5, which can simulate exploration and operations in virtual environments. This iteration not only achieved state-of-the-art performance in multimodality but also introduced the concept of a “multimodal Scaling paradigm,” allowing the model to learn the inherent laws of world evolution. This innovation holds promise for the future development of physical AI fields, particularly embodied intelligence.

The journey to publish Emu in Nature involved overcoming significant challenges. Originally initiated in February 2024, the project aimed to explore whether the autoregressive technology route could unify multimodality, a question that had remained elusive within the field. The team, comprising 50 researchers, undertook the ambitious task of developing a high-performance, native multimodal large model focused on autoregressive architecture. They pioneered the unified discretization of images, text, and videos into the same representation space, jointly training a single Transformer on multimodal sequence data.

Despite the team’s efforts, the path to success was fraught with technological risks, particularly in adopting a “discrete token” approach that aimed to reinvent a language system for visual modalities. The challenges included effectively compressing images based on tokens, which often led to setbacks. Additionally, amid a competitive landscape, many teams faltered, but BAAI remained resolute in its pursuit of a unified multimodal model. Team members believed that achieving a model capable of understanding the physical world was essential for advancing toward artificial general intelligence (AGI).

Since its release, Emu3 has markedly impacted the multimodal landscape, gaining recognition for its performance across various tasks. In the realm of text-to-image generation, it has shown capabilities comparable to leading diffusion models and established its standing in visual-language understanding without relying on specialized pre-trained large language models. Furthermore, Emu3’s video generation abilities allow it to create competitive five-second videos at 24 frames per second, surpassing established models within that timeframe.

As industries increasingly recognize Emu3’s contributions, its success reinforces the potential for autoregressive technology to serve as a foundation for unified multimodal learning. This development not only influences the direction of AI research but also lays the groundwork for future advancements in intelligent systems that can seamlessly integrate diverse modalities.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

Top Stories

MiniMax launches the free M2.7 AI model with 229 billion parameters, outperforming Gemini 3.1 Pro in key benchmarks and enhancing multi-agent capabilities.

AI Education

China launches a national AI education strategy to integrate artificial intelligence into all educational levels, ensuring a future-ready workforce and global tech competitiveness.

AI Generative

U.S. Treasury Secretary Scott Bessent and Fed Chair Jerome Powell discuss urgent cybersecurity measures as Anthropic's Mythos threatens $200B in economic damage.

Top Stories

DeepSeek trains its latest AI model on Nvidia's banned Blackwell chips, revealing critical loopholes in U.S. export controls amid rising China-U.S. tech tensions

AI Research

Makerfire adopts the USX51 AI Flight Controller, integrating 10 TOPS edge AI for enhanced autonomous decision-making in industrial drone operations.

AI Technology

Alibaba invests $300 million in AI video startup ShengShu, aiming to lead the burgeoning text-to-video market amid rising global competition.

Top Stories

OpenAI, Anthropic, and Google unite to combat distillation attacks from Chinese startups, launching the Frontier Model Forum to protect valuable AI innovations.

AI Research

DeepSeek's upcoming V4 AI model, potentially powered by Huawei chips, aims to redefine AI capabilities amid US export restrictions, signaling China's technological ascent.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.