Connect with us

Hi, what are you looking for?

AI Generative

Beijing Academy of Artificial Intelligence Publishes Emu3 Model in Nature, Advancing Multimodal Learning

Beijing Academy of Artificial Intelligence’s Emu3 model, published in Nature, achieves state-of-the-art multimodal learning, surpassing competitors in performance benchmarks.

In a significant breakthrough for artificial intelligence, the Beijing Academy of Artificial Intelligence (BAAI) has launched the multimodal large model “Wujie·Emu“, which has been published in the main issue of Nature on January 29, 2024. This makes BAAI the second Chinese research team, following DeepSeek, to have a large model published in this prestigious journal, marking a milestone as China’s first Nature paper in the multimodal large model domain.

The editors of Nature highlighted Emu’s capabilities, stating it “achieves unified learning of large-scale text, images, and videos based solely on ‘predicting the next token.'” The model demonstrates performance in generation and perception tasks that rivals specialized systems. This advancement is poised to significantly influence the development of native multimodal assistants and embodied intelligence.

Launched in October 2024, the Emu model has shown impressive versatility, excelling in text-to-image and text-to-video generation, future prediction, and visual-language understanding. Emu’s autoregressive approach is notable for its simplicity, fostering a unified pathway for generative AI. Its performance metrics reveal it surpasses existing models on various benchmarks; for instance, it outperformed diffusion models such as SDXL in image generation and achieved a score of 81 on the VBench for video generation, edging out models like Open-Sora 1.2.

Jack Clark, co-founder of Anthropic and former head of policy at OpenAI, remarked on Emu’s architectural route, emphasizing its simplicity and scalability potential. BAAI’s president, Wang Zhongyuan, echoed this sentiment, stating, “The simpler the architecture, the greater the potential productivity and the greater the value to the industry.” The streamlined approach not only reduces complexity in the research and development process but also enhances efficiency in model construction and maintenance.

By October 2025, the Emu series evolved into a multimodal world model, Emu3.5, which can simulate exploration and operations in virtual environments. This iteration not only achieved state-of-the-art performance in multimodality but also introduced the concept of a “multimodal Scaling paradigm,” allowing the model to learn the inherent laws of world evolution. This innovation holds promise for the future development of physical AI fields, particularly embodied intelligence.

The journey to publish Emu in Nature involved overcoming significant challenges. Originally initiated in February 2024, the project aimed to explore whether the autoregressive technology route could unify multimodality, a question that had remained elusive within the field. The team, comprising 50 researchers, undertook the ambitious task of developing a high-performance, native multimodal large model focused on autoregressive architecture. They pioneered the unified discretization of images, text, and videos into the same representation space, jointly training a single Transformer on multimodal sequence data.

Despite the team’s efforts, the path to success was fraught with technological risks, particularly in adopting a “discrete token” approach that aimed to reinvent a language system for visual modalities. The challenges included effectively compressing images based on tokens, which often led to setbacks. Additionally, amid a competitive landscape, many teams faltered, but BAAI remained resolute in its pursuit of a unified multimodal model. Team members believed that achieving a model capable of understanding the physical world was essential for advancing toward artificial general intelligence (AGI).

Since its release, Emu3 has markedly impacted the multimodal landscape, gaining recognition for its performance across various tasks. In the realm of text-to-image generation, it has shown capabilities comparable to leading diffusion models and established its standing in visual-language understanding without relying on specialized pre-trained large language models. Furthermore, Emu3’s video generation abilities allow it to create competitive five-second videos at 24 frames per second, surpassing established models within that timeframe.

As industries increasingly recognize Emu3’s contributions, its success reinforces the potential for autoregressive technology to serve as a foundation for unified multimodal learning. This development not only influences the direction of AI research but also lays the groundwork for future advancements in intelligent systems that can seamlessly integrate diverse modalities.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Research

China's 15th Five-Year Plan empowers AI innovation with a projected industry growth to $172.6B by 2025, spurring integration into daily life.

Top Stories

Nvidia's extensive technical support to China's DeepSeek in refining military AI models raises urgent national security concerns, warns U.S. lawmaker John Moolenaar.

Top Stories

China grants AI startup DeepSeek conditional access to Nvidia's H200 chips, marking a pivotal shift in U.S. tech export policies amid growing demand.

AI Regulation

India's adaptable AI strategy prioritizes practical innovation over costly Western models, aiming to cultivate local talent and domain-specific applications while navigating global market volatility.

AI Technology

China grants approval for Alibaba, Tencent, and DeepSeek to purchase 400,000 Nvidia H200 GPUs, enhancing AI capabilities and reducing sector risks.

AI Cybersecurity

95% of Model Context Protocol deployments lack security, raising alarms among experts as AI cyber threats escalate, particularly from nation-states like Iran and China.

AI Technology

Capstone forecasts intensified regulatory challenges for Google, Meta, and Apple in 2026, as new AI laws and cryptocurrency frameworks reshape the tech landscape.

Top Stories

U.S. startups face rising operational costs as fragmented AI laws across states threaten innovation, while China’s unified regulations bolster its competitive edge.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.