In a significant breakthrough for artificial intelligence, the Beijing Academy of Artificial Intelligence (BAAI) has launched the multimodal large model “Wujie·Emu“, which has been published in the main issue of Nature on January 29, 2024. This makes BAAI the second Chinese research team, following DeepSeek, to have a large model published in this prestigious journal, marking a milestone as China’s first Nature paper in the multimodal large model domain.
The editors of Nature highlighted Emu’s capabilities, stating it “achieves unified learning of large-scale text, images, and videos based solely on ‘predicting the next token.'” The model demonstrates performance in generation and perception tasks that rivals specialized systems. This advancement is poised to significantly influence the development of native multimodal assistants and embodied intelligence.
Launched in October 2024, the Emu model has shown impressive versatility, excelling in text-to-image and text-to-video generation, future prediction, and visual-language understanding. Emu’s autoregressive approach is notable for its simplicity, fostering a unified pathway for generative AI. Its performance metrics reveal it surpasses existing models on various benchmarks; for instance, it outperformed diffusion models such as SDXL in image generation and achieved a score of 81 on the VBench for video generation, edging out models like Open-Sora 1.2.
Jack Clark, co-founder of Anthropic and former head of policy at OpenAI, remarked on Emu’s architectural route, emphasizing its simplicity and scalability potential. BAAI’s president, Wang Zhongyuan, echoed this sentiment, stating, “The simpler the architecture, the greater the potential productivity and the greater the value to the industry.” The streamlined approach not only reduces complexity in the research and development process but also enhances efficiency in model construction and maintenance.
By October 2025, the Emu series evolved into a multimodal world model, Emu3.5, which can simulate exploration and operations in virtual environments. This iteration not only achieved state-of-the-art performance in multimodality but also introduced the concept of a “multimodal Scaling paradigm,” allowing the model to learn the inherent laws of world evolution. This innovation holds promise for the future development of physical AI fields, particularly embodied intelligence.
The journey to publish Emu in Nature involved overcoming significant challenges. Originally initiated in February 2024, the project aimed to explore whether the autoregressive technology route could unify multimodality, a question that had remained elusive within the field. The team, comprising 50 researchers, undertook the ambitious task of developing a high-performance, native multimodal large model focused on autoregressive architecture. They pioneered the unified discretization of images, text, and videos into the same representation space, jointly training a single Transformer on multimodal sequence data.
Despite the team’s efforts, the path to success was fraught with technological risks, particularly in adopting a “discrete token” approach that aimed to reinvent a language system for visual modalities. The challenges included effectively compressing images based on tokens, which often led to setbacks. Additionally, amid a competitive landscape, many teams faltered, but BAAI remained resolute in its pursuit of a unified multimodal model. Team members believed that achieving a model capable of understanding the physical world was essential for advancing toward artificial general intelligence (AGI).
Since its release, Emu3 has markedly impacted the multimodal landscape, gaining recognition for its performance across various tasks. In the realm of text-to-image generation, it has shown capabilities comparable to leading diffusion models and established its standing in visual-language understanding without relying on specialized pre-trained large language models. Furthermore, Emu3’s video generation abilities allow it to create competitive five-second videos at 24 frames per second, surpassing established models within that timeframe.
As industries increasingly recognize Emu3’s contributions, its success reinforces the potential for autoregressive technology to serve as a foundation for unified multimodal learning. This development not only influences the direction of AI research but also lays the groundwork for future advancements in intelligent systems that can seamlessly integrate diverse modalities.
See also
Sam Altman Praises ChatGPT for Improved Em Dash Handling
AI Country Song Fails to Top Billboard Chart Amid Viral Buzz
GPT-5.1 and Claude 4.5 Sonnet Personality Showdown: A Comprehensive Test
Rethink Your Presentations with OnlyOffice: A Free PowerPoint Alternative
OpenAI Enhances ChatGPT with Em-Dash Personalization Feature





















































