AI Generative

NTU and CUHK Unveil MoTok: 89% Error Reduction in Motion Generation with 1/6 Tokens

NTU and CUHK unveil MoTok, achieving an 89% reduction in trajectory error and utilizing only one-sixth the tokens of current models for improved motion generation.

Staff

Published

31 March, 2026

Research teams from Nanyang Technological University and The Chinese University of Hong Kong have developed a new approach to motion generation, addressing longstanding challenges in achieving both natural and controllable movements. Their innovative model, dubbed MoTok, seeks to resolve the trade-off between strong control, which can result in stiff movements, and naturalness, which often leads to deviations in execution.

The MoTok framework distinguishes between two critical aspects of motion generation: high-level semantic planning, which dictates “what to do,” and low-level detail reconstruction, which focuses on “how to do it.” Traditionally, these tasks have been conflated, resulting in conflicting requirements that hinder optimal performance. By separating these processes, MoTok allows for more efficient handling of each component, thereby enhancing both controllability and naturalness.

Employing a diffusion-based discrete motion tokenizer, MoTok introduces a new paradigm for conditional motion generation by effectively integrating the strengths of discrete tokens and continuous diffusion. This method achieves a significant reduction in token quantity, compressing it to just one-sixth of the state-of-the-art (SOTA) models while simultaneously reducing trajectory error by 89% (from 0.72 cm to 0.08 cm) and decreasing the Fréchet Inception Distance (FID) by 65% (from 0.083 to 0.029). Under enhanced joint trajectory control, the FID sees a further reduction of 58%, achieving the model’s goal of “the more controlled, the more natural.”

MoTok employs a Perception–Planning–Control three-stage framework for motion generation. Initially, the Perception phase assesses the conditions, followed by the Planning stage, where semantic planning occurs within a discrete token space. Finally, the Control stage fine-tunes motion details through the diffusion-based decoder. This structured approach allows for flexible adaptation to various inputs and tasks while optimizing performance across the different stages.

In contrast to traditional discrete-token methods, which often struggle to balance high-level semantics with low-level detail retention, MoTok streamlines the token creation process. By capitalizing on the diffusion decoder’s detailed reconstruction capability, MoTok enables the discrete tokens to focus primarily on semantic information beneficial for planning. As a result, the Planning stage becomes more efficient, yielding better outcomes in motion generation tasks.

In comparative experiments, researchers demonstrated MoTok’s superior effectiveness. By replacing traditional decoders with the MoTok diffusion-based decoder, the reconstruction quality improved significantly, showcasing the new model’s enhanced capabilities. Furthermore, when substituting original tokens with MoTok tokens, notable advancements were observed in both text-to-motion (T2M) generation and motion-to-text (M2T) tasks, indicating improved translation accuracy.

MoTok also innovates in its handling of joint trajectory conditions, which often conflict with text-based motion generation. Previous studies indicated that increasing control over joint trajectories typically degraded the quality of motion outputs. MoTok addresses this issue by implementing a coarse-to-fine control injection system. In the Planning stage, joint trajectories are used as coarse constraints, while fine-grained constraints are applied in the Control stage through iterative optimization. This separation facilitates improved harmony between text and motion control conditions, effectively breaking the cycle of conflict seen in prior methodologies.

The researchers conducted an ablation study to validate the efficacy of their dual-stream injection approach. Results indicated that merely retaining coarse constraints in the Planning phase significantly increased trajectory control errors, while applying only fine-grained constraints in the Control phase negatively impacted motion distribution. This underscores the importance of MoTok’s comprehensive strategy, which separates the nuances of high-level planning from low-level execution.

In conclusion, MoTok represents a significant advancement in motion generation, allowing high-level semantics and low-level details to operate without constraining each other. By fostering a more natural connection between planning and control, the model enhances controllability, naturalness, and general task applicability. This innovative framework holds promise for broader applications, including embodied agents and digital humans, illustrating the potential for future advancements in AI-driven motion generation.

Project homepage: https://rheallyc.github.io/projects/motok/

Paper link: https://arxiv.org/pdf/2603.19227v1

Github link: github.com/rheallyc/MoTok

This article is from the WeChat official account “QbitAI”, author: MoTok team. It is published by 36Kr with authorization.

AI Generative

SenseTime Launches NEO, First Native Multimodal Architecture, Outperforming Top Models

SenseTime unveils NEO, the world's first open-source native multimodal architecture, achieving top performance with just 390 million image-text pairs, outpacing leading models.

Staff5 December, 2025

AIPRESSA.COM

AI Generative

NTU and CUHK Unveil MoTok: 89% Error Reduction in Motion Generation with 1/6 Tokens

Trending

Top Stories

Albania Appoints AI Bot Minister Diella Amid Corruption Concerns and EU Membership Goals

AI Government

BigBear.ai Launches Biometric Platform at O’Hare, Acquires Generative AI Ask Sage for $250M

AI Cybersecurity

Endpoint Security Market to Reach $23.9B by 2030 with 7.2% CAGR Amid Rising Cyber Threats

AI Business

Enterprise Architecture Shifts to Strategic Enabler in AI-Driven Business Models

AI Research

Amazon Awards 63 Research Grants to 41 Universities Across 8 Countries for AI Innovation

You May Also Like

AI Generative

SenseTime Launches NEO, First Native Multimodal Architecture, Outperforming Top Models