Connect with us

Hi, what are you looking for?

AI Generative

NTU and CUHK Unveil MoTok: 89% Error Reduction in Motion Generation with 1/6 Tokens

NTU and CUHK unveil MoTok, achieving an 89% reduction in trajectory error and utilizing only one-sixth the tokens of current models for improved motion generation.

Research teams from Nanyang Technological University and The Chinese University of Hong Kong have developed a new approach to motion generation, addressing longstanding challenges in achieving both natural and controllable movements. Their innovative model, dubbed MoTok, seeks to resolve the trade-off between strong control, which can result in stiff movements, and naturalness, which often leads to deviations in execution.

The MoTok framework distinguishes between two critical aspects of motion generation: high-level semantic planning, which dictates “what to do,” and low-level detail reconstruction, which focuses on “how to do it.” Traditionally, these tasks have been conflated, resulting in conflicting requirements that hinder optimal performance. By separating these processes, MoTok allows for more efficient handling of each component, thereby enhancing both controllability and naturalness.

Employing a diffusion-based discrete motion tokenizer, MoTok introduces a new paradigm for conditional motion generation by effectively integrating the strengths of discrete tokens and continuous diffusion. This method achieves a significant reduction in token quantity, compressing it to just one-sixth of the state-of-the-art (SOTA) models while simultaneously reducing trajectory error by 89% (from 0.72 cm to 0.08 cm) and decreasing the Fréchet Inception Distance (FID) by 65% (from 0.083 to 0.029). Under enhanced joint trajectory control, the FID sees a further reduction of 58%, achieving the model’s goal of “the more controlled, the more natural.”

MoTok employs a Perception–Planning–Control three-stage framework for motion generation. Initially, the Perception phase assesses the conditions, followed by the Planning stage, where semantic planning occurs within a discrete token space. Finally, the Control stage fine-tunes motion details through the diffusion-based decoder. This structured approach allows for flexible adaptation to various inputs and tasks while optimizing performance across the different stages.

In contrast to traditional discrete-token methods, which often struggle to balance high-level semantics with low-level detail retention, MoTok streamlines the token creation process. By capitalizing on the diffusion decoder’s detailed reconstruction capability, MoTok enables the discrete tokens to focus primarily on semantic information beneficial for planning. As a result, the Planning stage becomes more efficient, yielding better outcomes in motion generation tasks.

In comparative experiments, researchers demonstrated MoTok’s superior effectiveness. By replacing traditional decoders with the MoTok diffusion-based decoder, the reconstruction quality improved significantly, showcasing the new model’s enhanced capabilities. Furthermore, when substituting original tokens with MoTok tokens, notable advancements were observed in both text-to-motion (T2M) generation and motion-to-text (M2T) tasks, indicating improved translation accuracy.

MoTok also innovates in its handling of joint trajectory conditions, which often conflict with text-based motion generation. Previous studies indicated that increasing control over joint trajectories typically degraded the quality of motion outputs. MoTok addresses this issue by implementing a coarse-to-fine control injection system. In the Planning stage, joint trajectories are used as coarse constraints, while fine-grained constraints are applied in the Control stage through iterative optimization. This separation facilitates improved harmony between text and motion control conditions, effectively breaking the cycle of conflict seen in prior methodologies.

The researchers conducted an ablation study to validate the efficacy of their dual-stream injection approach. Results indicated that merely retaining coarse constraints in the Planning phase significantly increased trajectory control errors, while applying only fine-grained constraints in the Control phase negatively impacted motion distribution. This underscores the importance of MoTok’s comprehensive strategy, which separates the nuances of high-level planning from low-level execution.

In conclusion, MoTok represents a significant advancement in motion generation, allowing high-level semantics and low-level details to operate without constraining each other. By fostering a more natural connection between planning and control, the model enhances controllability, naturalness, and general task applicability. This innovative framework holds promise for broader applications, including embodied agents and digital humans, illustrating the potential for future advancements in AI-driven motion generation.

Project homepage: https://rheallyc.github.io/projects/motok/

Paper link: https://arxiv.org/pdf/2603.19227v1

Github link: github.com/rheallyc/MoTok

This article is from the WeChat official account “QbitAI”, author: MoTok team. It is published by 36Kr with authorization.

See also
Staff
Written By

The AiPressa Staff team brings you comprehensive coverage of the artificial intelligence industry, including breaking news, research developments, business trends, and policy updates. Our mission is to keep you informed about the rapidly evolving world of AI technology.

You May Also Like

AI Generative

SenseTime unveils NEO, the world's first open-source native multimodal architecture, achieving top performance with just 390 million image-text pairs, outpacing leading models.

© 2025 AIPressa · Part of Buzzora Media · All rights reserved. This website provides general news and educational content for informational purposes only. While we strive for accuracy, we do not guarantee the completeness or reliability of the information presented. The content should not be considered professional advice of any kind. Readers are encouraged to verify facts and consult appropriate experts when needed. We are not responsible for any loss or inconvenience resulting from the use of information on this site. Some images used on this website are generated with artificial intelligence and are illustrative in nature. They may not accurately represent the products, people, or events described in the articles.