Command Palette
Search for a command to run...
TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts
Guo Chuan ; Zuo Xinxin ; Wang Sen ; Cheng Li

Abstract
Inspired by the strong ties between vision and language, the two intimatehuman sensing and communication modalities, our paper aims to explore thegeneration of 3D human full-body motions from texts, as well as its reciprocaltask, shorthanded for text2motion and motion2text, respectively. To tackle theexisting challenges, especially to enable the generation of multiple distinctmotions from the same text, and to avoid the undesirable production of trivialmotionless pose sequences, we propose the use of motion token, a discrete andcompact motion representation. This provides one level playing ground whenconsidering both motions and text signals, as the motion and text tokens,respectively. Moreover, our motion2text module is integrated into the inversealignment process of our text2motion training pipeline, where a significantdeviation of synthesized text from the input text would be penalized by a largetraining loss; empirically this is shown to effectively improve performance.Finally, the mappings in-between the two modalities of motions and texts arefacilitated by adapting the neural model for machine translation (NMT) to ourcontext. This autoregressive modeling of the distribution over discrete motiontokens further enables non-deterministic production of pose sequences, ofvariable lengths, from an input text. Our approach is flexible, could be usedfor both text2motion and motion2text tasks. Empirical evaluations on twobenchmark datasets demonstrate the superior performance of our approach on bothtasks over a variety of state-of-the-art methods. Project page:https://ericguo5513.github.io/TM2T/
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| motion-captioning-on-humanml3d | TM2T | BERTScore: 37.8 BLEU-4: 22.3 |
| motion-captioning-on-kit-motion-language | TM2T | BERTScore: 23.0 BLEU-4: 18.4 |
| motion-synthesis-on-humanml3d | TM2T | Diversity: 8.589 FID: 1.501 Multimodality: 2.424 R Precision Top3: 0.729 |
| motion-synthesis-on-humanml3d | Text2Gesture | Diversity: 6.409 FID: 5.012 R Precision Top3: 0.345 |
| motion-synthesis-on-humanml3d | Language2Pose | Diversity: 7.676 FID: 11.02 R Precision Top3: 0.486 |
| motion-synthesis-on-kit-motion-language | Text2Gesture | Diversity: 9.334 FID: 12.12 R Precision Top3: 0.338 |
| motion-synthesis-on-kit-motion-language | TM2T | Diversity: 9.473 FID: 3.599 Multimodality: 3.292 R Precision Top3: 0.587 |
| motion-synthesis-on-kit-motion-language | Language2Pose | Diversity: 9.073 FID: 6.545 R Precision Top3: 0.483 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.