8 months ago

Abstract

Inspired by the strong ties between vision and language, the two intimatehuman sensing and communication modalities, our paper aims to explore thegeneration of 3D human full-body motions from texts, as well as its reciprocaltask, shorthanded for text2motion and motion2text, respectively. To tackle theexisting challenges, especially to enable the generation of multiple distinctmotions from the same text, and to avoid the undesirable production of trivialmotionless pose sequences, we propose the use of motion token, a discrete andcompact motion representation. This provides one level playing ground whenconsidering both motions and text signals, as the motion and text tokens,respectively. Moreover, our motion2text module is integrated into the inversealignment process of our text2motion training pipeline, where a significantdeviation of synthesized text from the input text would be penalized by a largetraining loss; empirically this is shown to effectively improve performance.Finally, the mappings in-between the two modalities of motions and texts arefacilitated by adapting the neural model for machine translation (NMT) to ourcontext. This autoregressive modeling of the distribution over discrete motiontokens further enables non-deterministic production of pose sequences, ofvariable lengths, from an input text. Our approach is flexible, could be usedfor both text2motion and motion2text tasks. Empirical evaluations on twobenchmark datasets demonstrate the superior performance of our approach on bothtasks over a variety of state-of-the-art methods. Project page:https://ericguo5513.github.io/TM2T/

Source PDF View Code