8 months ago

Abstract

Though the advancement of pre-trained large language models unfolds, theexploration of building a unified model for language and other multi-modaldata, such as motion, remains challenging and untouched so far. Fortunately,human motion displays a semantic coupling akin to human language, oftenperceived as a form of body language. By fusing language data with large-scalemotion models, motion-language pre-training that can enhance the performance ofmotion-related tasks becomes feasible. Driven by this insight, we proposeMotionGPT, a unified, versatile, and user-friendly motion-language model tohandle multiple motion-relevant tasks. Specifically, we employ the discretevector quantization for human motion and transfer 3D motion into motion tokens,similar to the generation process of word tokens. Building upon this "motionvocabulary", we perform language modeling on both motion and text in a unifiedmanner, treating human motion as a specific language. Moreover, inspired byprompt learning, we pre-train MotionGPT with a mixture of motion-language dataand fine-tune it on prompt-based question-and-answer tasks. Extensiveexperiments demonstrate that MotionGPT achieves state-of-the-art performanceson multiple motion tasks including text-driven motion generation, motioncaptioning, motion prediction, and motion in-between.

View Code