Command Palette
Search for a command to run...
Wu Cho-Ying ; Hsu Chin-Cheng ; Neumann Ulrich

Abstract
This work digs into a root question in human perception: can face geometry begleaned from one's voices? Previous works that study this question only adoptdevelopments in image synthesis and convert voices into face images to showcorrelations, but working on the image domain unavoidably involves predictingattributes that voices cannot hint, including facial textures, hairstyles, andbackgrounds. We instead investigate the ability to reconstruct 3D faces toconcentrate on only geometry, which is much more physiologically grounded. Wepropose our analysis framework, Cross-Modal Perceptionist, under bothsupervised and unsupervised learning. First, we construct a dataset,Voxceleb-3D, which extends Voxceleb and includes paired voices and face meshes,making supervised learning possible. Second, we use a knowledge distillationmechanism to study whether face geometry can still be gleaned from voiceswithout paired voices and 3D face data under limited availability of 3D facescans. We break down the core question into four parts and perform visual andnumerical analyses as responses to the core question. Our findings echo thosein physiology and neuroscience about the correlation between voices and facialstructures. The work provides future human-centric cross-modal learning withexplainable foundations. See our project page:https://choyingw.github.io/works/Voice2Mesh/index.html
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| 3d-face-modelling-on-voxceleb-3d | CMP (supervised) | ARE-CR: 0.0457 ARE-ER: 0.0152 ARE-FR: 0.0186 ARE-MR: 0.0169 Mean ARE: 0.0241 |
| 3d-face-modelling-on-voxceleb-3d | CMP (unsupervised) | ARE-CR: 0.0480 ARE-ER: 0.0181 ARE-FR: 0.0169 ARE-MR: 0.0174 Mean ARE: 0.0251 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.