8 months ago

Abstract

This work digs into a root question in human perception: can face geometry begleaned from one's voices? Previous works that study this question only adoptdevelopments in image synthesis and convert voices into face images to showcorrelations, but working on the image domain unavoidably involves predictingattributes that voices cannot hint, including facial textures, hairstyles, andbackgrounds. We instead investigate the ability to reconstruct 3D faces toconcentrate on only geometry, which is much more physiologically grounded. Wepropose our analysis framework, Cross-Modal Perceptionist, under bothsupervised and unsupervised learning. First, we construct a dataset,Voxceleb-3D, which extends Voxceleb and includes paired voices and face meshes,making supervised learning possible. Second, we use a knowledge distillationmechanism to study whether face geometry can still be gleaned from voiceswithout paired voices and 3D face data under limited availability of 3D facescans. We break down the core question into four parts and perform visual andnumerical analyses as responses to the core question. Our findings echo thosein physiology and neuroscience about the correlation between voices and facialstructures. The work provides future human-centric cross-modal learning withexplainable foundations. See our project page:https://choyingw.github.io/works/Voice2Mesh/index.html

Source PDF