Integrated Modeling of Speech, Gesture, and Face for Virtual Characters (BodyTalk)
Human communicative behavior builds on a rich set of carefully orchestrated verbal and non-verbal components. Speech, facial expressions, and gestures are fundamentally intertwined, born out of a common representation of the message to be communicated, colored by the context such as emotions and the communicative situation at hand. It is well known that there is highly precise synchronization, e.g., between individual gestures, facial expressions, and prosodic inflections. At the same time, non-verbal behaviors are highly variable, even optional, which makes it difficult to formulate exact rules as to how they appear in a conversation, and indeed the variation itself appears to be a fundamental property of these behaviors.
Generating naturalistic and coherent conversational behaviors with a spontaneous quality (i.e., not scripted, repetitive, or “robotic”) for virtual characters is crucial to many applications, e.g., in virtual- or augmented reality, computer games, movies, as well as virtual assistants and social robots. It is, however, a process fraught with difficulties, and often large amounts of manual work are involved to create the illusion of life and spontaneity—which invariably breaks down as soon as the user sees or hears the exact same behavior twice.
In this project, we propose, for the first time, truly Integrated Behavior Generation (IBG), where speech, gesture, and facial expressions are produced by an integrated, deep probabilistic model, trained on multimodal recordings of voice, body motion, and facial expression. The system will learn a mapping from input text plus contextual parameters (style, emotion, personality traits, etc.) to a coherent multimodal output behavior stream. The architecture will be designed to take full advantage of mutual information between modalities, thus ensuring that the timing of prosodic, gestural, and mimicry events are inherently synchronized, both in terms of timing and content. This is in stark contrast to existing systems, where text-to-speech and gesture synthesis inevitably are separate. Our group is in a unique position to realize the goal of naturalistic, non-repetitive, integrated behavior generation, given our recent breakthroughs both in the text-to-speech and motion generation domains.
Specifically, we will address the following research questions: (A) How can we model and synthesize speech, gesture, and facial expression from text in a joint model, in a way that is on par with unimodal state-of-the-art, yet ensuring full congruence between modalities? (B) How can efficient high-level behavioral style-controls—governing, for example, level of engagement or agitation—be implemented and affect all modalities appropriately and coherently?
Researchers
Publications
Duration
2024-01-30 → 2027-12-22
Funding
Vetenskaprsådet grant nr 2023-05441