Modeling Turn-taking in Conversation

One of the fundamental aspects of spoken interaction is turn-taking, where speakers coordinate when to speak and when to listen. In this project, we investigate how to develop computational models of turn-taking, both for improving the interaction in conversational interfaces and human-robot interaction, but also as an analytical tool to better understand the underlying mechanisms in human-human interaction.

Extensive research has explored the mechanisms that underlie turn-taking, including the identification of acoustic and linguistic cues at the end of a speaker's turn that signal an upcoming turn shift. In addition to this, the speakers must also begin planning their next contribution early, relying on predictions about what their conversational partner will say and when their turn will end. How this predictive process works, and what specific cues are involved, remains less well understood, partly due to the complexity and interwoven nature of the signals involved. The ambiguity and intricacy of these signals make systematic identification and localization a challenging task. To address this, we are developing computational models that use deep learning to predict turn-taking in spoken interactions.

Turn-taking in Conversational Systems and Human-Robot Interaction

Users of conversational systems, such as virtual assistants and robots, often experience significant response delays or interruptions. Our models aim to enhance these interactions by predicting human behavior more effectively and identifying coordination cues in both speech and visual signals, such as gaze. By doing so, we can improve the fluidity of interaction, allowing systems to anticipate and respond more naturally to human conversational patterns.

You have previously denied the display of content of the type "External media". Do you want to show content?

Understanding Turn-taking in Conversation

While deep learning models are highly powerful, capable of learning to identify and represent complex signals across various modalities and timescales, they often lack transparency. A key focus of our project is to develop new methods and tools to analyze these models, enabling us to uncover the complex turn-taking cues that underlie prediction. These tools will not only enhance the practical application of these models but will also provide new insights into the fundamental mechanisms of inter-speaker coordination. Our research, therefore, contributes both to applied fields, like human-computer interaction, and to theoretical fields, such as linguistics and phonetics, offering a deeper understanding of how humans coordinate conversation in real time.

In this video, we have applied our Voice Activity Projection (VAP) model, which predicts who is the likely dominant speaker in the upcoming 2 sec time window:

You have previously denied the display of content of the type "External media". Do you want to show content?

Researchers

Gabriel Skantze professor

skantze@kth.se , +4687907874

Profile

Erik Ekstedt

erikekst@kth.se

Profile

Haotian Qi doctoral student

haotianq@kth.se

Profile

Publications

Funding

PREDICON - Prediction and Coordination for Conversational AI (Swedish Research Council, 2021-2026)
AnalyTIC - Understanding Predictive models of Turn-taking in Spoken Interaction (Riksbankens Jubileumsfond, 2021-2025)
Anticipatory Control in Conversational Human-Robot Interaction (WASP, 2023-2028)

Studies

Research

Collaboration

About KTH

Library

Modeling Turn-taking in Conversation

Researchers

Publications

Funding

Contact