Skip to main content
To KTH's start page

Modeling Turn-taking in Conversation

One of the fundamental aspects of spoken interaction is turn-taking, where speakers coordinate when to speak and when to listen. In this project, we investigate how to develop computational models of turn-taking, both for improving the interaction in conversational interfaces and human-robot interaction, but also as an analytical tool to better understand the underlying mechanisms in human-human interaction.

Turn-taking

Extensive research has explored the mechanisms that underlie turn-taking, including the identification of acoustic and linguistic cues at the end of a speaker's turn that signal an upcoming turn shift. In addition to this, the speakers must also begin planning their next contribution early, relying on predictions about what their conversational partner will say and when their turn will end. How this predictive process works, and what specific cues are involved, remains less well understood, partly due to the complexity and interwoven nature of the signals involved. The ambiguity and intricacy of these signals make systematic identification and localization a challenging task. To address this, we are developing computational models that use deep learning to predict turn-taking in spoken interactions.

Turn-taking in Conversational Systems and Human-Robot Interaction

Users of conversational systems, such as virtual assistants and robots, often experience significant response delays or interruptions. Our models aim to enhance these interactions by predicting human behavior more effectively and identifying coordination cues in both speech and visual signals, such as gaze. By doing so, we can improve the fluidity of interaction, allowing systems to anticipate and respond more naturally to human conversational patterns.

Understanding Turn-taking in Conversation

While deep learning models are highly powerful, capable of learning to identify and represent complex signals across various modalities and timescales, they often lack transparency. A key focus of our project is to develop new methods and tools to analyze these models, enabling us to uncover the complex turn-taking cues that underlie prediction. These tools will not only enhance the practical application of these models but will also provide new insights into the fundamental mechanisms of inter-speaker coordination. Our research, therefore, contributes both to applied fields, like human-computer interaction, and to theoretical fields, such as linguistics and phonetics, offering a deeper understanding of how humans coordinate conversation in real time.

In this video, we have applied our Voice Activity Projection  (VAP) model, which predicts who is the likely dominant speaker in the upcoming 2 sec time window:

Researchers

Gabriel Skantze
Gabriel Skantze professor
Haotian Qi
Haotian Qi doctoral student

Publications

[1]
K. Inoue et al., "Multilingual Turn-taking Prediction Using Voice Activity Projection," in 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation, LREC-COLING 2024 - Main Conference Proceedings, 2024, pp. 11873-11883.
[2]
E. Ekstedt et al., "Automatic Evaluation of Turn-taking Cues in Conversational Speech Synthesis," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2023, 2023, pp. 5481-5485.
[3]
B. Jiang, E. Ekstedt and G. Skantze, "Response-conditioned Turn-taking Prediction," in Findings of the Association for Computational Linguistics, ACL 2023, 2023, pp. 12241-12248.
[4]
B. Jiang, E. Ekstedt and G. Skantze, "What makes a good pause? Investigating the turn-holding effects of fillers," in Proceedings 20th International Congress of Phonetic Sciences (ICPhS), 2023, pp. 3512-3516.
[5]
E. Ekstedt and G. Skantze, "How Much Does Prosody Help Turn-taking?Investigations using Voice Activity Projection Models," in Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, 2022, pp. 541-551.
[6]
E. Ekstedt and G. Skantze, "Voice Activity Projection: Self-supervised Learning of Turn-taking Events," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH 2022, 2022, pp. 5190-5194.
[7]
E. Ekstedt and G. Skantze, "Projection of Turn Completion in Incremental Spoken Dialogue Systems," in SIGDIAL 2021 : SIGDIAL 2021 - 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue, Proceedings of the Conference, Virtual, Singapore 29 July 2021 through 31 July 2021, 2021, pp. 431-437.
[8]
G. Skantze, "Turn-taking in Conversational Systems and Human-Robot Interaction : A Review," Computer speech & language (Print), vol. 67, 2021.
[9]
M. Roddy, G. Skantze and N. Harte, "Investigating speech features for continuous turn-taking prediction using LSTMs," in Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, 2018, pp. 586-590.
[10]
M. Roddy, G. Skantze and N. Harte, "Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs," in ICMI 2018 - Proceedings of the 2018 International Conference on Multimodal Interaction, 2018, pp. 186-190.

Funding