Understanding predictive models of turn-taking in spoken interaction
One of the most fundamental aspects of spoken interaction is the taking of turns. Many studies have investigated the mechanisms by which speakers manage this coordination, for example by identifying acoustic and linguistic signals at the end of the turn that are associated with turn shifts. However, as pointed out by several researchers, speakers in dialog must start planning their next turn early, based on a prediction of what the interlocutor will say and when their utterance will end. How this prediction is done, and what signals are involved is much less understood, partly because we lack the necessary tools for such analyses. Since the signals involved are so complex, ambiguous and intertwined, they are very hard to identify and localize in a systematic fashion. At KTH, we have recently developed a novel computational model for making predictions in spoken interaction, based on deep learning. While these models are extremely powerful, in that they can learn to identify and represent complex signals across different modalities and timescales, they currently suffer from lack of transparency. In this project, we want to develop new methods and tools for analysing these models, and use these tools to identify complex turn-taking cues that are involved in prediction. This will open the way towards using these models for gaining insight into fundamental mechanisms of interspeaker coordination, over and above the technical applications which currently dominate the field.
Final report
PROJECT OBJECTIVES
The goal of the project has been to develop and analyze computational models for turn-taking in spoken conversation. The purpose has been both to analyze the models in order to gain insights into the fundamental mechanisms behind conversational turn-taking, and to investigate how the models can be used in human–machine interaction, for example in human–robot interaction or AI assistants.
Human conversations are characterized by rapid speaker changes, often with only about 200 milliseconds of delay and relatively little overlap. To achieve this precise coordination, conversational partners must continuously predict each other’s speech behavior and adjust their own accordingly. This happens through a combination of signals, such as prosody and gaze behavior. A central research question is therefore to understand which signals are most crucial, and how they are processed and interpreted in real time. Researchers from several fields have approached this question using methods from, among others, psycholinguistics, conversation analysis, and neuropsychology.
In this project, we used deep learning models trained to predict turn-taking patterns in large amounts of recorded conversations. The task forces the model to pick up and form representations of the coordination signals present in the speech signal. By analyzing the models, we can identify which signals are most important.
PROJECT IMPLEMENTATION
In the initial phase, we developed the model TurnGPT, which is based solely on the verbal component of conversations, i.e., transcriptions. The model has the same architecture as GPT-based language models, such as ChatGPT, but is further trained on conversation transcripts with special markers for speaker shifts. The analysis showed that linguistic signals relevant to speaker shifts may occur several turns before the actual shift, indicating that a longer context can be crucial for coordinating turn-taking.
In the next step, we developed the model Voice Activity Projection (VAP), which instead of transcriptions uses the raw audio waveform of the conversation and continuously predicts the speakers’ voice activity during the following two seconds. Compared to TurnGPT, this means that important acoustic components, such as prosody and temporal aspects like pause length, are preserved. The model was trained on approximately 2000 hours of recorded telephone conversations between Americans (the so-called Fisher corpus).
Once the model was trained, we analyzed it inspired by experiments in psycholinguistics. In these experiments, participants listen to audio clips and press a button when they think a speaker change is about to occur. By manipulating the speech signal—for example by flattening the pitch contour—it is possible to study how this affects the ability to predict shifts. With similar stimuli, we examined how our models’ predictions were affected. We found that intonation was generally less important than expected, but became crucial at syntactically ambiguous decision points. We also analyzed how filled pauses (“uhm”) signal that the speaker wishes to keep the floor, and found that duration, intensity, and pitch all contribute to the signal’s strength.
Furthermore, we investigated how well the model handles three languages from different language families—American English, Japanese, and Mandarin. The results showed that a model trained on one language does not work very well for another, suggesting that turn-taking signals are relatively language-specific. However, one and the same model can be trained on multiple languages and achieve good results for all of them.
In the final stage of the project, we applied the models in human–robot interaction to investigate whether they could improve turn-taking. We used a scenario where the human and the robot discuss ethical dilemmas, which naturally gives rise to longer thinking pauses where the robot must avoid interrupting. We compared our model with a more traditional model, where the robot waits for a silence of a certain length before taking the turn. The results showed that our model enabled shorter response times and fewer interruptions of the user. This was also reflected in users’ subjective experiences, measured through questionnaires.
THREE KEY RESULTS OF THE PROJECT
1. The development of deep learning–based computational models that formed the basis for later analyses and experiments, and that have also been used by researchers worldwide in other studies.
2. The development of methods and paradigms for how such models can be analyzed to generate insights into human communication. This includes both parameter analyses and manipulation of input signals.
3. The demonstration of how the models can be applied in practical AI applications to improve interaction between users and systems.
NEW RESEARCH QUESTIONS
In the final phase of the project, we investigated demographic and individual differences in turn-taking behavior using larger datasets where the same speakers appear in multiple conversations. We found some general effects of gender and age, but even greater individual variations. Above all, it became clear that each speaker pair develops a unique behavior. This means that computational models should be adapted to the specific conversation being modeled, in order to predict the dynamics that arise. Understanding and modeling demographic and individual differences in conversation therefore constitutes an important future research question.
Another central question is how the models can be extended to more than two speakers and integrate multimodal signals, such as gaze direction and gestures.
DISSEMINATION OF RESULTS AND COLLABORATION
The project results have been presented at a number of prestigious international conferences in speech communication, phonetics, language technology, and AI. On two occasions, our work has been awarded Best Paper Award: at SIGDIAL 2022 (ACL Special Interest Group for Discourse and Dialogue) in Edinburgh and at HRI 2025 (Human–Robot Interaction) in Melbourne.
During the course of the project, we hosted several international guest researchers. Among others, Koji Inoue, Associate Professor at Kyoto University, collaborated with us on several publications and later built upon our work in his own research. We also hosted Yu Wang, a doctoral student at Bielefeld University, who based his studies of feedback sounds in conversation on our work.
The models we developed have been published as open source, which has enabled other researchers to use and further develop them.
Commercial actors, including American AI startups, have also shown great interest in the results. This underscores that the problems we have addressed within conversational AI are relevant not only to basic research in phonetics and linguistics, but also to industry.
The goal of the project has been to develop and analyze computational models for turn-taking in spoken conversation. The purpose has been both to analyze the models in order to gain insights into the fundamental mechanisms behind conversational turn-taking, and to investigate how the models can be used in human–machine interaction, for example in human–robot interaction or AI assistants.
Human conversations are characterized by rapid speaker changes, often with only about 200 milliseconds of delay and relatively little overlap. To achieve this precise coordination, conversational partners must continuously predict each other’s speech behavior and adjust their own accordingly. This happens through a combination of signals, such as prosody and gaze behavior. A central research question is therefore to understand which signals are most crucial, and how they are processed and interpreted in real time. Researchers from several fields have approached this question using methods from, among others, psycholinguistics, conversation analysis, and neuropsychology.
In this project, we used deep learning models trained to predict turn-taking patterns in large amounts of recorded conversations. The task forces the model to pick up and form representations of the coordination signals present in the speech signal. By analyzing the models, we can identify which signals are most important.
PROJECT IMPLEMENTATION
In the initial phase, we developed the model TurnGPT, which is based solely on the verbal component of conversations, i.e., transcriptions. The model has the same architecture as GPT-based language models, such as ChatGPT, but is further trained on conversation transcripts with special markers for speaker shifts. The analysis showed that linguistic signals relevant to speaker shifts may occur several turns before the actual shift, indicating that a longer context can be crucial for coordinating turn-taking.
In the next step, we developed the model Voice Activity Projection (VAP), which instead of transcriptions uses the raw audio waveform of the conversation and continuously predicts the speakers’ voice activity during the following two seconds. Compared to TurnGPT, this means that important acoustic components, such as prosody and temporal aspects like pause length, are preserved. The model was trained on approximately 2000 hours of recorded telephone conversations between Americans (the so-called Fisher corpus).
Once the model was trained, we analyzed it inspired by experiments in psycholinguistics. In these experiments, participants listen to audio clips and press a button when they think a speaker change is about to occur. By manipulating the speech signal—for example by flattening the pitch contour—it is possible to study how this affects the ability to predict shifts. With similar stimuli, we examined how our models’ predictions were affected. We found that intonation was generally less important than expected, but became crucial at syntactically ambiguous decision points. We also analyzed how filled pauses (“uhm”) signal that the speaker wishes to keep the floor, and found that duration, intensity, and pitch all contribute to the signal’s strength.
Furthermore, we investigated how well the model handles three languages from different language families—American English, Japanese, and Mandarin. The results showed that a model trained on one language does not work very well for another, suggesting that turn-taking signals are relatively language-specific. However, one and the same model can be trained on multiple languages and achieve good results for all of them.
In the final stage of the project, we applied the models in human–robot interaction to investigate whether they could improve turn-taking. We used a scenario where the human and the robot discuss ethical dilemmas, which naturally gives rise to longer thinking pauses where the robot must avoid interrupting. We compared our model with a more traditional model, where the robot waits for a silence of a certain length before taking the turn. The results showed that our model enabled shorter response times and fewer interruptions of the user. This was also reflected in users’ subjective experiences, measured through questionnaires.
THREE KEY RESULTS OF THE PROJECT
1. The development of deep learning–based computational models that formed the basis for later analyses and experiments, and that have also been used by researchers worldwide in other studies.
2. The development of methods and paradigms for how such models can be analyzed to generate insights into human communication. This includes both parameter analyses and manipulation of input signals.
3. The demonstration of how the models can be applied in practical AI applications to improve interaction between users and systems.
NEW RESEARCH QUESTIONS
In the final phase of the project, we investigated demographic and individual differences in turn-taking behavior using larger datasets where the same speakers appear in multiple conversations. We found some general effects of gender and age, but even greater individual variations. Above all, it became clear that each speaker pair develops a unique behavior. This means that computational models should be adapted to the specific conversation being modeled, in order to predict the dynamics that arise. Understanding and modeling demographic and individual differences in conversation therefore constitutes an important future research question.
Another central question is how the models can be extended to more than two speakers and integrate multimodal signals, such as gaze direction and gestures.
DISSEMINATION OF RESULTS AND COLLABORATION
The project results have been presented at a number of prestigious international conferences in speech communication, phonetics, language technology, and AI. On two occasions, our work has been awarded Best Paper Award: at SIGDIAL 2022 (ACL Special Interest Group for Discourse and Dialogue) in Edinburgh and at HRI 2025 (Human–Robot Interaction) in Melbourne.
During the course of the project, we hosted several international guest researchers. Among others, Koji Inoue, Associate Professor at Kyoto University, collaborated with us on several publications and later built upon our work in his own research. We also hosted Yu Wang, a doctoral student at Bielefeld University, who based his studies of feedback sounds in conversation on our work.
The models we developed have been published as open source, which has enabled other researchers to use and further develop them.
Commercial actors, including American AI startups, have also shown great interest in the results. This underscores that the problems we have addressed within conversational AI are relevant not only to basic research in phonetics and linguistics, but also to industry.