Understanding predictive models of turn-taking in spoken interaction
One of the most fundamental aspects of spoken interaction is the taking of turns. Many studies have investigated the mechanisms by which speakers manage this coordination, for example by identifying acoustic and linguistic signals at the end of the turn that are associated with turn shifts. However, as pointed out by several researchers, speakers in dialog must start planning their next turn early, based on a prediction of what the interlocutor will say and when their utterance will end. How this prediction is done, and what signals are involved is much less understood, partly because we lack the necessary tools for such analyses. Since the signals involved are so complex, ambiguous and intertwined, they are very hard to identify and localize in a systematic fashion. At KTH, we have recently developed a novel computational model for making predictions in spoken interaction, based on deep learning. While these models are extremely powerful, in that they can learn to identify and represent complex signals across different modalities and timescales, they currently suffer from lack of transparency. In this project, we want to develop new methods and tools for analysing these models, and use these tools to identify complex turn-taking cues that are involved in prediction. This will open the way towards using these models for gaining insight into fundamental mechanisms of interspeaker coordination, over and above the technical applications which currently dominate the field.