Timing of intonation and gestures in spoken communication
Timing of intonation and gestures in spoken communication
1. Main goals and research direction of the project
The main goal of the project was to examine the temporal coordination between speech and gesture. The point of departure was to more specifically investigate the relationship between speech melody represented by intonational movement and co-speech gestures using high-quality audio, video and motion-capture data which allows automatic extraction and analysis of gesture and prosodic aspects of the speech signal. The project has concentrated on this goal and initially investigated the temporal coordination between head nods having a prominence-signaling function and stressed syllables. During the course of the project, work was also extended to investigating the temporal coordination between hand gestures and syllables which functionally signal potential locations for turn-taking in dialogue. Finally, new methods have been developed and tested within the project for automatic annotation of larger units of speech and gesture from motion data. These methods enable a more extensive investigation of temporal coordination and can be used for building avatars and robots with natural gesturing and gesture recognition capabilities. The project, moreover, had a broader and more general aim of testing the hypothesis that gesture and intonation which occur in synchrony principally involve gestures and intonation with the same communicative function.
2. The three most important results of the project and reasoning around these
The first principle result stemming from the project concerns the temporal synchronization between head nods and tonally marked stressed syllables. The head nods having a prominence lending “beat gesture” on average occurred slightly ahead of the syllable which is consistent with the literature on temporal synchronization of co-speech gestures (mainly hand and arm gestures). These results are especially interesting when compared to the results presented in the literature as the results of this project were obtained from spontaneous dialogue while the literature mainly presents results from scripted speech. The alignment of the head nods with the stressed syllable suggests that the prominence function is common to the intonation movement and the head gesture. However, the greater temporal variation of the head movement compared to prominence lending fundamental frequency movement does not support the hypothesis that there is a common motor generation component for both head and intonational gestures.
The second main result concerns the temporal relationship between hand gestures and syllables which comprise places for potential turn transitions in dialogue. An important relationship between gesture offset timing and turn transition was found. When speakers gesture near a turn boundary location, the gesture will stop before the end of speech when there is a change of turn. When the speaker wants to keep the turn the gestures tend to extend beyond the end of the acoustic speech signal. These results suggest that gesture functions as part of a prosodic system of turn-taking (along with duration and pitch) but also that gesture can function as an independent cue to turn-taking.
The third primary result also concerns hand gestures and the synchronization between gestural units and longer stretches of speech, but even involves development of methodology. The technical area of automatic speech and gesture detection is advancing rapidly, and during the course of the project we have seen a movement away from rule-based detection towards machine-learning methodologies. Our results in this area concern the development of methods to automatically detect and annotate gestural units in spontaneous speech. These methods allowed us to investigate the relationship between speech phrases and gestural units. Our results indicated a general tendency for the onset of speech to slightly precede longer gesture units, thus showing a timing trend contrary to that appearing between head motion and the syllable.
3. New research questions generated by the project
The initial work of the project involved investigating the temporal coordination between prosody and gesture restricted to the time domain of the syllable. One of the most important and exciting new questions generated by the project relates to coordination between prosody and gesture in a longer time domain and with different functions such as turn-taking. We have found a loose temporal relationship where both stressed syllables and turn-final syllables serve as anchor points between speech and gesture sharing the same functions, but we have also found a relatively large degree of variation and a certain optionality of gestures. How to integrate gesture into a full account of the prosodic system with all of its functions remains a challenging issue.
A second area of new research generated by the project is also related to the longer time domain of the gesture unit and involves the development and testing of modeling gestural flow using a Hierarchical Hidden Markov Model (HHMM) instead of a rule-based method. This type of modeling has been tested and validated within the project but needs to be extended and refined to include segmentation of gesture units into gesture phrases and gesture phases.
4. The international impact of the project
The project has been presented at international conferences and received wide attention at an invited plenary presentation at Cambridge, UK (June 2015). Project results have also been presented as invited talks at research seminars in Tilburg, The Netherlands (Oct. 2015); Utrecht, The Netherlands (July 2016) and Aix-en-Provence, France (Oct. 2016).
The project has been represented at three conferences in Sweden and seven international conferences. The Swedish national conferences are Fonetik 2013 (12-13 June, 2013, Linköping), The Fifth Swedish Language Technology Conference (13-14 November 2014, Uppsala) and Fonetik 2015 (8-10 June, 2015, Lund). The international conferences are Tilburg Gesture Research Meeting (19-21 June, 2013, Tilburg University, The Netherlands); The 12th International Conference on Auditory-Visual Speech Processing (AVSP2013) (29 August – 1 September, 2013, Annecy, France); Phonetics and Phonology in Europe 2015 (29-30 June 2015, University of Cambridge, UK); The 14th International Pragmatics Conference (26-31 July 2015, Antwerp, Belgium); Speech Prosody 2016 (31 May – 3 June 2016, Boston, USA); Seventh Conference of the International Society for Gesture Studies (18-22 July, 2016, Paris, France); and International Workshop on Multimodal Analyses enabling Artificial Agents in Human-Machine Interaction (16 November, 2016, Tokyo, Japan). Conference contributions have been accepted and will be presented at the following three upcoming international conferences: International Conference on Multimodal Communication: Developing New Theories and Methods (9-11 June, 2017, Osnabrück, Germany); Phonetics and Phonology in Europe 2017 (12-14 June, 2017, Cologne, Germany); and 15th International Pragmatics Conference (16-21 July, 2017, Belfast, Northern Ireland).
In addition to generally strengthening research activities involving gesture and speech at the home department, the project has generated increased collaboration with gesture researchers at Lund University and the University of Copenhagen particularly within the project, “Multimodal levels of prominence” supported by Stiftelsen Marcus och Amalia Wallenbergs Minnesfond, in which David House is a participant.
5. Dissemination of information outside the scientific community
Project results concerning speech and gesture have been presented at popular science events in Gothenburg organized by the SweClarin initiative. The project has also been in contact with Disney Research, USA, regarding animation technology.
6. The two most important publications of the project and some reflections
Zellers, House & Alexanderson (2016) is the most important publication concerning the temporal relationship between hand gestures and turn transitions. An important finding is that when speakers gesture in the vicinity of a potential turn boundary location, the gesture will end before the offset of speech when there is a change of turn. When the speaker wants to keep the turn, the gesture tends to continue on after the end of speech, but with a fairly constant end time of about half a second into the pause. The paper also reports on results showing a higher and more variable pitch at the end of speech in connection with gestures. These results suggest that gesture functions as part of a prosodic system of turn-taking.
Alexanderson, House and Beskow (2016) is the most important publication involving the development and testing of modeling gesture dynamics using a Hierarchical Hidden Markov Model (HHMM) instead of a rule-based method. The model is trained based on labels of complete gesture units and tested and validated on two datasets differing in genre and in method of capturing motion. The method outperforms a state-of- the-art classifier on a publicly available datatset. The results have implications for automatic classification of gesture units and for building avatars and robots with natural gesturing and gesture recognition capabilities.
7. Publication strategy of the project
The publication strategy has been to publish full-paper refereed conference reports (4), short paper reviewed conference papers (5) and full-paper non-referred conference contributions (2). All these papers are open access and freely available on the project webpage and the personal webpages of the authors. Two journal publications have been submitted. If accepted, these publications will be open access also to be made available on the webpages of the project and the authors.