CAPTivating: Comparative Analysis of Public speaking with Text-to-speech

Captivating an audience means attracting and holding the listeners' attention by being very interesting, exciting or pleasant. Thanks to a combination of a paradigm shift in the area of speech synthesis, and some of our own research group's achievements enabling us to leverage real-life speech data, it is now possible to mimic a captivating speaker's characteristics using realistic sounding TTS. At Interspeech 2019, our paper "Off the Cuff: Exploring Extemporaneous Speech Delivery with TTS" received the best Demo Award. It demonstrated the capabilities of one of our spontaneous synthetic voices, via an interactive interface navigating through different versions of resynthesized utterances from two keynote speeches. The aim of this project is to employ this tool for research in the area of linguistics and speech analysis, specifically to study public speaking. The proposed method aims to use comparative perceptual experiments with spontaneous speech synthesis to be able to systematically vary various speech features and measure the direct and combined perceptual impact. We will control breathing, vocal effort, prosody and hesitations in our TTS, in order to study their effect on listeners' perception, memory, recall and cognitive load, through multimodal sensors. Lastly, we will compare and contrast the impact of these variations in public speaking between Swedish and English, and to make this possible, we will create the first TTS built from Swedish spontaneous speech.

Final report

Purpose and development of the project
The purpose of CAPTivating (Comparative Analysis of Public speaking with Text-to-speech) has been to establish and validate a research methodology in which controllable neural text-to-speech synthesis (TTS), trained on spontaneous speech, is used as a tool for experimental research in linguistics and speech science. The project focuses on perceptual consequences of speech delivery characteristics that are relevant to public speaking. This includes prosodic and articulatory variation as well as spontaneous speech phenomena such as repetitions, filled pauses (“uh/um”), and other pause-internal particles. The project was carried out in accordance with the original project plan. A practical circumstance affecting the timeline was the PI’s parental leave (Aug 2021–Jun 2022), which delayed some activities but did not alter the scientific scope.

Brief description of implementation
The project has used neural spontaneous TTS to generate controlled speech stimuli for listening experiments. This approach combines ecological validity (speech patterns learned from spontaneous corpora) with experimental control (systematic manipulation of selected acoustic-prosodic and spontaneous speech variables while holding linguistic content constant). Controlled perception studies were conducted online, enabling the quantification of how specific delivery cues influence listener judgements of speaker stance and social attributes. In the later phase of the project, the availability of large language models has also made it possible to generate controlled textual variants around naturally occurring examples, reducing reliance on corpus frequency for rare phenomena and enabling more systematic stimulus design.

The three most important results and conclusions
(1) Disfluencies and prosodic delivery jointly shape perceived certainty and competence. A series of studies demonstrated that disfluency characteristics and prosodic parameters contribute to listener impressions in a systematic manner. In particular, filled pause placement, speech rate, and mean f0 contribute additively to perceived speaker certainty, with non-initial filled pauses being a major contributor. Further work showed that false starts and the overall number of disfluencies negatively affect perceived speaker competence, whereas repetitions led to different degrees of negative effect depending on what was repeated. These findings support the conclusion that the perceptual cost of disfluency is not uniform: it depends on the type of event and its interaction with prosodic delivery, with implications for both public speaker training and the design of naturalistic synthetic speech.

(2) Controlled prosodic variation influences pragmatic interpretation and behavioural intention. Using prosody-controllable spontaneous TTS, we showed that naturalistic modifications of speech rate and vocal effort affect perceived politeness and the likelihood that listeners would comply with indirect requests. Faster speech rate and higher vocal effort increased willingness to comply, and these effects differed depending on speaker profile. The results demonstrate that prosodic delivery does not only influence impression formation in terms of confidence or competence, but also affects pragmatic outcomes that are central to audience-directed speech.

(3) Data-driven prosodic pattern discovery enables experimental testing of subtle pragmatic functions. The project contributed evidence that prosodic variation on discourse markers can alter perceived meaning even when lexical context is held constant. A data-driven method was developed to identify common prosodic patterns of the discourse marker “well” from unlabelled found data via clustering, followed by synthesis of representative realisations using controllable TTS. Perceptual evaluation showed that prosodic characteristics such as duration and f0 contour properties systematically shift perceived agreement/disagreement strength. This supports the conclusion that spontaneous TTS can be used to experimentally isolate fine-grained pragmatic signals that are difficult to study with traditional stimulus creation methods.

New research questions generated by the project
The project generated new questions in three main directions. First, results indicate that the perceptual and pragmatic impact of delivery cues depends on speaker variation, including differences linked to perceived speaker identity; this motivates systematic study of how prosodic and spontaneous speech controls generalise across speakers and how identity perception mediates listener judgements. Second, the project raises questions about how prosodic cues combine with sequential and contextual information to shape pragmatic inference, and how such pragmatic functions should be represented in controllable synthesis systems. Third, our position paper motivates the question of whether increasingly prevalent synthetic voices may influence human speaking style over time through accommodation and entrainment mechanisms, with potential socioindexical and ethical implications.

Dissemination and collaboration
Results have been disseminated primarily through international peer-reviewed conferences in speech technology and speech science, with publications released in open-access venues. The project has involved collaboration within the department with doctoral researchers and faculty working on neural TTS, as well as international collaboration, including with Prof. Bernd Möbius’ lab (Saarbrücken) and with Dr. Ilaria Torre, resulting in joint publications. The project contributes to the broader research environment by providing empirically grounded insight into which controllable dimensions of neural spontaneous TTS are perceptually relevant, supporting both fundamental research and applied development of speech technologies.

Grant administrator

KTH Royal Institute of Technology

Reference number

P20-0298

Amount

SEK 4,352,000

Funding

RJ Projects

Subject

Specific Languages

Year

2020

CAPTivating: Comparative Analysis of Public speaking with Text-to-speech

About RJ

Research

Collaboration

Asset management

Follow us: