Transition Duration in Turn-Taking
Smooth turn-taking is vital for natural conversation, ensuring mutual comprehensibility. In human communication, timing between utterances is normatively constrained, with deviations conveying social information. However, spoken dialogue systems struggle with this, highlighting the need for robust turn-taking models. This paper examines human interaction data to develop an evidence-based model for turn-timing. Two models are proposed: a speaker-agnostic and a speaker-sensitive model. The latter predicts listeners' turn-taking propensity, improving dialogue system naturalness
Project Dates: 3/1/22 - 7/21/22
Date Published: 6/23/24
Overview
This project was published in SidDial 2022. The code repository is accessible on GitHub. I would like to extend my gratitude to J.P. de Ruiter and Charles Threlkeld for their assistance with this project.
Introduction
Turn-taking is crucial in natural conversations, allowing for smooth exchanges and mutual understanding. This paper addresses the challenge of implementing smooth turn-taking in spoken dialogue systems by examining human conversation data to develop a robust turn-timing model.
Motivation and Related Work
The study builds on two primary models of turn-taking:
- Duncan’s Signal-based Model: Suggests speakers use turn-yielding signals, such as intonation changes and gesture endings, to indicate they have finished speaking.
- Sacks et al.’s Simplest Systematics Model: Proposes that listeners can predict when a speaker will finish and take the turn at Transition Relevance Places (TRPs), where a speaker change is likely.
Data and Methods
The study uses the Switchboard corpus, a dataset of dyadic telephone conversations, to analyze the duration of TRPs between Turn Construction Units (TCUs). Two models are developed:
- Speaker-agnostic Model: Assumes a single distribution of TRP durations, not influenced by speaker identity.
- Speaker-sensitive Model: Differentiates between TRP durations depending on whether the same speaker continues or a new speaker takes the turn.
Findings
Speaker-Agnostic Model
The speaker-agnostic model found a mean TRP duration of 374 ms, fitting a truncated normal distribution. This model does not account for speaker identity, assuming equal likelihood for any participant to continue after a pause.
Speaker-Sensitive Model
The speaker-sensitive model revealed two distinct distributions:
- Speaker Switch: Mean TRP duration of 315 ms, indicating faster transitions when the speaker changes.
- Speaker Continuation: Mean TRP duration of 459 ms, showing longer pauses when the same speaker continues.
Model Comparison
Using Bayesian model comparison, the speaker-sensitive model was found to be significantly better at predicting TRP durations, supporting the Sacks et al. model of turn-taking.
Practical Implications
The findings suggest that spoken dialogue systems should:
- Minimize Gaps: Ensure quick responses to avoid the current speaker taking the silence as a cue to continue.
- Optimal Response Time: Respond within 394 ms after a turn ends, ideally around 150-200 ms, where the probability of a speaker change is high.
Proposed Implementation
The paper proposes integrating the turn-taking propensity function into the continuous dialogue system architecture by:
- Incremental Processing: Using Incremental Units (IUs) and Incremental Modules (IMs) to process and predict turn-taking in real-time.
- Module Extensions: Incorporating additional cues like intonation and semantic completeness to refine turn-taking predictions.
Conclusion
This study provides an evidence-based model for improving turn-taking in conversational agents by leveraging TRP duration. Future work involves implementing this model in spoken dialogue systems and evaluating conversational naturalness through human-subject experiments.
Acknowledgments
This research was funded by the Data Intensive Studies Center at Tufts University.