Transition Duration in Turn-Taking

Smooth turn-taking is vital for natural conversation, ensuring mutual comprehensibility. In human communication, timing between utterances is normatively constrained, with deviations conveying social information. However, spoken dialogue systems struggle with this, highlighting the need for robust turn-taking models. This paper examines human interaction data to develop an evidence-based model for turn-timing. Two models are proposed: a speaker-agnostic and a speaker-sensitive model. The latter predicts listeners' turn-taking propensity, improving dialogue system naturalness

Project Dates: 3/1/22 - 7/21/22

Date Published: 6/23/24

Overview

This project was published in SidDial 2022. The code repository is accessible on GitHub. I would like to extend my gratitude to J.P. de Ruiter and Charles Threlkeld for their assistance with this project.

Introduction

Turn-taking is crucial in natural conversations, allowing for smooth exchanges and mutual understanding. This paper addresses the challenge of implementing smooth turn-taking in spoken dialogue systems by examining human conversation data to develop a robust turn-timing model.

The study builds on two primary models of turn-taking:

  1. Duncan’s Signal-based Model: Suggests speakers use turn-yielding signals, such as intonation changes and gesture endings, to indicate they have finished speaking.
  2. Sacks et al.’s Simplest Systematics Model: Proposes that listeners can predict when a speaker will finish and take the turn at Transition Relevance Places (TRPs), where a speaker change is likely.

Data and Methods

The study uses the Switchboard corpus, a dataset of dyadic telephone conversations, to analyze the duration of TRPs between Turn Construction Units (TCUs). Two models are developed:

  1. Speaker-agnostic Model: Assumes a single distribution of TRP durations, not influenced by speaker identity.
  2. Speaker-sensitive Model: Differentiates between TRP durations depending on whether the same speaker continues or a new speaker takes the turn.

Findings

Speaker-Agnostic Model

The speaker-agnostic model found a mean TRP duration of 374 ms, fitting a truncated normal distribution. This model does not account for speaker identity, assuming equal likelihood for any participant to continue after a pause.

Speaker-Sensitive Model

The speaker-sensitive model revealed two distinct distributions:

  • Speaker Switch: Mean TRP duration of 315 ms, indicating faster transitions when the speaker changes.
  • Speaker Continuation: Mean TRP duration of 459 ms, showing longer pauses when the same speaker continues.

Model Comparison

Using Bayesian model comparison, the speaker-sensitive model was found to be significantly better at predicting TRP durations, supporting the Sacks et al. model of turn-taking.

Practical Implications

The findings suggest that spoken dialogue systems should:

  1. Minimize Gaps: Ensure quick responses to avoid the current speaker taking the silence as a cue to continue.
  2. Optimal Response Time: Respond within 394 ms after a turn ends, ideally around 150-200 ms, where the probability of a speaker change is high.

Proposed Implementation

The paper proposes integrating the turn-taking propensity function into the continuous dialogue system architecture by:

  1. Incremental Processing: Using Incremental Units (IUs) and Incremental Modules (IMs) to process and predict turn-taking in real-time.
  2. Module Extensions: Incorporating additional cues like intonation and semantic completeness to refine turn-taking predictions.

Conclusion

This study provides an evidence-based model for improving turn-taking in conversational agents by leveraging TRP duration. Future work involves implementing this model in spoken dialogue systems and evaluating conversational naturalness through human-subject experiments.

Acknowledgments

This research was funded by the Data Intensive Studies Center at Tufts University.