Transition Duration in Turn-Taking

Project Dates: 3/1/22 - 7/21/22

Date Published: 6/23/24

Overview

This project was published in SidDial 2022. The code repository is accessible on GitHub. I would like to extend my gratitude to J.P. de Ruiter and Charles Threlkeld for their assistance with this project.

Introduction

Turn-taking is crucial in natural conversations, allowing for smooth exchanges and mutual understanding. This paper addresses the challenge of implementing smooth turn-taking in spoken dialogue systems by examining human conversation data to develop a robust turn-timing model.

The study builds on two primary models of turn-taking:

Duncan’s Signal-based Model: Suggests speakers use turn-yielding signals, such as intonation changes and gesture endings, to indicate they have finished speaking.
Sacks et al.’s Simplest Systematics Model: Proposes that listeners can predict when a speaker will finish and take the turn at Transition Relevance Places (TRPs), where a speaker change is likely.

Data and Methods

The study uses the Switchboard corpus, a dataset of dyadic telephone conversations, to analyze the duration of TRPs between Turn Construction Units (TCUs). Two models are developed:

Speaker-agnostic Model: Assumes a single distribution of TRP durations, not influenced by speaker identity.
Speaker-sensitive Model: Differentiates between TRP durations depending on whether the same speaker continues or a new speaker takes the turn.

Findings

Speaker-Agnostic Model

The speaker-agnostic model found a mean TRP duration of 374 ms, fitting a truncated normal distribution. This model does not account for speaker identity, assuming equal likelihood for any participant to continue after a pause.

Speaker-Sensitive Model

The speaker-sensitive model revealed two distinct distributions:

Speaker Switch: Mean TRP duration of 315 ms, indicating faster transitions when the speaker changes.
Speaker Continuation: Mean TRP duration of 459 ms, showing longer pauses when the same speaker continues.

Model Comparison

Using Bayesian model comparison, the speaker-sensitive model was found to be significantly better at predicting TRP durations, supporting the Sacks et al. model of turn-taking.

Practical Implications

The findings suggest that spoken dialogue systems should:

Minimize Gaps: Ensure quick responses to avoid the current speaker taking the silence as a cue to continue.
Optimal Response Time: Respond within 394 ms after a turn ends, ideally around 150-200 ms, where the probability of a speaker change is high.

Proposed Implementation

The paper proposes integrating the turn-taking propensity function into the continuous dialogue system architecture by:

Incremental Processing: Using Incremental Units (IUs) and Incremental Modules (IMs) to process and predict turn-taking in real-time.
Module Extensions: Incorporating additional cues like intonation and semantic completeness to refine turn-taking predictions.

Conclusion

This study provides an evidence-based model for improving turn-taking in conversational agents by leveraging TRP duration. Future work involves implementing this model in spoken dialogue systems and evaluating conversational naturalness through human-subject experiments.

Acknowledgments

This research was funded by the Data Intensive Studies Center at Tufts University.