Publications | Muhammad Umair

2024

EMNLP
Large Language Models Know What To Say But Not When To Speak

Muhammad Umair, Vasanth Sarathy, and Jan Ruiter

In Findings of the Association for Computational Linguistics: EMNLP , 2024

Abs Bib HTML Supp Poster Slides

Turn-taking is a fundamental aspect of human communication, essential for smooth and comprehensible verbal interactions. While recent advances in Large Language Models (LLMs) have shown promise in enhancing Spoken Dialogue Systems (SDS), existing models often falter in natural, unscripted conversations due to their being trained on mostly written language, and focus only on turn-final Transition Relevance Places (TRPs). This paper addresses these limitations by evaluating the ability of state-of-the-art LLMs to predict within-turn TRPs, which are crucial for natural dialogue but challenging to predict. We introduce a new and unique dataset of participant-labeled within-turn TRPs and evaluate the accuracy of TRP prediction by state-of-the art LLMs. Our experiments demonstrate the limitations of LLMs in modeling spoken language dynamics and pave the way for developing more responsive and naturalistic spoken dialogue systems.
@inproceedings{umair2024emnlp, title = {Large Language Models Know What To Say But Not When To Speak}, author = {Umair, Muhammad and Sarathy, Vasanth and de Ruiter, Jan}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP}, year = {2024}, publisher = {Association for Computational Linguistics}, theme = {socialreasoning}, bibtex_show = true, }
CogSci
Can language models trained on written monologue learn to predict spoken dialogue?

Muhammad Umair, Julia Beret Mertens, Lena Warnke , and 1 more author

Cognitive Science, 2024

Abs Bib PDF Supp

Transformer-based Large Language Models (LLMs) have recently increased in popularity due to their impressive performance on numerous language tasks. While LLMs can produce human-like writing, their ability to predict spoken language in natural interaction remains unclear. This study investigates whether LLMs trained on monologue-based data can learn normative structures of interactive spoken language, focusing on speaker identity’s impact on response predictability in dialogue. Through fine-tuning GPT-2 on English dialogue transcripts, the study evaluates LLMs’ surprisal values in two-turn sequences with different speaker responses. Results show that while models incorporate speaker identity, they do not fully replicate human patterns in dialogue prediction, highlighting limitations in modeling spoken interaction norms.
@article{umair2024dialogue_prediction, title = {Can language models trained on written monologue learn to predict spoken dialogue?}, author = {Umair, Muhammad and Mertens, Julia Beret and Warnke, Lena and de Ruiter, Jan P.}, journal = {Cognitive Science}, year = {2024}, publisher = {Cognitive Science Society}, theme = {language_models, spoken_dialogue}, bibtex_show = true, keywords = {Generative pre-trained Transformers, Natural Language Processing, Language in Interaction} }

2022

D&D
GailBot: An automatic transcription system for Conversation Analysis

Muhammad Umair, Julia Beret Mertens, Saul Albert , and 1 more author

Dialogue & Discourse, 2022

Abs Bib HTML Website

Researchers studying human interaction, such as conversation analysts, psychologists, and linguists, all rely on detailed transcriptions of language use. Ideally, these should include so-called paralinguistic features of talk, such as overlaps, prosody, and intonation, as they convey important information. However, creating conversational transcripts that include these features by hand requires substantial amounts of time by trained transcribers. There are currently no Speech to Text (STT) systems that are able to integrate these features in the generated transcript. To reduce the resources needed to create detailed conversation transcripts that include representation of paralinguistic features, we developed a program called GailBot. GailBot combines STT services with plugins to automatically generate first drafts of transcripts that largely follow the transcription standards common in the field of Conversation Analysis. It also enables researchers to add new plugins to transcribe additional features, or to improve the plugins it currently uses. We describe GailBot’s architecture and its use of computational heuristics and machine learning. We also evaluate its output in relation to transcripts produced by both human transcribers and comparable automated transcription systems. We argue that despite its limitations, GailBot represents a substantial improvement over existing dialogue transcription software.
@article{umair2022gailbot, title = {GailBot: An automatic transcription system for Conversation Analysis}, author = {Umair, Muhammad and Mertens, Julia Beret and Albert, Saul and de Ruiter, Jan Peter}, journal = {Dialogue \& Discourse}, volume = {13}, number = {1}, pages = {63--95}, year = {2022}, }
SigDial
Using Transition Duration to Improve Turn-taking in Conversational Agents

Charles Threlkeld, Muhammad Umair, and Jp Ruiter

In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue , Sep 2022

Abs Bib HTML Supp

Smooth turn-taking is an important aspect of natural conversation that allows interlocutors to maintain adequate mutual comprehensibility. In human communication, the timing between utterances is normatively constrained, and deviations convey socially relevant paralinguistic information. However, for spoken dialogue systems, smooth turn-taking continues to be a challenge. This motivates the need for spoken dialogue systems to employ a robust model of turn-taking to ensure that messages are exchanged smoothly and without transmitting unintended paralinguistic information. In this paper, we examine dialogue data from natural human interaction to develop an evidence-based model for turn-timing in spoken dialogue systems. First, we use timing between turns to develop two models of turn-taking: a speaker-agnostic model and a speaker-sensitive model. From the latter model, we derive the propensity of listeners to take the next turn given TRP duration. Finally, we outline how this measure may be incorporated into a spoken dialogue system to improve the naturalness of conversation.
@inproceedings{threlkeld-etal-2022-using, title = {Using Transition Duration to Improve Turn-taking in Conversational Agents}, author = {Threlkeld, Charles and Umair, Muhammad and de Ruiter, Jp}, editor = {Lemon, Oliver and Hakkani-Tur, Dilek and Li, Junyi Jessy and Ashrafzadeh, Arash and Garcia, Daniel Hern{\'a}ndez and Alikhani, Malihe and Vandyke, David and Du{\v{s}}ek, Ond{\v{r}}ej}, booktitle = {Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue}, month = sep, year = {2022}, address = {Edinburgh, UK}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2022.sigdial-1.20}, doi = {10.18653/v1/2022.sigdial-1.20}, pages = {193--203}, }