Datasets for Conversational AI

Project Dates: 8/21/22 - 8/21/23

Date Published: 6/23/24

Overview

I started this project to develop data pipelines that enable users to load multiple datasets useful for research in conversational AI, spoken dialogue systems, and linguistics. The primary contributions include:

Leveraging the transformers dataset infrastructure to download and provide efficient access to variations of the same datasets.
Maintaining a single copy of the raw data while using minimal disk space to store dataset variations.
Providing access to a features package that can extract commonly required features from these datasets, such as voice activity.

This was inspired by Erik’s datasets turn-taking project.

About

The data pipelines project allows users to load multiple datasets useful for research in conversational AI, spoken dialogue systems, and linguistics. It leverages the transformers dataset infrastructure to download and provide efficient access to variations of the same datasets, maintaining a single copy of the raw data and using minimal disk space to store dataset variations. Additionally, it provides access to a features package that can be used to extract commonly required features from these datasets (e.g., voice activity).

The goal is to abstract the process of loading datasets that are useful for conversation research and allow researchers to focus on model development. It is unique because it provides access to tools and variations of datasets that might not be publicly available.

Datasets and Variants

Dataset Name	Variant	Description
Callfriend	default	Provides access to text features per conversation.
Callfriend	audio	Provides access to raw audio per conversation.
Callhome	default	Contains text features of telephone conversations.
Callhome	audio	Contains raw audio of telephone conversations.
Fisher	default	Includes time-aligned transcripts of conversations.
Fisher	audio	Includes audio paths of conversations.
Maptask	default	Provides access to dialogue transcripts.
Maptask	audio	Provides access to audio recordings of dialogues.
Switchboard	isip-aligned	Contains ISIP-aligned transcripts of conversations.
	swda	Contains SWDA-aligned transcripts of conversations.
	ldc-audio	Contains raw audio of conversations.

Features Package

The goal of this project is to be able to easily download and parse commonly used datasets in Conversational AI research. The load_dataset method described in previous methods provides access to common features for each dataset. However, depending on the application, there may be a need to extract additional features. For example, audio feature sets (such as GeMAPS) may be required - or voice activity from the transcripts may be required.

The Features package provides access to methods that may be used for these purposes. Some key features include:

Audio Feature Extraction: Methods to extract audio features such as GeMAPS using tools like OpenSmile.
Voice Activity Detection: Algorithms to detect and segment voice activity within audio recordings.
Text Features: Extraction of linguistic features from transcripts, such as part-of-speech tagging and syntactic parsing.

The goal is to map these methods onto the dataset and save the results for efficient access later, enabling researchers to focus on their core research without worrying about the preprocessing steps.

NOTE: Further documentation is forthcoming.