Datasets for Conversational AI

This project allows users to load multiple datasets useful for research in conversational AI, spoken dialogue systems, and linguistics. It leverages the transformers dataset infrastructure to download and provide efficient access to variations of the same datasets - maintaining a single copy of the raw data and using minimal disk space to store dataset variations.

Project Dates: 8/21/22 - 8/21/23

Date Published: 6/23/24

Overview

I started this project to develop data pipelines that enable users to load multiple datasets useful for research in conversational AI, spoken dialogue systems, and linguistics. The primary contributions include:

  • Leveraging the transformers dataset infrastructure to download and provide efficient access to variations of the same datasets.
  • Maintaining a single copy of the raw data while using minimal disk space to store dataset variations.
  • Providing access to a features package that can extract commonly required features from these datasets, such as voice activity.

This was inspired by Erik’s datasets turn-taking project.

About

The data pipelines project allows users to load multiple datasets useful for research in conversational AI, spoken dialogue systems, and linguistics. It leverages the transformers dataset infrastructure to download and provide efficient access to variations of the same datasets, maintaining a single copy of the raw data and using minimal disk space to store dataset variations. Additionally, it provides access to a features package that can be used to extract commonly required features from these datasets (e.g., voice activity).

The goal is to abstract the process of loading datasets that are useful for conversation research and allow researchers to focus on model development. It is unique because it provides access to tools and variations of datasets that might not be publicly available.

Datasets and Variants

Dataset Name Variant Description
Callfriend default Provides access to text features per conversation.
audio Provides access to raw audio per conversation.
Callhome default Contains text features of telephone conversations.
audio Contains raw audio of telephone conversations.
Fisher default Includes time-aligned transcripts of conversations.
audio Includes audio paths of conversations.
Maptask default Provides access to dialogue transcripts.
audio Provides access to audio recordings of dialogues.
Switchboard isip-aligned Contains ISIP-aligned transcripts of conversations.
swda Contains SWDA-aligned transcripts of conversations.
ldc-audio Contains raw audio of conversations.

Features Package

The goal of this project is to be able to easily download and parse commonly used datasets in Conversational AI research. The load_dataset method described in previous methods provides access to common features for each dataset. However, depending on the application, there may be a need to extract additional features. For example, audio feature sets (such as GeMAPS) may be required - or voice activity from the transcripts may be required.

The Features package provides access to methods that may be used for these purposes. Some key features include:

  • Audio Feature Extraction: Methods to extract audio features such as GeMAPS using tools like OpenSmile.
  • Voice Activity Detection: Algorithms to detect and segment voice activity within audio recordings.
  • Text Features: Extraction of linguistic features from transcripts, such as part-of-speech tagging and syntactic parsing.

The goal is to map these methods onto the dataset and save the results for efficient access later, enabling researchers to focus on their core research without worrying about the preprocessing steps.

NOTE: Further documentation is forthcoming.