Datasets for Conversational AI
This project allows users to load multiple datasets useful for research in conversational AI, spoken dialogue systems, and linguistics. It leverages the transformers dataset infrastructure to download and provide efficient access to variations of the same datasets - maintaining a single copy of the raw data and using minimal disk space to store dataset variations.
Project Dates: 8/21/22 - 8/21/23
Date Published: 6/23/24
Overview
I started this project to develop data pipelines that enable users to load multiple datasets useful for research in conversational AI, spoken dialogue systems, and linguistics. The primary contributions include:
- Leveraging the transformers dataset infrastructure to download and provide efficient access to variations of the same datasets.
- Maintaining a single copy of the raw data while using minimal disk space to store dataset variations.
- Providing access to a features package that can extract commonly required features from these datasets, such as voice activity.
This was inspired by Erik’s datasets turn-taking project.
About
The data pipelines project allows users to load multiple datasets useful for research in conversational AI, spoken dialogue systems, and linguistics. It leverages the transformers dataset infrastructure to download and provide efficient access to variations of the same datasets, maintaining a single copy of the raw data and using minimal disk space to store dataset variations. Additionally, it provides access to a features package that can be used to extract commonly required features from these datasets (e.g., voice activity).
The goal is to abstract the process of loading datasets that are useful for conversation research and allow researchers to focus on model development. It is unique because it provides access to tools and variations of datasets that might not be publicly available.
Datasets and Variants
Dataset Name | Variant | Description |
---|---|---|
Callfriend | default | Provides access to text features per conversation. |
audio | Provides access to raw audio per conversation. | |
Callhome | default | Contains text features of telephone conversations. |
audio | Contains raw audio of telephone conversations. | |
Fisher | default | Includes time-aligned transcripts of conversations. |
audio | Includes audio paths of conversations. | |
Maptask | default | Provides access to dialogue transcripts. |
audio | Provides access to audio recordings of dialogues. | |
Switchboard | isip-aligned | Contains ISIP-aligned transcripts of conversations. |
swda | Contains SWDA-aligned transcripts of conversations. | |
ldc-audio | Contains raw audio of conversations. |
Features Package
The goal of this project is to be able to easily download and parse commonly used datasets in Conversational AI research. The load_dataset
method described in previous methods provides access to common features for each dataset. However, depending on the application, there may be a need to extract additional features. For example, audio feature sets (such as GeMAPS) may be required - or voice activity from the transcripts may be required.
The Features package provides access to methods that may be used for these purposes. Some key features include:
- Audio Feature Extraction: Methods to extract audio features such as GeMAPS using tools like OpenSmile.
- Voice Activity Detection: Algorithms to detect and segment voice activity within audio recordings.
- Text Features: Extraction of linguistic features from transcripts, such as part-of-speech tagging and syntactic parsing.
The goal is to map these methods onto the dataset and save the results for efficient access later, enabling researchers to focus on their core research without worrying about the preprocessing steps.
NOTE: Further documentation is forthcoming.