
# Note: these are only SAMPLED dialogues for ACL Rolling Review Submission.

## DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection and Instruction-Aware Models for Conversational AI


## Contents

- [Introduction](#introduction)
- [Datasets](#datasets)
- [License](#license)


## Introduction

DialogStudio is a large collection and unified dialog datasets. 
The figure below provides a summary of the general statistics associated with DialogStudio. DialogStudio unified each dataset while preserving its original information, and this aids in supporting research on both individual datasets and Large Language Model (LLM) training. The full list of all available datasets is [here](./Dataset_Stats.csv).

## Datasets

The datasets are split into several categories in this GitHub repository. You can check the [table of dataset](./Dataset_Stats.csv) for more information. And you can click into each folder to check a few examples:

- [Knowledge-Grounded-Dialogues](./knowledge-grounded-dialogues/)
- [Natural-Language-Understanding](./natural-language-understanding/)
- [Open-Domain-Dialogues](./open-domain-dialogues/)
- [Task-Oriented-Dialogues](./task-oriented-dialogues/)
- [Dialogue-Summarization](./dialogue-summarization/)
- [Conversational-Recommendation-Dialogs](./conversational-recommendation-dialogues/)



## License

Our project follows the following structure with respect to licensing:

1. For all the modified datasets in DialogStudio: 
   - A portion of these datasets is under the Apache License 2.0.
   - Some retain their original licenses even after modification.
   - For a few datasets that lacked a license, we have cited the relevant papers.
2. Original dataset licenses: For reference, we also put the originally available licenses for each dataset into their respective dataset folders.
3. Code: Our codebase is under the Apache License 2.0.

For detailed licensing information, please refer to the specific licenses accompanying the original datasets. It is important to familiarize yourself with these terms as we do not assume responsibility for licensing issues.

Check [Dataset_Stats.csv](./Dataset_Stats.csv) for the license of each dataset.  

