Benjamin Maurice
2022
Bazinga! A Dataset for Multi-Party Dialogues Structuring
Paul Lerner
|
Juliette Bergoënd
|
Camille Guinaudeau
|
Hervé Bredin
|
Benjamin Maurice
|
Sharleyne Lefevre
|
Martin Bouteiller
|
Aman Berhe
|
Léo Galmant
|
Ruiqing Yin
|
Claude Barras
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self- or weakly-supervised learning methods.
Search
Co-authors
- Paul Lerner 1
- Juliette Bergoënd 1
- Camille Guinaudeau 1
- Hervé Bredin 1
- Sharleyne Lefevre 1
- show all...
Venues
- lrec1