Bazinga! A Dataset for Multi-Party Dialogues Structuring

Paul Lerner; Juliette Bergoënd; Camille Guinaudeau; Hervé Bredin; Benjamin Maurice; Sharleyne Lefevre; Martin Bouteiller; Aman Berhe; Léo Galmant; Ruiqing Yin; Claude Barras

Bazinga! A Dataset for Multi-Party Dialogues Structuring

Paul Lerner, Juliette Bergoënd, Camille Guinaudeau, Hervé Bredin, Benjamin Maurice, Sharleyne Lefevre, Martin Bouteiller, Aman Berhe, Léo Galmant, Ruiqing Yin, Claude Barras

Abstract

We introduce a dataset built around a large collection of TV (and movie) series. Those are filled with challenging multi-party dialogues. Moreover, TV series come with a very active fan base that allows the collection of metadata and accelerates annotation. With 16 TV and movie series, Bazinga! amounts to 400+ hours of speech and 8M+ tokens, including 500K+ tokens annotated with the speaker, addressee, and entity linking information. Along with the dataset, we also provide a baseline for speaker diarization, punctuation restoration, and person entity recognition. The results demonstrate the difficulty of the tasks and of transfer learning from models trained on mono-speaker audio or written text, which is more widely available. This work is a step towards better multi-party dialogue structuring and understanding. Bazinga! is available at hf.co/bazinga. Because (a large) part of Bazinga! is only partially annotated, we also expect this dataset to foster research towards self- or weakly-supervised learning methods.

Anthology ID:: 2022.lrec-1.367
Volume:: Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:: June
Year:: 2022
Address:: Marseille, France
Editors:: Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:: LREC
SIG:
Publisher:: European Language Resources Association
Note:
Pages:: 3434–3441
Language:
URL:: https://aclanthology.org/2022.lrec-1.367
DOI:
Bibkey:
Cite (ACL):: Paul Lerner, Juliette Bergoënd, Camille Guinaudeau, Hervé Bredin, Benjamin Maurice, Sharleyne Lefevre, Martin Bouteiller, Aman Berhe, Léo Galmant, Ruiqing Yin, and Claude Barras. 2022. Bazinga! A Dataset for Multi-Party Dialogues Structuring. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3434–3441, Marseille, France. European Language Resources Association.
Cite (Informal):: Bazinga! A Dataset for Multi-Party Dialogues Structuring (Lerner et al., LREC 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/nschneid-patch-1/2022.lrec-1.367.pdf
Data: Serial Speakers

PDF Search