Serial Speakers: a Dataset of TV Series

Xavier Bost, Vincent Labatut, Georges Linares


Abstract
For over a decade, TV series have been drawing increasing interest, both from the audience and from various academic fields. But while most viewers are hooked on the continuous plots of TV serials, the few annotated datasets available to researchers focus on standalone episodes of classical TV series. We aim at filling this gap by providing the multimedia/speech processing communities with “Serial Speakers”, an annotated dataset of 155 episodes from three popular American TV serials: “Breaking Bad”, “Game of Thrones” and “House of Cards”. “Serial Speakers” is suitable both for investigating multimedia retrieval in realistic use case scenarios, and for addressing lower level speech related tasks in especially challenging conditions. We publicly release annotations for every speech turn (boundaries, speaker) and scene boundary, along with annotations for shot boundaries, recurring shots, and interacting speakers in a subset of episodes. Because of copyright restrictions, the textual content of the speech turns is encrypted in the public version of the dataset, but we provide the users with a simple online tool to recover the plain text from their own subtitle files.
Anthology ID:
2020.lrec-1.525
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4256–4264
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.525
DOI:
Bibkey:
Cite (ACL):
Xavier Bost, Vincent Labatut, and Georges Linares. 2020. Serial Speakers: a Dataset of TV Series. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4256–4264, Marseille, France. European Language Resources Association.
Cite (Informal):
Serial Speakers: a Dataset of TV Series (Bost et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.525.pdf
Code
 bostxavier/Serial-Speakers
Data
Serial Speakers