TVD: A Reproducible and Multiply Aligned TV Series Dataset

Anindya Roy; Camille Guinaudeau; Hervé Bredin; Claude Barras

TVD: A Reproducible and Multiply Aligned TV Series Dataset

Anindya Roy, Camille Guinaudeau, Hervé Bredin, Claude Barras

Abstract

We introduce a new dataset built around two TV series from different genres, The Big Bang Theory, a situation comedy and Game of Thrones, a fantasy drama. The dataset has multiple tracks extracted from diverse sources, including dialogue (manual and automatic transcripts, multilingual subtitles), crowd-sourced textual descriptions (brief episode summaries, longer episode outlines) and various metadata (speakers, shots, scenes). The paper describes the dataset and provide tools to reproduce it for research purposes provided one has legally acquired the DVD set of the series. Tools are also provided to temporally align a major subset of dialogue and description tracks, in order to combine complementary information present in these tracks for enhanced accessibility. For alignment, we consider tracks as comparable corpora and first apply an existing algorithm for aligning such corpora based on dynamic time warping and TFIDF-based similarity scores. We improve this baseline algorithm using contextual information, WordNet-based word similarity and scene location information. We report the performance of these algorithms on a manually aligned subset of the data. To highlight the interest of the database, we report a use case involving rich speech retrieval and propose other uses.

Anthology ID:: L14-1588
Volume:: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:: May
Year:: 2014
Address:: Reykjavik, Iceland
Venue:: LREC
SIG:
Publisher:: European Language Resources Association (ELRA)
Note:
Pages:: 418–425
Language:
URL:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/751_Paper.pdf
DOI:
Bibkey:
Cite (ACL):: Anindya Roy, Camille Guinaudeau, Hervé Bredin, and Claude Barras. 2014. TVD: A Reproducible and Multiply Aligned TV Series Dataset. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 418–425, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):: TVD: A Reproducible and Multiply Aligned TV Series Dataset (Roy et al., LREC 2014)
Copy Citation:
PDF:: http://www.lrec-conf.org/proceedings/lrec2014/pdf/751_Paper.pdf

PDF Search