Xavier Bost
2026
Overcoming Copyright Barriers in Corpus Distribution Through Non-Reversible Hashing
Arthur Amalvy | Vincent Labatut | Xavier Bost | Hen-Hsen Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Arthur Amalvy | Vincent Labatut | Xavier Bost | Hen-Hsen Huang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While annotated corpora are crucial in the field of natural language processing (NLP), those containing copyrighted material are difficult to exchange among researchers. Yet, such corpora are necessary to fully represent the diversity of data found in the wild in the context of NLP tasks. We tackle this issue by proposing a method to lawfully share the annotations of any sequential copyrighted corpus. The corpus creator shares the annotations in clear, along with a non-reversible hashed version of the source material. The corpus user must own the source material, and apply the same hash function to their own tokens, in order to match them to the shared annotations. Crucially, our method is robust to reasonable divergences in the version of the copyrighted data owned by the user. As an illustration, we present alignment experiments on different editions of novels. Our results show that our method is able to correctly correctly align 98.7 to 99.79% of tokens depending on the novel, provided the user version is sufficiently close to the corpus creator’s version. We publicly release novelties-bookshare, a Python implementation of our method.
2020
Serial Speakers: a Dataset of TV Series
Xavier Bost | Vincent Labatut | Georges Linares
Proceedings of the Twelfth Language Resources and Evaluation Conference
Xavier Bost | Vincent Labatut | Georges Linares
Proceedings of the Twelfth Language Resources and Evaluation Conference
For over a decade, TV series have been drawing increasing interest, both from the audience and from various academic fields. But while most viewers are hooked on the continuous plots of TV serials, the few annotated datasets available to researchers focus on standalone episodes of classical TV series. We aim at filling this gap by providing the multimedia/speech processing communities with “Serial Speakers”, an annotated dataset of 155 episodes from three popular American TV serials: “Breaking Bad”, “Game of Thrones” and “House of Cards”. “Serial Speakers” is suitable both for investigating multimedia retrieval in realistic use case scenarios, and for addressing lower level speech related tasks in especially challenging conditions. We publicly release annotations for every speech turn (boundaries, speaker) and scene boundary, along with annotations for shot boundaries, recurring shots, and interacting speakers in a subset of episodes. Because of copyright restrictions, the textual content of the speech turns is encrypted in the public version of the dataset, but we provide the users with a simple online tool to recover the plain text from their own subtitle files.