BehanceCC: A ChitChat Detection Dataset For Livestreaming Video Transcripts

Viet Lai, Amir Pouran Ben Veyseh, Franck Dernoncourt, Thien Nguyen


Abstract
Livestreaming videos have become an effective broadcasting method for both video sharing and educational purposes. However, livestreaming videos contain a considerable amount of off-topic content (i.e., up to 50%) which introduces significant noises and data load to downstream applications. This paper presents BehanceCC, a new human-annotated benchmark dataset for off-topic detection (also called chitchat detection) in livestreaming video transcripts. In addition to describing the challenges of the dataset, our extensive experiments of various baselines reveal the complexity of chitchat detection for livestreaming videos and suggest potential future research directions for this task. The dataset will be made publicly available to foster research in this area.
Anthology ID:
2022.lrec-1.791
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
7284–7290
Language:
URL:
https://aclanthology.org/2022.lrec-1.791
DOI:
Bibkey:
Cite (ACL):
Viet Lai, Amir Pouran Ben Veyseh, Franck Dernoncourt, and Thien Nguyen. 2022. BehanceCC: A ChitChat Detection Dataset For Livestreaming Video Transcripts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 7284–7290, Marseille, France. European Language Resources Association.
Cite (Informal):
BehanceCC: A ChitChat Detection Dataset For Livestreaming Video Transcripts (Lai et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp22-frontmatter/2022.lrec-1.791.pdf