A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts

Miriam Schirmer, Udo Kruschwitz, Gregor Donabauer


Abstract
Recent progress in natural language processing has been impressive in many different areas with transformer-based approaches setting new benchmarks for a wide range of applications. This development has also lowered the barriers for people outside the NLP community to tap into the tools and resources applied to a variety of domain-specific applications. The bottleneck however still remains the lack of annotated gold-standard collections as soon as one’s research or professional interest falls outside the scope of what is readily available. One such area is genocide-related research (also including the work of experts who have a professional interest in accessing, exploring and searching large-scale document collections on the topic, such as lawyers). We present GTC (Genocide Transcript Corpus), the first annotated corpus of genocide-related court transcripts which serves three purposes: (1) to provide a first reference corpus for the community, (2) to establish benchmark performances (using state-of-the-art transformer-based approaches) for the new classification task of paragraph identification of violence-related witness statements, (3) to explore first steps towards transfer learning within the domain. We consider our contribution to be addressing in particular this year’s hot topic on Language Technology for All.
Anthology ID:
2022.lrec-1.479
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
4504–4512
Language:
URL:
https://aclanthology.org/2022.lrec-1.479
DOI:
Bibkey:
Cite (ACL):
Miriam Schirmer, Udo Kruschwitz, and Gregor Donabauer. 2022. A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4504–4512, Marseille, France. European Language Resources Association.
Cite (Informal):
A New Dataset for Topic-Based Paragraph Classification in Genocide-Related Court Transcripts (Schirmer et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-1/2022.lrec-1.479.pdf
Code
 miriamschirmer/genocide-transcript-corpus
Data
Genocide Transcript Corpus (GTC): Topic-Based Paragraph Classification in Genocide-Related Court Transcripts