A Comprehensive Evaluation and Correction of the TimeBank Corpus

Mustafa Ocal, Antonela Radas, Jared Hummer, Karine Megerdoomian, Mark Finlayson


Abstract
TimeML is an annotation scheme for capturing temporal information in text. The developers of TimeML built the TimeBank corpus to both validate the scheme and provide a rich dataset of events, temporal expressions, and temporal relationships for training and testing temporal analysis systems. In our own work we have been developing methods aimed at TimeML graphs for detecting (and eventually automatically correcting) temporal inconsistencies, extracting timelines, and assessing temporal indeterminacy. In the course of this investigation we identified numerous previously unrecognized issues in the TimeBank corpus, including multiple violations of TimeML annotation guide rules, incorrectly disconnected temporal graphs, as well as inconsistent, redundant, missing, or otherwise incorrect annotations. We describe our methods for detecting and correcting these problems, which include: (a) automatic guideline checking (109 violations); (b) automatic inconsistency checking (65 inconsistent files); (c) automatic disconnectivity checking (625 incorrect breakpoints); and (d) manual comparison with the output of state-of-the-art automatic annotators to identify missing annotations (317 events, 52 temporal expressions). We provide our code as well as a set of patch files that can be applied to the TimeBank corpus to produce a corrected version for use by other researchers in the field.
Anthology ID:
2022.lrec-1.313
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2919–2927
Language:
URL:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2022.lrec-1.313/
DOI:
Bibkey:
Cite (ACL):
Mustafa Ocal, Antonela Radas, Jared Hummer, Karine Megerdoomian, and Mark Finlayson. 2022. A Comprehensive Evaluation and Correction of the TimeBank Corpus. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2919–2927, Marseille, France. European Language Resources Association.
Cite (Informal):
A Comprehensive Evaluation and Correction of the TimeBank Corpus (Ocal et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2022.lrec-1.313.pdf