Martin Yamalov
2020
The MARCELL Legislative Corpus
Tamás Váradi
|
Svetla Koeva
|
Martin Yamalov
|
Marko Tadić
|
Bálint Sass
|
Bartłomiej Nitoń
|
Maciej Ogrodniczuk
|
Piotr Pęzik
|
Verginica Barbu Mititelu
|
Radu Ion
|
Elena Irimia
|
Maria Mitrofan
|
Vasile Păiș
|
Dan Tufiș
|
Radovan Garabík
|
Simon Krek
|
Andraz Repar
|
Matjaž Rihtar
|
Janez Brank
Proceedings of the Twelfth Language Resources and Evaluation Conference
This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub-corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency annotation, the corpus is enriched with the IATE and EUROVOC labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represents a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.