Estonian Dependency Treebank: from Constraint Grammar tagset to Universal Dependencies

Kadri Muischnek, Kaili Müürisep, Tiina Puolakainen


Abstract
This paper presents the first version of Estonian Universal Dependencies Treebank which has been semi-automatically acquired from Estonian Dependency Treebank and comprises ca 400,000 words (ca 30,000 sentences) representing the genres of fiction, newspapers and scientific writing. Article analyses the differences between two annotation schemes and the conversion procedure to Universal Dependencies format. The conversion has been conducted by manually created Constraint Grammar transfer rules. As the rules enable to consider unbounded context, include lexical information and both flat and tree structure features at the same time, the method has proved to be reliable and flexible enough to handle most of transformations. The automatic conversion procedure achieved LAS 95.2%, UAS 96.3% and LA 98.4%. If punctuation marks were excluded from the calculations, we observed LAS 96.4%, UAS 97.7% and LA 98.2%. Still the refinement of the guidelines and methodology is needed in order to re-annotate some syntactic phenomena, e.g. inter-clausal relations. Although automatic rules usually make quite a good guess even in obscure conditions, some relations should be checked and annotated manually after the main conversion.
Anthology ID:
L16-1247
Volume:
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Month:
May
Year:
2016
Address:
Portorož, Slovenia
Editors:
Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
1558–1565
Language:
URL:
https://aclanthology.org/L16-1247
DOI:
Bibkey:
Cite (ACL):
Kadri Muischnek, Kaili Müürisep, and Tiina Puolakainen. 2016. Estonian Dependency Treebank: from Constraint Grammar tagset to Universal Dependencies. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 1558–1565, Portorož, Slovenia. European Language Resources Association (ELRA).
Cite (Informal):
Estonian Dependency Treebank: from Constraint Grammar tagset to Universal Dependencies (Muischnek et al., LREC 2016)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/L16-1247.pdf