Abstract
Multilingual, inflectional corpora are a scarce resource in the NLP community, especially corpora with annotated morpheme boundaries. We are evaluating a generated, multilingual inflectional corpus with morpheme boundaries, generated from the English Wiktionary (Metheniti and Neumann, 2018), against the largest, multilingual, high-quality inflectional corpus of the UniMorph project (Kirov et al., 2018). We confirm that the generated Wikinflection corpus is not of such quality as UniMorph, but we were able to extract a significant amount of words from the intersection of the two corpora. Our Wikinflection corpus benefits from the morpheme segmentations of Wiktionary/Wikinflection and from the manually-evaluated morphological feature tags of the UniMorph project, and has 216K lemmas and 5.4M word forms, in a total of 68 languages.- Anthology ID:
- 2020.lrec-1.481
- Volume:
- Proceedings of the Twelfth Language Resources and Evaluation Conference
- Month:
- May
- Year:
- 2020
- Address:
- Marseille, France
- Editors:
- Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association
- Note:
- Pages:
- 3905–3912
- Language:
- English
- URL:
- https://aclanthology.org/2020.lrec-1.481
- DOI:
- Cite (ACL):
- Eleni Metheniti and Guenter Neumann. 2020. Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3905–3912, Marseille, France. European Language Resources Association.
- Cite (Informal):
- Wikinflection Corpus: A (Better) Multilingual, Morpheme-Annotated Inflectional Corpus (Metheniti & Neumann, LREC 2020)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-2/2020.lrec-1.481.pdf