Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0

Clémentine Fourrier, Benoît Sagot


Abstract
Diachronic lexical information is not only important in the field of historical linguistics, but is also increasingly used in NLP, most recently for machine translation of low resource languages. Therefore, there is a need for fine-grained, large-coverage and accurate etymological lexical resources. In this paper, we propose a set of guidelines to generate such resources, for each step of the life-cycle of an etymological lexicon: creation, update, evaluation, dissemination, and exploitation. To illustrate the guidelines, we introduce EtymDB 2.0, an etymological database automatically generated from the Wiktionary, which contains 1.8 million lexemes, linked by more than 700,000 fine-grained etymological relations, across 2,536 living and dead languages. We also introduce use cases for which EtymDB 2.0 could represent a key resource, such as phylogenetic tree generation, low resource machine translation or medieval languages study.
Anthology ID:
2020.lrec-1.392
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3207–3216
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.392
DOI:
Bibkey:
Cite (ACL):
Clémentine Fourrier and Benoît Sagot. 2020. Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3207–3216, Marseille, France. European Language Resources Association.
Cite (Informal):
Methodological Aspects of Developing and Managing an Etymological Lexical Resource: Introducing EtymDB-2.0 (Fourrier & Sagot, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/paclic-22-ingestion/2020.lrec-1.392.pdf
Code
 clefourrier/EtymDB
Data
EtymDB 2.0