Creating and enriching a repository of 177k interlinearized examples in 1611 mostly lesser-resourced languages

Sebastian Nordhoff


Abstract
29 Much of NLP is concerned with languages for which dictionaries, thesauri, word nets or treebanks are available. This contribution focuses on languages for which all we have might be some isolated examples with word-to-word translation. We detail the collection, aggregation, storage and querying of this database of 177k examples from 1611 languages with a special eye on enrichment via Named Entity Recognition and links to the Wikidata ontology. We also discuss pitfalls of the approach and discuss the legal status of interlinear examples.
Anthology ID:
2025.ldk-1.20
Volume:
Proceedings of the 5th Conference on Language, Data and Knowledge
Month:
September
Year:
2025
Address:
Naples, Italy
Editors:
Mehwish Alam, Andon Tchechmedjiev, Jorge Gracia, Dagmar Gromann, Maria Pia di Buono, Johanna Monti, Maxim Ionov
Venues:
LDK | WS
SIG:
Publisher:
Unior Press
Note:
Pages:
186–196
Language:
URL:
https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.20/
DOI:
Bibkey:
Cite (ACL):
Sebastian Nordhoff. 2025. Creating and enriching a repository of 177k interlinearized examples in 1611 mostly lesser-resourced languages. In Proceedings of the 5th Conference on Language, Data and Knowledge, pages 186–196, Naples, Italy. Unior Press.
Cite (Informal):
Creating and enriching a repository of 177k interlinearized examples in 1611 mostly lesser-resourced languages (Nordhoff, LDK 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.20.pdf