Working Towards Digital Documentation of Uralic Languages With Open-Source Tools and Modern NLP Methods
Mika Hämäläinen, Jack Rueter, Khalid Alnajjar, Niko Partanen
Abstract
We present our work towards building an infrastructure for documenting endangered languages with the focus on Uralic languages in particular. Our infrastructure consists of tools to write dictionaries so that entries are structured in XML format. These dictionaries are the foundation for rule-based NLP tools such as FSTs. We also work actively towards enhancing these dictionaries and tools by using the latest state-of-the-art neural models by generating training data through rules and lexica- Anthology ID:
- 2023.bigpicture-1.2
- Volume:
- Proceedings of the Big Picture Workshop
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Yanai Elazar, Allyson Ettinger, Nora Kassner, Sebastian Ruder, Noah A. Smith
- Venue:
- BigPicture
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 18–27
- Language:
- URL:
- https://aclanthology.org/2023.bigpicture-1.2
- DOI:
- 10.18653/v1/2023.bigpicture-1.2
- Cite (ACL):
- Mika Hämäläinen, Jack Rueter, Khalid Alnajjar, and Niko Partanen. 2023. Working Towards Digital Documentation of Uralic Languages With Open-Source Tools and Modern NLP Methods. In Proceedings of the Big Picture Workshop, pages 18–27, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Working Towards Digital Documentation of Uralic Languages With Open-Source Tools and Modern NLP Methods (Hämäläinen et al., BigPicture 2023)
- PDF:
- https://preview.aclanthology.org/ingest-bitext-workshop/2023.bigpicture-1.2.pdf