LinguaMeta: Unified Metadata for Thousands of Languages
Sandy Ritchie, Daan van Esch, Uche Okonkwo, Shikhar Vashishth, Emily Drummond
Abstract
We introduce LinguaMeta, a unified resource for language metadata for thousands of languages, including language codes, names, number of speakers, writing systems, countries, official status, coordinates, and language varieties. The resources are drawn from various existing repositories and supplemented with our own research. Each data point is tagged for its origin, allowing us to easily trace back to and improve existing resources with more up-to-date and complete metadata. The resource is intended for use by researchers and organizations who aim to extend technology to thousands of languages.- Anthology ID:
- 2024.lrec-main.921
- Original:
- 2024.lrec-main.921v1
- Version 2:
- 2024.lrec-main.921v2
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 10530–10538
- Language:
- URL:
- https://aclanthology.org/2024.lrec-main.921
- DOI:
- Cite (ACL):
- Sandy Ritchie, Daan van Esch, Uche Okonkwo, Shikhar Vashishth, and Emily Drummond. 2024. LinguaMeta: Unified Metadata for Thousands of Languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 10530–10538, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- LinguaMeta: Unified Metadata for Thousands of Languages (Ritchie et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/landing_page/2024.lrec-main.921.pdf