Mason Shipton
2026
Simple Additions, Substantial Gains: Expanding Scripts, Languages, and Lineage Coverage in URIEL+
Mason Shipton | York Hay Ng | Aditya Khan | Phuong H. Hoang | Xiang Lu | A. Seza Dogruoz | Annie En-Shiun Lee
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Mason Shipton | York Hay Ng | Aditya Khan | Phuong H. Hoang | Xiang Lu | A. Seza Dogruoz | Annie En-Shiun Lee
Proceedings of the Fifteenth Language Resources and Evaluation Conference
The URIEL+ linguistic knowledge base supports multilingual research by encoding languages through geographic, genetic, and typological vectors. However, data sparsity (e.g. missing feature types, incomplete language entries, and limited genealogical coverage) remains prevalent. This limits the usefulness of URIEL+ in cross-lingual transfer, particularly for supporting low-resource languages. To address this sparsity, we extend URIEL+ by introducing script vectors to represent writing system properties for 7,488 languages, integrating Glottolog to add 18,710 additional languages, and expanding lineage imputation for 26,449 languages by propagating typological and script features across genealogies. These improvements reduce feature sparsity by 14% for script vectors, increase language coverage by up to 19,015 languages (1,007%), and boost imputation quality metrics by up to 35%. Our benchmark on cross-lingual transfer tasks (oriented around low-resource languages) shows occasionally divergent performance compared to URIEL+, with performance gains up to 6% in certain setups.
2025
URIEL+: Enhancing Linguistic Inclusion and Usability in a Typological and Multilingual Knowledge Base
Aditya Khan | Mason Shipton | David Anugraha | Kaiyao Duan | Phuong H. Hoang | Eric Khiu | A. Seza Doğruöz | En-Shiun Annie Lee
Proceedings of the 31st International Conference on Computational Linguistics
Aditya Khan | Mason Shipton | David Anugraha | Kaiyao Duan | Phuong H. Hoang | Eric Khiu | A. Seza Doğruöz | En-Shiun Annie Lee
Proceedings of the 31st International Conference on Computational Linguistics
URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec that addresses these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves the user experience with robust, customizable distance calculations to better suit the needs of users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
2024
Empowering the Future with Multilinguality and Language Diversity
En-Shiun Annie Lee | Kosei Uemura | Syed Mekael Wasti | Mason Shipton
Proceedings of the Sixth Workshop on Teaching NLP
En-Shiun Annie Lee | Kosei Uemura | Syed Mekael Wasti | Mason Shipton
Proceedings of the Sixth Workshop on Teaching NLP
The rapid advancements and the widespread transformation of Large Language Models, have made it necessary to incorporate these cutting-edge techniques into the educational curricula of Natural Language Processing (NLP) with limited computing resources. This paper presents an applied NLP course designed for upper-year computer science undergraduate students on state-of-the-art techniques with an emphasis on multilinguality and language diversity. We hope to empower learners to advance their language community while preparing for industry.