Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages
C. Downey, Shannon Drizin, Levon Haroutunian, Shivin Thukral
Abstract
We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K’iche’, a Mayan language. We compare our multilingual model to a monolingual (from-scratch) baseline, as well as a model pre-trained on Quechua only. We show that the multilingual pre-trained approach yields consistent segmentation quality across target dataset sizes, exceeding the monolingual baseline in 6/10 experimental settings. Our model yields especially strong results at small target sizes, including a zero-shot performance of 20.6 F1. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).- Anthology ID:
- 2022.acl-long.366
- Volume:
- Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Smaranda Muresan, Preslav Nakov, Aline Villavicencio
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5331–5346
- Language:
- URL:
- https://aclanthology.org/2022.acl-long.366
- DOI:
- 10.18653/v1/2022.acl-long.366
- Cite (ACL):
- C. Downey, Shannon Drizin, Levon Haroutunian, and Shivin Thukral. 2022. Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5331–5346, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages (Downey et al., ACL 2022)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2022.acl-long.366.pdf
- Code
- cmdowney88/xlslm