Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages

C. Downey, Shannon Drizin, Levon Haroutunian, Shivin Thukral


Abstract
We show that unsupervised sequence-segmentation performance can be transferred to extremely low-resource languages by pre-training a Masked Segmental Language Model (Downey et al., 2021) multilingually. Further, we show that this transfer can be achieved by training over a collection of low-resource languages that are typologically similar (but phylogenetically unrelated) to the target language. In our experiments, we transfer from a collection of 10 Indigenous American languages (AmericasNLP, Mager et al., 2021) to K’iche’, a Mayan language. We compare our multilingual model to a monolingual (from-scratch) baseline, as well as a model pre-trained on Quechua only. We show that the multilingual pre-trained approach yields consistent segmentation quality across target dataset sizes, exceeding the monolingual baseline in 6/10 experimental settings. Our model yields especially strong results at small target sizes, including a zero-shot performance of 20.6 F1. These results have promising implications for low-resource NLP pipelines involving human-like linguistic units, such as the sparse transcription framework proposed by Bird (2020).
Anthology ID:
2022.acl-long.366
Volume:
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
May
Year:
2022
Address:
Dublin, Ireland
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5331–5346
Language:
URL:
https://aclanthology.org/2022.acl-long.366
DOI:
10.18653/v1/2022.acl-long.366
Bibkey:
Cite (ACL):
C. Downey, Shannon Drizin, Levon Haroutunian, and Shivin Thukral. 2022. Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5331–5346, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Multilingual unsupervised sequence segmentation transfers to extremely low-resource languages (Downey et al., ACL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/auto-file-uploads/2022.acl-long.366.pdf
Video:
 https://preview.aclanthology.org/auto-file-uploads/2022.acl-long.366.mp4
Code
 cmdowney88/xlslm