How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages
Rachit Bansal, Himanshu Choudhary, Ravneet Punia, Niko Schenk, Émilie Pagé-Perron, Jacob Dahl
Abstract
Despite the recent advancements of attention-based deep learning architectures across a majority of Natural Language Processing tasks, their application remains limited in a low-resource setting because of a lack of pre-trained models for such languages. In this study, we make the first attempt to investigate the challenges of adapting these techniques to an extremely low-resource language – Sumerian cuneiform – one of the world’s oldest written languages attested from at least the beginning of the 3rd millennium BC. Specifically, we introduce the first cross-lingual information extraction pipeline for Sumerian, which includes part-of-speech tagging, named entity recognition, and machine translation. We introduce InterpretLR, an interpretability toolkit for low-resource NLP and use it alongside human evaluations to gauge the trained models. Notably, all our techniques and most components of our pipeline can be generalised to any low-resource language. We publicly release all our implementations including a novel data set with domain-specific pre-processing to promote further research in this domain.- Anthology ID:
- 2021.acl-srw.5
- Volume:
- Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
- Month:
- August
- Year:
- 2021
- Address:
- Online
- Editors:
- Jad Kabbara, Haitao Lin, Amandalynne Paullada, Jannis Vamvas
- Venues:
- ACL | IJCNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 44–59
- Language:
- URL:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2021.acl-srw.5/
- DOI:
- 10.18653/v1/2021.acl-srw.5
- Cite (ACL):
- Rachit Bansal, Himanshu Choudhary, Ravneet Punia, Niko Schenk, Émilie Pagé-Perron, and Jacob Dahl. 2021. How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop, pages 44–59, Online. Association for Computational Linguistics.
- Cite (Informal):
- How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages (Bansal et al., ACL-IJCNLP 2021)
- PDF:
- https://preview.aclanthology.org/build-pipeline-with-new-library/2021.acl-srw.5.pdf
- Code
- cdli-gh/Semi-Supervised-NMT-for-Sumerian-English + additional community code