One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

Sebastian Nehrdich; Oliver Hellwig; Kurt Keutzer

doi:10.18653/v1/2024.findings-emnlp.805

One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks

Sebastian Nehrdich, Oliver Hellwig, Kurt Keutzer

Abstract

Morphologically rich languages are notoriously challenging to process for downstream NLP applications. This paper presents a new pretrained language model, ByT5-Sanskrit, designed for NLP applications involving the morphologically rich language Sanskrit. We evaluate ByT5-Sanskrit on established Sanskrit word segmentation tasks, where it outperforms previous data-driven approaches by a considerable margin and matches the performance of the current best lexicon-based model. It is easier to deploy and more robust to data not covered by external linguistic resources. It also achieves new state-of-the-art results in Vedic Sanskrit dependency parsing and OCR post-correction tasks. Additionally, based on the Digital Corpus of Sanskrit, we introduce a novel multitask dataset for the joint training of Sanskrit word segmentation, lemmatization, and morphosyntactic tagging tasks. We fine-tune ByT5-Sanskrit on this dataset, creating a versatile multitask model for various downstream Sanskrit applications. We have used this model in Sanskrit linguistic annotation projects, in information retrieval setups, and as a preprocessing step in a Sanskrit machine translation pipeline. We also show that our approach yields new best scores for lemmatization and dependency parsing of other morphologically rich languages. We thus demonstrate that byte-level pretrained language models can achieve excellent performance for morphologically rich languages, outperforming tokenizer-based models and presenting an important vector of exploration when constructing NLP pipelines for such languages.

Anthology ID:: 2024.findings-emnlp.805
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2024
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 13742–13751
Language:
URL:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.805/
DOI:: 10.18653/v1/2024.findings-emnlp.805
Bibkey:
Cite (ACL):: Sebastian Nehrdich, Oliver Hellwig, and Kurt Keutzer. 2024. One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 13742–13751, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: One Model is All You Need: ByT5-Sanskrit, a Unified Model for Sanskrit NLP Tasks (Nehrdich et al., Findings 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add-emnlp-2024-awards/2024.findings-emnlp.805.pdf

PDF Cite Search Fix data