Abstract
The conversion of romanized texts back to the native scripts is a challenging task because of the inconsistent romanization conventions and non-standard language use. This problem is compounded by code-mixing, i.e., using words from more than one language within the same discourse. In this paper, we propose a novel approach for handling these two problems together in a single system. Our approach combines three components: language identification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state of the art for the task of deromanization of code-mixed texts.- Anthology ID:
- W19-1403
- Volume:
- Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
- Month:
- June
- Year:
- 2019
- Address:
- Ann Arbor, Michigan
- Editors:
- Marcos Zampieri, Preslav Nakov, Shervin Malmasi, Nikola Ljubešić, Jörg Tiedemann, Ahmed Ali
- Venue:
- VarDial
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 26–34
- Language:
- URL:
- https://aclanthology.org/W19-1403
- DOI:
- 10.18653/v1/W19-1403
- Cite (ACL):
- Rashed Rubby Riyadh and Grzegorz Kondrak. 2019. Joint Approach to Deromanization of Code-mixed Texts. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 26–34, Ann Arbor, Michigan. Association for Computational Linguistics.
- Cite (Informal):
- Joint Approach to Deromanization of Code-mixed Texts (Riyadh & Kondrak, VarDial 2019)
- PDF:
- https://preview.aclanthology.org/autopr/W19-1403.pdf