Joint Approach to Deromanization of Code-mixed Texts

Rashed Rubby Riyadh, Grzegorz Kondrak


Abstract
The conversion of romanized texts back to the native scripts is a challenging task because of the inconsistent romanization conventions and non-standard language use. This problem is compounded by code-mixing, i.e., using words from more than one language within the same discourse. In this paper, we propose a novel approach for handling these two problems together in a single system. Our approach combines three components: language identification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state of the art for the task of deromanization of code-mixed texts.
Anthology ID:
W19-1403
Volume:
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects
Month:
June
Year:
2019
Address:
Ann Arbor, Michigan
Editors:
Marcos Zampieri, Preslav Nakov, Shervin Malmasi, Nikola Ljubešić, Jörg Tiedemann, Ahmed Ali
Venue:
VarDial
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
26–34
Language:
URL:
https://aclanthology.org/W19-1403
DOI:
10.18653/v1/W19-1403
Bibkey:
Cite (ACL):
Rashed Rubby Riyadh and Grzegorz Kondrak. 2019. Joint Approach to Deromanization of Code-mixed Texts. In Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects, pages 26–34, Ann Arbor, Michigan. Association for Computational Linguistics.
Cite (Informal):
Joint Approach to Deromanization of Code-mixed Texts (Riyadh & Kondrak, VarDial 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/autopr/W19-1403.pdf