Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Soumil Mandal; Karthick Nanmaran

doi:10.18653/v1/W18-6107

Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance

Abstract

Building tools for code-mixed data is rapidly gaining popularity in the NLP research community as such data is exponentially rising on social media. Working with code-mixed data contains several challenges, especially due to grammatical inconsistencies and spelling variations in addition to all the previous known challenges for social media scenarios. In this article, we present a novel architecture focusing on normalizing phonetic typing variations, which is commonly seen in code-mixed data. One of the main features of our architecture is that in addition to normalizing, it can also be utilized for back-transliteration and word identification in some cases. Our model achieved an accuracy of 90.27% on the test data.

Anthology ID:: W18-6107
Volume:: Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text
Month:: November
Year:: 2018
Address:: Brussels, Belgium
Editors:: Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi
Venue:: WNUT
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 49–53
Language:
URL:: https://preview.aclanthology.org/jlcl-multiple-ingestion/W18-6107/
DOI:: 10.18653/v1/W18-6107
Bibkey:
Cite (ACL):: Soumil Mandal and Karthick Nanmaran. 2018. Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 49–53, Brussels, Belgium. Association for Computational Linguistics.
Cite (Informal):: Normalization of Transliterated Words in Code-Mixed Data Using Seq2Seq Model & Levenshtein Distance (Mandal & Nanmaran, WNUT 2018)
Copy Citation:
PDF:: https://preview.aclanthology.org/jlcl-multiple-ingestion/W18-6107.pdf

PDF Cite Search Fix data