Abstract
With the growing popularity of code-mixed data, there is an increasing need for better handling of this type of data, which poses a number of challenges, such as dealing with spelling variations, multiple languages, different scripts, and a lack of resources. Current language models face difficulty in effectively handling code-mixed data as they primarily focus on the semantic representation of words and ignore the auditory phonetic features. This leads to difficulties in handling spelling variations in code-mixed text. In this paper, we propose an effective approach for creating language models for handling code-mixed textual data using auditory information of words from SOUNDEX. Our approach includes a pre-training step based on masked-language-modelling, which includes SOUNDEX representations (SAMLM) and a new method of providing input data to the pre-trained model. Through experimentation on various code-mixed datasets (of different languages) for sentiment, offensive and aggression classification tasks, we establish that our novel language modeling approach (SAMLM) results in improved robustness towards adversarial attacks on code-mixed classification tasks. Additionally, our SAMLM based approach also results in better classification results over the popular baselines for code-mixed tasks. We use the explainability technique, SHAP (SHapley Additive exPlanations) to explain how the auditory features incorporated through SAMLM assist the model to handle the code-mixed text effectively and increase robustness against adversarial attacks.- Anthology ID:
- 2023.emnlp-main.987
- Volume:
- Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
- Month:
- December
- Year:
- 2023
- Address:
- Singapore
- Editors:
- Houda Bouamor, Juan Pino, Kalika Bali
- Venue:
- EMNLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 15918–15932
- Language:
- URL:
- https://aclanthology.org/2023.emnlp-main.987
- DOI:
- 10.18653/v1/2023.emnlp-main.987
- Cite (ACL):
- Mamta Mamta, Zishan Ahmad, and Asif Ekbal. 2023. Elevating Code-mixed Text Handling through Auditory Information of Words. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15918–15932, Singapore. Association for Computational Linguistics.
- Cite (Informal):
- Elevating Code-mixed Text Handling through Auditory Information of Words (Mamta et al., EMNLP 2023)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2023.emnlp-main.987.pdf