CoMix: Guide Transformers to Code-Mix using POS structure and Phonetics

Gaurav Arora, Srujana Merugu, Vivek Sembium


Abstract
Code-mixing is ubiquitous in multilingual societies, which makes it vital to build models for code-mixed data to power human language interfaces. Existing multilingual transformer models trained on pure corpora lack the ability to intermix words of one language into the structure of another. These models are also not robust to orthographic variations. We propose CoMixCoMix is not a trademark and only used to refer to our models for code-mixed data for presentational brevity., a pretraining approach to improve representation of code-mixed data in transformer models by incorporating phonetic signals, a modified attention mechanism, and weak supervision guided generation by parts-of-speech constraints. We show that CoMix improves performance across four code-mixed tasks: machine translation, sequence classification, named entity recognition (NER), and abstractive summarization. It also achieves the new SOTA performance for English-Hinglish translation and NER on LINCE Leaderboard and provides better generalization on out-of-domain translation. Motivated by variations in human annotations, we also propose a new family of metrics based on phonetics and demonstrate that the phonetic variant of BLEU correlates better with human judgement than BLEU on code-mixed text.
Anthology ID:
2023.findings-acl.506
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7985–8002
Language:
URL:
https://aclanthology.org/2023.findings-acl.506
DOI:
10.18653/v1/2023.findings-acl.506
Bibkey:
Cite (ACL):
Gaurav Arora, Srujana Merugu, and Vivek Sembium. 2023. CoMix: Guide Transformers to Code-Mix using POS structure and Phonetics. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7985–8002, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
CoMix: Guide Transformers to Code-Mix using POS structure and Phonetics (Arora et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-2024-clasp/2023.findings-acl.506.pdf