Abstract
Code-mixing is ubiquitous in multilingual societies, which makes it vital to build models for code-mixed data to power human language interfaces. Existing multilingual transformer models trained on pure corpora lack the ability to intermix words of one language into the structure of another. These models are also not robust to orthographic variations. We propose CoMixCoMix is not a trademark and only used to refer to our models for code-mixed data for presentational brevity., a pretraining approach to improve representation of code-mixed data in transformer models by incorporating phonetic signals, a modified attention mechanism, and weak supervision guided generation by parts-of-speech constraints. We show that CoMix improves performance across four code-mixed tasks: machine translation, sequence classification, named entity recognition (NER), and abstractive summarization. It also achieves the new SOTA performance for English-Hinglish translation and NER on LINCE Leaderboard and provides better generalization on out-of-domain translation. Motivated by variations in human annotations, we also propose a new family of metrics based on phonetics and demonstrate that the phonetic variant of BLEU correlates better with human judgement than BLEU on code-mixed text.- Anthology ID:
- 2023.findings-acl.506
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 7985–8002
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.506
- DOI:
- 10.18653/v1/2023.findings-acl.506
- Cite (ACL):
- Gaurav Arora, Srujana Merugu, and Vivek Sembium. 2023. CoMix: Guide Transformers to Code-Mix using POS structure and Phonetics. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7985–8002, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- CoMix: Guide Transformers to Code-Mix using POS structure and Phonetics (Arora et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/ingest-2024-clasp/2023.findings-acl.506.pdf