Abstract
Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generation and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the in- efficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.- Anthology ID:
- 2021.eval4nlp-1.20
- Volume:
- Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Yang Gao, Steffen Eger, Wei Zhao, Piyawat Lertvittayakumjorn, Marina Fomicheva
- Venue:
- Eval4NLP
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 200–208
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/2021.eval4nlp-1.20/
- DOI:
- 10.18653/v1/2021.eval4nlp-1.20
- Cite (ACL):
- Vivek Srivastava and Mayank Singh. 2021. HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text. In Proceedings of the 2nd Workshop on Evaluation and Comparison of NLP Systems, pages 200–208, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text (Srivastava & Singh, Eval4NLP 2021)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/2021.eval4nlp-1.20.pdf