Assessing Back-Translation as a Corpus Generation Strategy for non-English Tasks: A Study in Reading Comprehension and Word Sense Disambiguation
Fabricio Monsalve, Kervy Rivas Rojas, Marco Antonio Sobrevilla Cabezudo, Arturo Oncevay
Abstract
Corpora curated by experts have sustained Natural Language Processing mainly in English, but the expensiveness of corpora creation is a barrier for the development in further languages. Thus, we propose a corpus generation strategy that only requires a machine translation system between English and the target language in both directions, where we filter the best translations by computing automatic translation metrics and the task performance score. By studying Reading Comprehension in Spanish and Word Sense Disambiguation in Portuguese, we identified that a more quality-oriented metric has high potential in the corpora selection without degrading the task performance. We conclude that it is possible to systematise the building of quality corpora using machine translation and automatic metrics, besides some prior effort to clean and process the data.- Anthology ID:
- W19-4010
- Volume:
- Proceedings of the 13th Linguistic Annotation Workshop
- Month:
- August
- Year:
- 2019
- Address:
- Florence, Italy
- Editors:
- Annemarie Friedrich, Deniz Zeyrek, Jet Hoek
- Venue:
- LAW
- SIG:
- SIGANN
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 81–89
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/W19-4010/
- DOI:
- 10.18653/v1/W19-4010
- Cite (ACL):
- Fabricio Monsalve, Kervy Rivas Rojas, Marco Antonio Sobrevilla Cabezudo, and Arturo Oncevay. 2019. Assessing Back-Translation as a Corpus Generation Strategy for non-English Tasks: A Study in Reading Comprehension and Word Sense Disambiguation. In Proceedings of the 13th Linguistic Annotation Workshop, pages 81–89, Florence, Italy. Association for Computational Linguistics.
- Cite (Informal):
- Assessing Back-Translation as a Corpus Generation Strategy for non-English Tasks: A Study in Reading Comprehension and Word Sense Disambiguation (Monsalve et al., LAW 2019)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/W19-4010.pdf
- Data
- SQuAD, Word Sense Disambiguation: a Unified Evaluation Framework and Empirical Comparison