On the Role of Parallel Data in Cross-lingual Transfer Learning

Machel Reid, Mikel Artetxe


Abstract
While prior work has established that the use of parallel data is conducive for cross-lingual learning, it is unclear if the improvements come from the data itself, or if it is the modeling of parallel interactions that matters. Exploring this, we examine the usage of unsupervised machine translation to generate synthetic parallel data, and compare it to supervised machine translation and gold parallel data. We find that even model generated parallel data can be useful for downstream tasks, in both a general setting (continued pretraining) as well as the task-specific setting (translate-train), although our best results are still obtained using real parallel data. Our findings suggest that existing multilingual models do not exploit the full potential of monolingual data, and prompt the community to reconsider the traditional categorization of cross-lingual learning approaches.
Anthology ID:
2023.findings-acl.372
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5999–6006
Language:
URL:
https://aclanthology.org/2023.findings-acl.372
DOI:
10.18653/v1/2023.findings-acl.372
Bibkey:
Cite (ACL):
Machel Reid and Mikel Artetxe. 2023. On the Role of Parallel Data in Cross-lingual Transfer Learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5999–6006, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
On the Role of Parallel Data in Cross-lingual Transfer Learning (Reid & Artetxe, Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2023.findings-acl.372.pdf