xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages
Mingda Chen, Kevin Heffernan, Onur Çelebi, Alexandre Mourachko, Holger Schwenk
Abstract
We introduce a new proxy score for evaluating bitext mining based on similarity in a multilingual embedding space: xsim++. In comparison to xsim, this improved proxy leverages rule-based approaches to extend English sentences in any evaluation set with synthetic, hard-to-distinguish examples which more closely mirror the scenarios we encounter during large-scale mining. We validate this proxy by running a significant number of bitext mining experiments for a set of low-resource languages, and subsequently train NMT systems on the mined data. In comparison to xsim, we show that xsim++ is better correlated with the downstream BLEU scores of translation systems trained on mined bitexts, providing a reliable proxy of bitext mining performance without needing to run expensive bitext mining pipelines. xsim++ also reports performance for different error types, offering more fine-grained feedbacks for model development.- Anthology ID:
- 2023.acl-short.10
- Volume:
- Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 101–109
- Language:
- URL:
- https://preview.aclanthology.org/add_missing_videos/2023.acl-short.10/
- DOI:
- 10.18653/v1/2023.acl-short.10
- Cite (ACL):
- Mingda Chen, Kevin Heffernan, Onur Çelebi, Alexandre Mourachko, and Holger Schwenk. 2023. xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 101–109, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- xSIM++: An Improved Proxy to Bitext Mining Performance for Low-Resource Languages (Chen et al., ACL 2023)
- PDF:
- https://preview.aclanthology.org/add_missing_videos/2023.acl-short.10.pdf