Multilingual Data Filtering using Synthetic Data from Large Language Models

Jonas Waldendorf, Barry Haddow, Alexandra Birch, Mateusz Klimaszewski


Abstract
Filtering data, particularly data scraped from the internet, has long been recognised as a means to improve model performance. Recent studies have shown that effective filters can be created by utilising Large Language Models (LLMs) to synthetically label data, which is then used to train smaller neural models for filtering purposes. However, this approach has been tested mainly in English. Our paper extends this approach to languages beyond English, including languages not officially supported by the LLM. We validate our results on the downstream task of NMT and demonstrate that our approach is effective at both filtering parallel text for translation quality and filtering for domain specificity. For training the filtering model, we experiment with two different objectives for finetuning pre-trained transformers, as well as an efficient approach based on *n*-gram language models.
Anthology ID:
2025.findings-emnlp.495
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2025
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
9317–9334
Language:
URL:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.495/
DOI:
10.18653/v1/2025.findings-emnlp.495
Bibkey:
Cite (ACL):
Jonas Waldendorf, Barry Haddow, Alexandra Birch, and Mateusz Klimaszewski. 2025. Multilingual Data Filtering using Synthetic Data from Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 9317–9334, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Multilingual Data Filtering using Synthetic Data from Large Language Models (Waldendorf et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.495.pdf
Checklist:
 2025.findings-emnlp.495.checklist.pdf