Unsupervised Feature Selection for Effective Parallel Corpus Filtering

Mikko Aulamo, Ona de Gibert, Sami Virpioja, Jörg Tiedemann


Abstract
This work presents an unsupervised method of selecting filters and threshold values for the OpusFilter parallel corpus cleaning toolbox. The method clusters sentence pairs into noisy and clean categories and uses the features of the noisy cluster center as filtering parameters. Our approach utilizes feature importance analysis to disregard filters that do not differentiate between clean and noisy data. A randomly sampled subset of a given corpus is used for filter selection and ineffective filters are not run for the full corpus. We use a set of automatic evaluation metrics to assess the quality of translation models trained with data filtered by our method and data filtered with OpusFilter’s default parameters. The trained models cover English-German and English-Ukrainian in both directions. The proposed method outperforms the default parameters in all translation directions for almost all evaluation metrics.
Anthology ID:
2023.eamt-1.4
Volume:
Proceedings of the 24th Annual Conference of the European Association for Machine Translation
Month:
June
Year:
2023
Address:
Tampere, Finland
Editors:
Mary Nurminen, Judith Brenner, Maarit Koponen, Sirkku Latomaa, Mikhail Mikhailov, Frederike Schierl, Tharindu Ranasinghe, Eva Vanmassenhove, Sergi Alvarez Vidal, Nora Aranberri, Mara Nunziatini, Carla Parra Escartín, Mikel Forcada, Maja Popovic, Carolina Scarton, Helena Moniz
Venue:
EAMT
SIG:
Publisher:
European Association for Machine Translation
Note:
Pages:
31–38
Language:
URL:
https://aclanthology.org/2023.eamt-1.4
DOI:
Bibkey:
Cite (ACL):
Mikko Aulamo, Ona de Gibert, Sami Virpioja, and Jörg Tiedemann. 2023. Unsupervised Feature Selection for Effective Parallel Corpus Filtering. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 31–38, Tampere, Finland. European Association for Machine Translation.
Cite (Informal):
Unsupervised Feature Selection for Effective Parallel Corpus Filtering (Aulamo et al., EAMT 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl-2023-videos/2023.eamt-1.4.pdf