MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora

Lifeng Han, Gareth Jones, Alan Smeaton


Abstract
Multi-word expressions (MWEs) are a hot topic in research in natural language processing (NLP), including topics such as MWE detection, MWE decomposition, and research investigating the exploitation of MWEs in other NLP fields such as Machine Translation. However, the availability of bilingual or multi-lingual MWE corpora is very limited. The only bilingual MWE corpora that we are aware of is from the PARSEME (PARSing and Multi-word Expressions) EU Project. This is a small collection of only 871 pairs of English-German MWEs. In this paper, we present multi-lingual and bilingual MWE corpora that we have extracted from root parallel corpora. Our collections are 3,159,226 and 143,042 bilingual MWE pairs for German-English and Chinese-English respectively after filtering. We examine the quality of these extracted bilingual MWEs in MT experiments. Our initial experiments applying MWEs in MT show improved translation performances on MWE terms in qualitative analysis and better general evaluation scores in quantitative analysis, on both German-English and Chinese-English language pairs. We follow a standard experimental pipeline to create our MultiMWE corpora which are available online. Researchers can use this free corpus for their own models or use them in a knowledge base as model features.
Anthology ID:
2020.lrec-1.363
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
2970–2979
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.363
DOI:
Bibkey:
Cite (ACL):
Lifeng Han, Gareth Jones, and Alan Smeaton. 2020. MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2970–2979, Marseille, France. European Language Resources Association.
Cite (Informal):
MultiMWE: Building a Multi-lingual Multi-Word Expression (MWE) Parallel Corpora (Han et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/2020.lrec-1.363.pdf
Code
 poethan/MWE4MT