Marilena Malli
2022
Evaluating Corpus Cleanup Methods in the WMT’22 News Translation Task
Marilena Malli
|
George Tambouratzis
Proceedings of the Seventh Conference on Machine Translation (WMT)
This submission to the WMT22: General MT Task, consists of translations produced from a series of NMT models of the following two language pairs: german-to-english and german-to-french. All the models are trained using only the parallel training data specified by WMT22, and no monolingual training data was used. The models follow the transformer architecture employing 8 attention heads and 6 layers in both the encoder and decoder. It is also worth mentioning that, in order to limit the computational resources that we would use during the training process, we decided to train the majority of models by limiting the training to 21 epochs. Moreover, the translations submitted at WMT22 have been produced using the test data released by the WMT22.The aim of our experiments has been to evaluate methods for cleaning-up a parallel corpus to determine if this will lead to a translation model producing more accurate translations. For each language pair, the base NMT models has been trained from raw parallel training corpora, while the additional NMT models have been trained with corpora subjected to a special cleaning process with the following tools: Bifixer and Bicleaner. It should be mentioned that the Bicleaner repository doesn’t provide pre-trained classifiers for the above language pairs, consequently we trained probabilistic dictionaries in order to produce new models. The fundamental differences between these NMT models produced are mainly related to the quality and the quantity of the training data, while there are very few differences in the training parameters. To complete this work, we used the following three software packages: (i) MARIAN NMT (Version: v1.11.5), which was used for the training of the neural machine translation models and (ii) Bifixer and (iii) Bicleaner, which were used in order to correct and clean the parallel training data. Concerning the Bifixer and Bicleaner tools, we followed all the steps as described meticulously in the following article: “Ramírez-Sánchez, G., Zaragoza-Bernabeu, J., Bañón, M., & Rojas, S.O. (2020). Bifixer and Bicleaner: two open-source tools to clean your parallel data. EAMT. ” and also in the official github pages: https://github.com/bitextor/bifixer, https://github.com/bitextor/bicleaner.
2020
VMWE discovery: a comparative analysis between Literature and Twitter Corpora
Vivian Stamou
|
Artemis Xylogianni
|
Marilena Malli
|
Penny Takorou
|
Stella Markantonatou
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
We evaluate manually five lexical association measurements as regards the discovery of Modern Greek verb multiword expressions with two or more lexicalised components usingmwetoolkit3 (Ramisch et al., 2010). We use Twitter corpora and compare our findings with previous work on fiction corpora. The results of LL, MLE and T-score were found to overlap significantly in both the fiction and the Twitter corpora, while the results of PMI and Dice do not.We find that MWEs with two lexicalised components are more frequent in Twitter than in fiction corpora and that lean syntactic patterns help retrieve them more efficiently than richer ones.Our work (i) supports the enrichment of the lexicographical database for Modern Greek MWEs’ IDION’ (Markantonatou et al., 2019) and (ii) highlights aspects of the usage of five association measurements on specific text genres for best MWE discovery results.