T4T Solution: WMT21 Similar Language Task for the Spanish-Catalan and Spanish-Portuguese Language Pair

Miguel Canals, Marc Raventós Tato


Abstract
The main idea of this solution has been to focus on corpus cleaning and preparation and after that, use an out of box solution (OpenNMT) with its default published transformer model. To prepare the corpus, we have used set of standard tools (as Moses scripts or python packages), but also, among other python scripts, a python custom tokenizer with the ability to replace numbers for variables, solve the upper/lower case issue of the vocabulary and provide good segmentation for most of the punctuation. We also have started a line to clean corpus based on statistical probability estimation of source-target corpus, with unclear results. Also, we have run some tests with syllabical word segmentation, again with unclear results, so at the end, after word sentence tokenization we have used BPE SentencePiece for subword units to feed OpenNMT.
Anthology ID:
2021.wmt-1.28
Volume:
Proceedings of the Sixth Conference on Machine Translation
Month:
November
Year:
2021
Address:
Online
Venue:
WMT
SIG:
SIGMT
Publisher:
Association for Computational Linguistics
Note:
Pages:
279–283
Language:
URL:
https://aclanthology.org/2021.wmt-1.28
DOI:
Bibkey:
Cite (ACL):
Miguel Canals and Marc Raventós Tato. 2021. T4T Solution: WMT21 Similar Language Task for the Spanish-Catalan and Spanish-Portuguese Language Pair. In Proceedings of the Sixth Conference on Machine Translation, pages 279–283, Online. Association for Computational Linguistics.
Cite (Informal):
T4T Solution: WMT21 Similar Language Task for the Spanish-Catalan and Spanish-Portuguese Language Pair (Canals & Raventós Tato, WMT 2021)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2021.wmt-1.28.pdf
Data
WMT 2020