2023
pdf
abs
A Fast Method to Filter Noisy Parallel Data WMT2023 Shared Task on Parallel Data Curation
Nguyen-Hoang Minh-Cong
|
Nguyen Van Vinh
|
Nguyen Le-Minh
Proceedings of the Eighth Conference on Machine Translation
The effectiveness of a machine translation (MT) system is intricately linked to the quality of its training dataset. In an era where websites offer an extensive repository of translations such as movie subtitles, stories, and TED Talks, the fundamental challenge resides in pinpointing the sentence pairs or documents that represent accurate translations of each other. This paper presents the results of our submission to the shared task WMT2023 (Sloto et al., 2023), which aimed to evaluate parallel data curation methods for improving the MT system. The task involved alignment and filtering data to create high-quality parallel corpora for training and evaluating the MT models. Our approach leveraged a combination of dictionary and rule-based methods to ensure data quality and consistency. We achieved an improvement with the highest 1.6 BLEU score compared to the baseline system. Significantly, our approach showed consistent improvements across all test sets, suggesting its efficiency.
2022
pdf
abs
A Simple and Fast Strategy for Handling Rare Words in Neural Machine Translation
Nguyen-Hoang Minh-Cong
|
Vinh Thi Ngo
|
Van Vinh Nguyen
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing: Student Research Workshop
Neural Machine Translation (NMT) has currently obtained state-of-the-art in machine translation systems. However, dealing with rare words is still a big challenge in translation systems. The rare words are often translated using a manual dictionary or copied from the source to the target with original words. In this paper, we propose a simple and fast strategy for integrating constraints during the training and decoding process to improve the translation of rare words. The effectiveness of our proposal is demonstrated in both high and low-resource translation tasks, including the language pairs: English → Vietnamese, Chinese → Vietnamese, Khmer → Vietnamese, and Lao → Vietnamese. We show the improvements of up to +1.8 BLEU scores over the baseline systems.
2020
pdf
The UET-ICTU Submissions to the VLSP 2020 News Translation Task
Ngo Thi-Vinh
|
Nguyen Minh-Thuan
|
Nguyen Hoang Minh Cong
|
Nguyen Hoang-Quan
|
Nguyen Phuong-Thai
|
Nguyen Van-Vinh
Proceedings of the 7th International Workshop on Vietnamese Language and Speech Processing