Xiaohua Jia

2024

pdf abs
Recovery Should Never Deviate from Ground Truth: Mitigating Exposure Bias in Neural Machine Translation
Jianfei He | Shichao Sun | Xiaohua Jia | Wenjie Li
Proceedings of the 25th Annual Conference of the European Association for Machine Translation (Volume 1)

In Neural Machine Translation, models are often trained with teacher forcing and suffer from exposure bias due to the discrepancy between training and inference. Current token-level solutions, such as scheduled sampling, aim to maximize the model’s capability to recover from errors. Their loss functions have a side effect: a sequence with errors may have a larger probability than the ground truth. The consequence is that the generated sequences may recover too much and deviate from the ground truth. This side effect is verified in our experiments. To address this issue, we propose using token-level contrastive learning to coordinate three training objectives: the usual MLE objective, an objective for recovery from errors, and a new objective to explicitly constrain the recovery in a scope that does not impact the ground truth. Our empirical analysis shows that this method effectively achieves these objectives in training and reduces the frequency with which the third objective is violated. We conduct experiments on three language pairs: German-English, Russian-English, and English-Russian. Results show that our method outperforms the vanilla Transformer and other methods addressing the exposure bias.

There exists a discrepancy between the token-level objective during training and the overall sequence-level quality that is expected from the model. This discrepancy leads to issues like exposure bias.To align the model with human expectations, sequence-level objectives are often used to fine-tune pre-trained models.In this paper, we introduce a contrastive preference model that enhances the traditional Plackett-Luce model by incorporating an indicator function. Building upon this novel preference model, we propose Contrastive Preference Learning (CPL), which uses offline samples with list-wise preferences to fine-tune a pre-trained model in Neural Machine Translation. Our experiments, conducted on three language pairs, demonstrate that CPL outperforms not only the vanilla Transformer model but also other token-level and sequence-level baselines. Furthermore, the ablation study highlights the essential role of the proposed indicator function in achieving this improvement.

2023

pdf abs
Empirical Analysis of Beam Search Curse and Search Errors with Model Errors in Neural Machine Translation
Jianfei He | Shichao Sun | Xiaohua Jia | Wenjie Li
Proceedings of the 24th Annual Conference of the European Association for Machine Translation

Beam search is the most popular decoding method for Neural Machine Translation (NMT) and is still a strong baseline compared with the newly proposed sampling-based methods. To better understand beam search, we investigate its two well-recognized issues, beam search curse and search errors, at the sentence level. We find that only less than 30% of sentences in the test set experience these issues. Meanwhile, there is a related phenomenon. For the majority of sentences, their gold references have lower probabilities than the predictions from beam search. We also test with different levels of model errors including a special test using training samples and models without regularization. We find that these phenomena still exist even for a model with an accuracy of 95% although they are mitigated. These findings show that it is not promising to improve beam search by seeking higher probabilities in searching and further reducing its search errors. The relationship between the quality and the probability of predictions at the sentence level in our results provides useful information to find new ways to improve NMT.

Abstractive Text Summarization (ATS) models are commonly trained using large-scale data that is randomly shuffled. However, the impact of data selection and data ordering on ATS models remains a relatively unexplored research area, where a significant challenge lies in accurately assessing the learning difficulty of each training instance. This study introduces a Data Selection Curriculum (DSC) scoring system that incorporates both the difficulty of improving ATS model via an instance and the expected performance on this instance. By selectively excluding excessively simple and overly complex instances, the training efficiency can be optimized. Furthermore, curriculum learning is integrated to accelerate convergence and improve performance by gradually increasing the learning difficulty, inspired by human learners. Experimental results on the CNN/DailyMail dataset demonstrate that our approach surpasses potent baselines, utilizing a mere 20% of the available instances.

Co-authors

Wenjie Li 4

Ziqiang Cao 1

Venues

eamt2
findings2