Yu Yu


Probabilistic Robustness for Data Filtering
Yu Yu | Abdul Khan | Shahram Khadivi | Jia Xu
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

We introduce our probabilistic robustness rewarded data optimization (PRoDO) approach as a framework to enhance the model’s generalization power by selecting training data that optimizes our probabilistic robustness metrics. We use proximal policy optimization (PPO) reinforcement learning to approximately solve the computationally intractable training subset selection problem. The PPO’s reward is defined as our (${alpha,{epsilon, {gamma$)-Robustness that measures performance consistency over multiple domains by simulating unknown test sets in real-world scenarios using a leaving-one-out strategy. We demonstrate that our PRoDO effectively filters data that lead to significantly higher prediction accuracy and robustness on unknown-domain test sets. Our experiments achieve up to +17.2{% increase of accuracy (+25.5{% relatively) in sentiment analysis, and -28.05 decrease of perplexity (-32.1{% relatively) in language modeling.In addition, our probabilistic (${alpha,{epsilon, {gamma$)-Robustness definition serves as an evaluation metric with higher levels of agreement with human annotations than typical performance-based metrics.


Measuring Robustness for NLP
Yu Yu | Abdul Rafae Khan | Jia Xu
Proceedings of the 29th International Conference on Computational Linguistics

The quality of Natural Language Processing (NLP) models is typically measured by the accuracy or error rate of a predefined test set. Because the evaluation and optimization of these measures are narrowed down to a specific domain like news and cannot be generalized to other domains like Twitter, we often observe that a system reported with human parity results generates surprising errors in real-life use scenarios. We address this weakness with a new approach that uses an NLP quality measure based on robustness. Unlike previous work that has defined robustness using Minimax to bound worst cases, we measure robustness based on the consistency of cross-domain accuracy and introduce the coefficient of variation and (epsilon, gamma)-Robustness. Our measures demonstrate higher agreements with human evaluation than accuracy scores like BLEU on ranking Machine Translation (MT) systems. Our experiments of sentiment analysis and MT tasks show that incorporating our robustness measures into learning objectives significantly enhances the final NLP prediction accuracy over various domains, such as biomedical and social media.

Can Data Diversity Enhance Learning Generalization?
Yu Yu | Shahram Khadivi | Jia Xu
Proceedings of the 29th International Conference on Computational Linguistics

This paper introduces our Diversity Advanced Actor-Critic reinforcement learning (A2C) framework (DAAC) to improve the generalization and accuracy of Natural Language Processing (NLP). We show that the diversification of training samples alleviates overfitting and improves model generalization and accuracy. We quantify diversity on a set of samples using the max dispersion, convex hull volume, and graph entropy based on sentence embeddings in high-dimensional metric space. We also introduce A2C to select such a diversified training subset efficiently. Our experiments achieve up to +23.8 accuracy increase (38.0% relatively) in sentiment analysis, -44.7 perplexity decrease (37.9% relatively) in language modeling, and consistent improvements in named entity recognition over various domains. In particular, our method outperforms both domain adaptation and generalization baselines without using any target domain knowledge.