Yulong Wu
2026
Beyond Static Synthetic Noise: Assessing the Robustness of Large Language Models to Natural Context Variation in the Real World
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro
Findings of the Association for Computational Linguistics: ACL 2026
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro
Findings of the Association for Computational Linguistics: ACL 2026
Robustness evaluation in Question Answering (QA) has predominantly relied on synthetic perturbations that poorly capture natural text evolution in real-world settings, a limitation that becomes more pronounced with the widespread deployment of Large Language Models (LLMs) in dynamic, user-facing environments. In this work, we address this gap by proposing a framework for automatically evaluating QA models under naturally occurring textual perturbations, replacing context passages with revised counterparts from Wikipedia edit histories. Through extensive evaluation on SQUAD across diverse encoder architectures, we construct two challenging sets where human performance remains stable, yet state-of-the-art LLMs exhibit significant degradation, with performance drops of up to 28.28%. These robustness gaps further generalize to more complex QA scenarios, such as DROP and HOTPOTQA. To mitigate these errors, we show that robustness to natural perturbations can be improved via adversarial training for encoder-only models and in-context demonstrations of perturbed instances for LLMs, though a more generalizable and effective defense strategy remains an open challenge.
2025
Natural Context Drift Undermines the Natural Language Understanding of Large Language Models
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro
Findings of the Association for Computational Linguistics: EMNLP 2025
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro
Findings of the Association for Computational Linguistics: EMNLP 2025
How does the natural evolution of context paragraphs affect Question Answering (QA) in generative Large Language Models (LLMs)? To address this, we propose a framework for curating naturally evolved, human-edited variants of reading passages from contemporary QA benchmarks and for analysing LLM performance across a range of semantic similarity scores, which quantify how closely each variant aligns with Wikipedia content on the same article topic that the LLM saw during pretraining. Using this framework, we evaluate 6 QA datasets and 8 LLMs with publicly available training data. Our experiments reveal that LLM performance declines as reading passages naturally diverge from the versions encountered during pretraining–even when the question and all necessary information remains present at inference time. For instance, average accuracy on BoolQ drops by over 30% from the highest to lowest similarity bins. This finding suggests that natural text evolution may pose a significant challenge to the language understanding capabilities of fully open-source LLMs.
SR-LLM: Rethinking the Structured Representation in Large Language Model
Jiahuan Zhang | Tianheng Wang | Hanqing Wu | Ziyi Huang | Yulong Wu | Dongbai Chen | Linfeng Song | Yue Zhang | Guozheng Rao | Kaicheng Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiahuan Zhang | Tianheng Wang | Hanqing Wu | Ziyi Huang | Yulong Wu | Dongbai Chen | Linfeng Song | Yue Zhang | Guozheng Rao | Kaicheng Yu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Structured representations, exemplified by Abstract Meaning Representation (AMR), have long been pivotal in computational linguistics. However, their role remains ambiguous in the Large Language Models (LLMs) era. Initial attempts to integrate structured representation into LLMs via a zero-shot setting yielded inferior performance. We hypothesize that such a decline stems from the structure information being passed into LLMs in a code format unfamiliar to LLMs’ training corpora. Consequently, we propose SR-LLM, an innovative framework with two settings to explore a superior way of integrating structured representation with LLMs from training-free and training-dependent perspectives. The former integrates structural information through natural language descriptions in LLM prompts, whereas its counterpart augments the model’s inference capability through fine-tuning on linguistically described structured representations. Performance improvements were observed in widely downstream datasets, with particularly notable gains of 3.17% and 12.38% in PAWS. To the best of our knowledge, this work represents the pioneering demonstration that leveraging structural representations can substantially enhance LLMs’ inference capability. We hope that our work sheds light and encourages future research to enhance the reasoning and interoperability of LLMs by structure data.
2023
MMT’s Submission for the WMT 2023 Quality Estimation Shared Task
Yulong Wu | Viktor Schlegel | Daniel Beck | Riza Batista-Navarro
Proceedings of the Eighth Conference on Machine Translation
Yulong Wu | Viktor Schlegel | Daniel Beck | Riza Batista-Navarro
Proceedings of the Eighth Conference on Machine Translation
This paper presents our submission to the WMT 2023 Quality Estimation (QE) shared task 1 (sentence-level subtask). We propose a straightforward training data augmentation approach aimed at improving the correlation between QE model predictions and human quality assessments. Utilising eleven data augmentation approaches and six distinct language pairs, we systematically create augmented training sets by individually applying each method to the original training set of each respective language pair. By evaluating the performance gap between the model before and after training on the augmented dataset, as measured on the development set, we assess the effectiveness of each augmentation method. Experimental results reveal that synonym replacement via the Paraphrase Database (PPDB) yields the most substantial performance boost for language pairs English-German, English-Marathi and English-Gujarati, while for the remaining language pairs, methods such as contextual word embeddings-based words insertion, back translation, and direct paraphrasing prove to be more effective. Training the model on a more diverse and larger set of samples does confer further performance improvements for certain language pairs, albeit to a marginal extent, and this phenomenon is not universally applicable. At the time of submission, we select the model trained on the augmented dataset constructed using the respective most effective method to generate predictions for the test set in each language pair, except for the English-German. Despite not being highly competitive, our system consistently surpasses the baseline performance on most language pairs and secures a third-place ranking in the English-Marathi.
Are Machine Reading Comprehension Systems Robust to Context Paraphrasing?
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
2021
Is the Understanding of Explicit Discourse Relations Required in Machine Reading Comprehension?
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
An in-depth analysis of the level of language understanding required by existing Machine Reading Comprehension (MRC) benchmarks can provide insight into the reading capabilities of machines. In this paper, we propose an ablation-based methodology to assess the extent to which MRC datasets evaluate the understanding of explicit discourse relations. We define seven MRC skills which require the understanding of different discourse relations. We then introduce ablation methods that verify whether these skills are required to succeed on a dataset. By observing the drop in performance of neural MRC models evaluated on the original and the modified dataset, we can measure to what degree the dataset requires these skills, in order to be understood correctly. Experiments on three large-scale datasets with the BERT-base and ALBERT-xxlarge model show that the relative changes for all skills are small (less than 6%). These results imply that most of the answered questions in the examined datasets do not require understanding the discourse structure of the text. To specifically probe for natural language understanding, there is a need to design more challenging benchmarks that can correctly evaluate the intended skills.