Longfei Zuo
2026
EVADE: LLM-Based Explanation Generation and Validation for Error Detection in NLI
Longfei Zuo | Barbara Plank | Siyao Peng
Findings of the Association for Computational Linguistics: ACL 2026
Longfei Zuo | Barbara Plank | Siyao Peng
Findings of the Association for Computational Linguistics: ACL 2026
High-quality datasets are critical for training and evaluating reliable NLP models. In tasks like natural language inference (NLI), human label variation (HLV) arises when multiple labels are valid for the same instance, making it difficult to separate annotation errors from plausible variation. An earlier framework, VariErr (Weber-Genzel et al., 2024), asks multiple annotators to explain their label decisions in the first round and flags errors through validity judgments in the second round. However, conducting two rounds of manual annotation is costly and may limit the coverage of plausible labels or explanations. Our study proposes a new framework, EVADE, for generating and validating explanations to detect errors using large language models (LLMs). We perform a comprehensive analysis comparing human- and LLM-detected errors for NLI across distribution comparison, validation overlap, and impact on model fine-tuning. Our experiments demonstrate that LLM validation refines generated explanation distributions to more closely align with human annotations, and that removing LLM-detected errors from training data yields improvements in fine-tuning performance than removing errors identified by human annotators. This highlights the potential to scale error detection, reducing human effort while improving dataset quality under label variation.
2025
Evaluating Large Language Models for Cross-Lingual Retrieval
Longfei Zuo | Pingjun Hong | Oliver Kraus | Barbara Plank | Robert Litschko
Findings of the Association for Computational Linguistics: EMNLP 2025
Longfei Zuo | Pingjun Hong | Oliver Kraus | Barbara Plank | Robert Litschko
Findings of the Association for Computational Linguistics: EMNLP 2025
Multi-stage information retrieval (IR) has become a widely-adopted paradigm in search. While Large Language Models (LLMs) have been extensively evaluated as second-stage reranking models for monolingual IR, a systematic large-scale comparison is still lacking for cross-lingual IR (CLIR). Moreover, while prior work shows that LLM-based rerankers improve CLIR performance, their evaluation setup relies on machine translation (MT) for the first stage. This is not only prohibitively expensive but also prone to error propagation across stages. Our evaluation on passage-level and document-level CLIR reveals that this setup, which we term noisy monolingual IR, is favorable for LLMs. However, LLMs still fail to improve the first-stage ranking if instead produced by multilingual bi-encoders. We further show that pairwise rerankers based on instruction-tuned LLMs perform competitively with listwise rerankers. To the best of our knowledge, we are the first to study the interaction between retrievers and rerankers in two-stage CLIR with LLMs. Our findings reveal that, without MT, current state-of-the-art rerankers fall severely short when directly applied in CLIR.
2024
LMU-BioNLP at SemEval-2024 Task 2: Large Diverse Ensembles for Robust Clinical NLI
Zihang Sun | Danqi Yan | Anyi Wang | Tanalp Agustoslu | Qi Feng | Chengzhi Hu | Longfei Zuo | Shijia Zhou | Hermine Kleiner | Pingjun Hong | Suteera Seeha | Sebastian Loftus | Anna Susanna Barwig | Oliver Kraus | Jona Voholonsky | Yang Sun | Leopold Martin | Lena Altinger | Jing Wang | Leon Weber-Genzel
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
Zihang Sun | Danqi Yan | Anyi Wang | Tanalp Agustoslu | Qi Feng | Chengzhi Hu | Longfei Zuo | Shijia Zhou | Hermine Kleiner | Pingjun Hong | Suteera Seeha | Sebastian Loftus | Anna Susanna Barwig | Oliver Kraus | Jona Voholonsky | Yang Sun | Leopold Martin | Lena Altinger | Jing Wang | Leon Weber-Genzel
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)
In this paper, we describe our submission for the NLI4CT 2024 shared task on robust Natural Language Inference over clinical trial reports. Our system is an ensemble of nine diverse models which we aggregate via majority voting. The models use a large spectrum of different approaches ranging from a straightforward Convolutional Neural Network over fine-tuned Large Language Models to few-shot-prompted language models using chain-of-thought reasoning.Surprisingly, we find that some individual ensemble members are not only more accurate than the final ensemble model but also more robust.
MultiClimate: Multimodal Stance Detection on Climate Change Videos
Jiawen Wang | Longfei Zuo | Siyao Peng | Barbara Plank
Proceedings of the Third Workshop on NLP for Positive Impact
Jiawen Wang | Longfei Zuo | Siyao Peng | Barbara Plank
Proceedings of the Third Workshop on NLP for Positive Impact
Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747/0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models. Our code, dataset, as well as supplementary materials, are available at https://github.com/werywjw/MultiClimate.