Ning Li
2026
Controllable Contamination Detection for Reliable LLM Evaluation with Statistical Guarantees
Zheng Zhang | Qi Liu | Siyuan Liang | Ning Li | Zirui Hu | Weibo Gao | Rui Li | Zhenya Huang | Leszek Rutkowski | Baosheng Yu | Dacheng Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zheng Zhang | Qi Liu | Siyuan Liang | Ning Li | Zirui Hu | Weibo Gao | Rui Li | Zhenya Huang | Leszek Rutkowski | Baosheng Yu | Dacheng Tao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) have achieved remarkable performance across diverse tasks, largely driven by large-scale pretraining. However, this data abundance introduces test data contamination, where benchmark datasets overlap with pretraining corpora, undermining the reliability of model evaluation by confounding memorization with genuine generalization. To mitigate this issue, existing training data detectors attempt to identify clean (unseen) samples from contaminated test sets, but often suffer from residual contamination due to the black-box nature of LLMs. As a result, contaminated data may be mistakenly retained, leading to unreliable evaluation.To address this challenge, we propose FTD (FDR-controlled Training Data detection), a principled framework that detects and filters contaminated evaluation data while providing a statistical guarantee: the proportion of contaminated samples mistakenly retained as clean, the false discovery rate (FDR), is provably controlled below a user-specified threshold. FTD combines multiple complementary detectors via an adaptive weighting strategy, and we theoretically show it achieves high statistical power under valid FDR control. Extensive experiments on real-world benchmarks demonstrate that FTD significantly reduces residual contamination compared to existing methods while preserving evaluation consistency.
LongTutor: Benchmarking Large Language Models for Long-term Personalized Tutoring
Ning Li | Zheng Zhang | Zhenya Huang | Rui Li | Yi Zhan | Yinbo Luo | Qi Liu | Enhong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Ning Li | Zheng Zhang | Zhenya Huang | Rui Li | Yi Zhan | Yinbo Luo | Qi Liu | Enhong Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rapid advancement of large language models (LLMs) has driven the deployment of LLM-based AI tutors on online learning platforms. This widespread adoption highlights an urgent need for systematic benchmarks to evaluate their tutoring capabilities. However, existing evaluations predominantly focus on isolated, short-term interactions, overlooking the inherently long-term nature of learning. To bridge this gap, we introduce LongTutor, a benchmark for long-term personalized tutoring grounded in formative assessment theory. Built from expert-annotated real-world learning logs, LongTutor evaluates LLMs across three progressive tasks: historical evidence acquisition, knowledge state diagnosis, and adaptive teaching action. Our experiments reveal a critical capability mismatch: while LLMs excel at evidence acquisition, they struggle to effectively leverage long-term history for accurate diagnosis and adaptive teaching. To enable scalable benchmark expansion, we further propose an automated generator–verifier pipeline, paving the way toward truly long-term AI tutoring systems.
2025
YNU-HPCC at SemEval-2025 Task 10: A Two-Stage Approach to Solving Multi-Label and Multi-Class Role Classification Based on DeBERTa
Ning Li | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Ning Li | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
A two-stage role classification model based on DeBERTa is proposed for the Entity Framework task in SemEval 2025 Task 10. The task is confronted with challenges such as multi-labeling, multi-category, and category imbalance, particularly in the semantic overlap and data sparsity of fine-grained roles. Existing methods primarily rely on rules, traditional machine learning, or deep learning, but the accurate classification of fine-grained roles is difficult to achieve. To address this, the proposed model integrates the deep semantic representation of the DeBERTa pre-trained language model through two sub-models: main role classification and sub-role classification, and utilizes Focal Loss to optimize the category imbalance issue. Experimental results indicate that the model achieves an accuracy of 75.32% in predicting the main role, while the exact matching rate for the sub-role is 8.94%. This is mainly limited by the strict matching standard and semantic overlap of fine-grained roles in the multi-label task. Compared to the baseline’s sub-role exact matching rate of 3.83%, the proposed model significantly improves this metric. The model ultimately ranked 23rd on the leaderboard. The code of this paper is available at:https://github.com/jiyuaner/YNU-HPCC-at-SemEval-2025-Task10.
An LLM-based Framework for Domain-Specific Information Extraction: A Case Study in Computer Science and Chemistry
Xungang Gu | Yangjie Tian | Ning Li | Meng Liu | Ruohua Xu | He Zhang | Hanqiu Liu | Yongpan Sheng | Ming Liu
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association
Xungang Gu | Yangjie Tian | Ning Li | Meng Liu | Ruohua Xu | He Zhang | Hanqiu Liu | Yongpan Sheng | Ming Liu
Proceedings of the 23rd Annual Workshop of the Australasian Language Technology Association
Information extraction (IE) in specialized domains like computer science and chemistry is challenged by the poor generalization of traditional models and the knowledge deficits of general-purpose Large Language Models (LLMs). We introduce a robust, LLM-based framework featuring two core contributions: an end-to-end training and inference paradigm that combines continual pre-training (CPT) for knowledge injection, supervised fine-tuning (SFT) for task alignment, and retrieval-augmented generation (RAG) for inference-time enhancement; and a novel LLM-assisted data annotation pipeline for the efficient creation of high-quality training data. Comprehensive experiments demonstrate that while fine-tuning alone yields strong in-domain performance, our complete framework exhibits superior robustness and generalization. It consistently achieves state-of-the-art results in challenging domain-shift and novel-schema scenarios, validating our integrated approach for building adaptable and high-performance domain-specific IE systems.