Shiwen Ni
2026
CoTJudger: A Graph-Driven Framework for Automatic Evaluation of Chain-of-Thought Efficiency and Redundancy in LRMs
Siyi Li | Jiajun Shi | Shiwen Ni | Ge Zhang | Shuaimin Li | Shijian Wang | Zhoufutu Wen | Yizhi LI | Hamid Alinejad-Rokny | Jiaheng Liu | Min Yang | Wenhao Huang
Findings of the Association for Computational Linguistics: ACL 2026
Siyi Li | Jiajun Shi | Shiwen Ni | Ge Zhang | Shuaimin Li | Shijian Wang | Zhoufutu Wen | Yizhi LI | Hamid Alinejad-Rokny | Jiaheng Liu | Min Yang | Wenhao Huang
Findings of the Association for Computational Linguistics: ACL 2026
Large Reasoning Models (LRMs) have demonstrated strong performance by producing extended Chain-of-Thought (CoT) traces before answering. However, this paradigm often induces over-reasoning: redundant calculations and circular self-verification that increase computational cost without improving outcomes. Existing evaluations largely emphasize final accuracy or coarse token counts, and lack automated tools to separate essential logic from structural redundancy. We introduce CoTJudger, a graph-driven framework that quantifies reasoning efficiency by converting free-form CoTs into directed dependency graphs and extracting the Shortest Effective Path (SEP) needed to reach a correct solution. This yields an interpretable efficiency signal – how much of a CoT is necessary versus structurally redundant – that is comparable across models and tasks. Evaluating 21 LRMs, CoTJudger reveals pervasive redundancy and surfaces recurring failure modes, including verification obsession and compensatory redundancy. These results provide a practical metric for disentangling reasoning ability from computational waste, enabling more targeted evaluation and diagnosis of LRM efficiency.
SrDetection: A Self-Referential Framework for Data Leakage Detection in Code Large Language Models
Shuaimin Li | Liyang Fan | Zeyang li | Zhuoyue Wan | Yufang Lin | Shiwen Ni | Feiteng Fang | Hamid Alinejad-Rokny | Yuanfeng Song | Kun Jing | Chen Jason Zhang | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Shuaimin Li | Liyang Fan | Zeyang li | Zhuoyue Wan | Yufang Lin | Shiwen Ni | Feiteng Fang | Hamid Alinejad-Rokny | Yuanfeng Song | Kun Jing | Chen Jason Zhang | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Evaluating code large language models (Code LLMs) requires reliable detection of data leakage, where benchmark performance is artificially inflated by exposure to benchmark data during pre-training. Existing approaches either assume access to proprietary training corpora, rely on brittle heuristics such as timestamp filtering, or use external reference sets with manually tuned, non-generalizable thresholds. To address these limitations, we introduce SrDetection, a unified self-referential leakage detection framework for both gray-box (access to model logits) and black-box (access to model outputs) settings. SrDetection generates semantically equivalent variants of a benchmark sample and detects leakage by contrasting the model’s behavior on the original versus its variants, flagging cases where the original is disproportionately easier for the model. We further design a controlled leakage detection testbed and evaluate SrDetection in this environment. Across different models and training stages, SrDetection improves average F1 by 21.52 points in the gray-box setting and 14.46 points in the black-box setting over strong baselines, demonstrating robust, threshold-independent leakage detection. Finally, a gray-box study of 15 widely used Code LLMs on four popular benchmarks reveals benchmark-specific leakage patterns beyond prior overlap-based analyses[Source code and data are available at <https://github.com/SMinL/SrDetectionCode>].
Towards IP Intelligence: Benchmarking Large Language Models on Intellectual Property Knowledge and Practice
Qiyao Wang | Guhong Chen | Hongbo Wang | Huaren Liu | Minghui Zhu | Zhifei Qin | Li Linwei | Yilin Yue | Shiqiang Wang | Jiayan Li | Wu Yihang | Ziqiang Liu | Longze Chen | Run Luo | Liyang Fan | Jiaming Li | Lei Zhang | Kan Xu | Hamid Alinejad-Rokny | Chengming Li | Shiwen Ni | Yuan Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Qiyao Wang | Guhong Chen | Hongbo Wang | Huaren Liu | Minghui Zhu | Zhifei Qin | Li Linwei | Yilin Yue | Shiqiang Wang | Jiayan Li | Wu Yihang | Ziqiang Liu | Longze Chen | Run Luo | Liyang Fan | Jiaming Li | Lei Zhang | Kan Xu | Hamid Alinejad-Rokny | Chengming Li | Shiwen Ni | Yuan Lin | Min Yang
Findings of the Association for Computational Linguistics: ACL 2026
Intellectual Property (IP) is a highly specialized domain that integrates technical and legal knowledge, making it inherently complex and knowledge-intensive. Recent advancements in LLMs have demonstrated their potential to handle IP tasks, enabling more efficient analysis, understanding, and generation of IP-related content. However, existing datasets and benchmarks focus narrowly on patents or cover limited aspects of the IP field, lacking alignment with real-world scenarios. To bridge this gap, we introduce **IPBench**, the first comprehensive IP task taxonomy and a large-scale bilingual benchmark encompassing **8 IP mechanisms and 20 distinct tasks**, designed to evaluate LLMs in real-world IP practice. We benchmark **19 main LLMs**, ranging from general purpose to domain-specific, including chat-oriented and reasoning-focused models, under zero-shot, few-shot, and chain-of-thought settings. Our results show that even the top-performing model, DeepSeek-V3, achieves only 75.8% accuracy, indicating significant room for improvement. Notably, open-source IP and law-oriented models lag behind closed-source general-purpose models. To foster future research, we publicly release IPBench, and will expand it with additional tasks to better reflect real-world complexities and support model advancements in the IP domain. We provide the data, code in the supplementary materials.
Beyond Quantity: Trajectory Diversity Scaling for Code Agents
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of quantity scaling; moreover, quantity-centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling-based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Moreover, TDScaling is more data-efficient: under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance-cost trade-off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a Blueprint-driven multi-agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long-tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool-use benchmarks (BFCL, 𝜏2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win-win outcome: TDScaling improves both tool-use generalization and inherent coding proficiency. Crucially, we show that trajectory diversity scaling attains a substantially higher performance ceiling than quantity scaling, establishing a resource-efficient paradigm for training robust code agents under data bottlenecks.
A Multilingual Dataset and Empirical Validation for the Mutual Reinforcement Effect in Information Extraction
Chengguang Gan | Sunbowen Lee | Qingyu Yin | Yunhao Liang | Xinyang He | Hanjun Wei | Younghun Lim | Shijian Wang | Hexiang Huang | QingHao Zhang | Shiwen Ni | Tatsunori Mori
Findings of the Association for Computational Linguistics: ACL 2026
Chengguang Gan | Sunbowen Lee | Qingyu Yin | Yunhao Liang | Xinyang He | Hanjun Wei | Younghun Lim | Shijian Wang | Hexiang Huang | QingHao Zhang | Shiwen Ni | Tatsunori Mori
Findings of the Association for Computational Linguistics: ACL 2026
The Mutual Reinforcement Effect (MRE) describes a phenomenon in information extraction where word-level and sentence-level tasks can mutually improve each other when jointly modeled. While prior work has reported MRE in Japanese, its generality across languages and task settings has not been empirically validated, largely due to the lack of multilingual MRE datasets. To address this limitation, we introduce the Multilingual MRE Mix dataset (MMM), which consists of 21 sub-datasets covering English, Japanese, and Chinese. We propose an LLM-assisted dataset translation and alignment framework that significantly reduces manual annotation effort while preserving the structural requirements of MRE tasks. Building on MMM, we adopt a unified input-output framework to train an open-domain information extraction model and conduct extensive empirical studies, including full fine-tuning ablations and the construction of knowledgeable verbalizers based on MRE-mix data. Experimental results show that 76 percent of the MMM sub-datasets consistently exhibit the Mutual Reinforcement Effect across languages. These findings provide systematic empirical validation of MRE in multilingual settings and demonstrate its practical value for information extraction.
2025
LIME: Less Is More for MLLM Evaluation
King Zhu | Qianbo Zang | Shian Jia | Siwei Wu | Feiteng Fang | Yizhi Li | Shuyue Guo | Tianyu Zheng | Jiawei Guo | Bo Li | Haoning Wu | Xingwei Qu | Jian Yang | Ruibo Liu | Xiang Yue | Jiaheng Liu | Chenghua Lin | Hamid Alinejad-Rokny | Min Yang | Shiwen Ni | Wenhao Huang | Ge Zhang
Findings of the Association for Computational Linguistics: ACL 2025
King Zhu | Qianbo Zang | Shian Jia | Siwei Wu | Feiteng Fang | Yizhi Li | Shuyue Guo | Tianyu Zheng | Jiawei Guo | Bo Li | Haoning Wu | Xingwei Qu | Jian Yang | Ruibo Liu | Xiang Yue | Jiaheng Liu | Chenghua Lin | Hamid Alinejad-Rokny | Min Yang | Shiwen Ni | Wenhao Huang | Ge Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal Large Language Models (MLLMs) are measured on numerous benchmarks like image captioning, visual question answer, and reasoning. However, these benchmarks often include overly simple or uninformative samples, making it difficult to effectively distinguish the performance of different MLLMs. Additionally, evaluating models across many benchmarks creates a significant computational burden. To address these issues, we propose LIME (Less Is More for MLLM Evaluation), a refined and efficient benchmark curated using a semi-automated pipeline. This pipeline filters out uninformative samples and eliminates answer leakage by focusing on tasks that require image-based understanding. Our experiments show that LIME reduces the number of samples by 76% and evaluation time by 77%, while it can more effectively distinguish different models’ abilities. Notably, we find that traditional automatic metrics like CIDEr are insufficient for evaluating MLLMs’ captioning performance, and excluding the caption task score yields a more accurate reflection of overall model performance. All code and data are available at https://anonymous.4open.science/r/LIME-49CD
Quantification of Large Language Model Distillation
Sunbowen Lee | Junting Zhou | Chang Ao | Kaige Li | Xeron Du | Sirui He | Haihong Wu | Tianci Liu | Jiaheng Liu | Hamid Alinejad-Rokny | Min Yang | Yitao Liang | Zhoufutu Wen | Shiwen Ni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sunbowen Lee | Junting Zhou | Chang Ao | Kaige Li | Xeron Du | Sirui He | Haihong Wu | Tianci Liu | Jiaheng Liu | Hamid Alinejad-Rokny | Min Yang | Yitao Liang | Zhoufutu Wen | Shiwen Ni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Model distillation is a fundamental technique in building large language models (LLMs), transferring knowledge from a teacher model to a student model. However, distillation can lead to model homogenization, reducing diversity among models and impairing their ability to robustly handle complex or novel tasks. These limitations underscore the need to systematically quantify the distillation process and its impact. In this work, we propose a framework to evaluate and quantify model distillation. Our method addresses two key aspects: (1) Identifying identity cognition contradictions to assess discrepancies in how models perceive and represent identity-related information, and (2) Analyzing multi-granularity response similarities across models to measure the extent of homogenization. Experimental results demonstrate two key insights: (1) Well-known closed-source and open-source LLMs usually exhibit high distillation degrees, except for Claude, Doubao, and Gemini. (2) Base LLMs show higher distillation degrees compared to aligned LLMs. By offering a systematic approach to improve the transparency of LLM data distillation, we call for LLMs with more independent development and more transparent technical reports to improve LLMs’ robustness and safety. The code and data are available at https://github.com/Aegis1863/LLMs-Distillation-Quantification.
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning
Yuelin Bai | Xeron Du | Yiming Liang | Leo Jin | Junting Zhou | Ziqiang Liu | Feiteng Fang | Mingshan Chang | Tianyu Zheng | Xincheng Zhang | Nuo Ma | Zekun Moore Wang | Ruibin Yuan | Haihong Wu | Hongquan Lin | Wenhao Huang | Jiajun Zhang | Chenghua Lin | Jie Fu | Min Yang | Shiwen Ni | Ge Zhang
Findings of the Association for Computational Linguistics: NAACL 2025
Yuelin Bai | Xeron Du | Yiming Liang | Leo Jin | Junting Zhou | Ziqiang Liu | Feiteng Fang | Mingshan Chang | Tianyu Zheng | Xincheng Zhang | Nuo Ma | Zekun Moore Wang | Ruibin Yuan | Haihong Wu | Hongquan Lin | Wenhao Huang | Jiajun Zhang | Chenghua Lin | Jie Fu | Min Yang | Shiwen Ni | Ge Zhang
Findings of the Association for Computational Linguistics: NAACL 2025
Remarkable progress on large language models (LLMs), particularly in English, has facilitated impressive capabilities in following human instructions. However, there remains a noticeable gap in instruction fine-tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users’ interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world data resources and undergoing comprehensive human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.
AgentCourt: Simulating Court with Adversarial Evolvable Lawyer Agents
Guhong Chen | Liyang Fan | Zihan Gong | Nan Xie | Zixuan Li | Ziqiang Liu | Chengming Li | Qiang Qu | Hamid Alinejad-Rokny | Shiwen Ni | Min Yang
Findings of the Association for Computational Linguistics: ACL 2025
Guhong Chen | Liyang Fan | Zihan Gong | Nan Xie | Zixuan Li | Ziqiang Liu | Chengming Li | Qiang Qu | Hamid Alinejad-Rokny | Shiwen Ni | Min Yang
Findings of the Association for Computational Linguistics: ACL 2025
Current research in LLM-based simulation systems lacks comprehensive solutions for modeling real-world court proceedings, while existing legal language models struggle with dynamic courtroom interactions. We present **AgentCourt**, a comprehensive legal simulation framework that addresses these challenges through adversarial evolution of LLM-based agents. Our AgentCourt introduces a new adversarial evolutionary approach for agents called **AdvEvol**, which performs dynamic knowledge learning and evolution through structured adversarial interactions in a simulated courtroom program, breaking the limitations of the traditional reliance on static knowledge bases or manual annotations. By simulating 1,000 civil cases, we construct an evolving knowledge base that enhances the agents’ legal reasoning abilities. The evolved lawyer agents demonstrated outstanding performance on our newly introduced **CourtBench** benchmark, achieving a 12.1% improvement in performance compared to the original lawyer agents. Evaluations by professional lawyers confirm the effectiveness of our approach across three critical dimensions: cognitive agility, professional knowledge, and logical rigor. Beyond outperforming specialized legal models in interactive reasoning tasks, our findings emphasize the importance of adversarial learning in legal AI and suggest promising directions for extending simulation-based legal reasoning to broader judicial and regulatory contexts.
Can MLLMs Understand the Deep Implication Behind Chinese Images?
Chenhao Zhang | Xi Feng | Yuelin Bai | Xeron Du | Jinchang Hou | Kaixin Deng | Guangzeng Han | Qinrui Li | Bingli Wang | Jiaheng Liu | Xingwei Qu | Yifei Zhang | Qixuan Zhao | Yiming Liang | Ziqiang Liu | Feiteng Fang | Min Yang | Wenhao Huang | Chenghua Lin | Ge Zhang | Shiwen Ni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chenhao Zhang | Xi Feng | Yuelin Bai | Xeron Du | Jinchang Hou | Kaixin Deng | Guangzeng Han | Qinrui Li | Bingli Wang | Jiaheng Liu | Xingwei Qu | Yifei Zhang | Qixuan Zhao | Yiming Liang | Ziqiang Liu | Feiteng Fang | Min Yang | Wenhao Huang | Chenghua Lin | Ge Zhang | Shiwen Ni
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As the capabilities of Multimodal Large Language Models (MLLMs) improve, the need for higher-order evaluation of them is increasing. However, there is a lack of work evaluating MLLM for higher-order perception and understanding of Chinese visual content. To address this, we introduce the CII-Bench, which aims to assess MLLMs’ such capabilities for Chinese images. To ensure the authenticity of the Chinese context, images in CII-Bench are sourced from the Chinese Internet and manually reviewed, with corresponding answers also manually crafted. Additionally, CII-Bench incorporates images that represent Chinese traditional culture, such as famous Chinese traditional paintings, which can deeply reflect the model’s understanding of Chinese traditional culture. Through experiments on multiple MLLMs using CII-Bench, significant findings emerged. There is a large gap between MLLMs and humans in performance. The highest MLLM accuracy is 64.4%, while the human average is 78.2% and the peak is 81.0%. MLLMs perform poorly on traditional culture images, indicating limitations in understanding high-level semantics and lacking a deep knowledge base of Chinese traditional culture. Moreover, most models have higher accuracy when image emotion hints are added to the prompts. We believe CII-Bench will help MLLMs better understand Chinese semantics and specific images, and move forward the development of expert artificial general intelligence (AGI). Our project is publicly available at https://cii-bench.github.io.
Expanding before Inferring: Enhancing Factuality in Large Language Models through Premature Layers Interpolation
Dingwei Chen | Ziqiang Liu | Feiteng Fang | Chak Tou Leong | Shiwen Ni | Ahmadreza Argha | Hamid Alinejad-Rokny | Min Yang | Chengming Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Dingwei Chen | Ziqiang Liu | Feiteng Fang | Chak Tou Leong | Shiwen Ni | Ahmadreza Argha | Hamid Alinejad-Rokny | Min Yang | Chengming Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) demonstrate remarkable capabilities in text understanding and generation. However, their tendency to produce factually inconsistent outputs—commonly referred to as “hallucinations”—remains a critical challenge. Existing approaches, such as retrieval-based and inference-time correction methods, primarily address this issue at the input or output level, often overlooking the intrinsic information refinement process and the role of premature layers. Meanwhile, alignment- and fine-tuning-based methods are resource-intensive. In this paper, we propose **PLI** (**P**remature **L**ayers **I**nterpolation), a novel, training-free, and plug-and-play intervention designed to enhance factuality. PLI mitigates hallucinations by inserting premature layers formed through mathematical interpolation with adjacent layers. Inspired by stable diffusion and sampling steps, PLI extends the depth of information processing and transmission in LLMs, improving factual coherence. Experiments on four publicly available datasets demonstrate that PLI effectively reduces hallucinations while outperforming existing baselines in most cases. Further analysis suggests that the success of layer interpolation is closely linked to LLMs’ internal mechanisms. To promote reproducibility, we will release our code and data upon acceptance.
2024
Assessing Essay Fluency with Large Language Models
Haihong Wu | Chang Ao | Shiwen Ni
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
Haihong Wu | Chang Ao | Shiwen Ni
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“With the development of education and the widespread use of the internet, the scale of essay evaluation has increased, making the cost and efficiency of manual grading a significant challenge. To address this, The Twenty-third China National Conference on Computational Linguistics (CCL2024) established evaluation contest for essay fluency. This competition has three tracks corresponding to three sub-tasks. This paper conducts a detailed analysis of different tasks,employing the BERT model as well as the latest popular large language models Qwen to address these sub-tasks. As a result, our overall scores for the three tasks reached 37.26, 42.48, and 47.64.”
Enhancing Noise Robustness of Retrieval-Augmented Language Models with Adaptive Adversarial Training
Feiteng Fang | Yuelin Bai | Shiwen Ni | Min Yang | Xiaojun Chen | Ruifeng Xu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Feiteng Fang | Yuelin Bai | Shiwen Ni | Min Yang | Xiaojun Chen | Ruifeng Xu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) exhibit substantial capabilities yet encounter challenges including hallucination, outdated knowledge, and untraceable reasoning processes. Retrieval-augmented generation (RAG) has emerged as a promising solution, integrating knowledge from external databases to mitigate these challenges. However, inappropriate retrieved passages can potentially hinder the LLMs’ capacity to generate comprehensive and high-quality responses. Prior RAG studies on the robustness of retrieval noises often confine themselves to a limited set of noise types, deviating from real-world retrieval environments and limiting practical applicability. In this study, we initially investigate retrieval noises and categorize them into three distinct types, reflecting real-world environments. We analyze the impact of these various retrieval noises on the robustness of LLMs. Subsequently, we propose a novel RAG approach known as Retrieval-augmented Adaptive Adversarial Training (RAAT). RAAT leverages adaptive adversarial training to dynamically adjust the model’s training process in response to retrieval noises. Concurrently, it employs multi-task learning to ensure the model’s capacity to internally recognize noisy contexts. Extensive experiments demonstrate that the LLaMA-2 7B model trained using RAAT exhibits significant improvements in F1 and EM scores under diverse noise conditions. For reproducibility, we will release our code and data upon acceptance.
Forgetting before Learning: Utilizing Parametric Arithmetic for Knowledge Updating in Large Language Models
Shiwen Ni | Dingwei Chen | Chengming Li | Xiping Hu | Ruifeng Xu | Min Yang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shiwen Ni | Dingwei Chen | Chengming Li | Xiping Hu | Ruifeng Xu | Min Yang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advancements in Large Language Models (LLMs) have showcased their remarkable capabilities in text understanding and generation. However, even stronger LLMs are susceptible to acquiring erroneous or obsolete information from the training corpus. Direct secondary fine-tuning with data containing new knowledge may be ineffective in updating knowledge due to the conflict between old and new knowledge. In this paper, we propose a new paradigm for fine-tuning called F-Learning (Forgetting before Learning), which employs parametric arithmetic to facilitate the forgetting of old knowledge and learning of new knowledge. Experimental results on two publicly available datasets demonstrate that our proposed F-Learning can obviously improve the knowledge updating performance of both full fine-tuning and LoRA fine-tuning, simultaneously outperforming the existing baselines in most cases. Moreover, we have also discovered that forgetting old knowledge by subtracting the parameters of LoRA can yield a similar effect to subtracting the parameters of full fine-tuning, and occasionally even surpass it significantly.
Layer-wise Regularized Dropout for Neural Language Models
Shiwen Ni | Min Yang | Ruifeng Xu | Chengming Li | Xiping Xiping Hu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Shiwen Ni | Min Yang | Ruifeng Xu | Chengming Li | Xiping Xiping Hu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Among the various pre-trained neural language models that are popular today, dropout is already an indispensable regularization technique. To solve the inconsistency between training and inference caused by the randomness of dropout, some studies use consistency training to regularize dropout at the output layer. In this paper, we propose a novel Layer-wise Regularized Dropout (LR-Drop), which is specially designed for Transformer-based Language models. Specifically, LR-Drop layer-wise regularizes each Transformer layer using the consistency training strategy. Each training sample passes through the two siamese sub-models sampled by dropout, and then LR-Drop forces the hidden states, multi-head attention matrices, and output distribution of the two siamese sub-models to be consistent. The proposed LR-Drop can be regarded as a “self-distillation” framework, in which each sub-model generated by dropout is the other’s “teacher” model and “student” model. Through extensive experiments on 8 natural language understanding datasets, 6 neural machine translation datasets, and 1 abstractive summarization dataset (a total of 15 datasets), we show that LR-Drop achieves superior performances, including state-of-the-art results.
E-EVAL: A Comprehensive Chinese K-12 Education Evaluation Benchmark for Large Language Models
Jinchang Hou | Chang Ao | Haihong Wu | Xiangtao Kong | Zhigang Zheng | Daijia Tang | Chengming Li | Xiping Hu | Ruifeng Xu | Shiwen Ni | Min Yang
Findings of the Association for Computational Linguistics: ACL 2024
Jinchang Hou | Chang Ao | Haihong Wu | Xiangtao Kong | Zhigang Zheng | Daijia Tang | Chengming Li | Xiping Hu | Ruifeng Xu | Shiwen Ni | Min Yang
Findings of the Association for Computational Linguistics: ACL 2024
The rapid development of Large Language Models (LLMs) has led to their increasing utilization in Chinese K-12 education. Despite the growing integration of LLMs and education, the absence of a dedicated benchmark for evaluating LLMs within this domain presents a pressing concern. Consequently, there is an urgent need for a comprehensive natural language processing benchmark to precisely assess the capabilities of various LLMs in Chinese K-12 education. In response, we introduce E-EVAL, the first comprehensive evaluation benchmark specifically tailored for Chinese K-12 education. E-EVAL comprises 4,351 multiple-choice questions spanning primary, middle, and high school levels, covering a diverse array of subjects. Through meticulous evaluation, we find that Chinese-dominant models often outperform English-dominant ones, with many exceeding GPT 4.0. However, most struggle with complex subjects like mathematics. Additionally, our analysis indicates that most Chinese-dominant LLMs do not achieve higher scores at the primary school level compared to the middle school level, highlighting the nuanced relationship between proficiency in higher-order and lower-order knowledge domains. Furthermore, experimental results highlight the effectiveness of the Chain of Thought (CoT) technique in scientific subjects and Few-shot prompting in liberal arts. Through E-EVAL, we aim to conduct a rigorous analysis delineating the strengths and limitations of LLMs in educational applications, thereby contributing significantly to the advancement of Chinese K-12 education and LLMs.
MoZIP: A Multilingual Benchmark to Evaluate Large Language Models in Intellectual Property
Shiwen Ni | Minghuan Tan | Yuelin Bai | Fuqiang Niu | Min Yang | Bowen Zhang | Ruifeng Xu | Xiaojun Chen | Chengming Li | Xiping Hu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Shiwen Ni | Minghuan Tan | Yuelin Bai | Fuqiang Niu | Min Yang | Bowen Zhang | Ruifeng Xu | Xiaojun Chen | Chengming Li | Xiping Hu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large language models (LLMs) have demonstrated impressive performance in various natural language processing (NLP) tasks. However, there is limited understanding of how well LLMs perform in specific domains (e.g, the intellectual property (IP) domain). In this paper, we contribute a new benchmark, the first Multilingual-oriented quiZ on Intellectual Property (MoZIP), for the evaluation of LLMs in the IP domain. The MoZIP benchmark includes three challenging tasks: IP multiple-choice quiz (IPQuiz), IP question answering (IPQA), and patent matching (PatentMatch). In addition, we also develop a new IP-oriented multilingual large language model (called MoZi), which is a BLOOMZ-based model that has been supervised fine-tuned with multilingual IP-related text data. We evaluate our proposed MoZi model and four well-known LLMs (i.e., BLOOMZ, BELLE, ChatGLM and ChatGPT) on the MoZIP benchmark. Experimental results demonstrate that MoZi outperforms BLOOMZ, BELLE and ChatGLM by a noticeable margin, while it had lower scores compared with ChatGPT. Notably, the performance of current LLMs on the MoZIP benchmark has much room for improvement, and even the most powerful ChatGPT does not reach the passing level. Our source code, data, and models are available at https://github.com/AI-for-Science/MoZi.
2022
R-AT: Regularized Adversarial Training for Natural Language Understanding
Shiwen Ni | Jiawen Li | Hung-Yu Kao
Findings of the Association for Computational Linguistics: EMNLP 2022
Shiwen Ni | Jiawen Li | Hung-Yu Kao
Findings of the Association for Computational Linguistics: EMNLP 2022
Currently, adversarial training has become a popular and powerful regularization method in the natural language domain. In this paper, we Regularized Adversarial Training (R-AT) via dropout, which forces the output probability distributions of different sub-models generated by dropout to be consistent under the same adversarial samples. Specifically, we generate adversarial samples by perturbing the word embeddings. For each adversarial sample fed to the model, R-AT minimizes both the adversarial risk and the bidirectional KL-divergence between the adversarial output distributions of two sub-models sampled by dropout. Through extensive experiments on 13 public natural language understanding datasets, we found that R-AT has improvements for many models (e.g., rnn-based, cnn-based, and transformer-based models). For the GLUE benchmark, when R-AT is only applied to the fine-tuning stage, it is able to improve the overall test score of the BERT-base model from 78.3 to 79.6 and the RoBERTa-large model from 88.1 to 88.6. Theoretical analysis reveals that R-AT has potential gradient regularization during the training process. Furthermore, R-AT can reduce the inconsistency between training and testing of models with dropout.
2021
Search
Fix author
Co-authors
- Min Yang 15
- Hamid Alinejad-Rokny 8
- Feiteng Fang 7
- Chengming Li 7
- Ziqiang Liu 5
- Ruifeng Xu (徐睿峰) 5
- Yuelin Bai 4
- Wenhao Huang 4
- Jiaheng Liu 4
- Haihong Wu 4
- Ge Zhang 4
- Chang Ao 3
- Guhong Chen 3
- Xeron Du 3
- Liyang Fan 3
- Xiping Hu 3
- Chenghua Lin 3
- Ahmadreza Argha 2
- Xiaojun Chen 2
- Dingwei Chen 2
- Jinchang Hou 2
- Hung-Yu Kao 2
- Sunbowen Lee 2
- Yizhi Li 2
- Jiawen Li 2
- Shuaimin Li 2
- Yiming Liang 2
- Xingwei Qu 2
- Qiang Qu 2
- Shijian Wang 2
- Qiyao Wang 2
- Zhoufutu Wen 2
- Tianyu Zheng 2
- Junting Zhou 2
- Mingshan Chang 1
- Longze Chen 1
- Guangxu Chen 1
- Kaixin Deng 1
- Xi Feng 1
- Jie Fu 1
- Cheng Fu 1
- Chengguang Gan 1
- Zihan Gong 1
- Shuyue Guo 1
- Jiawei Guo 1
- Qi Han 1
- Guangzeng Han 1
- Sirui He 1
- Xinyang He 1
- Zhihong Huang 1
- Hexiang Huang 1
- Shian Jia 1
- Leo Jin 1
- Kun Jing 1
- Xiangtao Kong 1
- Chak Tou Leong 1
- Bo Li 1
- Kaige Li 1
- Siyi Li 1
- Zeyang Li 1
- Jiayan Li 1
- Jiaming Li 1
- Binhua Li 1
- Yongbin Li 1
- Zixuan Li 1
- Qinrui Li 1
- Yitao Liang 1
- Yunhao Liang 1
- Younghun Lim 1
- Hongquan Lin 1
- Yufang Lin 1
- Yuan Lin 1
- Li Linwei 1
- Ruibo Liu 1
- Tianci Liu 1
- Huaren Liu 1
- Run Luo 1
- Nuo Ma 1
- Tatsunori Mori 1
- Fuqiang Niu 1
- Zhifei Qin 1
- Jiajun Shi 1
- Yuanfeng Song 1
- Chenghao Sun 1
- Minghuan Tan 1
- Daijia Tang 1
- Zhuoyue Wan 1
- Zekun Moore Wang 1
- Hongbo Wang 1
- Shiqiang Wang 1
- Bingli Wang 1
- ChaoPeng Wei 1
- HU Wei 1
- Hanjun Wei 1
- Siwei Wu 1
- Haoning Wu 1
- Nan Xie 1
- Xiping Xiping Hu 1
- Kan Xu 1
- Xander Xu 1
- Jian Yang 1
- Wu Yihang 1
- Qingyu Yin 1
- Ruibin Yuan 1
- Xiang Yue 1
- Yilin Yue 1
- Qianbo Zang 1
- Xincheng Zhang 1
- Jiajun Zhang 1
- Chen Jason Zhang 1
- Lei Zhang 1
- Qinghao Zhang 1
- Bowen Zhang 1
- Chenhao Zhang 1
- Yifei Zhang 1
- Bing Zhao 1
- Qixuan Zhao 1
- Zhigang Zheng 1
- King Zhu 1
- Minghui Zhu 1