Cheng Gao
2026
ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement
Kangyang Luo | Yuzhuo Bai | Shuzheng Si | Cheng Gao | Zhitong Wang | Yingli Shen | Wenhao Li | Zhu Liu | Yufeng Han | Jiayi Wu | Cunliang Kong | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kangyang Luo | Yuzhuo Bai | Shuzheng Si | Cheng Gao | Zhitong Wang | Yingli Shen | Wenhao Li | Zhu Liu | Yufeng Han | Jiayi Wu | Cunliang Kong | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose ImCoref-CeS, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (ImCoref) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.
MEIC-DT: Memory-Efficient Incremental Clustering for Long-Text Coreference Resolution with Dual-Threshold Constraints
Kangyang Luo | Shuzheng Si | Yuzhuo Bai | Cheng Gao | Zhitong Wang | Cheng Huang | Yingli Shen | Yufeng Han | Wenhao Li | Cunliang Kong | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
Kangyang Luo | Shuzheng Si | Yuzhuo Bai | Cheng Gao | Zhitong Wang | Cheng Huang | Yingli Shen | Yufeng Han | Wenhao Li | Cunliang Kong | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
In the era of large language models (LLMs), supervised neural methods remain the state-of-the-art (SOTA) for Coreference Resolution. Yet, their full potential is underexplored, particularly in incremental clustering, which faces the critical challenge of balancing efficiency with performance for long texts. To address the limitation, we propose MEIC-DT, a novel dual-threshold, memory-efficient incremental clustering approach based on a lightweight Transformer. MEIC-DT features a dual-threshold constraint mechanism designed to precisely control the Transformer’s input scale within a predefined memory budget. This mechanism incorporates two key components: a Statistics-Aware Eviction Strategy (SAES) and an Internal Regularization Policy (IRP). The SAES utilizes distinct statistical profiles from the training and inference phases for intelligent cache management. The IRP strategically condenses clusters by selecting the most representative mentions, thereby preserving semantic integrity. Extensive experiments on common benchmarks demonstrate that MEIC-DT achieves highly competitive coreference performance under stringent memory constraints.
2025
GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion
Kangyang Luo | Yuzhuo Bai | Cheng Gao | Shuzheng Si | Zhu Liu | Yingli Shen | Zhitong Wang | Cunliang Kong | Wenhao Li | Yufei Huang | Ye Tian | Xuantang Xiong | Lei Han | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Kangyang Luo | Yuzhuo Bai | Cheng Gao | Shuzheng Si | Zhu Liu | Yingli Shen | Zhitong Wang | Cunliang Kong | Wenhao Li | Yufei Huang | Ye Tian | Xuantang Xiong | Lei Han | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Knowledge Graph Completion (KGC), which aims to infer missing or incomplete facts, is a crucial task for KGs. However, integrating the vital structural information of KGs into Large Language Models (LLMs) and outputting predictions deterministically remains challenging. To address this, we propose a new method called GLTW, which encodes the structural information of KGs and merges it with LLMs to enhance KGC performance. Specifically, we introduce an improved Graph Transformer (iGT) that effectively encodes subgraphs with both local and global structural information and inherits the characteristics of language model, bypassing training from scratch. Also, we develop a subgraph-based multi-classification training objective, using all entities within KG as classification objects, to boost learning efficiency. Importantly, we combine iGT with an LLM that takes KG language prompts as input. Our extensive experiments on various KG datasets show that GLTW achieves significant performance gains compared to SOTA baselines.
Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering
Shuzheng Si | Haozhe Zhao | Gang Chen | Cheng Gao | Yuzhuo Bai | Zhitong Wang | Kaikai An | Kangyang Luo | Chen Qian | Fanchao Qi | Baobao Chang | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shuzheng Si | Haozhe Zhao | Gang Chen | Cheng Gao | Yuzhuo Bai | Zhitong Wang | Kaikai An | Kangyang Luo | Chen Qian | Fanchao Qi | Baobao Chang | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Training LLMs on data containing unfamiliar knowledge during the instruction tuning stage can encourage hallucinations. To address this challenge, we introduce NOVA, a novel framework designed to identify high-quality data that aligns well with the LLM’s learned knowledge to reduce hallucinations. NOVA includes Internal Consistency Probing (ICP) and Semantic Equivalence Identification (SEI) to measure how familiar the LLM is with instruction data. Specifically, ICP evaluates the LLM’s understanding of the given instruction by calculating the tailored consistency among multiple self-generated responses. SEI further assesses the familiarity of the LLM with the target response by comparing it to the generated responses, using the proposed semantic clustering and well-designed voting strategy. Finally, to ensure the quality of selected samples, we introduce an expert-aligned reward model, considering characteristics beyond just familiarity. By considering data quality and avoiding unfamiliar data, we can utilize the selected data to effectively align LLMs to follow instructions and hallucinate less. Experiments show that NOVA significantly reduces hallucinations while maintaining a competitive ability to follow instructions.
Document Segmentation Matters for Retrieval-Augmented Generation
Zhitong Wang | Cheng Gao | Chaojun Xiao | Yufei Huang | Shuzheng Si | Kangyang Luo | Yuzhuo Bai | Wenhao Li | Tangjian Duan | Chuancheng Lv | Guoshan Lu | Gang Chen | Fanchao Qi | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Zhitong Wang | Cheng Gao | Chaojun Xiao | Yufei Huang | Shuzheng Si | Kangyang Luo | Yuzhuo Bai | Wenhao Li | Tangjian Duan | Chuancheng Lv | Guoshan Lu | Gang Chen | Fanchao Qi | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Retrieval-augmented generation (RAG) enhances large language models (LLMs) by integrating external knowledge. A critical yet underexplored challenge in RAG is document segmentation, also known as document chunking. Existing widely-used rule-based chunking methods usually lead to suboptimal splits, where overly large chunks introduce irrelevant information and small chunks lack semantic coherence. Existing semantic-based approaches either require costly LLM calls or fail to adaptively group contextually related sentences. To address these limitations, we propose PIC, Pseudo-Instruction for document Chunking), a simple yet effective method that leverages document summaries as pseudo-instructions to guide chunking. By computing semantic similarity between sentences and the summary, PIC dynamically groups sentences into chunks that align with the document’s key themes, ensuring semantic completeness and relevance to potential user instructions. Experiments on multiple open-domain question-answering benchmarks demonstrate that PIC can significantly improve retrieval accuracy (Hits@k) and end-to-end QA performance (Exact Match) without any additional training.
2024
Enhancing Legal Case Retrieval via Scaling High-quality Synthetic Query-Candidate Pairs
Cheng Gao | Chaojun Xiao | Zhenghao Liu | Huimin Chen | Zhiyuan Liu | Maosong Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Cheng Gao | Chaojun Xiao | Zhenghao Liu | Huimin Chen | Zhiyuan Liu | Maosong Sun
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Legal case retrieval (LCR) aims to provide similar cases as references for a given fact description. This task is crucial for promoting consistent judgments in similar cases, effectively enhancing judicial fairness and improving work efficiency for judges. However, existing works face two main challenges for real-world applications: existing works mainly focus on case-to-case retrieval using lengthy queries, which does not match real-world scenarios; and the limited data scale, with current datasets containing only hundreds of queries, is insufficient to satisfy the training requirements of existing data-hungry neural models. To address these issues, we introduce an automated method to construct synthetic query-candidate pairs and build the largest LCR dataset to date, LEAD, which is hundreds of times larger than existing datasets. This data construction method can provide ample training signals for LCR models. Experimental results demonstrate that model training with our constructed data can achieve state-of-the-art results on two widely-used LCR benchmarks. Besides, the construction method can also be applied to civil cases and achieve promising results. The data and codes can be found in https://github.com/thunlp/LEAD.
Search
Fix author
Co-authors
- Maosong Sun (孙茂松) 6
- Yuzhuo Bai 5
- Kangyang Luo 5
- Shuzheng Si 5
- Zhitong Wang 5
- Cunliang Kong (孔存良) 3
- Yingli Shen 3
- Yufeng Han 2
- Yufei Huang 2
- Wenhao Li 2
- Wenhao Li 2
- Zhu Liu 2
- Fanchao Qi 2
- Chaojun Xiao 2
- Kaikai An 1
- Baobao Chang (常宝宝) 1
- Huimin Chen 1
- Gang Chen 1
- Gang Chen 1
- Tangjian Duan 1
- Lei Han 1
- Cheng Huang 1
- Zhenghao Liu (刘正皓) 1
- Zhiyuan Liu 1
- Guoshan Lu 1
- Chuancheng Lv 1
- Chen Qian 1
- Ye Tian 1
- Jiayi Wu 1
- Xuantang Xiong 1
- Haozhe Zhao 1