Yingli Shen
2026
MEIC-DT: Memory-Efficient Incremental Clustering for Long-Text Coreference Resolution with Dual-Threshold Constraints
Kangyang Luo | Shuzheng Si | Yuzhuo Bai | Cheng Gao | Zhitong Wang | Cheng Huang | Yingli Shen | Yufeng Han | Wenhao Li | Cunliang Kong | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
Kangyang Luo | Shuzheng Si | Yuzhuo Bai | Cheng Gao | Zhitong Wang | Cheng Huang | Yingli Shen | Yufeng Han | Wenhao Li | Cunliang Kong | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2026
In the era of large language models (LLMs), supervised neural methods remain the state-of-the-art (SOTA) for Coreference Resolution. Yet, their full potential is underexplored, particularly in incremental clustering, which faces the critical challenge of balancing efficiency with performance for long texts. To address the limitation, we propose MEIC-DT, a novel dual-threshold, memory-efficient incremental clustering approach based on a lightweight Transformer. MEIC-DT features a dual-threshold constraint mechanism designed to precisely control the Transformer’s input scale within a predefined memory budget. This mechanism incorporates two key components: a Statistics-Aware Eviction Strategy (SAES) and an Internal Regularization Policy (IRP). The SAES utilizes distinct statistical profiles from the training and inference phases for intelligent cache management. The IRP strategically condenses clusters by selecting the most representative mentions, thereby preserving semantic integrity. Extensive experiments on common benchmarks demonstrate that MEIC-DT achieves highly competitive coreference performance under stringent memory constraints.
ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement
Kangyang Luo | Yuzhuo Bai | Shuzheng Si | Cheng Gao | Zhitong Wang | Yingli Shen | Wenhao Li | Zhu Liu | Yufeng Han | Jiayi Wu | Cunliang Kong | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kangyang Luo | Yuzhuo Bai | Shuzheng Si | Cheng Gao | Zhitong Wang | Yingli Shen | Wenhao Li | Zhu Liu | Yufeng Han | Jiayi Wu | Cunliang Kong | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose ImCoref-CeS, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (ImCoref) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.
2025
GLTW: Joint Improved Graph Transformer and LLM via Three-Word Language for Knowledge Graph Completion
Kangyang Luo | Yuzhuo Bai | Cheng Gao | Shuzheng Si | Zhu Liu | Yingli Shen | Zhitong Wang | Cunliang Kong | Wenhao Li | Yufei Huang | Ye Tian | Xuantang Xiong | Lei Han | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Kangyang Luo | Yuzhuo Bai | Cheng Gao | Shuzheng Si | Zhu Liu | Yingli Shen | Zhitong Wang | Cunliang Kong | Wenhao Li | Yufei Huang | Ye Tian | Xuantang Xiong | Lei Han | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Knowledge Graph Completion (KGC), which aims to infer missing or incomplete facts, is a crucial task for KGs. However, integrating the vital structural information of KGs into Large Language Models (LLMs) and outputting predictions deterministically remains challenging. To address this, we propose a new method called GLTW, which encodes the structural information of KGs and merges it with LLMs to enhance KGC performance. Specifically, we introduce an improved Graph Transformer (iGT) that effectively encodes subgraphs with both local and global structural information and inherits the characteristics of language model, bypassing training from scratch. Also, we develop a subgraph-based multi-classification training objective, using all entities within KG as classification objects, to boost learning efficiency. Importantly, we combine iGT with an LLM that takes KG language prompts as input. Our extensive experiments on various KG datasets show that GLTW achieves significant performance gains compared to SOTA baselines.
From Unaligned to Aligned: Scaling Multilingual LLMs with Multi-Way Parallel Corpora
Yingli Shen | Wen Lai | Shuo Wang | Ge Gao | Kangyang Luo | Alexander Fraser | Maosong Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yingli Shen | Wen Lai | Shuo Wang | Ge Gao | Kangyang Luo | Alexander Fraser | Maosong Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multi-way parallel data consistently outperform those trained on unaligned multilingual data.