Ling Feng


2025

pdf bib
LongReward: Improving Long-context Large Language Models with AI Feedback
Jiajie Zhang | Zhongni Hou | Xin Lv | Shulin Cao | Zhenyu Hou | Yilin Niu | Lei Hou | Yuxiao Dong | Ling Feng | Juanzi Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Though significant advancements have been achieved in developing long-context large language models (LLMs), the compromised quality of LLM-synthesized data for supervised fine-tuning (SFT) often affects the long-context performance of SFT models and leads to inherent limitations. In principle, reinforcement learning (RL) with appropriate reward signals can further enhance models’ capacities. However, how to obtain reliable rewards in long-context scenarios remains unexplored. To this end, we propose LongReward, a novel method that utilizes an off-the-shelf LLM to provide rewards for long-context model responses from four human-valued dimensions: helpfulness, logicality, faithfulness, and completeness, each with a carefully designed assessment pipeline. By combining LongReward and offline RL algorithm DPO, we are able to effectively improve long-context SFT models. Our experiments indicate that LongReward not only significantly improves models’ long-context performance but also enhances their ability to follow short instructions. We also find that long-context DPO with LongReward and conventional short-context DPO can be used together without hurting either one’s performance.

pdf bib
LongCite: Enabling LLMs to Generate Fine-grained Citations in Long-Context QA
Jiajie Zhang | Yushi Bai | Xin Lv | Wanjun Gu | Danqing Liu | Minhao Zou | Shulin Cao | Lei Hou | Yuxiao Dong | Ling Feng | Juanzi Li
Findings of the Association for Computational Linguistics: ACL 2025

Though current long-context large language models (LLMs) have demonstrated impressive capacities in answering various questions based on extensive text, the lack of citations in their responses makes user verification difficult, leading to concerns about their trustworthiness due to the potential hallucinations. In this work, we aim to enable long-context LLMs to generate responses with fine-grained sentence-level citations on the fly, thereby improving their faithfulness and verifiability. We first introduce LongBench-Cite, an automated benchmark for assessing current LLMs’ performance in long-context question answering with citations (LQAC), revealing considerable room for improvement. To this end, we propose CoF (Coarse to Fine), a novel pipeline that utilizes off-the-shelf LLMs to automatically construct long-context QA instances with precise sentence-level citations, and leverage this pipeline to construct LongCite-45k, a large-scale SFT dataset for LQAC. Finally, we train LongCite-8B and LongCite-9B using the constructed dataset, successfully enabling the generation of accurate responses and fine-grained citations in one pass. The evaluation results on LongBench-Cite show that our trained models achieve state-of-the-art citation quality, surpassing advanced proprietary models including GPT-4o. We also discover that SFT with citation information can further improve the correctness of model responses compared to standard long-context SFT.

2024

pdf bib
KB-Plugin: A Plug-and-play Framework for Large Language Models to Induce Programs over Low-resourced Knowledge Bases
Jiajie Zhang | Shulin Cao | Linmei Hu | Ling Feng | Lei Hou | Juanzi Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Program induction (PI) has become a promising paradigm for using knowledge bases (KBs) to help large language models (LLMs) answer complex knowledge-intensive questions. Nonetheless, PI typically relies on a large number of parallel question-program pairs to make the LLM aware of the schema of a given KB, and is thus challenging for many low-resourced KBs that lack annotated data. To this end, we propose KB-Plugin, a plug-and-play framework that enables LLMs to induce programs over any low-resourced KB. Firstly, KB-Plugin adopts self-supervised learning to encode the detailed schema information of a given KB into a pluggable module, namely schema plugin. Secondly, KB-Plugin utilizes abundant annotated data from a rich-resourced KB to train another pluggable module, namely PI plugin, which can help the LLM extract question-relevant schema information from the schema plugin of any KB and utilize the information to induce programs over this KB. Experiments show that KB-Plugin outperforms SoTA low-resourced PI methods with 25x smaller backbone LLM on both large-scale and domain-specific KBs, and even approaches the performance of supervised methods.

2019

pdf bib
Latent Suicide Risk Detection on Microblog via Suicide-Oriented Word Embeddings and Layered Attention
Lei Cao | Huijun Zhang | Ling Feng | Zihan Wei | Xin Wang | Ningyun Li | Xiaohao He
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Despite detection of suicidal ideation on social media has made great progress in recent years, people’s implicitly and anti-real contrarily expressed posts still remain as an obstacle, constraining the detectors to acquire higher satisfactory performance. Enlightened by the hidden “tree holes” phenomenon on microblog, where people at suicide risk tend to disclose their inner real feelings and thoughts to the microblog space whose authors have committed suicide, we explore the use of tree holes to enhance microblog-based suicide risk detection from the following two perspectives. (1) We build suicide-oriented word embeddings based on tree hole contents to strength the sensibility of suicide-related lexicons and context based on tree hole contents. (2) A two-layered attention mechanism is deployed to grasp intermittently changing points from individual’s open blog streams, revealing one’s inner emotional world more or less. Our experimental results show that with suicide-oriented word embeddings and attention, microblog-based suicide risk detection can achieve over 91% accuracy. A large-scale well-labelled suicide data set is also reported in the paper.