2025
pdf
bib
abs
Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
Guangxiang Zhao
|
Saier Hu
|
Xiaoqi Jian
|
Wu Jinzhu
|
Yuhan Wu
|
Lin Sun
|
Xiangzheng Zhang
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In this paper, we propose a “Generalization Stress Test” to assess Large Language Models’ (LLMs) generalization ability under slight and controlled perturbations, including option length, problem types, and irrelevant noun replacements. We achieve novel and significant findings that, despite high benchmark scores, LLMs exhibit severe accuracy drops and unexpected biases (e.g., preference for longer distractors) when faced with these minor but content-preserving modifications. For example, Qwen 2.5 1.5B’s MMLU score rises from 60 to 89 and drops from 89 to 36 when option lengths are changed without altering the question. Even GPT4o experiences a 25-point accuracy loss when problem types are changed, with a 6-point drop across all three modification categories. These analyses suggest that LLMs rely heavily on superficial cues rather than forming robust, abstract representations that generalize across formats, lexical variations, and shifts in irrelevant content.
2024
pdf
bib
abs
Dataflow-Guided Retrieval Augmentation for Repository-Level Code Completion
Wei Cheng
|
Yuhan Wu
|
Wei Hu
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent years have witnessed the deployment of code language models (LMs) in various code intelligence tasks such as code completion. Yet, it is challenging for pre-trained LMs to generate correct completions in private repositories. Previous studies retrieve cross-file context based on import relations or text similarity, which is insufficiently relevant to completion targets. In this paper, we propose a dataflow-guided retrieval augmentation approach, called DraCo, for repository-level code completion. DraCo parses a private repository into code entities and establishes their relations through an extended dataflow analysis, forming a repo-specific context graph. Whenever triggering code completion, DraCo precisely retrieves relevant background knowledge from the repo-specific context graph and generates well-formed prompts to query code LMs. Furthermore, we construct a large Python dataset, ReccEval, with more diverse completion targets. Our experiments demonstrate the superior accuracy and applicable efficiency of DraCo, improving code exact match by 3.43% and identifier F1-score by 3.27% on average compared to the state-of-the-art approach.
pdf
bib
abs
RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
Noah Wang
|
Z.y. Peng
|
Haoran Que
|
Jiaheng Liu
|
Wangchunshu Zhou
|
Yuhan Wu
|
Hongcheng Guo
|
Ruitong Gan
|
Zehao Ni
|
Jian Yang
|
Man Zhang
|
Zhaoxiang Zhang
|
Wanli Ouyang
|
Ke Xu
|
Wenhao Huang
|
Jie Fu
|
Junran Peng
Findings of the Association for Computational Linguistics: ACL 2024
The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art LLMs and their general-purpose training limit role-playing optimization. In this paper, we introduce RoleLLM, a framework to benchmark, elicit, and enhance role-playing abilities in LLMs. RoleLLM comprises four stages: (1) Role Profile Construction for 100 roles; (2) Context-Based Instruction Generation (Context-Instruct) for role-specific knowledge extraction; (3) Role Prompting using GPT (RoleGPT) for speaking style imitation; and (4) Role-Conditioned Instruction Tuning (RoCIT) for fine-tuning open-source models along with role customization. By Context-Instruct and RoleGPT, we create RoleBench, the first systematic and fine-grained character-level benchmark dataset for role-playing with 168,093 samples. Moreover, RoCIT on RoleBench yields RoleLLaMA (English) and RoleGLM (Chinese), significantly enhancing role-playing abilities and even achieving comparable results with RoleGPT (using GPT-4).
2023
pdf
bib
abs
DiaASQ: A Benchmark of Conversational Aspect-based Sentiment Quadruple Analysis
Bobo Li
|
Hao Fei
|
Fei Li
|
Yuhan Wu
|
Jinsong Zhang
|
Shengqiong Wu
|
Jingye Li
|
Yijiang Liu
|
Lizi Liao
|
Tat-Seng Chua
|
Donghong Ji
Findings of the Association for Computational Linguistics: ACL 2023
The rapid development of aspect-based sentiment analysis (ABSA) within recent decades shows great potential for real-world society. The current ABSA works, however, are mostly limited to the scenario of a single text piece, leaving the study in dialogue contexts unexplored. To bridge the gap between fine-grained sentiment analysis and conversational opinion mining, in this work, we introduce a novel task of conversational aspect-based sentiment quadruple analysis, namely DiaASQ, aiming to detect the quadruple of target-aspect-opinion-sentiment in a dialogue. We manually construct a large-scale high-quality DiaASQ dataset in both Chinese and English languages. We deliberately develop a neural model to benchmark the task, which advances in effectively performing end-to-end quadruple prediction, and manages to incorporate rich dialogue-specific and discourse feature representations for better cross-utterance quadruple extraction. We hope the new benchmark will spur more advancements in the sentiment analysis community.