Xuan Luo


2025

pdf bib
Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models
Mingyang Song | Mao Zheng | Xuan Luo
Proceedings of the 31st International Conference on Computational Linguistics

Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose Counting-Stars, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. Counting-Stars comprises two counting-based multiple pieces of evidence retrieval tasks: searching and reasoning. Using Counting-Stars, we conducted experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.

pdf bib
Can Many-Shot In-Context Learning Help LLMs as Evaluators? A Preliminary Empirical Study
Mingyang Song | Mao Zheng | Xuan Luo
Proceedings of the 31st International Conference on Computational Linguistics

Utilizing Large Language Models (LLMs) as evaluators to assess the performance of other LLMs has garnered attention. However, this evaluation approach is affected by potential biases within LLMs, raising concerns about the accuracy and reliability of the evaluation results of LLMs. To address this issue, we propose and explore two many-shot In-Context Learning (ICL) prompt templates to help LLM evaluators mitigate potential biases: Many-Shot with Reference (MSwR) and Many-Shot without Reference (MSoR). Specifically, the former utilizes in-context examples with model-generated rationales as references, while the latter does not include these references. Using these prompt designs, we investigate the impact of increasing the number of in-context examples on the consistency and quality of the evaluation results. Experimental results show that advanced LLMs, such as GPT-4, perform better in the many-shot regime than in the zero-shot regime. Furthermore, in most cases, MSwR performs significantly better than MSoR.

pdf bib
DiffSkip: Differential Layer Skipping in Large Language Models
Xuan Luo | Weizhi Wang | Xifeng Yan
Findings of the Association for Computational Linguistics: ACL 2025

Existing Large Language Models (LLMs) enforce uniform computation across all tokens. We analyze the correlation between the input-output difference of self-attention block and Feed-Forward Network (FFN) within the same transformer layer, and find that these two differential vectors are highly correlated. Thus, we propose to dynamically skip the FFN blocks based on the self-attention difference and introduce Diffential Layer Skipping (DiffSkip) to show that LLMs are inherently dynamic-depth models, capable of adjusting computational depth when generating different tokens. DiffSkip employs a lightweight router module to dynamically skip a set of FFN blocks in LLMs and only requires efficient fine-tuning while keeping the whole LLM frozen. Experimental results demonstrate that DiffSkip effectively enables dynamic FFN skipping in decoder-only language models, even in continuous token generation tasks where many layer-skipping methods struggle.

pdf bib
FastCuRL: Curriculum Reinforcement Learning with Stage-wise Context Scaling for Efficient Training R1-like Reasoning Models
Mingyang Song | Mao Zheng | Zheng Li | Wenjie Yang | Xuan Luo
Findings of the Association for Computational Linguistics: EMNLP 2025

Improving training efficiency continues to be one of the primary challenges in large-scale Reinforcement Learning (RL). In this paper, we investigate how context length and the complexity of training data influence the RL scaling training process of R1-distilled reasoning models, e.g., DeepSeek-R1-Distill-Qwen-1.5B.Our experimental results reveal that: text-green(1) simply controlling the context length and selecting the training data based on the input prompt length can effectively improve the training efficiency of RL scaling, achieving better performance with more concise CoT; text-blue(2) properly scaling the context length helps mitigate entropy collapse; text-redand (3) carefully choosing the context length facilitates achieving efficient LLM training and reasoning. Inspired by these insights, we propose FastCuRL, a curriculum RL framework with stage-wise context scaling to achieve efficient LLM training and reasoning. Extensive experimental results demonstrate that FastCuRL-1.5B-V3 significantly outperforms state-of-the-art reasoning models on five competition-level benchmarks and achieves 49.6% accuracy on AIME 2024. Furthermore, FastCuRL-1.5B-Preview surpasses DeepScaleR-1.5B-Preview on five benchmarks while only using a single node with 8 GPUs and a total of 50% of training steps.

pdf bib
Large Language Models as Reader for Bias Detection
Xuan Luo | Jing Li | Zhong Wenzhong | Geng Tu | Ruifeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2025

Detecting bias in media content is crucial for maintaining information integrity and promoting inclusivity. Traditional methods analyze text from the writer’s perspective, which analyzes textual features directly from the writer’s intent, leaving the reader’s perspective underexplored. This paper investigates whether Large Language Models (LLMs) can be leveraged as readers for bias detection by generating reader-perspective comments. Experiments are conducted on the BASIL (news bias) and BeyondGender (gender bias) datasets with LLMs Gemma-7B, Phi-3-3.8B, Llama3.1-8B, Llama3.1-70B, and GPT4. The results demonstrate the effectiveness of reader-perspective comments for open-source LLMs, achieving performance comparable to GPT4’s. The findings highlight the significance of emotion-related comments, which are generally more beneficial than value-related ones in bias detection. In addition, experiments on Llamas show that comment selection ensures consistent performance regardless of model sizes and comment combinations. This study is particularly beneficial for small-size open-source LLMs.

2022

pdf bib
Masked Language Models Know Which are Popular: A Simple Ranking Strategy for Commonsense Question Answering
Xuan Luo | Chuang Fan | Yice Zhang | Wanguo Jiang | Bing Qin | Ruifeng Xu
Findings of the Association for Computational Linguistics: EMNLP 2022

We propose a simple ranking strategy to solve a generative commonsense question answering (QA) problem. Compared with multiple-choice QA, it is challenging because the answers to a question are not unique and they are supposed to be popular and diverse. Our strategy exploits the dataset itself and negative samples that we collect from WordNet to train a ranker that picks out the most popular answers for commonsense questions. The effectiveness of our strategy is verified on different pre-trained masked language models (MLMs) in a pipeline framework, where an MLM reranks the generated answers. Further, we explore an end-to-end framework where MLMs are utilized to guide the generation of generative language models (GLMs). Taking advantage of reinforcement learning, we apply policy gradient to train a GLM with the rewards fed back by an MLM. Empirical results on ProtoQA dataset demonstrate that MLMs can acquire the ability to distinguish the popular answers and improve the typical answer generation of GLMs as well.