Yuxuan Li
2026
AutoReproduce: Automatic AI Experiment Reproduction with Paper Lineage
Xuanle Zhao | Zilin Sang | Yuxuan Li | Qi Shi | Weilun Zhao | Shuo Wang | Duzhen Zhang | Xu Han | Zhiyuan Liu | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xuanle Zhao | Zilin Sang | Yuxuan Li | Qi Shi | Weilun Zhao | Shuo Wang | Duzhen Zhang | Xu Han | Zhiyuan Liu | Maosong Sun
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Efficient reproduction of research papers is pivotal to accelerating scientific progress. However, the increasing complexity of proposed methods often renders reproduction a labor-intensive endeavor, necessitating profound domain expertise.To address this, we introduce the paper lineage, which systematically mines implicit knowledge from the cited literature. This algorithm serves as the backbone of our proposed , a multi-agent framework designed to autonomously reproduce experimental code in a complete, end-to-end manner. To ensure code executability, incorporates a sampling-based unit testing strategy for rapid validation. To assess reproduction capabilities, we introduce , a benchmark featuring verified implementations, alongside comprehensive metrics for evaluating both reproduction and execution fidelity. Extensive evaluations on PaperBench and demonstrate that consistently surpasses existing baselines across all metrics. Notably, it yields substantial improvements in reproduction fidelity and final execution performance. The code is available at https://github.com/AI9Stars/AutoReproduce.
ArcLight: A Lightweight LLM Inference Architecture for Many-Core CPUs
Yuzhuang Xu | Xu Han | Yuxuan Li | Wanxiang Che
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Yuzhuang Xu | Xu Han | Yuxuan Li | Wanxiang Che
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Although existing frameworks for large language model (LLM) inference on CPUs are mature, they fail to fully exploit the computational potential of many-core CPU platforms. Many-core CPUs are widely deployed in web servers and high-end networking devices, and are typically organized into multiple NUMA nodes that group cores and memory. Current frameworks largely overlook the substantial overhead of cross-NUMA memory access, limiting inference scalability and intelligence enabling on such platforms. To address this limitation, we build ArcLight, a lightweight LLM inference architecture designed from the ground up for many-core CPUs. ArcLight integrates efficient memory management and thread scheduling, and introduces finely controlled tensor parallelism to mitigate the cross-node memory access wall. Experimental results show that ArcLight significantly surpasses the performance ceiling of mainstream frameworks, achieving up to 46% higher inference throughput. Moreover, ArcLight maintains compatibility with arbitrary CPU devices. ArcLight is publicly available at https://github.com/OpenBMB/ArcLight.
2025
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators
Jianling Li | Shangzhan Li | Zhenye Gao | Qi Shi | Yuxuan Li | Zefan Wang | Jiacheng Huang | Haojie Wang | Jianrong Wang | Xu Han | Zhiyuan Liu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Jianling Li | Shangzhan Li | Zhenye Gao | Qi Shi | Yuxuan Li | Zefan Wang | Jiacheng Huang | Haojie Wang | Jianrong Wang | Xu Han | Zhiyuan Liu | Maosong Sun
Findings of the Association for Computational Linguistics: ACL 2025
Triton, a high-level Python-like language designed for building efficient GPU kernels, is widely adopted in deep learning frameworks due to its portability, flexibility, and accessibility. However, programming and parallel optimization still require considerable trial and error from Triton developers. Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code, as they lack awareness of its specifications and the complexities of GPU programming. More critically, there is an urgent need for systematic evaluations tailored to Triton. In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation. TritonBench features two evaluation channels: a curated set of 184 real-world operators from GitHub and a collection of operators aligned with PyTorch interfaces. Unlike conventional code benchmarks prioritizing functional correctness, TritonBench also profiles efficiency performance on widely deployed GPUs aligned with industry applications. Our study reveals that current state-of-the-art code LLMs struggle to generate efficient Triton operators, highlighting a significant gap in high-performance code generation.
On LLM-Based Scientific Inductive Reasoning Beyond Equations
Brian S. Lin | Jiaxin Yuan | Zihan Zhou | Shouli Wang | Shuo Wang | Cunliang Kong | Qi Shi | Yuxuan Li | Liner Yang | Zhiyuan Liu | Maosong Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Brian S. Lin | Jiaxin Yuan | Zihan Zhou | Shouli Wang | Shuo Wang | Cunliang Kong | Qi Shi | Yuxuan Li | Liner Yang | Zhiyuan Liu | Maosong Sun
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
As large language models (LLMs) increasingly exhibit human-like capabilities, a fundamental question emerges: How can we enable LLMs to learn the underlying patterns from limited examples in entirely novel environments and apply them effectively? This question is central to the ability of LLMs in inductive reasoning. Existing research on LLM-based inductive reasoning can be broadly categorized based on whether the underlying rules are expressible via explicit mathematical equations. However, many recent studies in the beyond-equations category have emphasized rule design without grounding them in specific scenarios. Inspired by the parallels between inductive reasoning and human scientific discovery, we propose the task of LLM-Based Scientific Inductive Reasoning Beyond Equations and introduce a new benchmark, SIRBench-V1, to evaluate the inductive reasoning abilities of LLMs in scientific settings. Our experimental results show that current LLMs still struggle with this task, underscoring its difficulty and the need for further advancement in this area.
Spontaneous Giving and Calculated Greed in Language Models
Yuxuan Li | Hirokazu Shirado
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Yuxuan Li | Hirokazu Shirado
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models demonstrate strong problem-solving abilities through reasoning techniques such as chain-of-thought prompting and reflection. However, it remains unclear whether these reasoning capabilities extend to a form of social intelligence: making effective decisions in cooperative contexts. We examine this question using economic games that simulate social dilemmas. First, we apply chain-of-thought and reflection prompting to GPT-4o in a Public Goods Game. We then evaluate multiple off-the-shelf models across six cooperation and punishment games, comparing those with and without explicit reasoning mechanisms. We find that reasoning models consistently reduce cooperation and norm enforcement, favoring individual rationality. In repeated interactions, groups with more reasoning agents exhibit lower collective gains. These behaviors mirror human patterns of “spontaneous giving and calculated greed.” Our findings underscore the need for LLM architectures that incorporate social intelligence alongside reasoning, to help address—rather than reinforce—the challenges of collective action.
FR-Spec: Accelerating Large-Vocabulary Language Models via Frequency-Ranked Speculative Sampling
Weilin Zhao | Tengyu Pan | Xu Han | Yudi Zhang | Ao Sun | Yuxiang Huang | Kaihuo Zhang | Weilun Zhao | Yuxuan Li | Jie Zhou | Hao Zhou | Jianyong Wang | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Weilin Zhao | Tengyu Pan | Xu Han | Yudi Zhang | Ao Sun | Yuxiang Huang | Kaihuo Zhang | Weilun Zhao | Yuxuan Li | Jie Zhou | Hao Zhou | Jianyong Wang | Zhiyuan Liu | Maosong Sun
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speculative sampling has emerged as an important technique for accelerating the auto-regressive generation process of large language models (LLMs) by utilizing a draft-then-verify mechanism to produce multiple tokens per forward pass. While state-of-the-art speculative sampling methods use only a single layer and a language modeling (LM) head as the draft model to achieve impressive layer compression, their efficiency gains are substantially reduced for large-vocabulary LLMs, such as Llama-3-8B with a vocabulary of 128k tokens. To address this, we present FR-Spec, a frequency-ranked speculative sampling framework that optimizes draft candidate selection through vocabulary space compression. By constraining the draft search to a frequency-prioritized token subset, our method reduces LM Head computation overhead by 75% while ensuring the equivalence of the final output distribution. Experiments across multiple datasets demonstrate an average of 1.12× speedup over the state-of-the-art speculative sampling method EAGLE-2. Code is availableat https://github.com/thunlp/FR-Spec.
LLMs Trust Humans More, That’s a Problem! Unveiling and Mitigating the Authority Bias in Retrieval-Augmented Generation
Yuxuan Li | Xinwei Guo | Jiashi Gao | Guanhua Chen | Xiangyu Zhao | Jiaxin Zhang | Quanying Liu | Haiyan Wu | Xin Yao | Xuetao Wei
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuxuan Li | Xinwei Guo | Jiashi Gao | Guanhua Chen | Xiangyu Zhao | Jiaxin Zhang | Quanying Liu | Haiyan Wu | Xin Yao | Xuetao Wei
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) has been proven to be an effective approach to address the hallucination problem in large language models (LLMs). In current RAG systems, LLMs typically need to synthesize knowledge provided by two main external sources (user prompts and an external database) to generate a final answer. When the knowledge provided by the user conflicts with that retrieved from the database, a critical question arises: Does the LLM favor one knowledge source over the other when generating the answer? In this paper, we are the first to unveil a new phenomenon, Authority Bias, where the LLMs tend to favor the knowledge provided by the user even when it deviates from the facts; this new phenomenon is rigorously evidenced via our novel and comprehensive characterization of Authority Bias in six widely used LLMs and across diverse task scenarios. We propose a novel dataset specifically designed for detecting Authority Bias, called the Authority Bias Detection Dataset (ABDD), and introduce new, detailed metrics to measure Authority Bias. To mitigate Authority bias, we finally propose the Conflict Detection Enhanced Query (CDEQ) framework. We identify the sentences and atomic information that generate conflicts, perform a credibility assessment on the conflicting paragraphs, and ultimately enhance the query to detect perturbed text, thereby reducing Authority bias. Comparative experiments with widely used mitigation methods demonstrate that CDEQ exhibits both effectiveness and advancement, significantly enhancing the robustness of RAG systems.
Search
Fix author
Co-authors
- Maosong Sun (孙茂松) 4
- Zhiyuan Liu 3
- Qi Shi 3
- Xu Han 2
- Xu Han 2
- Shuo Wang 2
- Weilun Zhao 2
- Wanxiang Che (车万翔) 1
- Guanhua Chen 1
- Jiashi Gao 1
- Zhenye Gao 1
- Xinwei Guo 1
- Jiacheng Huang 1
- Yuxiang Huang 1
- Cunliang Kong (孔存良) 1
- Jianling Li 1
- Shangzhan Li 1
- Brian S. Lin 1
- Quanying Liu 1
- Zhiyuan Liu 1
- Tengyu Pan 1
- Zilin Sang 1
- Hirokazu Shirado 1
- Ao Sun 1
- Jianrong Wang 1
- Jianyong Wang 1
- Shouli Wang 1
- Zefan Wang 1
- WangHaojie WangHaojie 1
- Xuetao Wei 1
- Haiyan Wu 1
- Yuzhuang Xu 1
- Liner Yang 1
- Xin Yao 1
- Jiaxin Yuan 1
- Duzhen Zhang 1
- Jiaxin Zhang 1
- Kaihuo Zhang 1
- Yudi Zhang 1
- Weilin Zhao 1
- Xiangyu Zhao 1
- Xuanle Zhao 1
- Hao Zhou 1
- Jie Zhou 1
- Zihan Zhou 1