Shangqing Tu
2026
SimPBL: A Multi-Agent Framework for Project-Based Learning
Daniel Zhang-Li | Joy Jia Yin Lim | Binglin Liu | Shangqing Tu | Zijun Yao | Hao Peng | Jifan Yu | Haoxuan Li | Zhanxin Hao | Ye He | Zekun Li | Jiangyi Wang | Lei Hou | Bin Xu | Xin Cong | Zhiyuan Liu | Huiqin Liu | Yu Zhang | Juanzi Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Daniel Zhang-Li | Joy Jia Yin Lim | Binglin Liu | Shangqing Tu | Zijun Yao | Hao Peng | Jifan Yu | Haoxuan Li | Zhanxin Hao | Ye He | Zekun Li | Jiangyi Wang | Lei Hou | Bin Xu | Xin Cong | Zhiyuan Liu | Huiqin Liu | Yu Zhang | Juanzi Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Project-Based Learning (PBL) is an important learning method that promotes understanding and acquiring practical skills through training learners through a project. However, effective PBL often requires sustained orchestration and collaboration, but existing LLM-based learning tools provide partial assistance without explicitly modeling these roles, and overly comprehensive help provided by LLM can reduce learner autonomy. We propose SimPBL, a multi-agent framework with an orchestrator agent that provides adaptive scaffolding from interaction logs and collaborator agents that support project work through boundary-aware collaboration. We conduct comprehensive evaluation to study the effectiveness of SimPBL, where we observe a 14% improvement in learner examination score. Results from extensive studies further highlights the ability of SimPBL to manage learning behavior and improve learning experience. Code and materials are available at https://anonymous.4open.science/r/SimPBL-D5B8.
Beyond Self-Report: Bridging the Intention-Behavior Gap in Critical Thinking Assessment via Interpretable Multi-Agent System
Zekun Li | Jifan Yu | Haoxuan Li | Ye He | Daniel Zhang-Li | Shangqing Tu | Joy Jia Yin Lim | Yikun Jiang | Jiaxin Yuan | Yu Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zekun Li | Jifan Yu | Haoxuan Li | Ye He | Daniel Zhang-Li | Shangqing Tu | Joy Jia Yin Lim | Yikun Jiang | Jiaxin Yuan | Yu Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Accurate assessment of critical thinking is historically limited by the Intention Behavior Gap in psychology: the disconnect between what individuals self-reported disposition and their actual practical behaviors. We try to bridge this gap with MASA (Multi-Agent Scenario-based Assessment), a framework that operationalizes cognitive assessment into an interpretable and interactive multi-agent workflow with Assessment Chain-of-Thought (AsCoT). Validating on both large-scale simulations (N=1,161) and human participants (N=70), we find that MASA aligns better with human expert ratings (r=0.882) than traditional gold-standard inventories (r=0.720), with an average cost of only 0.41 per participant. These results suggest that by shifting from self-report inventory to behavior-grounded dialogue, MASA offers a more accurate, cost-effective, and transparent solution for real-world cognitive evaluation.
DeepPrune: Parallel Scaling without Inter-trace Redundancy
Shangqing Tu | Yaxuan Li | Yushi Bai | Lei Hou | Juanzi Li
Findings of the Association for Computational Linguistics: ACL 2026
Shangqing Tu | Yaxuan Li | Yushi Bai | Lei Hou | Juanzi Li
Findings of the Association for Computational Linguistics: ACL 2026
Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to *inter-trace redundancy*—our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose **DeepPrune**, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with oversampling techniques to accurately predict answer equivalence from partial reasoning traces, achieving 0.7072 AUROC on equivalence prediction across unseen reasoning models. This is combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction ranging from 65.73% to 88.50% compared to conventional consensus sampling, while maintaining competitive accuracy within 3.4 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: https://github.com/THU-KEG/DeepPrune/
2025
Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis
Kejian Zhu | Shangqing Tu | Zhuoran Jin | Lei Hou | Juanzi Li | Jun Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kejian Zhu | Shangqing Tu | Zhuoran Jin | Lei Hou | Juanzi Li | Jun Zhao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The development of large language models (LLMs) depends on **trustworthy evaluation**. However, most current evaluations rely on public benchmarks, which are prone to data contamination issues that significantly compromise fairness. Previous researches have focused on constructing dynamic benchmarks to address contamination. However, continuously building new benchmarks is costly and cyclical.In this work, we aim to tackle contamination by analyzing the mechanisms of contaminated models themselves. Through our experiments, we discover that the overestimation of contaminated models is likely due to parameters acquiring shortcut solutions in training. We further propose a novel method for identifying shortcut neurons through **comparative and causal analysis**.Building on this, we introduce an evaluation method called **shortcut neuron patching** to suppress shortcut neurons. Experiments validate the effectiveness of our approach in mitigating contamination. Additionally, our evaluation results exhibit a strong linear correlation with MixEval, a recently released trustworthy benchmark, achieving a Spearman coefficient (𝜌) exceeding 0.95. This high correlation indicates that our method closely reveals true capabilities of the models and is trustworthy. We conduct further experiments to demonstrate the generalizability of our method across various benchmarks and hyperparameter settings. **Code**: https://github.com/GaryStack/Trustworthy-Evaluation.
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
Yushi Bai | Shangqing Tu | Jiajie Zhang | Hao Peng | Xiaozhi Wang | Xin Lv | Shulin Cao | Jiazheng Xu | Lei Hou | Yuxiao Dong | Jie Tang | Juanzi Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yushi Bai | Shangqing Tu | Jiajie Zhang | Hao Peng | Xiaozhi Wang | Xin Lv | Shulin Cao | Jiazheng Xu | Lei Hou | Yuxiao Dong | Jie Tang | Juanzi Li
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.
2024
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models
Shangqing Tu | Yuliang Sun | Yushi Bai | Jifan Yu | Lei Hou | Juanzi Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shangqing Tu | Yuliang Sun | Yushi Bai | Jifan Yu | Lei Hou | Juanzi Li
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
To mitigate the potential misuse of large language models (LLMs), recent research has developed watermarking algorithms, which restrict the generation process to leave an invisible trace for watermark detection. Due to the two-stage nature of the task, most studies evaluate the generation and detection separately, thereby presenting a challenge in unbiased, thorough, and applicable evaluations. In this paper, we introduce WaterBench, the first comprehensive benchmark for LLM watermarks, in which we design three crucial factors: (1) For benchmarking procedure, to ensure an apples-to-apples comparison, we first adjust each watermarking method’s hyper-parameter to reach the same watermarking strength, then jointly evaluate their generation and detection performance. (2) For task selection, we diversify the input and output length to form a five-category taxonomy, covering 9 tasks. (3) For evaluation metric, we adopt the GPT4-Judge for automatically evaluating the decline of instruction-following abilities after watermarking. We evaluate 4 open-source watermarks on 2 LLMs under 2 watermarking strengths and observe the common struggles for current methods on maintaining the generation quality. The code and data are available at https://github.com/THU-KEG/WaterBench.
2022
UPER: Boosting Multi-Document Summarization with an Unsupervised Prompt-based Extractor
Shangqing Tu | Jifan Yu | Fangwei Zhu | Juanzi Li | Lei Hou | Jian-Yun Nie
Proceedings of the 29th International Conference on Computational Linguistics
Shangqing Tu | Jifan Yu | Fangwei Zhu | Juanzi Li | Lei Hou | Jian-Yun Nie
Proceedings of the 29th International Conference on Computational Linguistics
Multi-Document Summarization (MDS) commonly employs the 2-stage extract-then-abstract paradigm, which first extracts a relatively short meta-document, then feeds it into the deep neural networks to generate an abstract. Previous work usually takes the ROUGE score as the label for training a scoring model to evaluate source documents. However, the trained scoring model is prone to under-fitting for low-resource settings, as it relies on the training data. To extract documents effectively, we construct prompting templates that invoke the underlying knowledge in Pre-trained Language Model (PLM) to calculate the document and keyword’s perplexity, which can assess the document’s semantic salience. Our unsupervised approach can be applied as a plug-in to boost other metrics for evaluating a document’s salience, thus improving the subsequent abstract generation. We get positive results on 2 MDS datasets, 2 data settings, and 2 abstractive backbone models, showing our method’s effectiveness. Our code is available at https://github.com/THU-KEG/UPER
2021
TWAG: A Topic-Guided Wikipedia Abstract Generator
Fangwei Zhu | Shangqing Tu | Jiaxin Shi | Juanzi Li | Lei Hou | Tong Cui
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Fangwei Zhu | Shangqing Tu | Jiaxin Shi | Juanzi Li | Lei Hou | Tong Cui
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Wikipedia abstract generation aims to distill a Wikipedia abstract from web sources and has met significant success by adopting multi-document summarization techniques. However, previous works generally view the abstract as plain text, ignoring the fact that it is a description of a certain entity and can be decomposed into different topics. In this paper, we propose a two-stage model TWAG that guides the abstract generation with topical information. First, we detect the topic of each input paragraph with a classifier trained on existing Wikipedia articles to divide input documents into different topics. Then, we predict the topic distribution of each abstract sentence, and decode the sentence from topic-aware representations with a Pointer-Generator network. We evaluate our model on the WikiCatSum dataset, and the results show that TWAG outperforms various existing baselines and is capable of generating comprehensive abstracts.
Search
Fix author
Co-authors
- Lei Hou 7
- Juanzi Li 7
- Jifan Yu 4
- Yushi Bai 3
- Ye He 2
- Haoxuan Li 2
- Zekun Li 2
- Joy Jia Yin Lim 2
- Hao Peng 2
- Yu Zhang 2
- Daniel Zhang-Li 2
- Fangwei Zhu 2
- Shulin Cao 1
- Xin Cong 1
- Tong Cui 1
- Yuxiao Dong 1
- Zhanxin Hao 1
- Yikun Jiang 1
- Zhuoran Jin 1
- Yaxuan Li 1
- Binglin Liu 1
- Zhiyuan Liu 1
- Huiqin Liu 1
- Xin Lv 1
- Jian-Yun Nie 1
- Jiaxin Shi 1
- Yuliang Sun 1
- Jie Tang 1
- Xiaozhi Wang 1
- Jiangyi Wang 1
- Jiazheng Xu 1
- Bin Xu 1
- Zijun Yao 1
- Jiaxin Yuan (袁佳欣) 1
- Jiajie Zhang 1
- Jun Zhao 1
- Kejian Zhu 1