Qi Han
2026
CrosSing: Cross-Scale Reasoning Evaluation on LLMs against Humans
Qi Han | Yifan Wu | Marten Van Schijndel
Proceedings of the Society for Computation in Linguistics 2026
Qi Han | Yifan Wu | Marten Van Schijndel
Proceedings of the Society for Computation in Linguistics 2026
While many studies have shown LLMs perform well in various reasoning tasks, few have examined their capacity on semantic reasoning tasks. As LLMs reason with language, it is crucial to understand how well they grasp and use the underlying scalar relationships in language. In this study, we introduced a new dataset CrosSing (Cross-Scale reasoning), providing a human baseline against which to evaluate LLMs’ ability to reason across lexical scales in gradable adjectives. We further probed how their understanding is influenced by overinformative contexts. We evaluated ten high-performing LLMs and found that some outperformed humans when no extra information was provided, but that LLM performance declined in certain overinformative contexts while human performance improved significantly. This contrast reveals a fundamental difference between recent LLMs and humans in understanding adjectives’ scalar relationships and how such understanding behaves in overinformative contexts.
Beyond Quantity: Trajectory Diversity Scaling for Code Agents
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
Guhong Chen | Chenghao Sun | Cheng Fu | Qiyao Wang | Zhihong Huang | ChaoPeng Wei | Guangxu Chen | Feiteng Fang | Ahmadreza Argha | Bing Zhao | Xander Xu | Qi Han | Hamid Alinejad-Rokny | Qiang Qu | Binhua Li | Shiwen Ni | Min Yang | HU Wei | Yongbin Li
Findings of the Association for Computational Linguistics: ACL 2026
As code large language models (LLMs) evolve into tool-interactive agents via the Model Context Protocol (MCP), their generalization is increasingly limited by low-quality synthetic data and the diminishing returns of quantity scaling; moreover, quantity-centric scaling exhibits an early bottleneck that underutilizes trajectory data. We propose TDScaling, a Trajectory Diversity Scaling-based data synthesis framework for code agents that scales performance through diversity rather than raw volume. Moreover, TDScaling is more data-efficient: under a fixed training budget, increasing trajectory diversity yields larger gains than adding more trajectories, improving the performance-cost trade-off for agent training. TDScaling integrates four innovations: (1) a Business Cluster mechanism that captures real-service logical dependencies; (2) a Blueprint-driven multi-agent paradigm that enforces trajectory coherence; (3) an adaptive evolution mechanism that steers synthesis toward long-tail scenarios using Domain Entropy, Reasoning Mode Entropy, and Cumulative Action Complexity to prevent mode collapse; and (4) a sandboxed code tool that mitigates catastrophic forgetting of intrinsic coding capabilities. Experiments on general tool-use benchmarks (BFCL, 𝜏2-Bench) and code agent tasks (RebenchT, CodeCI, BIRD) demonstrate a win-win outcome: TDScaling improves both tool-use generalization and inherent coding proficiency. Crucially, we show that trajectory diversity scaling attains a substantially higher performance ceiling than quantity scaling, establishing a resource-efficient paradigm for training robust code agents under data bottlenecks.
PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering
Xiangfeng Wang | Hangyu Guo | Yanlin Lai | Mitt Huang | Liang Zhao | Chengyuan Yao | Yinmin Zhang | Qi Han | Xiaoxiaoren | Chun Yuan | Tong Xu | Zheng Ge | Xiangyu Zhang | Daxin Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiangfeng Wang | Hangyu Guo | Yanlin Lai | Mitt Huang | Liang Zhao | Chengyuan Yao | Yinmin Zhang | Qi Han | Xiaoxiaoren | Chun Yuan | Tong Xu | Zheng Ge | Xiangyu Zhang | Daxin Jiang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
While model-based verifiers are essential for scaling Reinforcement Learning with Verifiable Rewards (RLVR), current outcome-centric verification paradigms primarily focus on the consistency between the final result and the ground truth, often neglecting potential errors in the derivation process. This leads to assigning positive rewards to correct answers produced from incorrect derivations. To bridge this gap, we introduce **PRIME**, a benchmark for evaluating verifiers on **PR**ocess-outcome alignment verification **I**n **M**athematics and **E**ngineering. Curated from a comprehensive collection of college-level STEM problems, **PRIME** comprises 2,530 high-difficulty samples through a consistency-based filtering pipeline. Through extensive evaluation, we find that current verifiers frequently fail to detect derivation flaws. Furthermore, we propose a process-aware RLVR training paradigm utilizing verifiers selected via **PRIME**. This approach substantially outperforms the outcome-only verification baseline, achieving absolute performance gains of **8.29%**, **9.12%**, and **7.31%** on AIME24, AIME25, and Beyond-AIME, respectively, for the Qwen3-14B-Base model. Finally, we demonstrate a strong linear correlation (R2 > 0.92) between verifier accuracy on **PRIME** and RLVR training effectiveness, validating **PRIME** as a reliable predictor for verifier selection.
PaCoRe: Learning to Scale Test-Time Compute with Parallel Coordinated Reasoning
Jingcheng Hu | Yinmin Zhang | Shijie Shang | Xiaobo Yang | Yue Peng | Zhewei Huang | Hebin Zhou | Xin Wu | Jie Cheng | Fanqi Wan | Xiangwen Kong | Chengyuan Yao | Kaiwen Yan | Ailin Huang | Hongyu Zhou | Qi Han | Zheng Ge | Xiangyu Zhang | Heung-Yeung Shum
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jingcheng Hu | Yinmin Zhang | Shijie Shang | Xiaobo Yang | Yue Peng | Zhewei Huang | Hebin Zhou | Xin Wu | Jie Cheng | Fanqi Wan | Xiangwen Kong | Chengyuan Yao | Kaiwen Yan | Ailin Huang | Hongyu Zhou | Qi Han | Zheng Ge | Xiangyu Zhang | Heung-Yeung Shum
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce Parallel Coordinated Reasoning (PaCoRe), a training-and-inference framework designed to overcome a central limitation of contemporary language models: their inability to scale test-time compute (TTC) far beyond sequential reasoning under a fixed context window. PaCoRe departs from the traditional sequential paradigm by driving TTC through massive parallel exploration coordinated via a message-passing architecture in multiple rounds. Each round launches many parallel reasoning trajectories, compacts their findings into context-bounded messages, and synthesizes these messages to guide the next round and ultimately produce the final answer. Trained end-to-end with large-scale, outcome-based reinforcement learning, the model masters the synthesis abilities required by PaCoRe and scales to multi-million-token effective TTC without exceeding context limits. The approach yields strong improvements across diverse domains and notably pushes reasoning beyond frontier systems in mathematics: an 8B model reaches 94.5% on HMMT 2025, surpassing GPT-5’s 93.2% by scaling effective TTC to roughly two million tokens. We open-source model checkpoints, training data, and the full inference pipeline to accelerate follow-up work.
2016
Visualisation and Exploration of High-Dimensional Distributional Features in Lexical Semantic Classification
Maximilian Köper | Melanie Zaiß | Qi Han | Steffen Koch | Sabine Schulte im Walde
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Maximilian Köper | Melanie Zaiß | Qi Han | Steffen Koch | Sabine Schulte im Walde
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
Vector space models and distributional information are widely used in NLP. The models typically rely on complex, high-dimensional objects. We present an interactive visualisation tool to explore salient lexical-semantic features of high-dimensional word objects and word similarities. Most visualisation tools provide only one low-dimensional map of the underlying data, so they are not capable of retaining the local and the global structure. We overcome this limitation by providing an additional trust-view to obtain a more realistic picture of the actual object distances. Additional tool options include the reference to a gold standard classification, the reference to a cluster analysis as well as listing the most salient (common) features for a selected subset of the words.
2014
A tunable language model for statistical machine translation
Junfei Guo | Juan Liu | Qi Han | Andreas Maletti
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
Junfei Guo | Juan Liu | Qi Han | Andreas Maletti
Proceedings of the 11th Conference of the Association for Machine Translation in the Americas: MT Researchers Track
A novel variation of modified KNESER-NEY model using monomial discounting is presented and integrated into the MOSES statistical machine translation toolkit. The language model is trained on a large training set as usual, but its new discount parameters are tuned to the small development set. An in-domain and cross-domain evaluation of the language model is performed based on perplexity, in which sizable improvements are obtained. Additionally, the performance of the language model is also evaluated in several major machine translation tasks including Chinese-to-English. In those tests, the test data is from a (slightly) different domain than the training data. The experimental results indicate that the new model significantly outperforms a baseline model using SRILM in those domain adaptation scenarios. The new language model is thus ideally suited for domain adaptation without sacrificing performance on in-domain experiments.
2013
CodeX: Combining an SVM Classifier and Character N-gram Language Models for Sentiment Analysis on Twitter Text
Qi Han | Junfei Guo | Hinrich Schuetze
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
Qi Han | Junfei Guo | Hinrich Schuetze
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)
Search
Fix author
Co-authors
- Zheng Ge 2
- Junfei Guo 2
- Chengyuan Yao 2
- Xiangyu Zhang 2
- Yinmin Zhang 2
- Hamid Alinejad-Rokny 1
- Ahmadreza Argha 1
- Guangxu Chen 1
- Guhong Chen 1
- Jie Cheng 1
- Feiteng Fang 1
- Cheng Fu 1
- Hangyu Guo 1
- Jingcheng Hu 1
- Ailin Huang 1
- Mitt Huang 1
- Zhewei Huang 1
- Zhihong Huang 1
- Daxin Jiang 1
- Steffen Koch 1
- Xiangwen Kong 1
- Maximilian Köper 1
- Yanlin Lai 1
- Binhua Li 1
- Yongbin Li 1
- Juan Liu 1
- Andreas Maletti 1
- Shiwen Ni 1
- Yue Peng 1
- Qiang Qu 1
- Sabine Schulte im Walde 1
- Hinrich Schütze 1
- Shijie Shang 1
- Heung Yeung Shum 1
- Chenghao Sun 1
- Fanqi Wan 1
- Qiyao Wang 1
- Xiangfeng Wang 1
- ChaoPeng Wei 1
- HU Wei 1
- Xin Wu 1
- Yifan Wu 1
- Xiaoxiaoren 1
- Tong Xu 1
- Xander Xu 1
- Kaiwen Yan 1
- Min Yang 1
- Xiaobo Yang 1
- Chun Yuan 1
- Melanie Zaiß 1
- Bing Zhao 1
- Liang Zhao (赵亮) 1
- Hebin Zhou 1
- Hongyu Zhou 1
- Marten van Schijndel 1