Yiqian Yang
2026
HSCodeComp: A Realistic and Expert-level Agent Benchmark for Hierarchical Rule Application
Tian Lan | Yiqian Yang | Qianghuai Jia | Li Zhu | Hui Jiang | Hang Zhu | Weihua Luo | Longyue Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tian Lan | Yiqian Yang | Qianghuai Jia | Li Zhu | Hui Jiang | Hang Zhu | Weihua Luo | Longyue Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite recent progress, existing agent benchmarks neglect a fundamental real-world capability: hierarchical rule application, a critical requirement in fields such as law and medicine where agents must reason from broad categories down to specific exceptions to reach rule-compliant decisions.This introduces significant challenges in resolving logical dependencies and disambiguating vague boundaries.To bridge this gap, we introduce HSCodeComp, a novel benchmark derived from e-commerce, requiring agents to assign a unique 10-digit Harmonized System (HS) Code to products by aligning their fuzzy attributes with strict tariff classification rules.HSCodeComp comprises 632 realistic products across 32 categories, featuring detailed yet noisy product information (titles, attributes, and images). The HS Codes are annotated by a panel of 26 tariff experts, strictly adhering to official rules and an empirical knowledge base, both of which we jointly open-source.Through a comprehensive evaluation of 23 LLMs, VLMs, and agents on HSCodeComp, we demonstrate that: 1) a substantial performance gap remains between state-of-the-art agents and human experts (46.8% vs. 95.0%); and 2) test-time scaling fails to close this gap. Further analysis reveals that 1) excessive reasoning steps frequently induce “reasoning drift,” which degrades accuracy; and 2) agents are prone to premature decisions on high-level categories and reasoning hallucinations that lack factual grounding.
2025
Analyzing and Modeling LLM Response Lengths with Extreme Value Theory: Anchoring Effects and Hybrid Distributions
Liuxuan Jiao | Chen Gao | Yiqian Yang | Chenliang Zhou | YiXian Huang | Xinlei Chen | Yong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Liuxuan Jiao | Chen Gao | Yiqian Yang | Chenliang Zhou | YiXian Huang | Xinlei Chen | Yong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
We present a statistical framework for modeling and controlling large language model (LLM) response lengths using extreme value theory. Analyzing 14,301 GPT-4o responses across temperature and prompting conditions, with cross-validation on Qwen and DeepSeek architectures, we demonstrate that verbosity follows Weibull-type generalized extreme value (GEV) distributions with heavier tails under stochastic generation. Our key contributions include: (1) development of a novel GEV-generalized Pareto (GPD) hybrid model that improves tail fit (R2CDF=0.9993 vs standalone GEV’s 0.998) while maintaining architectural generalizability; (2) quantitative characterization of prompt anchoring effects across models, showing reduced dispersion but increased outliers under randomization; and (3) identification of temperature-dependent response patterns that persist across architectures, with higher temperatures amplifying length variability while preserving extreme-value mechanisms. The hybrid model’s threshold selection method enables precise verbosity control in production systems regardless of model choice. While validated on multiple architectures, generalizability to emerging model families requires further study.