HSCodeComp: A Realistic and Expert-level Agent Benchmark for Hierarchical Rule Application

Tian Lan; Yiqian Yang; Qianghuai Jia; Li Zhu; Hui Jiang; Hang Zhu; Weihua Luo; Longyue Wang

HSCodeComp: A Realistic and Expert-level Agent Benchmark for Hierarchical Rule Application

Tian Lan, Yiqian Yang, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Weihua Luo, Longyue Wang

Abstract

Despite recent progress, existing agent benchmarks neglect a fundamental real-world capability: hierarchical rule application, a critical requirement in fields such as law and medicine where agents must reason from broad categories down to specific exceptions to reach rule-compliant decisions.This introduces significant challenges in resolving logical dependencies and disambiguating vague boundaries.To bridge this gap, we introduce HSCodeComp, a novel benchmark derived from e-commerce, requiring agents to assign a unique 10-digit Harmonized System (HS) Code to products by aligning their fuzzy attributes with strict tariff classification rules.HSCodeComp comprises 632 realistic products across 32 categories, featuring detailed yet noisy product information (titles, attributes, and images). The HS Codes are annotated by a panel of 26 tariff experts, strictly adhering to official rules and an empirical knowledge base, both of which we jointly open-source.Through a comprehensive evaluation of 23 LLMs, VLMs, and agents on HSCodeComp, we demonstrate that: 1) a substantial performance gap remains between state-of-the-art agents and human experts (46.8% vs. 95.0%); and 2) test-time scaling fails to close this gap. Further analysis reveals that 1) excessive reasoning steps frequently induce “reasoning drift,” which degrades accuracy; and 2) agents are prone to premature decisions on high-level categories and reasoning hallucinations that lack factual grounding.

Anthology ID:: 2026.acl-long.937
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 20458–20489
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.937/
DOI:
Bibkey:
Cite (ACL):: Tian Lan, Yiqian Yang, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Weihua Luo, and Longyue Wang. 2026. HSCodeComp: A Realistic and Expert-level Agent Benchmark for Hierarchical Rule Application. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20458–20489, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: HSCodeComp: A Realistic and Expert-level Agent Benchmark for Hierarchical Rule Application (Lan et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.937.pdf
Checklist:: 2026.acl-long.937.checklist.pdf

PDF Cite Search Checklist Fix data