2025
pdf
bib
abs
RealHiTBench: A Comprehensive Realistic Hierarchical Table Benchmark for Evaluating LLM-Based Table Analysis
Pengzuo Wu
|
Yuhang Yang
|
Guangcheng Zhu
|
Chao Ye
|
Hong Gu
|
Xu Lu
|
Ruixuan Xiao
|
Bowen Bao
|
Yijing He
|
Liangyu Zha
|
Wentao Ye
|
Junbo Zhao
|
Haobo Wang
Findings of the Association for Computational Linguistics: ACL 2025
With the rapid advancement of Large Language Models (LLMs), there is an increasing need for challenging benchmarks to evaluate their capabilities in handling complex tabular data. However, existing benchmarks are either based on outdated data setups or focus solely on simple, flat table structures. In this paper, we introduce **RealHiTBench**, a comprehensive benchmark designed to evaluate the performance of both LLMs and Multimodal LLMs (MLLMs) across a variety of input formats for complex tabular data, including LaTeX, HTML, and PNG. RealHiTBench also includes a diverse collection of tables with intricate structures, spanning a wide range of task types. Our experimental results, using **25** state-of-the-art LLMs, demonstrate that RealHiTBench is indeed a challenging benchmark. Moreover, we also develop TreeThinker, a tree-based agent that organizes hierarchical headers into a tree structure for enhanced tabular reasoning, validating the importance of improving LLMs’ perception of table hierarchies. We hope that our work will inspire further research on tabular data reasoning and the development of more robust models. The code and data are available at https://github.com/cspzyy/RealHiTBench.
2024
pdf
bib
abs
Navigate Complex Physical Worlds via Geometrically Constrained LLM
Yongqiang Huang
|
Wentao Ye
|
Liyao Li
|
Junbo Zhao
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
This study investigates the potential of Large Language Models (LLMs) for reconstructing and understanding the physical world based solely on textual knowledge. It explores the impact of model performance on spatial understanding abilities by introducing a set of geometric conventions and developing a workflow based on multi-layer graphs and multi-agent systems. The study examines how LLMs achieve multi-step and multi-objective geometric inference in a spatial environment, using unified geometric conventions and a graph-driven framework. A genetic algorithm, inspired by large-scale model knowledge, is employed to solve geometric constraint problems, enhancing the spatial reasoning capabilities of LLMs. This work innovatively explores the feasibility of using text-based LLMs as builders of the physical world and designs a workflow to enhance their spatial comprehension and construction capabilities.
pdf
bib
abs
Data Contamination Calibration for Black-box LLMs
Wentao Ye
|
Jiaqi Hu
|
Liyao Li
|
Haobo Wang
|
Gang Chen
|
Junbo Zhao
Findings of the Association for Computational Linguistics: ACL 2024
The rapid advancements of Large Language Models (LLMs) tightly associate with the expansion of the training data size. However, the unchecked ultra-large-scale training sets introduce a series of potential risks like data contamination, i.e. the benchmark data is used for training. In this work, we propose a holistic method named Polarized Augment Calibration (PAC) along with a new to-be-released dataset to detect the contaminated data and diminish the contamination effect. PAC extends the popular MIA (Membership Inference Attack) — from machine learning community — by forming a more global target at detecting training data to Clarify invisible training data. As a pioneering work, PAC is very much plug-and-play that can be integrated with most (if not all) current white- and black-box LLMs. By extensive experiments, PAC outperforms existing methods by at least 4.5%, towards data contamination detection on more 4 dataset formats, with more than 10 base LLMs. Besides, our application in real-world scenarios highlights the prominent presence of contamination and related issues.