Liyao Li

2025

We introduce **LongTableBench**, a benchmark for evaluating long-context reasoning over semi-structured tables across diverse formats, tasks, and domains. It comprises 5,950 QA instances spanning 7 table formats (e.g., Markdown, HTML, SQL), 18 domains, and input lengths up to 128K tokens, including multi-turn and multi-table settings. To ensure data quality, we combine symbolic supervision, cross-model validation, and human review. Evaluating 52 LLMs—including general-purpose, table-specific, and reasoning-enhanced models—reveals that only the strongest models maintain robust performance under increasing context lengths and format diversity. We further show that end-to-end models outperform compression-based approaches, especially on tasks requiring semantic integration. LongTableBench provides a rigorous, scalable testbed for advancing long-context tabular understanding and highlights key limitations in current LLMs’ structural and reasoning capabilities.

2024

pdf bib abs
Navigate Complex Physical Worlds via Geometrically Constrained LLM
Yongqiang Huang | Wentao Ye | Liyao Li | Junbo Zhao
Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)

This study investigates the potential of Large Language Models (LLMs) for reconstructing and understanding the physical world based solely on textual knowledge. It explores the impact of model performance on spatial understanding abilities by introducing a set of geometric conventions and developing a workflow based on multi-layer graphs and multi-agent systems. The study examines how LLMs achieve multi-step and multi-objective geometric inference in a spatial environment, using unified geometric conventions and a graph-driven framework. A genetic algorithm, inspired by large-scale model knowledge, is employed to solve geometric constraint problems, enhancing the spatial reasoning capabilities of LLMs. This work innovatively explores the feasibility of using text-based LLMs as builders of the physical world and designs a workflow to enhance their spatial comprehension and construction capabilities.

The rapid advancements of Large Language Models (LLMs) tightly associate with the expansion of the training data size. However, the unchecked ultra-large-scale training sets introduce a series of potential risks like data contamination, i.e. the benchmark data is used for training. In this work, we propose a holistic method named Polarized Augment Calibration (PAC) along with a new to-be-released dataset to detect the contaminated data and diminish the contamination effect. PAC extends the popular MIA (Membership Inference Attack) — from machine learning community — by forming a more global target at detecting training data to Clarify invisible training data. As a pioneering work, PAC is very much plug-and-play that can be integrated with most (if not all) current white- and black-box LLMs. By extensive experiments, PAC outperforms existing methods by at least 4.5%, towards data contamination detection on more 4 dataset formats, with more than 10 base LLMs. Besides, our application in real-world scenarios highlights the prominent presence of contamination and related issues.

Co-authors

Xing Fu 1

Chao Ye 1

Venues

Fix author