Jorge Osés Grijalba

Also published as: Jorge Osés Grijalba

2026

Enhancing and Evaluating Tabular Models on the Fly via Synthetic Question–Answer Generation
Jorge Osés Grijalba | Eugenio Martínez Cámara | L. Alfonso Ureñ-López | Jose Camacho-Collados
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Question Answering (QA) over Tabular Data has been traditionally a challenging task, but LLMs have recently shown the ability to respond to questions related to this type of structured data. However, current tabular QA datasets are skewed toward Wikipedia tables and SQL-style answers composed of human-crafted question–answer pairs. This limits the evaluation of LLMs on this task to a narrow genre of data and language, while also requiring extensive human effort for dataset or benchmark creation. To address this, we introduce SynTabQA, a methodology for the automatic generation of synthetic question–answer pairs from any unannotated table. SynTabQA defines a detailed question typology, enabling fine-grained evaluation and facilitating the creation of diverse QA datasets. Our approach not only provides an automated test bed for any tabular dataset but can also be used in few-shot settings to supply LLMs with tailored examples, improving their focus and accuracy. We validate SynTabQA on two large, manually constructed tabular QA benchmarks of distinct nature.

pdf bib abs

The Problem of Ambiguity in Table Question Answering
Jorge Osés Grijalba | L. Alfonso Ureña | Eugenio Martínez-Cámara | Jose Camacho-Collados
Findings of the Association for Computational Linguistics: EACL 2026

Question Answering on Tabular Data (or Table Question Answering) has seen tremendous advances with the coming of new generation Large Language Models (LLMs). Despite this, significant challenges still remain to be solved if we are to develop robust enough approaches for general usage. One of these is ambiguity in question answering, which historically has not merited much attention due to the previously limited capabilities of LLMs. In this work, we outlay the main types of ambiguousness inherent to tabular data. Then, we discuss how they are influenced by the way our models interact with the information stored in the tables, and we test the capabilities of some LLMs in detecting them. This work provides an initial ground for a deeper discussion on how to approach ambiguity in Tabular Data in the age of LLMs.

2025

pdf bib abs

SemEval-2025 Task 8: Question Answering over Tabular Data
Jorge Osés Grijalba | L. Alfonso Ureñ - López | Eugenio Martínez Cámara | Jose Camacho - Collados
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

We introduce the findings and results of SemEval-2025 Task 8: Question Answering over Tabular Data. We featured two subtasks, DataBench and DataBench Lite. DataBench consists on question answering over tabular data, and DataBench Lite small comprising small datasets that might be easier to manage by current models by for example fitting them into a prompt. The task was open for any approach, but their answer has to conform to a required typing format. In this paper we present the task, analyze a number of system submissions and discuss the results. The results show how approaches leveraging LLMs dominated the task, with larger models exhibiting a considerably superior performance compared to small models.

2024

pdf bib abs

Question Answering over Tabular Data with DataBench: A Large-Scale Empirical Evaluation of LLMs
Jorge Osés Grijalba | L. Alfonso Ureña-López | Eugenio Martínez Cámara | Jose Camacho-Collados
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) are showing emerging abilities, and one of the latest recognized ones deals with their ability to reason and answer questions from tabular data. Although there are some available datasets to assess question answering systems on tabular data, they are not large and diverse enough to properly assess the capabilities of LLMs. To this end, we propose DataBench, a benchmark composed of 65 real-world datasets over several domains, including 20 human-generated questions per dataset, totaling 1300 questions and answers overall. Using this benchmark, we perform a large-scale empirical comparison of several open and closed source models, including both code-generating and in-context learning models. The results highlight the current gap between open-source and closed-source models, with all types of model having room for improvement even in simple boolean questions or involving a single column.

Co-authors

Venues

Fix author