Hemolix.TabGen: Optimized Table Generation from Documents

Gyanendra Shrestha, Todor Ivanov, Karthik Vemireddy, Anna Pyayt, Michael Gubanov


Abstract
Modern Data Lakes contain vast and heterogeneous document collections, making table generation from documents a persistent and nontrivial challenge. Traditional approaches are often rigid — i.e. domain-specific, require extensive supervision, or are limited to set of pre-defined schemas; LLM-based approaches are more flexible, but typically suffer from hallucinations, non-determinism, and high computational costs. To overcome these limitations, we introduce Hemolix.TabGen, a novel scalable LLM-based table generation systemthat comprehends documents and generates Bi-dimensional tables based on the entire document content. We evaluated TabGen on 4 publicly available datasets spanning multiple domains and observed an Average Precision delta up to 30% compared to vanilla LLMs
Anthology ID:
2026.acl-industry.73
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Month:
July
Year:
2026
Address:
San Diego, California, USA
Editors:
Yunyao Li, Georg Rehm, Mei Tu
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1055–1066
Language:
URL:
https://preview.aclanthology.org/ingestion-form-platform/2026.acl-industry.73/
DOI:
Bibkey:
Cite (ACL):
Gyanendra Shrestha, Todor Ivanov, Karthik Vemireddy, Anna Pyayt, and Michael Gubanov. 2026. Hemolix.TabGen: Optimized Table Generation from Documents. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pages 1055–1066, San Diego, California, USA. Association for Computational Linguistics.
Cite (Informal):
Hemolix.TabGen: Optimized Table Generation from Documents (Shrestha et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-form-platform/2026.acl-industry.73.pdf