Hemolix.TabGen: Optimized Table Generation from Documents
Gyanendra Shrestha, Todor Ivanov, Karthik Vemireddy, Anna Pyayt, Michael Gubanov
Abstract
Modern Data Lakes contain vast and heterogeneous document collections, making table generation from documents a persistent and nontrivial challenge. Traditional approaches are often rigid — i.e. domain-specific, require extensive supervision, or are limited to set of pre-defined schemas; LLM-based approaches are more flexible, but typically suffer from hallucinations, non-determinism, and high computational costs. To overcome these limitations, we introduce Hemolix.TabGen, a novel scalable LLM-based table generation systemthat comprehends documents and generates Bi-dimensional tables based on the entire document content. We evaluated TabGen on 4 publicly available datasets spanning multiple domains and observed an Average Precision delta up to 30% compared to vanilla LLMs- Anthology ID:
- 2026.acl-industry.73
- Volume:
- Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, USA
- Editors:
- Yunyao Li, Georg Rehm, Mei Tu
- Venue:
- ACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1055–1066
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.acl-industry.73/
- DOI:
- Cite (ACL):
- Gyanendra Shrestha, Todor Ivanov, Karthik Vemireddy, Anna Pyayt, and Michael Gubanov. 2026. Hemolix.TabGen: Optimized Table Generation from Documents. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1055–1066, San Diego, California, USA. Association for Computational Linguistics.
- Cite (Informal):
- Hemolix.TabGen: Optimized Table Generation from Documents (Shrestha et al., ACL 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.acl-industry.73.pdf