TeCoFeS: Text Column Featurization using Semantic Analysis

Ananya Singha, Mukul Singh, Ashish Tiwari, Sumit Gulwani, Vu Le, Chris Parnin


Abstract
Extracting insights from text columns can bechallenging and time-intensive. Existing methods for topic modeling and feature extractionare based on syntactic features and often overlook the semantics. We introduce the semantictext column featurization problem, and presenta scalable approach for automatically solvingit. We extract a small sample smartly, use alarge language model (LLM) to label only thesample, and then lift the labeling to the wholecolumn using text embeddings. We evaluateour approach by turning existing text classification benchmarks into semantic categorization benchmarks. Our approach performs better than baselines and naive use of LLMs.
Anthology ID:
2025.findings-naacl.392
Volume:
Findings of the Association for Computational Linguistics: NAACL 2025
Month:
April
Year:
2025
Address:
Albuquerque, New Mexico
Editors:
Luis Chiruzzo, Alan Ritter, Lu Wang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
7055–7061
Language:
URL:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.392/
DOI:
Bibkey:
Cite (ACL):
Ananya Singha, Mukul Singh, Ashish Tiwari, Sumit Gulwani, Vu Le, and Chris Parnin. 2025. TeCoFeS: Text Column Featurization using Semantic Analysis. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 7055–7061, Albuquerque, New Mexico. Association for Computational Linguistics.
Cite (Informal):
TeCoFeS: Text Column Featurization using Semantic Analysis (Singha et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/fix-sig-urls/2025.findings-naacl.392.pdf