Ashish Tiwari


2026

We present TEN, a neurosymbolic approach for extracting tabular data from semistructured text such as copy-pasted content from PDFs, emails, or OCR-flattened outputs. This task poses real-world challenges in domains like finance and healthcare, where manual copy-paste into spreadsheets introduces errors and OCR distortions compromise data integrity, leading to financial losses and flawed decisions.Purely neural methods suffer from hallucinations and structural inconsistencies, hindering deployment robustness. TEN addresses this via a novel triadic feedback loop that iteratively refines table hypotheses to enforce constraints and achieve verifiable convergence.Experiments show TEN outperforms neural baselines in exact match accuracy and lower hallucination rates. A 21-participant user study rates TEN tables more accurate and preferred in over 60% of pairwise comparisons, though verification and correction effort did not differ significantly between conditions.

2025

Extracting insights from text columns can bechallenging and time-intensive. Existing methods for topic modeling and feature extractionare based on syntactic features and often overlook the semantics. We introduce the semantictext column featurization problem, and presenta scalable approach for automatically solvingit. We extract a small sample smartly, use alarge language model (LLM) to label only thesample, and then lift the labeling to the wholecolumn using text embeddings. We evaluateour approach by turning existing text classification benchmarks into semantic categorization benchmarks. Our approach performs better than baselines and naive use of LLMs.