Thesis Proposal: A Normalization-First Framework for Sound, Complete, and Utility-Ready Open Information Extraction

Chandan Prakash, Pavan Kumar Chittimalli, Arnab Bhattacharya


Abstract
Open Information Extraction (OIE) has largely focused on extracting relational tuples from text, yet in its current form remains unsuitable for downstream systems due to the absence of standardized, semantically sound representations. This thesis argues that the field has been addressing extraction as a surface-level prediction problem, leading to outputs that are semantically incomplete and logically ambiguous, particularly in the presence of modality, negation, conditionality, quantification, and attribution. We propose a normalization-first framework that reframes OIE as a structured semantic transformation pipeline, where raw text is first converted into a lossless, canonical form of declarative, active-voice, and irreducible sentence units, and extraction is constrained to atomic unary and binary relations augmented with explicit semantic annotations. Within a Probably Approximately Correct (PAC) learning perspective, we formalize soundness, completeness, and usefulness as approximate yet verifiable guarantees over extraction quality, acknowledging the inherent undecidability of full semantic interpretation. This thesis outlines a feasible research program to develop the theoretical foundations, models, and evaluation protocols required to produce system-ready OIE representations, thereby establishing a principled and executable path toward making OIE directly usable for downstream reasoning and machine interpretability.
Anthology ID:
2026.acl-srw.116
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1291–1304
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.116/
DOI:
Bibkey:
Cite (ACL):
Chandan Prakash, Pavan Kumar Chittimalli, and Arnab Bhattacharya. 2026. Thesis Proposal: A Normalization-First Framework for Sound, Complete, and Utility-Ready Open Information Extraction. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026), pages 1291–1304, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Thesis Proposal: A Normalization-First Framework for Sound, Complete, and Utility-Ready Open Information Extraction (Prakash et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw.116.pdf