From Documents to Segments: A Contextual Reformulation for Topic Assignment

Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-T\"ur, Stanley Jungkyu Choi


Abstract
Traditional topic modeling treats each document as a single, coherent unit of topic, which can cause topic contamination when documents cover multiple topics. This becomes especially problematic when stakeholders are interested in identifying documents that focus on a specific topic. We introduce segment-based topic allocation, a novel paradigm that redefines topic assignment at the level of segments, coherent textual spans conveying distinct topical content. This granularity improves topic purity, interpretability, and applicability to multi-theme corpora such as reviews or survey responses. To support this paradigm, we construct SemEval-STM, a benchmark derived from aspect-based sentiment datasets, where segments are automatically extracted using large language models (LLMs) and post-processed with human supervision. We further propose the segment intrusion task (SIT), a novel evaluation method extending word intrusion to the span level, enabling human-centric assessment of topical coherence. Empirical results across diverse metrics and models demonstrate that SBTA significantly outperforms traditional document-based methods in clustering and interpretability. Our framework provides a practical and scalable solution for fine-grained topic analysis in heterogeneous text corpora.
Anthology ID:
2026.findings-acl.1278
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
25586–25624
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1278/
DOI:
Bibkey:
Cite (ACL):
Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-T\"ur, and Stanley Jungkyu Choi. 2026. From Documents to Segments: A Contextual Reformulation for Topic Assignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25586–25624, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
From Documents to Segments: A Contextual Reformulation for Topic Assignment (Yoon et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1278.pdf
Checklist:
 2026.findings-acl.1278.checklist.pdf