From Documents to Segments: A Contextual Reformulation for Topic Assignment
Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-T\"ur, Stanley Jungkyu Choi
Abstract
Traditional topic modeling treats each document as a single, coherent unit of topic, which can cause topic contamination when documents cover multiple topics. This becomes especially problematic when stakeholders are interested in identifying documents that focus on a specific topic. We introduce segment-based topic allocation, a novel paradigm that redefines topic assignment at the level of segments, coherent textual spans conveying distinct topical content. This granularity improves topic purity, interpretability, and applicability to multi-theme corpora such as reviews or survey responses. To support this paradigm, we construct SemEval-STM, a benchmark derived from aspect-based sentiment datasets, where segments are automatically extracted using large language models (LLMs) and post-processed with human supervision. We further propose the segment intrusion task (SIT), a novel evaluation method extending word intrusion to the span level, enabling human-centric assessment of topical coherence. Empirical results across diverse metrics and models demonstrate that SBTA significantly outperforms traditional document-based methods in clustering and interpretability. Our framework provides a practical and scalable solution for fine-grained topic analysis in heterogeneous text corpora.- Anthology ID:
- 2026.findings-acl.1278
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 25586–25624
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1278/
- DOI:
- Cite (ACL):
- Hoonsang Yoon, Takyoung Kim, Wonkee Lee, Ilmin Cho, Dilek Hakkani-T\"ur, and Stanley Jungkyu Choi. 2026. From Documents to Segments: A Contextual Reformulation for Topic Assignment. In Findings of the Association for Computational Linguistics: ACL 2026, pages 25586–25624, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- From Documents to Segments: A Contextual Reformulation for Topic Assignment (Yoon et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1278.pdf