From Noise to Clarity: Filtering Real and LLM-Generated Samples for Enhanced Intent Detection

Junbao Huang; Weizhen Li; Peijie Huang (黄沛杰); Yuhong Xu (徐禹洪)

doi:10.18653/v1/2025.findings-emnlp.1186

From Noise to Clarity: Filtering Real and LLM-Generated Samples for Enhanced Intent Detection

Junbao Huang, Weizhen Li, Peijie Huang, Yuhong Xu

Abstract

In dialogue intent detection, the challenge of acquiring sufficient corpora and the high cost of manual annotation often lead to incorrectly labeled or unrepresentative samples, which can hinder the generalization ability of classification models. Additionally, as using large language models for generating synthetic samples for data augmentation becomes more common, these synthetic samples may exacerbate the problem by introducing additional noise due to the models’ limited prior knowledge. To address this challenge, this paper proposes an interpretable Sample Filter by Topic Modeling (SFTM) framework. By evaluating the diversity and authenticity of the samples, SFTM effectively reduces the quantity of real and synthetic samples while improving the performance of the classification models. Our codes are publicly available at https://github.com/gumbouh/SFTM.

Anthology ID:: 2025.findings-emnlp.1186
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 21736–21746
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1186/
DOI:: 10.18653/v1/2025.findings-emnlp.1186
Bibkey:
Cite (ACL):: Junbao Huang, Weizhen Li, Peijie Huang, and Yuhong Xu. 2025. From Noise to Clarity: Filtering Real and LLM-Generated Samples for Enhanced Intent Detection. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 21736–21746, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: From Noise to Clarity: Filtering Real and LLM-Generated Samples for Enhanced Intent Detection (Huang et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.1186.pdf
Checklist:: 2025.findings-emnlp.1186.checklist.pdf

PDF Cite Search Checklist Fix data