Generative Deduplication For Socia Media Data Selection

Xianming Li, Jing Li


Abstract
Social media data exhibits severe redundancy caused by its noisy nature. It leads to increased training time and model bias in its processing. To address this issue, we propose a novel Generative Deduplication framework for social media data selection by removing semantically duplicate data. While related work involves data selection in the task-specific training, our model functions as an efficient pre-processing method to universally enhance social media NLP pipelines. Specifically, we train a generative model via self-supervised learning to predict keyword to capture the semantics of noisy social media text for deduplication. Meanwhile, time-dimensional Gaussian noise is added to improve training complexity and avoid learning trivial features. Extensive experiments suggest that our model can better reduce training samples while improving performance than baselines. The results show our model’s potential to broadly advance social media language understanding in effectiveness and efficiency.
Anthology ID:
2024.findings-emnlp.330
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2024
Month:
November
Year:
2024
Address:
Miami, Florida, USA
Editors:
Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5765–5776
Language:
URL:
https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2024.findings-emnlp.330/
DOI:
10.18653/v1/2024.findings-emnlp.330
Bibkey:
Cite (ACL):
Xianming Li and Jing Li. 2024. Generative Deduplication For Socia Media Data Selection. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 5765–5776, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):
Generative Deduplication For Socia Media Data Selection (Li & Li, Findings 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/Add-Cong-Liu-Florida-Atlantic-University-author-id/2024.findings-emnlp.330.pdf
Software:
 2024.findings-emnlp.330.software.zip