Abstract
In this paper, we present our approach to addressing the binary classification tasks, Tasks 5 and 6, as part of the Social Media Mining for Health (SMM4H) text classification challenge. Both tasks involved working with imbalanced datasets that featured a scarcity of positive examples. To mitigate this imbalance, we employed a Large Language Model to generate synthetic texts with positive labels, aiming to augment the training data for our text classification models. Unfortunately, this method did not significantly improve model performance. Through clustering analysis using text embeddings, we discovered that the generated texts significantly lacked diversity compared to the raw data. This finding highlights the challenges of using synthetic text generation for enhancing model efficacy in real-world applications, specifically in the context of health-related social media data.- Anthology ID:
- 2024.smm4h-1.10
- Volume:
- Proceedings of The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks
- Month:
- August
- Year:
- 2024
- Address:
- Bangkok, Thailand
- Editors:
- Dongfang Xu, Graciela Gonzalez-Hernandez
- Venues:
- SMM4H | WS
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 42–47
- Language:
- URL:
- https://aclanthology.org/2024.smm4h-1.10
- DOI:
- Cite (ACL):
- Yosuke Yamagishi and Yuta Nakamura. 2024. UTRad-NLP at #SMM4H 2024: Why LLM-Generated Texts Fail to Improve Text Classification Models. In Proceedings of The 9th Social Media Mining for Health Research and Applications (SMM4H 2024) Workshop and Shared Tasks, pages 42–47, Bangkok, Thailand. Association for Computational Linguistics.
- Cite (Informal):
- UTRad-NLP at #SMM4H 2024: Why LLM-Generated Texts Fail to Improve Text Classification Models (Yamagishi & Nakamura, SMM4H-WS 2024)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-4/2024.smm4h-1.10.pdf