HU at SemEval-2025 Task 9: Leveraging LLM-Based Data Augmentation for Class Imbalance

Muhammad Saad; Meesum Abbas; Sandesh Kumar; Abdul Samad

HU at SemEval-2025 Task 9: Leveraging LLM-Based Data Augmentation for Class Imbalance

Muhammad Saad, Meesum Abbas, Sandesh Kumar, Abdul Samad

Abstract

This paper presents a solution to the food hazard detection challenge in the SemEval-2025 Task 9, focusing on overcoming class imbalance using data augmentation techniques. We employ large language models (LLMs) like GPT-4o, Gemini Flash 1.5, and T5 to generate synthetic data, alongside other methods like synonym replacement, back-translation, and paraphrasing. These augmented datasets are used to fine-tune transformer-based models such as DistilBERT, improving their performance in detecting food hazards and categorizing products. Our approach achieves notable improvements in macro-F1 scores for both subtasks, although challenges remain in detecting implicit hazards and handling extreme class imbalance. The paper also discusses various techniques, including class weighting and ensemble modeling, as part of the training process. Despite the improvements, further work is necessary to refine hazard detection, particularly for rare and implicit categories.

Anthology ID:: 2025.semeval-1.210
Volume:: Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Sara Rosenthal, Aiala Rosá, Debanjan Ghosh, Marcos Zampieri
Venues:: SemEval | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 1593–1601
Language:
URL:: https://preview.aclanthology.org/transition-to-people-yaml/2025.semeval-1.210/
DOI:
Bibkey:
Cite (ACL):: Muhammad Saad, Meesum Abbas, Sandesh Kumar, and Abdul Samad. 2025. HU at SemEval-2025 Task 9: Leveraging LLM-Based Data Augmentation for Class Imbalance. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1593–1601, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: HU at SemEval-2025 Task 9: Leveraging LLM-Based Data Augmentation for Class Imbalance (Saad et al., SemEval 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/transition-to-people-yaml/2025.semeval-1.210.pdf

PDF Cite Search Fix data