Muhammad Saad


2025

pdf bib
HU at SemEval-2025 Task 9: Leveraging LLM-Based Data Augmentation for Class Imbalance
Muhammad Saad | Meesum Abbas | Sandesh Kumar | Abdul Samad
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents a solution to the food hazard detection challenge in the SemEval-2025 Task 9, focusing on overcoming class imbalance using data augmentation techniques. We employ large language models (LLMs) like GPT-4o, Gemini Flash 1.5, and T5 to generate synthetic data, alongside other methods like synonym replacement, back-translation, and paraphrasing. These augmented datasets are used to fine-tune transformer-based models such as DistilBERT, improving their performance in detecting food hazards and categorizing products. Our approach achieves notable improvements in macro-F1 scores for both subtasks, although challenges remain in detecting implicit hazards and handling extreme class imbalance. The paper also discusses various techniques, including class weighting and ensemble modeling, as part of the training process. Despite the improvements, further work is necessary to refine hazard detection, particularly for rare and implicit categories.