Overcoming Data Scarcity in Named Entity Recognition: Synthetic Data Generation with Large Language Models

An Dao, Hiroki Teranishi, Yuji Matsumoto, Florian Boudin, Akiko Aizawa


Abstract
Named Entity Recognition (NER) is crucial for extracting domain-specific entities from text, particularly in biomedical and chemical fields. Developing high-quality NER models in specialized domains is challenging due to the limited availability of annotated data, with manual annotation being a key method of data construction. However, manual annotation is time-consuming and requires domain expertise, making it difficult in specialized domains. Traditional data augmentation (DA) techniques also rely on annotated data to some extent, further limiting their effectiveness. In this paper, we propose a novel approach to synthetic data generation for NER using large language models (LLMs) to generate sentences based solely on a set of example entities. This method simplifies the augmentation process and is effective even with a limited set of entities.We evaluate our approach using BERT-based models on the BC4CHEMD, BC5CDR, and TDMSci datasets, demonstrating that synthetic data significantly improves model performance and robustness, particularly in low-resource settings. This work provides a scalable solution for enhancing NER in specialized domains, overcoming the limitations of manual annotation and traditional augmentation methods.
Anthology ID:
2025.bionlp-1.28
Volume:
ACL 2025
Month:
August
Year:
2025
Address:
Viena, Austria
Editors:
Dina Demner-Fushman, Sophia Ananiadou, Makoto Miwa, Junichi Tsujii
Venues:
BioNLP | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
328–340
Language:
URL:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bionlp-1.28/
DOI:
Bibkey:
Cite (ACL):
An Dao, Hiroki Teranishi, Yuji Matsumoto, Florian Boudin, and Akiko Aizawa. 2025. Overcoming Data Scarcity in Named Entity Recognition: Synthetic Data Generation with Large Language Models. In ACL 2025, pages 328–340, Viena, Austria. Association for Computational Linguistics.
Cite (Informal):
Overcoming Data Scarcity in Named Entity Recognition: Synthetic Data Generation with Large Language Models (Dao et al., BioNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/acl25-workshop-ingestion/2025.bionlp-1.28.pdf
Supplementarymaterial:
 2025.bionlp-1.28.SupplementaryMaterial.txt
Supplementarymaterial:
 2025.bionlp-1.28.SupplementaryMaterial.zip