Abstract
Named Entity Recognition (NER) state-of-the-art methods requires high-quality labeled datasets. Issues such as scarcity of labeled data, under-representation of entities, and privacy concerns with using sensitive data for training, can be significant barriers. Generating synthetic data to train models is a promising solution to mitigate these problems. We propose ECG-QALM, a contextual question and answering approach using pre-trained language models to synthetically generate entity-controlled text. Generated text is then used to augment small labeled datasets for downstream NER tasks. We evaluate our method on two publicly available datasets. We find ECG-QALM is capable of producing full text samples with desired entities appearing in a controllable way, while retaining sentence coherence closest to the real world data. Evaluations on NER tasks show significant improvements (75% - 140%) in low-labeled data regimes.- Anthology ID:
- 2023.findings-acl.349
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2023
- Month:
- July
- Year:
- 2023
- Address:
- Toronto, Canada
- Editors:
- Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 5649–5660
- Language:
- URL:
- https://aclanthology.org/2023.findings-acl.349
- DOI:
- 10.18653/v1/2023.findings-acl.349
- Cite (ACL):
- Karan Aggarwal, Henry Jin, and Aitzaz Ahmad. 2023. ECG-QALM: Entity-Controlled Synthetic Text Generation using Contextual Q&A for NER. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5649–5660, Toronto, Canada. Association for Computational Linguistics.
- Cite (Informal):
- ECG-QALM: Entity-Controlled Synthetic Text Generation using Contextual Q&A for NER (Aggarwal et al., Findings 2023)
- PDF:
- https://preview.aclanthology.org/dois-2013-emnlp/2023.findings-acl.349.pdf