ECG-QALM: Entity-Controlled Synthetic Text Generation using Contextual Q&A for NER

Karan Aggarwal, Henry Jin, Aitzaz Ahmad


Abstract
Named Entity Recognition (NER) state-of-the-art methods requires high-quality labeled datasets. Issues such as scarcity of labeled data, under-representation of entities, and privacy concerns with using sensitive data for training, can be significant barriers. Generating synthetic data to train models is a promising solution to mitigate these problems. We propose ECG-QALM, a contextual question and answering approach using pre-trained language models to synthetically generate entity-controlled text. Generated text is then used to augment small labeled datasets for downstream NER tasks. We evaluate our method on two publicly available datasets. We find ECG-QALM is capable of producing full text samples with desired entities appearing in a controllable way, while retaining sentence coherence closest to the real world data. Evaluations on NER tasks show significant improvements (75% - 140%) in low-labeled data regimes.
Anthology ID:
2023.findings-acl.349
Volume:
Findings of the Association for Computational Linguistics: ACL 2023
Month:
July
Year:
2023
Address:
Toronto, Canada
Editors:
Anna Rogers, Jordan Boyd-Graber, Naoaki Okazaki
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5649–5660
Language:
URL:
https://aclanthology.org/2023.findings-acl.349
DOI:
10.18653/v1/2023.findings-acl.349
Bibkey:
Cite (ACL):
Karan Aggarwal, Henry Jin, and Aitzaz Ahmad. 2023. ECG-QALM: Entity-Controlled Synthetic Text Generation using Contextual Q&A for NER. In Findings of the Association for Computational Linguistics: ACL 2023, pages 5649–5660, Toronto, Canada. Association for Computational Linguistics.
Cite (Informal):
ECG-QALM: Entity-Controlled Synthetic Text Generation using Contextual Q&A for NER (Aggarwal et al., Findings 2023)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/2023.findings-acl.349.pdf