Entity Extraction in Low Resource Domains with Selective Pre-training of Large Language Models

Aniruddha Mahapatra, Sharmila Reddy Nangi, Aparna Garimella, Anandhavelu N


Abstract
Transformer-based language models trained on large natural language corpora have been very useful in downstream entity extraction tasks. However, they often result in poor performances when applied to domains that are different from those they are pretrained on. Continued pretraining using unlabeled data from target domains can help improve the performances of these language models on the downstream tasks. However, using all of the available unlabeled data for pretraining can be time-intensive; also, it can be detrimental to the performance of the downstream tasks, if the unlabeled data is not aligned with the data distribution for the target tasks. Previous works employed external supervision in the form of ontologies for selecting appropriate data samples for pretraining, but external supervision can be quite hard to obtain in low-resource domains. In this paper, we introduce effective ways to select data from unlabeled corpora of target domains for language model pretraining to improve the performances in target entity extraction tasks. Our data selection strategies do not require any external supervision. We conduct extensive experiments for the task of named entity recognition (NER) on seven different domains and show that language models pretrained on target domain unlabeled data obtained using our data selection strategies achieve better performances compared to those using data selection strategies in previous works that use external supervision. We also show that these pretrained language models using our data selection strategies outperform those pretrained on all of the available unlabeled target domain data.
Anthology ID:
2022.emnlp-main.61
Volume:
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
942–951
Language:
URL:
https://aclanthology.org/2022.emnlp-main.61
DOI:
10.18653/v1/2022.emnlp-main.61
Bibkey:
Cite (ACL):
Aniruddha Mahapatra, Sharmila Reddy Nangi, Aparna Garimella, and Anandhavelu N. 2022. Entity Extraction in Low Resource Domains with Selective Pre-training of Large Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 942–951, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Entity Extraction in Low Resource Domains with Selective Pre-training of Large Language Models (Mahapatra et al., EMNLP 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2022.emnlp-main.61.pdf