Generating unlabelled data for a tri-training approach in a low resourced NER task

Hugo Boulanger; Thomas Lavergne; Sophie Rosset

doi:10.18653/v1/2022.deeplo-1.4

Generating unlabelled data for a tri-training approach in a low resourced NER task

Hugo Boulanger, Thomas Lavergne, Sophie Rosset

Abstract

Training a tagger for Named Entity Recognition (NER) requires a substantial amount of labeled data in the task domain. Manual labeling is a tedious and complicated task. Semisupervised learning methods can reduce the quantity of labeled data necessary to train a model. However, these methods require large quantities of unlabeled data, which remains an issue in many cases.We address this problem by generating unlabeled data. Large language models have proven to be powerful tools for text generation. We use their generative capacity to produce new sentences and variations of the sentences of our available data. This generation method, combined with a semi-supervised method, is evaluated on CoNLL and I2B2. We prepare both of these corpora to simulate a low resource setting. We obtain significant improvements for semisupervised learning with synthetic data against supervised learning on natural data.

Anthology ID:: 2022.deeplo-1.4
Volume:: Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Month:: July
Year:: 2022
Address:: Hybrid
Editors:: Colin Cherry, Angela Fan, George Foster, Gholamreza (Reza) Haffari, Shahram Khadivi, Nanyun (Violet) Peng, Xiang Ren, Ehsan Shareghi, Swabha Swayamdipta
Venue:: DeepLo
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 30–37
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2022.deeplo-1.4/
DOI:: 10.18653/v1/2022.deeplo-1.4
Bibkey:
Cite (ACL):: Hugo Boulanger, Thomas Lavergne, and Sophie Rosset. 2022. Generating unlabelled data for a tri-training approach in a low resourced NER task. In Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing, pages 30–37, Hybrid. Association for Computational Linguistics.
Cite (Informal):: Generating unlabelled data for a tri-training approach in a low resourced NER task (Boulanger et al., DeepLo 2022)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2022.deeplo-1.4.pdf
Video:: https://preview.aclanthology.org/ingest-emnlp/2022.deeplo-1.4.mp4

PDF Cite Search Video Fix data