A Named Entity Recognition Corpus for Vietnamese Biomedical Texts to Support Tuberculosis Treatment

Uyen Phan, Phuong N.V Nguyen, Nhung Nguyen


Abstract
Named Entity Recognition (NER) is an important task in information extraction. However, due to the lack of labelled corpora, biomedical NER has scarcely been studied in Vietnamese compared to English. To address this situation, we have constructed VietBioNER, a labelled NER corpus of Vietnamese academic biomedical text. The corpus focuses specifically on supporting tuberculosis surveillance, and was constructed by collecting scientific papers and grey literature related to tuberculosis symptoms and diagnostics. We manually annotated a small set of the collected documents with five categories of named entities: Organisation, Location, Date and Time, Symptom and Disease, and Diagnostic Procedure. Inter-annotator agreement ranges from 70.59% and 95.89% F-score according to entity category. In this paper, we make available two splits of the corpus, corresponding to traditional supervised learning and few-shot learning settings. We also provide baseline results for both of these settings, in addition to a dictionary-based approach, as a means to stimulate further research into Vietnamese biomedical NER. Although supervised methods produce results that are far superior to the other two approaches, the fact that even one-shot learning can outperform the dictionary-based method provides evidence that further research into few-shot learning on this text type would be worthwhile.
Anthology ID:
2022.lrec-1.385
Volume:
Proceedings of the Thirteenth Language Resources and Evaluation Conference
Month:
June
Year:
2022
Address:
Marseille, France
Editors:
Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, Stelios Piperidis
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3601–3609
Language:
URL:
https://aclanthology.org/2022.lrec-1.385
DOI:
Bibkey:
Cite (ACL):
Uyen Phan, Phuong N.V Nguyen, and Nhung Nguyen. 2022. A Named Entity Recognition Corpus for Vietnamese Biomedical Texts to Support Tuberculosis Treatment. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3601–3609, Marseille, France. European Language Resources Association.
Cite (Informal):
A Named Entity Recognition Corpus for Vietnamese Biomedical Texts to Support Tuberculosis Treatment (Phan et al., LREC 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl-24-ws-corrections/2022.lrec-1.385.pdf
Code
 ptpuyen1511/vietbioner