Nhung Nguyen


A Named Entity Recognition Corpus for Vietnamese Biomedical Texts to Support Tuberculosis Treatment
Uyen Phan | Phuong N.V Nguyen | Nhung Nguyen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Named Entity Recognition (NER) is an important task in information extraction. However, due to the lack of labelled corpora, biomedical NER has scarcely been studied in Vietnamese compared to English. To address this situation, we have constructed VietBioNER, a labelled NER corpus of Vietnamese academic biomedical text. The corpus focuses specifically on supporting tuberculosis surveillance, and was constructed by collecting scientific papers and grey literature related to tuberculosis symptoms and diagnostics. We manually annotated a small set of the collected documents with five categories of named entities: Organisation, Location, Date and Time, Symptom and Disease, and Diagnostic Procedure. Inter-annotator agreement ranges from 70.59% and 95.89% F-score according to entity category. In this paper, we make available two splits of the corpus, corresponding to traditional supervised learning and few-shot learning settings. We also provide baseline results for both of these settings, in addition to a dictionary-based approach, as a means to stimulate further research into Vietnamese biomedical NER. Although supervised methods produce results that are far superior to the other two approaches, the fact that even one-shot learning can outperform the dictionary-based method provides evidence that further research into few-shot learning on this text type would be worthwhile.

Simple Semantic-based Data Augmentation for Named Entity Recognition in Biomedical Texts
Uyen Phan | Nhung Nguyen
Proceedings of the 21st Workshop on Biomedical Language Processing

Data augmentation is important in addressing data sparsity and low resources in NLP. Unlike data augmentation for other tasks such as sentence-level and sentence-pair ones, data augmentation for named entity recognition (NER) requires preserving the semantic of entities. To that end, in this paper we propose a simple semantic-based data augmentation method for biomedical NER. Our method leverages semantic information from pre-trained language models for both entity-level and sentence-level. Experimental results on two datasets: i2b2-2010 (English) and VietBioNER (Vietnamese) showed that the proposed method could improve NER performance.

UoM&MMU at TSAR-2022 Shared Task: Prompt Learning for Lexical Simplification
Laura Vásquez-Rodríguez | Nhung Nguyen | Matthew Shardlow | Sophia Ananiadou
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

We present PromptLS, a method for fine-tuning large pre-trained Language Models (LM) to perform the task of Lexical Simplification. We use a predefined template to attain appropriate replacements for a term, and fine-tune a LM using this template on language specific datasets. We filter candidate lists in post-processing to improve accuracy. We demonstrate that our model can work in a) a zero shot setting (where we only require a pre-trained LM), b) a fine-tuned setting (where language-specific data is required), and c) a multilingual setting (where the model is pre-trained across multiple languages and fine-tuned in an specific language). Experimental results show that, although the zero-shot setting is competitive, its performance is still far from the fine-tuned setting. Also, the multilingual is unsurprisingly worse than the fine-tuned model. Among all TSAR-2022 Shared Task participants, our team was ranked second in Spanish and third in English.


Coreference Resolution in Full Text Articles with BERT and Syntax-based Mention Filtering
Hai-Long Trieu | Anh-Khoa Duong Nguyen | Nhung Nguyen | Makoto Miwa | Hiroya Takamura | Sophia Ananiadou
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

This paper describes our system developed for the coreference resolution task of the CRAFT Shared Tasks 2019. The CRAFT corpus is more challenging than other existing corpora because it contains full text articles. We have employed an existing span-based state-of-theart neural coreference resolution system as a baseline system. We enhance the system with two different techniques to capture longdistance coreferent pairs. Firstly, we filter noisy mentions based on parse trees with increasing the number of antecedent candidates. Secondly, instead of relying on the LSTMs, we integrate the highly expressive language model–BERT into our model. Experimental results show that our proposed systems significantly outperform the baseline. The best performing system obtained F-scores of 44%, 48%, 39%, 49%, 40%, and 57% on the test set with B3, BLANC, CEAFE, CEAFM, LEA, and MUC metrics, respectively. Additionally, the proposed model is able to detect coreferent pairs in long distances, even with a distance of more than 200 sentences.


A New Corpus to Support Text Mining for the Curation of Metabolites in the ChEBI Database
Matthew Shardlow | Nhung Nguyen | Gareth Owen | Claire O’Donovan | Andrew Leach | John McNaught | Steve Turner | Sophia Ananiadou
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

An Empirical Investigation of Error Types in Vietnamese Parsing
Quy Nguyen | Yusuke Miyao | Hiroshi Noji | Nhung Nguyen
Proceedings of the 27th International Conference on Computational Linguistics

Syntactic parsing plays a crucial role in improving the quality of natural language processing tasks. Although there have been several research projects on syntactic parsing in Vietnamese, the parsing quality has been far inferior than those reported in major languages, such as English and Chinese. In this work, we evaluated representative constituency parsing models on a Vietnamese Treebank to look for the most suitable parsing method for Vietnamese. We then combined the advantages of automatic and manual analysis to investigate errors produced by the experimented parsers and find the reasons for them. Our analysis focused on three possible sources of parsing errors, namely limited training data, part-of-speech (POS) tagging errors, and ambiguous constructions. As a result, we found that the last two sources, which frequently appear in Vietnamese text, significantly attributed to the poor performance of Vietnamese parsing.


Proactive Learning for Named Entity Recognition
Maolin Li | Nhung Nguyen | Sophia Ananiadou
BioNLP 2017

The goal of active learning is to minimise the cost of producing an annotated dataset, in which annotators are assumed to be perfect, i.e., they always choose the correct labels. However, in practice, annotators are not infallible, and they are likely to assign incorrect labels to some instances. Proactive learning is a generalisation of active learning that can model different kinds of annotators. Although proactive learning has been applied to certain labelling tasks, such as text classification, there is little work on its application to named entity (NE) tagging. In this paper, we propose a proactive learning method for producing NE annotated corpora, using two annotators with different levels of expertise, and who charge different amounts based on their levels of experience. To optimise both cost and annotation quality, we also propose a mechanism to present multiple sentences to annotators at each iteration. Experimental results for several corpora show that our method facilitates the construction of high-quality NE labelled datasets at minimal cost.