2023
pdf
abs
Comparative evaluation of boundary-relaxed annotation for Entity Linking performance
Gabriel Herman Bernardim Andrade
|
Shuntaro Yada
|
Eiji Aramaki
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Entity Linking performance has a strong reliance on having a large quantity of high-quality annotated training data available. Yet, manual annotation of named entities, especially their boundaries, is ambiguous, error-prone, and raises many inconsistencies between annotators. While imprecise boundary annotation can degrade a model’s performance, there are applications where accurate extraction of entities’ surface form is not necessary. For those cases, a lenient annotation guideline could relieve the annotators’ workload and speed up the process. This paper presents a case study designed to verify the feasibility of such annotation process and evaluate the impact of boundary-relaxed annotation in an Entity Linking pipeline. We first generate a set of noisy versions of the widely used AIDA CoNLL-YAGO dataset by expanding the boundaries subsets of annotated entity mentions and then train three Entity Linking models on this data and evaluate the relative impact of imprecise annotation on entity recognition and disambiguation performances. We demonstrate that the magnitude of effects caused by noise in the Named Entity Recognition phase is dependent on both model complexity and noise ratio, while Entity Disambiguation components are susceptible to entity boundary imprecision due to strong vocabulary dependency.
2022
pdf
abs
PICO Corpus: A Publicly Available Corpus to Support Automatic Data Extraction from Biomedical Literature
Faith Mutinda
|
Kongmeng Liew
|
Shuntaro Yada
|
Shoko Wakamiya
|
Eiji Aramaki
Proceedings of the first Workshop on Information Extraction from Scientific Publications
We present a publicly available corpus with detailed annotations describing the core elements of clinical trials: Participants, Intervention, Control, and Outcomes. The corpus consists of 1011 abstracts of breast cancer randomized controlled trials extracted from the PubMed database. The corpus improves previous corpora by providing detailed annotations for outcomes to identify numeric texts that report the number of participants that experience specific outcomes. The corpus will be helpful for the development of systems for automatic extraction of data from randomized controlled trial literature to support evidence-based medicine. Additionally, we demonstrate the feasibility of the corpus by using two strong baselines for named entity recognition task. Most of the entities achieve F1 scores greater than 0.80 demonstrating the quality of the dataset.
pdf
abs
JaMIE: A Pipeline Japanese Medical Information Extraction System with Novel Relation Annotation
Fei Cheng
|
Shuntaro Yada
|
Ribeka Tanaka
|
Eiji Aramaki
|
Sadao Kurohashi
Proceedings of the Thirteenth Language Resources and Evaluation Conference
In the field of Japanese medical information extraction, few analyzing tools are available and relation extraction is still an under-explored topic. In this paper, we first propose a novel relation annotation schema for investigating the medical and temporal relations between medical entities in Japanese medical reports. We experiment with the practical annotation scenarios by separately annotating two different types of reports. We design a pipeline system with three components for recognizing medical entities, classifying entity modalities, and extracting relations. The empirical results show accurate analyzing performance and suggest the satisfactory annotation quality, the superiority of the latest contextual embedding models. and the feasible annotation strategy for high-accuracy demand.
2021
pdf
abs
End-to-end Biomedical Entity Linking with Span-based Dictionary Matching
Shogo Ujiie
|
Hayate Iso
|
Shuntaro Yada
|
Shoko Wakamiya
|
Eiji Aramaki
Proceedings of the 20th Workshop on Biomedical Language Processing
Disease name recognition and normalization is a fundamental process in biomedical text mining. Recently, neural joint learning of both tasks has been proposed to utilize the mutual benefits. While this approach achieves high performance, disease concepts that do not appear in the training dataset cannot be accurately predicted. This study introduces a novel end-to-end approach that combines span representations with dictionary-matching features to address this problem. Our model handles unseen concepts by referring to a dictionary while maintaining the performance of neural network-based models. Experiments using two major datasaets demonstrate that our model achieved competitive results with strong baselines, especially for unseen concepts during training.
2020
pdf
abs
Offensive Language Detection on Video Live Streaming Chat
Zhiwei Gao
|
Shuntaro Yada
|
Shoko Wakamiya
|
Eiji Aramaki
Proceedings of the 28th International Conference on Computational Linguistics
This paper presents a prototype of a chat room that detects offensive expressions in a video live streaming chat in real time. Focusing on Twitch, one of the most popular live streaming platforms, we created a dataset for the task of detecting offensive expressions. We collected 2,000 chat posts across four popular game titles with genre diversity (e.g., competitive, violent, peaceful). To make use of the similarity in offensive expressions among different social media platforms, we adopted state-of-the-art models trained on offensive expressions from Twitter for our Twitch data (i.e., transfer learning). We investigated two similarity measurements to predict the transferability, textual similarity, and game-genre similarity. Our results show that the transfer of features from social media to live streaming is effective. However, the two measurements show less correlation in the transferability prediction.
pdf
abs
Towards a Versatile Medical-Annotation Guideline Feasible Without Heavy Medical Knowledge: Starting From Critical Lung Diseases
Shuntaro Yada
|
Ayami Joh
|
Ribeka Tanaka
|
Fei Cheng
|
Eiji Aramaki
|
Sadao Kurohashi
Proceedings of the Twelfth Language Resources and Evaluation Conference
Applying natural language processing (NLP) to medical and clinical texts can bring important social benefits by mining valuable information from unstructured text. A popular application for that purpose is named entity recognition (NER), but the annotation policies of existing clinical corpora have not been standardized across clinical texts of different types. This paper presents an annotation guideline aimed at covering medical documents of various types such as radiography interpretation reports and medical records. Furthermore, the annotation was designed to avoid burdensome requirements related to medical knowledge, thereby enabling corpus development without medical specialists. To achieve these design features, we specifically focus on critical lung diseases to stabilize linguistic patterns in corpora. After annotating around 1100 electronic medical records following the annotation scheme, we demonstrated its feasibility using an NER task. Results suggest that our guideline is applicable to large-scale clinical NLP projects.