Camilo Thorne


Stress Test Evaluation of Biomedical Word Embeddings
Vladimir Araujo | Andrés Carvallo | Carlos Aspillaga | Camilo Thorne | Denis Parra
Proceedings of the 20th Workshop on Biomedical Language Processing

The success of pretrained word embeddings has motivated their use in the biomedical domain, with contextualized embeddings yielding remarkable results in several biomedical NLP tasks. However, there is a lack of research on quantifying their behavior under severe “stress” scenarios. In this work, we systematically evaluate three language models with adversarial examples – automatically constructed tests that allow us to examine how robust the models are. We propose two types of stress scenarios focused on the biomedical named entity recognition (NER) task, one inspired by spelling errors and another based on the use of synonyms for medical terms. Our experiments with three benchmarks show that the performance of the original models decreases considerably, in addition to revealing their weaknesses and strengths. Finally, we show that adversarial training causes the models to improve their robustness and even to exceed the original performance in some cases.


Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings
Zenan Zhai | Dat Quoc Nguyen | Saber Akhondi | Camilo Thorne | Christian Druckenbrodt | Trevor Cohn | Michelle Gregory | Karin Verspoor
Proceedings of the 18th BioNLP Workshop and Shared Task

Chemical patents are an important resource for chemical information. However, few chemical Named Entity Recognition (NER) systems have been evaluated on patent documents, due in part to their structural and linguistic complexity. In this paper, we explore the NER performance of a BiLSTM-CRF model utilising pre-trained word embeddings, character-level word representations and contextualized ELMo word representations for chemical patents. We compare word embeddings pre-trained on biomedical and chemical patent corpora. The effect of tokenizers optimized for the chemical domain on NER performance in chemical patents is also explored. The results on two patent corpora show that contextualized word representations generated from ELMo substantially improve chemical NER performance w.r.t. the current state-of-the-art. We also show that domain-specific resources such as word embeddings trained on chemical patents and chemical-specific tokenizers, have a positive impact on NER performance.

Detecting Chemical Reactions in Patents
Hiyori Yoshikawa | Dat Quoc Nguyen | Zenan Zhai | Christian Druckenbrodt | Camilo Thorne | Saber A. Akhondi | Timothy Baldwin | Karin Verspoor
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

Extracting chemical reactions from patents is a crucial task for chemists working on chemical exploration. In this paper we introduce the novel task of detecting the textual spans that describe or refer to chemical reactions within patents. We formulate this task as a paragraph-level sequence tagging problem, where the system is required to return a sequence of paragraphs which contain a description of a reaction. To address this new task, we construct an annotated dataset from an existing proprietary database of chemical reactions manually extracted from patents. We introduce several baseline methods for the task and evaluate them over our dataset. Through error analysis, we discuss what makes the task complex and challenging, and suggest possible directions for future research.


Towards Confidence Estimation for Typed Protein-Protein Relation Extraction
Camilo Thorne | Roman Klinger
Proceedings of the Biomedical NLP Workshop associated with RANLP 2017

Systems which build on top of information extraction are typically challenged to extract knowledge that, while correct, is not yet well-known. We hypothesize that a good confidence measure for relational information has the property that such interesting information is found between information extracted with very high confidence and very low confidence. We discuss confidence estimation for the domain of biomedical protein-protein relation discovery in biomedical literature. As facts reported in papers take some time to be validated and recorded in biomedical databases, such task gives rise to large quantities of unknown but potentially true candidate relations. It is thus important to rank them based on supporting evidence rather than discard them. In this paper, we discuss this task and propose different approaches for confidence estimation and a pipeline to evaluate such methods. We show that the most straight-forward approach, a combination of different confidence measures from pipeline modules seems not to work well. We discuss this negative result and pinpoint potential future research directions.


Spanish NER with Word Representations and Conditional Random Fields
Jenny Linet Copara Zea | Jose Eduardo Ochoa Luna | Camilo Thorne | Goran Glavaš
Proceedings of the Sixth Named Entity Workshop


Semantic Complexity of Quantifiers and Their Distribution in Corpora
Jakub Szymanik | Camilo Thorne
Proceedings of the 11th International Conference on Computational Semantics


The VERICLIG Project: Extraction of Computer Interpretable Guidelines via Syntactic and Semantic Annotation
Camilo Thorne | Marco Montali | Diego Calvanese | Elena Cardillo | Claudio Eccher
Proceedings of the IWCS 2013 Workshop on Computational Semantics in Clinical Text (CSCT 2013)

Automated Activity Recognition in Clinical Documents
Camilo Thorne | Marco Montali | Diego Calvanese | Elena Cardillo | Claudio Eccher
Proceedings of the Sixth International Joint Conference on Natural Language Processing