Clinical Natural Language Processing Workshop (2020)


bib (full) Proceedings of the 3rd Clinical Natural Language Processing Workshop

pdf bib
Proceedings of the 3rd Clinical Natural Language Processing Workshop
Anna Rumshisky | Kirk Roberts | Steven Bethard | Tristan Naumann

pdf bib
Various Approaches for Predicting Stroke Prognosis using Magnetic Resonance Imaging Text Records
Tak-Sung Heo | Chulho Kim | Jeong-Myeong Choi | Yeong-Seok Jeong | Yu-Seop Kim

Stroke is one of the leading causes of death and disability worldwide. Stroke is treatable, but it is prone to disability after treatment and must be prevented. To grasp the degree of disability caused by stroke, we use magnetic resonance imaging text records to predict stroke and measure the performance according to the document-level and sentence-level representation. As a result of the experiment, the document-level representation shows better performance.

pdf bib
Multiple Sclerosis Severity Classification From Clinical Text
Alister D’Costa | Stefan Denkovski | Michal Malyska | Sae Young Moon | Brandon Rufino | Zhen Yang | Taylor Killian | Marzyeh Ghassemi

Multiple Sclerosis (MS) is a chronic, inflammatory and degenerative neurological disease, which is monitored by a specialist using the Expanded Disability Status Scale (EDSS) and recorded in unstructured text in the form of a neurology consult note. An EDSS measurement contains an overall ‘EDSS’ score and several functional subscores. Typically, expert knowledge is required to interpret consult notes and generate these scores. Previous approaches used limited context length Word2Vec embeddings and keyword searches to predict scores given a consult note, but often failed when scores were not explicitly stated. In this work, we present MS-BERT, the first publicly available transformer model trained on real clinical data other than MIMIC. Next, we present MSBC, a classifier that applies MS-BERT to generate embeddings and predict EDSS and functional subscores. Lastly, we explore combining MSBC with other models through the use of Snorkel to generate scores for unlabelled consult notes. MSBC achieves state-of-the-art performance on all metrics and prediction tasks and outperforms the models generated from the Snorkel ensemble. We improve Macro-F1 by 0.12 (to 0.88) for predicting EDSS and on average by 0.29 (to 0.63) for predicting functional subscores over previous Word2Vec CNN and rule-based approaches.

BERT-XML: Large Scale Automated ICD Coding Using BERT Pretraining
Zachariah Zhang | Jingshu Liu | Narges Razavian

ICD coding is the task of classifying and cod-ing all diagnoses, symptoms and proceduresassociated with a patient’s visit. The process isoften manual, extremely time-consuming andexpensive for hospitals as clinical interactionsare usually recorded in free text medical notes.In this paper, we propose a machine learningmodel, BERT-XML, for large scale automatedICD coding of EHR notes, utilizing recentlydeveloped unsupervised pretraining that haveachieved state of the art performance on a va-riety of NLP tasks. We train a BERT modelfrom scratch on EHR notes, learning with vo-cabulary better suited for EHR tasks and thusoutperform off-the-shelf models. We furtheradapt the BERT architecture for ICD codingwith multi-label attention. We demonstratethe effectiveness of BERT-based models on thelarge scale ICD code classification task usingmillions of EHR notes to predict thousands ofunique codes.

Incorporating Risk Factor Embeddings in Pre-trained Transformers Improves Sentiment Prediction in Psychiatric Discharge Summaries
Xiyu Ding | Mei-Hua Hall | Timothy Miller

Reducing rates of early hospital readmission has been recognized and identified as a key to improve quality of care and reduce costs. There are a number of risk factors that have been hypothesized to be important for understanding re-admission risk, including such factors as problems with substance abuse, ability to maintain work, relations with family. In this work, we develop Roberta-based models to predict the sentiment of sentences describing readmission risk factors in discharge summaries of patients with psychosis. We improve substantially on previous results by a scheme that shares information across risk factors while also allowing the model to learn risk factor-specific information.

Information Extraction from Swedish Medical Prescriptions with Sig-Transformer Encoder
John Pougué Biyong | Bo Wang | Terry Lyons | Alejo Nevado-Holgado

Relying on large pretrained language models such as Bidirectional Encoder Representations from Transformers (BERT) for encoding and adding a simple prediction layer has led to impressive performance in many clinical natural language processing (NLP) tasks. In this work, we present a novel extension to the Transformer architecture, by incorporating signature transform with the self-attention model. This architecture is added between embedding and prediction layers. Experiments on a new Swedish prescription data show the proposed architecture to be superior in two of the three information extraction tasks, comparing to baseline models. Finally, we evaluate two different embedding approaches between applying Multilingual BERT and translating the Swedish text to English then encode with a BERT model pretrained on clinical notes.

Evaluation of Transfer Learning for Adverse Drug Event (ADE) and Medication Entity Extraction
Sankaran Narayanan | Kaivalya Mannam | Sreeranga P Rajan | P Venkat Rangan

We evaluate several biomedical contextual embeddings (based on BERT, ELMo, and Flair) for the detection of medication entities such as Drugs and Adverse Drug Events (ADE) from Electronic Health Records (EHR) using the 2018 ADE and Medication Extraction (Track 2) n2c2 data-set. We identify best practices for transfer learning, such as language-model fine-tuning and scalar mix. Our transfer learning models achieve strong performance in the overall task (F1=92.91%) as well as in ADE identification (F1=53.08%). Flair-based embeddings out-perform in the identification of context-dependent entities such as ADE. BERT-based embeddings out-perform in recognizing clinical terminology such as Drug and Form entities. ELMo-based embeddings deliver competitive performance in all entities. We develop a sentence-augmentation method for enhanced ADE identification benefiting BERT-based and ELMo-based models by up to 3.13% in F1 gains. Finally, we show that a simple ensemble of these models out-paces most current methods in ADE extraction (F1=55.77%).

BioBERTpt - A Portuguese Neural Language Model for Clinical Named Entity Recognition
Elisa Terumi Rubel Schneider | João Vitor Andrioli de Souza | Julien Knafou | Lucas Emanuel Silva e Oliveira | Jenny Copara | Yohan Bonescki Gumiel | Lucas Ferro Antunes de Oliveira | Emerson Cabrera Paraiso | Douglas Teodoro | Cláudia Maria Cabral Moro Barra

With the growing number of electronic health record data, clinical NLP tasks have become increasingly relevant to unlock valuable information from unstructured clinical text. Although the performance of downstream NLP tasks, such as named-entity recognition (NER), in English corpus has recently improved by contextualised language models, less research is available for clinical texts in low resource languages. Our goal is to assess a deep contextual embedding model for Portuguese, so called BioBERTpt, to support clinical and biomedical NER. We transfer learned information encoded in a multilingual-BERT model to a corpora of clinical narratives and biomedical-scientific papers in Brazilian Portuguese. To evaluate the performance of BioBERTpt, we ran NER experiments on two annotated corpora containing clinical narratives and compared the results with existing BERT models. Our in-domain model outperformed the baseline model in F1-score by 2.72%, achieving higher performance in 11 out of 13 assessed entities. We demonstrate that enriching contextual embedding models with domain literature can play an important role in improving performance for specific NLP tasks. The transfer learning process enhanced the Portuguese biomedical NER model by reducing the necessity of labeled data and the demand for retraining a whole new model.

Dilated Convolutional Attention Network for Medical Code Assignment from Clinical Text
Shaoxiong Ji | Erik Cambria | Pekka Marttinen

Medical code assignment, which predicts medical codes from clinical texts, is a fundamental task of intelligent medical information systems. The emergence of deep models in natural language processing has boosted the development of automatic assignment methods. However, recent advanced neural architectures with flat convolutions or multi-channel feature concatenation ignore the sequential causal constraint within a text sequence and may not learn meaningful clinical text representations, especially for lengthy clinical notes with long-term sequential dependency. This paper proposes a Dilated Convolutional Attention Network (DCAN), integrating dilated convolutions, residual connections, and label attention, for medical code assignment. It adopts dilated convolutions to capture complex medical patterns with a receptive field which increases exponentially with dilation size. Experiments on a real-world clinical dataset empirically show that our model improves the state of the art.

Classification of Syncope Cases in Norwegian Medical Records
Ildiko Pilan | Pål H. Brekke | Fredrik A. Dahl | Tore Gundersen | Haldor Husby | Øystein Nytrø | Lilja Øvrelid

Loss of consciousness, so-called syncope, is a commonly occurring symptom associated with worse prognosis for a number of heart-related diseases. We present a comparison of methods for a diagnosis classification task in Norwegian clinical notes, targeting syncope, i.e. fainting cases. We find that an often neglected baseline with keyword matching constitutes a rather strong basis, but more advanced methods do offer some improvement in classification performance, especially a convolutional neural network model. The developed pipeline is planned to be used for quantifying unregistered syncope cases in Norway.

Comparison of Machine Learning Methods for Multi-label Classification of Nursing Education and Licensure Exam Questions
John Langton | Krishna Srihasam | Junlin Jiang

In this paper, we evaluate several machine learning methods for multi-label classification of text questions. Every nursing student in the United States must pass the National Council Licensure Examination (NCLEX) to begin professional practice. NCLEX defines a number of competencies on which students are evaluated. By labeling test questions with NCLEX competencies, we can score students according to their performance in each competency. This information helps instructors measure how prepared students are for the NCLEX, as well as which competencies they may need help with. A key challenge is that questions may be related to more than one competency. Labeling questions with NCLEX competencies, therefore, equates to a multi-label, text classification problem where each competency is a label. Here we present an evaluation of several methods to support this use case along with a proposed approach. While our work is grounded in the nursing education domain, the methods described here can be used for any multi-label, text classification use case.

Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation
Kexin Huang | Abhishek Singh | Sitong Chen | Edward Moseley | Chih-Ying Deng | Naomi George | Charolotta Lindvall

Clinical notes contain rich information, which is relatively unexploited in predictive modeling compared to structured data. In this work, we developed a new clinical text representation Clinical XLNet that leverages the temporal information of the sequence of the notes. We evaluated our models on prolonged mechanical ventilation prediction problem and our experiments demonstrated that Clinical XLNet outperforms the best baselines consistently. The models and scripts are made publicly available.

Automatic recognition of abdominal lymph nodes from clinical text
Yifan Peng | Sungwon Lee | Daniel C. Elton | Thomas Shen | Yu-xing Tang | Qingyu Chen | Shuai Wang | Yingying Zhu | Ronald Summers | Zhiyong Lu

Lymph node status plays a pivotal role in the treatment of cancer. The extraction of lymph nodes from radiology text reports enables large-scale training of lymph node detection on MRI. In this work, we first propose an ontology of 41 types of abdominal lymph nodes with a hierarchical relationship. We then introduce an end-to-end approach based on the combination of rules and transformer-based methods to detect these abdominal lymph node mentions and classify their types from the MRI radiology reports. We demonstrate the superior performance of a model fine-tuned on MRI reports using BlueBERT, called MriBERT. We find that MriBERT outperforms the rule-based labeler (0.957 vs 0.644 in micro weighted F1-score) as well as other BERT-based variations (0.913 - 0.928). We make the code and MriBERT publicly available at, with the hope that this method can facilitate the development of medical report annotators to produce labels from scratch at scale.

How You Ask Matters: The Effect of Paraphrastic Questions to BERT Performance on a Clinical SQuAD Dataset
Sungrim (Riea) Moon | Jungwei Fan

Reading comprehension style question-answering (QA) based on patient-specific documents represents a growing area in clinical NLP with plentiful applications. Bidirectional Encoder Representations from Transformers (BERT) and its derivatives lead the state-of-the-art accuracy on the task, but most evaluation has treated the data as a pre-mixture without systematically looking into the potential effect of imperfect train/test questions. The current study seeks to address this gap by experimenting with full versus partial train/test data consisting of paraphrastic questions. Our key findings include 1) training with all pooled question variants yielded best accuracy, 2) the accuracy varied widely, from 0.74 to 0.80, when trained with each single question variant, and 3) questions of similar lexical/syntactic structure tended to induce identical answers. The results suggest that how you ask questions matters in BERT-based QA, especially at the training stage.

Relative and Incomplete Time Expression Anchoring for Clinical Text
Louise Dupuis | Nicol Bergou | Hegler Tissot | Sumithra Velupillai

Extracting and modeling temporal information in clinical text is an important element for developing timelines and disease trajectories. Time information in written text varies in preciseness and explicitness, posing challenges for NLP approaches that aim to accurately anchor temporal information on a timeline. Relative and incomplete time expressions (RI-Timexes) are expressions that require additional information for their temporal anchor to be resolved, but few studies have addressed this challenge specifically. In this study, we aimed to reproduce and verify a classification approach for identifying anchor dates and relations in clinical text, and propose a novel relation classification approach for this task.

MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining
Zhi Wen | Xing Han Lu | Siva Reddy

One of the biggest challenges that prohibit the use of many current NLP methods in clinical settings is the availability of public datasets. In this work, we present MeDAL, a large medical text dataset curated for abbreviation disambiguation, designed for natural language understanding pre-training in the medical domain. We pre-trained several models of common architectures on this dataset and empirically showed that such pre-training leads to improved performance and convergence speed when fine-tuning on downstream medical tasks.

Knowledge Grounded Conversational Symptom Detection with Graph Memory Networks
Hongyin Luo | Shang-Wen Li | James Glass

In this work, we propose a novel goal-oriented dialog task, automatic symptom detection. We build a system that can interact with patients through dialog to detect and collect clinical symptoms automatically, which can save a doctor’s time interviewing the patient. Given a set of explicit symptoms provided by the patient to initiate a dialog for diagnosing, the system is trained to collect implicit symptoms by asking questions, in order to collect more information for making an accurate diagnosis. After getting the reply from the patient for each question, the system also decides whether current information is enough for a human doctor to make a diagnosis. To achieve this goal, we propose two neural models and a training pipeline for the multi-step reasoning task. We also build a knowledge graph as additional inputs to further improve model performance. Experiments show that our model significantly outperforms the baseline by 4%, discovering 67% of implicit symptoms on average with a limited number of questions.

Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art
Patrick Lewis | Myle Ott | Jingfei Du | Veselin Stoyanov

A large array of pretrained models are available to the biomedical NLP (BioNLP) community. Finding the best model for a particular task can be difficult and time-consuming. For many applications in the biomedical and clinical domains, it is crucial that models can be built quickly and are highly accurate. We present a large-scale study across 18 established biomedical and clinical NLP tasks to determine which of several popular open-source biomedical and clinical NLP models work well in different settings. Furthermore, we apply recent advances in pretraining to train new biomedical language models, and carefully investigate the effect of various design choices on downstream performance. Our best models perform well in all of our benchmarks, and set new State-of-the-Art in 9 tasks. We release these models in the hope that they can help the community to speed up and increase the accuracy of BioNLP and text mining applications.

Assessment of DistilBERT performance on Named Entity Recognition task for the detection of Protected Health Information and medical concepts
Macarious Abadeer

Bidirectional Encoder Representations from Transformers (BERT) models achieve state-of-the-art performance on a number of Natural Language Processing tasks. However, their model size on disk often exceeds 1 GB and the process of fine-tuning them and using them to run inference consumes significant hardware resources and runtime. This makes them hard to deploy to production environments. This paper fine-tunes DistilBERT, a lightweight deep learning model, on medical text for the named entity recognition task of Protected Health Information (PHI) and medical concepts. This work provides a full assessment of the performance of DistilBERT in comparison with BERT models that were pre-trained on medical text. For Named Entity Recognition task of PHI, DistilBERT achieved almost the same results as medical versions of BERT in terms of F1 score at almost half the runtime and consuming approximately half the disk space. On the other hand, for the detection of medical concepts, DistilBERT’s F1 score was lower by 4 points on average than medical BERT variants.

Distinguishing between Dementia with Lewy bodies (DLB) and Alzheimer’s Disease (AD) using Mental Health Records: a Classification Approach
Zixu Wang | Julia Ive | Sinead Moylett | Christoph Mueller | Rudolf Cardinal | Sumithra Velupillai | John O’Brien | Robert Stewart

While Dementia with Lewy Bodies (DLB) is the second most common type of neurodegenerative dementia following Alzheimer’s Disease (AD), it is difficult to distinguish from AD. We propose a method for DLB detection by using mental health record (MHR) documents from a (3-month) period before a patient has been diagnosed with DLB or AD. Our objective is to develop a model that could be clinically useful to differentiate between DLB and AD across datasets from different healthcare institutions. We cast this as a classification task using Convolutional Neural Network (CNN), an efficient neural model for text classification. We experiment with different representation models, and explore the features that contribute to model performances. In addition, we apply temperature scaling, a simple but efficient model calibration method, to produce more reliable predictions. We believe the proposed method has important potential for clinical applications using routine healthcare records, and for generalising to other relevant clinical record datasets. To the best of our knowledge, this is the first attempt to distinguish DLB from AD using mental health records, and to improve the reliability of DLB predictions.

Weakly Supervised Medication Regimen Extraction from Medical Conversations
Dhruvesh Patel | Sandeep Konam | Sai Prabhakar

Automated Medication Regimen (MR) extraction from medical conversations can not only improve recall and help patients follow through with their care plan, but also reduce the documentation burden for doctors. In this paper, we focus on extracting spans for frequency, route and change, corresponding to medications discussed in the conversation. We first describe a unique dataset of annotated doctor-patient conversations and then present a weakly supervised model architecture that can perform span extraction using noisy classification data. The model utilizes an attention bottleneck inside a classification model to perform the extraction. We experiment with several variants of attention scoring and projection functions and propose a novel transformer-based attention scoring function (TAScore). The proposed combination of TAScore and Fusedmax projection achieves a 10 point increase in Longest Common Substring F1 compared to the baseline of additive scoring plus softmax projection.

Extracting Relations between Radiotherapy Treatment Details
Danielle Bitterman | Timothy Miller | David Harris | Chen Lin | Sean Finan | Jeremy Warner | Raymond Mak | Guergana Savova

We present work on extraction of radiotherapy treatment information from the clinical narrative in the electronic medical records. Radiotherapy is a central component of the treatment of most solid cancers. Its details are described in non-standardized fashions using jargon not found in other medical specialties, complicating the already difficult task of manual data extraction. We examine the performance of several state-of-the-art neural methods for relation extraction of radiotherapy treatment details, with a goal of automating detailed information extraction. The neural systems perform at 0.82-0.88 macro-average F1, which approximates or in some cases exceeds the inter-annotator agreement. To the best of our knowledge, this is the first effort to develop models for radiotherapy relation extraction and one of the few efforts for relation extraction to describe cancer treatment in general.

Cancer Registry Information Extraction via Transfer Learning
Yan-Jie Lin | Hong-Jie Dai | You-Chen Zhang | Chung-Yang Wu | Yu-Cheng Chang | Pin-Jou Lu | Chih-Jen Huang | Yu-Tsang Wang | Hui-Min Hsieh | Kun-San Chao | Tsang-Wu Liu | I-Shou Chang | Yi-Hsin Connie Yang | Ti-Hao Wang | Ko-Jiunn Liu | Li-Tzong Chen | Sheau-Fang Yang

A cancer registry is a critical and massive database for which various types of domain knowledge are needed and whose maintenance requires labor-intensive data curation. In order to facilitate the curation process for building a high-quality and integrated cancer registry database, we compiled a cross-hospital corpus and applied neural network methods to develop a natural language processing system for extracting cancer registry variables buried in unstructured pathology reports. The performance of the developed networks was compared with various baselines using standard micro-precision, recall and F-measure. Furthermore, we conducted experiments to study the feasibility of applying transfer learning to rapidly develop a well-performing system for processing reports from different sources that might be presented in different writing styles and formats. The results demonstrate that the transfer learning method enables us to develop a satisfactory system for a new hospital with only a few annotations and suggest more opportunities to reduce the burden of cancer registry curation.

PHICON: Improving Generalization of Clinical Text De-identification Models via Data Augmentation
Xiang Yue | Shuang Zhou

De-identification is the task of identifying protected health information (PHI) in the clinical text. Existing neural de-identification models often fail to generalize to a new dataset. We propose a simple yet effective data augmentation method PHICON to alleviate the generalization issue. PHICON consists of PHI augmentation and Context augmentation, which creates augmented training corpora by replacing PHI entities with named-entities sampled from external sources, and by changing background context with synonym replacement or random word insertion, respectively. Experimental results on the i2b2 2006 and 2014 de-identification challenge datasets show that PHICON can help three selected de-identification models boost F1-score (by at most 8.6%) on cross-dataset test setting. We also discuss how much augmentation to use and how each augmentation method influences the performance.

Where’s the Question? A Multi-channel Deep Convolutional Neural Network for Question Identification in Textual Data
George Michalopoulos | Helen Chen | Alexander Wong

In most clinical practice settings, there is no rigorous reviewing of the clinical documentation, resulting in inaccurate information captured in the patient medical records. The gold standard in clinical data capturing is achieved via “expert-review”, where clinicians can have a dialogue with a domain expert (reviewers) and ask them questions about data entry rules. Automatically identifying “real questions” in these dialogues could uncover ambiguities or common problems in data capturing in a given clinical setting. In this study, we proposed a novel multi-channel deep convolutional neural network architecture, namely Quest-CNN, for the purpose of separating real questions that expect an answer (information or help) about an issue from sentences that are not questions, as well as from questions referring to an issue mentioned in a nearby sentence (e.g., can you clarify this?), which we will refer as “c-questions”. We conducted a comprehensive performance comparison analysis of the proposed multi-channel deep convolutional neural network against other deep neural networks. Furthermore, we evaluated the performance of traditional rule-based and learning-based methods for detecting question sentences. The proposed Quest-CNN achieved the best F1 score both on a dataset of data entry-review dialogue in a dialysis care setting, and on a general domain dataset.

Learning from Unlabelled Data for Clinical Semantic Textual Similarity
Yuxia Wang | Karin Verspoor | Timothy Baldwin

Domain pretraining followed by task fine-tuning has become the standard paradigm for NLP tasks, but requires in-domain labelled data for task fine-tuning. To overcome this, we propose to utilise domain unlabelled data by assigning pseudo labels from a general model. We evaluate the approach on two clinical STS datasets, and achieve r= 0.80 on N2C2-STS. Further investigation reveals that if the data distribution of unlabelled sentence pairs is closer to the test data, we can obtain better performance. By leveraging a large general-purpose STS dataset and small-scale in-domain training data, we obtain further improvements to r= 0.90, a new SOTA.

Joint Learning with Pre-trained Transformer on Named Entity Recognition and Relation Extraction Tasks for Clinical Analytics
Miao Chen | Ganhui Lan | Fang Du | Victor Lobanov

In drug development, protocols define how clinical trials are conducted, and are therefore of paramount importance. They contain key patient-, investigator-, medication-, and study-related information, often elaborated in different sections in the protocol texts. Granular-level parsing on large quantity of existing protocols can accelerate clinical trial design and provide actionable insights into trial optimization. Here, we report our progresses in using deep learning NLP algorithms to enable automated protocol analytics. In particular, we combined a pre-trained BERT transformer model with joint-learning strategies to simultaneously identify clinically relevant entities (i.e. Named Entity Recognition) and extract the syntactic relations between these entities (i.e. Relation Extraction) from the eligibility criteria section in protocol texts. When comparing to standalone NER and RE models, our joint-learning strategy can effectively improve the performance of RE task while retaining similarly high NER performance, likely due to the synergy of optimizing toward both tasks’ objectives via shared parameters. The derived NLP model provides an end-to-end solution to convert unstructured protocol texts into structured data source, which will be embedded into a comprehensive clinical analytics workflow for downstream trial design missions such like patient population extraction, patient enrollment rate estimation, and protocol amendment prediction.

Extracting Semantic Aspects for Structured Representation of Clinical Trial Eligibility Criteria
Tirthankar Dasgupta | Ishani Mondal | Abir Naskar | Lipika Dey

Eligibility criteria in the clinical trials specify the characteristics that a patient must or must not possess in order to be treated according to a standard clinical care guideline. As the process of manual eligibility determination is time-consuming, automatic structuring of the eligibility criteria into various semantic categories or aspects is the need of the hour. Existing methods use hand-crafted rules and feature-based statistical machine learning methods to dynamically induce semantic aspects. However, in order to deal with paucity of aspect-annotated clinical trials data, we propose a novel weakly-supervised co-training based method which can exploit a large pool of unlabeled criteria sentences to augment the limited supervised training data, and consequently enhance the performance. Experiments with 0.2M criteria sentences show that the proposed approach outperforms the competitive supervised baselines by 12% in terms of micro-averaged F1 score for all the aspects. Probing deeper into analysis, we observe domain-specific information boosts up the performance by a significant margin.

An Ensemble Approach for Automatic Structuring of Radiology Reports
Morteza Pourreza Shahri | Amir Tahmasebi | Bingyang Ye | Henghui Zhu | Javed Aslam | Timothy Ferris

Automatic structuring of electronic medical records is of high demand for clinical workflow solutions to facilitate extraction, storage, and querying of patient care information. However, developing a scalable solution is extremely challenging, specifically for radiology reports, as most healthcare institutes use either no template or department/institute specific templates. Moreover, radiologists’ reporting style varies from one to another as sentences are written in a telegraphic format and do not follow general English grammar rules. In this work, we present an ensemble method that consolidates the predictions of three models, capturing various attributes of textual information for automatic labeling of sentences with section labels. These three models are: 1) Focus Sentence model, capturing context of the target sentence; 2) Surrounding Context model, capturing the neighboring context of the target sentence; and finally, 3) Formatting/Layout model, aimed at learning report formatting cues. We utilize Bi-directional LSTMs, followed by sentence encoders, to acquire the context. Furthermore, we define several features that incorporate the structure of reports. We compare our proposed approach against multiple baselines and state-of-the-art approaches on a proprietary dataset as well as 100 manually annotated radiology notes from the MIMIC-III dataset, which we are making publicly available. Our proposed approach significantly outperforms other approaches by achieving 97.1% accuracy.

Utilizing Multimodal Feature Consistency to Detect Adversarial Examples on Clinical Summaries
Wenjie Wang | Youngja Park | Taesung Lee | Ian Molloy | Pengfei Tang | Li Xiong

Recent studies have shown that adversarial examples can be generated by applying small perturbations to the inputs such that the well- trained deep learning models will misclassify. With the increasing number of safety and security-sensitive applications of deep learn- ing models, the robustness of deep learning models has become a crucial topic. The robustness of deep learning models for health- care applications is especially critical because the unique characteristics and the high financial interests of the medical domain make it more sensitive to adversarial attacks. Among the modalities of medical data, the clinical summaries have higher risks to be attacked because they are generated by third-party companies. As few works studied adversarial threats on clinical summaries, in this work we first apply adversarial attack to clinical summaries of electronic health records (EHR) to show the text-based deep learning systems are vulnerable to adversarial examples. Secondly, benefiting from the multi-modality of the EHR dataset, we propose a novel defense method, MATCH (Multimodal feATure Consistency cHeck), which leverages the consistency between multiple modalities in the data to defend against adversarial examples on a single modality. Our experiments demonstrate the effectiveness of MATCH on a hospital readmission prediction task comparing with baseline methods.

Advancing Seq2seq with Joint Paraphrase Learning
So Yeon Min | Preethi Raghavan | Peter Szolovits

We address the problem of model generalization for sequence to sequence (seq2seq) architectures. We propose going beyond data augmentation via paraphrase-optimized multi-task learning and observe that it is useful in correctly handling unseen sentential paraphrases as inputs. Our models greatly outperform SOTA seq2seq models for semantic parsing on diverse domains (Overnight - up to 3.2% and emrQA - 7%) and Nematus, the winning solution for WMT 2017, for Czech to English translation (CzENG 1.6 - 1.5 BLEU).

On the diminishing return of labeling clinical reports
Jean-Baptiste Lamare | Oloruntobiloba Olatunji | Li Yao

Ample evidence suggests that better machine learning models may be steadily obtained by training on increasingly larger datasets on natural language processing (NLP) problems from non-medical domains. Whether the same holds true for medical NLP has by far not been thoroughly investigated. This work shows that this is indeed not always the case. We reveal the somehow counter-intuitive observation that performant medical NLP models may be obtained with small amount of labeled data, quite the opposite to the common belief, most likely due to the domain specificity of the problem. We show quantitatively the effect of training data size on a fixed test set composed of two of the largest public chest x-ray radiology report datasets on the task of abnormality classification. The trained models not only make use of the training data efficiently, but also outperform the current state-of-the-art rule-based systems by a significant margin.

The Chilean Waiting List Corpus: a new resource for clinical Named Entity Recognition in Spanish
Pablo Báez | Fabián Villena | Matías Rojas | Manuel Durán | Jocelyn Dunstan

In this work we describe the Waiting List Corpus consisting of de-identified referrals for several specialty consultations from the waiting list in Chilean public hospitals. A subset of 900 referrals was manually annotated with 9,029 entities, 385 attributes, and 284 pairs of relations with clinical relevance. A trained medical doctor annotated these referrals, and then together with other three researchers, consolidated each of the annotations. The annotated corpus has nested entities, with 32.2% of entities embedded in other entities. We use this annotated corpus to obtain preliminary results for Named Entity Recognition (NER). The best results were achieved by using a biLSTM-CRF architecture using word embeddings trained over Spanish Wikipedia together with clinical embeddings computed by the group. NER models applied to this corpus can leverage statistics of diseases and pending procedures within this waiting list. This work constitutes the first annotated corpus using clinical narratives from Chile, and one of the few for the Spanish language. The annotated corpus, the clinical word embeddings, and the annotation guidelines are freely released to the research community.

Exploring Text Specific and Blackbox Fairness Algorithms in Multimodal Clinical NLP
John Chen | Ian Berlot-Attwell | Xindi Wang | Safwan Hossain | Frank Rudzicz

Clinical machine learning is increasingly multimodal, collected in both structured tabular formats and unstructured forms such as free text. We propose a novel task of exploring fairness on a multimodal clinical dataset, adopting equalized odds for the downstream medical prediction tasks. To this end, we investigate a modality-agnostic fairness algorithm - equalized odds post processing - and compare it to a text-specific fairness algorithm: debiased clinical word embeddings. Despite the fact that debiased word embeddings do not explicitly address equalized odds of protected groups, we show that a text-specific approach to fairness may simultaneously achieve a good balance of performance classical notions of fairness. Our work opens the door for future work at the critical intersection of clinical NLP and fairness.