David Harris

2026

Temporal information extraction is the task of identifying temporal entities in a text and relating them to each other. In medicine, electronic health records (EHRs) contain text that documents the sequence of events during an encounter with a patient, and sometimes the events prior to the encounter (e.g., social history). Temporality is especially important for the specialty of psychiatry. In this work, we describe the updates to the guidelines that allowed us to create a corpus of temporally-annotated psychiatric discharge summaries and progress notes. These updated guidelines were used to create a corpus of over 18000 events, 2200 time expressions, and 13,000 temporal relations. Temporal information extraction performance with a baseline system trained on non-psychiatric data obtains an F1 score of 0.152 on relation extraction, indicating the importance of this new dataset for making progress on temporal information extraction in the psychiatric domain.

2024

pdf bib abs

Development of a Benchmark Corpus for Medical Device Adverse Event Detection
Susmitha Wunnava | David Harris | Florence T. Bourgeois | Timothy A. Miller
Proceedings of the First Workshop on Patient-Oriented Language Processing (CL4Health) @ LREC-COLING 2024

The U.S. Food and Drug Administration (FDA) collects real-world adverse events, including device-associated deaths, injuries, and malfunctions, through passive reporting to the agency’s Manufacturer and User Facility Device Experience (MAUDE) database. However, this system’s full potential remains untapped given the extensive use of unstructured text in medical device adverse event reports and lack of FDA resources and expertise to properly analyze all available data. In this work, we focus on addressing this limitation through the development of an annotated benchmark corpus to support the design and development of state-of-the-art NLP approaches towards automatic extraction of device-related adverse event information from FDA Medical Device Adverse Event Reports. We develop a dataset of labeled medical device reports from a diverse set of high-risk device types, that can be used for supervised machine learning. We develop annotation guidelines and manually annotate for nine entity types. The resulting dataset contains 935 annotated adverse event reports, containing 12252 annotated spans across the nine entity types. The dataset developed in this work will be made publicly available upon publication.

2023

pdf bib abs

We explore temporal dependency graph (TDG) parsing in the clinical domain. We leverage existing annotations on the THYME dataset to semi-automatically construct a TDG corpus. Then we propose a new natural language inference (NLI) approach to TDG parsing, and evaluate it both on general domain TDGs from wikinews and the newly constructed clinical TDG corpus. We achieve competitive performance on general domain TDGs with a much simpler model than prior work. On the clinical TDGs, our method establishes the first result of TDG parsing on clinical data with 0.79/0.88 micro/macro F1.

2020

pdf bib abs

We present work on extraction of radiotherapy treatment information from the clinical narrative in the electronic medical records. Radiotherapy is a central component of the treatment of most solid cancers. Its details are described in non-standardized fashions using jargon not found in other medical specialties, complicating the already difficult task of manual data extraction. We examine the performance of several state-of-the-art neural methods for relation extraction of radiotherapy treatment details, with a goal of automating detailed information extraction. The neural systems perform at 0.82-0.88 macro-average F1, which approximates or in some cases exceeds the inter-annotator agreement. To the best of our knowledge, this is the first effort to develop models for radiotherapy relation extraction and one of the few efforts for relation extraction to describe cancer treatment in general.