Matthew M. Engelhard
Also published as: Matthew Engelhard
2025
IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences
Fengnan Li | Elliot D. Hill | Jiang Shu | Jiaxin Gao | Matthew M. Engelhard
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fengnan Li | Elliot D. Hill | Jiang Shu | Jiaxin Gao | Matthew M. Engelhard
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose **IRIS** (**I**nterpretable **R**etrieval-Augmented Classification for long **I**nterspersed Document **S**equences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.
2021
SpanPredict: Extraction of Predictive Document Spans with Neural Attention
Vivek Subramanian | Matthew Engelhard | Sam Berchuck | Liqun Chen | Ricardo Henao | Lawrence Carin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Vivek Subramanian | Matthew Engelhard | Sam Berchuck | Liqun Chen | Ricardo Henao | Lawrence Carin
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
In many natural language processing applications, identifying predictive text can be as important as the predictions themselves. When predicting medical diagnoses, for example, identifying predictive content in clinical notes not only enhances interpretability, but also allows unknown, descriptive (i.e., text-based) risk factors to be identified. We here formalize this problem as predictive extraction and address it using a simple mechanism based on linear attention. Our method preserves differentiability, allowing scalable inference via stochastic gradient descent. Further, the model decomposes predictions into a sum of contributions of distinct text spans. Importantly, we require only document labels, not ground-truth spans. Results show that our model identifies semantically-cohesive spans and assigns them scores that agree with human ratings, while preserving classification performance.