Bahar İlgen

Also published as: Bahar Ilgen

2026

VaxScope: Document-Level Structured Evidence Extraction from Immunization Systematic Reviews
Bahar Ilgen | Ebenezer Awotoro | Georges Hattab
BioNLP 2026

Systematic reviews are fundamental to evidence-based medicine, but the clinical evidence they contain is primarily expressed in unstructured text, making large-scale extraction and reuse difficult. Existing biomedical NLP methods have achieved strong performance on span-level extraction from clinical trials and abstracts; however, these approaches are insufficient for systematic reviews, where evidence is often distributed across multiple studies, sentences, and sections and must be aggregated into normalized document-level attributes. We introduce VaxScope, a benchmark dataset for document-level structured evidence extraction from immunization-related systematic reviews. VaxScope is constructed through an expert-guided semi-automatic annotation pipeline that combines automatic candidate generation with domain expert validation to ensure consistency and annotation quality. We formalize the task as document-level structured extraction, where target labels are defined at the review level and require aggregating evidence beyond isolated textual spans. We further establish baselines for document-level structured extraction using abstract-level input representations and evaluate how access to evidence-grounded contextual input improves performance over abstract-only settings. Baseline experiments show that PubMedBERT achieves the best overall performance (Avg F1: 0.850), with evidence-grounded input improving performance particularly for fields requiring distributed contextual reasoning.

2025

pdf bib abs

Toward Human-Centered Readability Evaluation
Bahar İlgen | Georges Hattab
Proceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+NLP)

Text simplification is essential for making public health information accessible to diverse populations, including those with limited health literacy. However, commonly used evaluation metrics in Natural Language Processing (NLP)—such as BLEU, FKGL, and SARI—mainly capture surface-level features and fail to account for human-centered qualities like clarity, trustworthiness, tone, cultural relevance, and actionability. This limitation is particularly critical in high-stakes health contexts, where communication must be not only simple but also usable, respectful, and trustworthy. To address this gap, we propose the Human-Centered Readability Score (HCRS), a five-dimensional evaluation framework grounded in Human-Computer Interaction (HCI) and health communication research. HCRS integrates automatic measures with structured human feedback to capture the relational and contextual aspects of readability. We outline the framework, discuss its integration into participatory evaluation workflows, and present a protocol for empirical validation. This work aims to advance the evaluation of health text simplification beyond surface metrics, enabling NLP systems that align more closely with diverse users’ needs, expectations, and lived experiences.

pdf bib abs

Reasoning Under Distress: Mining Claims and Evidence in Mental Health Narratives
Jannis Köckritz | Bahar İlgen | Georges Hattab
Proceedings of the 12th Argument mining Workshop

This paper explores the application of argument mining to mental health narratives using zero‐shot transfer learning. We fine‐tune a BERT‐based sentence classifier on ~15k essays from the Persuade dataset—achieving 69.1% macro‐F1 on its test set—and apply it without domain adaptation to the CAMS dataset, which consists of anonymized mental health–related Reddit posts. On a manually annotated gold‐standard set of 150 CAMS sentences, our model attains 54.7% accuracy and 48.9% macro‐F1, with evidence detection (F1 = 63.4%) transferring more effectively than claim identification (F1 = 32.0%). Analysis across expert‐annotated causal factors of distress shows that personal narratives heavily favor experiential evidence (65–77% of sentences) compared to academic writing. The prevalence of evidence sentences, many of which appear to be grounded in lived experiences, such as descriptions of emotional states or personal events, suggests that personal narratives favor descriptive recollection over formal, argumentative reasoning. These findings underscore the unique challenges of argument mining in affective contexts and offer recommendations for enhancing argument mining tools within clinical and digital mental health support systems.

Co-authors

Venues

Fix author