Social Media Mining for Health Applications Workshop (2026)

Volumes

Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks 54 papers

pdf (full)
bib (full) Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks

Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks
Guillermo Lopez-Garcia | Graciela Gonzalez-Hernandez

pdf bib abs

A3S@C-DAC at #SMM4H-HeaRD 2026: Reasoning Meets Evidence: LLMs for Interpretable Insomnia Detection with Evidence Extraction in Clinical Notes
Abhishek Maity | Amol Shinde | Abhishek Suresh Kushare | Swapnil Pawar

Detecting insomnia from clinical narratives requires both accurate classification and clinically grounded reasoning with interpretable evidence. We present our systems for the SMM4H-HeaRD 2026 shared task, which leverages MIMIC-III notes annotated with rule-based insomnia criteria and supporting evidence spans. We explore two complementary approaches: parameter-efficient fine-tuning of lightweight models using QLoRA and LoRA, and few-shot prompting of large language models for joint reasoning and evidence extraction. Our best system achieves an F1-score of 0.7333 on binary classification and a micro-F1 of 0.6535 on multi-label rule prediction, with up to 0.5192 partial-match F1 for evidence extraction. Results show that lightweight fine-tuned models can outperform larger models in classification, while larger models demonstrate stronger reasoning but struggle with precise span localization, highlighting a key gap in clinically interpretable NLP systems.

pdf bib abs

Gladiators at #SMM4H–HeaRD 2026: Multi-Seed XLM-RoBERTa Ensemble with Focal Loss and Per-Language Threshold Optimization for Multilingual Adverse Drug Event Detection
Ankit Kumar Singh

This paper describes the Gladiators system for Task 1 of the SMM4H 2026 shared task on binary classification of adverse drug event (ADE) mentions in multilingual social media posts. Our system fine-tunes three XLM-RoBERTa large models with different random seeds using focal loss (α=0.75, γ=2.0) and 3× positive oversampling, then averages their predicted probabilities and applies per-language threshold optimization. On the development set, our ensemble achieves a pooled binary F1 of 0.7505. On the official test set—which introduced surprise Farsi comprising 35.5% of samples—our system achieves F1 = 0.6039, above the competition mean (0.5465) and median (0.5798). We evaluated eleven approaches and document key negative results. Post evaluation, a six-model cross-regime ensembleimproved dev F1 to 0.7585.

pdf bib abs

LSI_UNED at #SMM4H–HeaRD 2026: Grid-Based Biomedical Named Entity Recognition Across Languages and Entity Types
Alicia Ramirez-Arrabe | Juan Martinez-Romo | Andres Duque

This paper describes the participation of the LSI_UNED team in the firt sub-task of MultiClinAI at the #SMM4H-HeaRD 2026 Workshop, which focuses on multilingual clinical named entity recognition in seven languages. The task requires identifying mentions of diseases, procedures, and symptoms in clinical case reports. We propose a set of systems based on the W2NER architecture, with a separate model trained for each language and entity type. For Spanish, we use a RoBERTa-based model with data augmentation from additional NER resources, while English and Italian systems are based on different biomedical BERT variants. Results show consistent performance across languages, with the best overall results obtained for Spanish. Data augmentation improves recall and F1, while English and Italian models achieve competitive but slightly lower scores. Symptom recognition remains the most challenging entity type across all languages.

pdf bib abs

SINAI at #SMM4H–HeaRD 2026: Multilingual Clinical NER with MrBERT-biomed and Optuna Hyperparameter Optimization
Lucas Molino Piñar | Manuel Carlos Diaz-Galiano | María-Teresa Martín-Valdivia

This paper describes the system submitted by our team to the MultiClinAI shared task at the 11th SMM4H-HeaRD Workshop (ACL 2026). The task addresses multilingual clinical Named Entity Recognition (NER) for three entity types (Disease, Procedure, and Symptom) in Spanish clinical texts. Our approach fine-tunes MrBERT-biomed, a domain-adapted ModernBERT model pre-trained on biomedical corpora, using multilingual clinical data from seven European languages. We train independent entity-specific models, each optimized via Bayesian hyperparameter search with Optuna, and apply a deterministic post-processing step that aligns predicted spans to word boundaries. On the official test set, our system achieves overall strict micro-F1 scores of 0.7453, 0.7107, and 0.6603 for Disease, Procedure, and Symptom, respectively.

pdf bib abs

Prestige at #SMM4H-HeaRD 2026: Binary Insomnia Classification from Clinical Notes Using LLMs with Chain-of-Thought Reasoning
Oyindolapo O. Komolafe

This paper describes our system for Subtask 1 of the SMM4H HeaRD 2026 Task 2, which is an LLM-based system for binary insomnia classification from MIMIC-III clinical notes using OpenAI GPT-5.2 with chain-of-thought (CoT) prompting. Our approach implements three strategies: baseline fixed 8-shot prompting, dynamic retrieval using semantic embeddings, and self-consistency voting. The system applies rule-based criteria combining symptom patterns (difficulty sleeping and daytime impairment) with medication indicators (primary and secondary insomnia medications).Our best configuration (Self-Consistency Voting) achieved 95.67% weighted F1 on validation and 82.35% F1 on the official test set , outperforming the Baseline (81.25% F1). Notably, our test F1-score of 82.35% substantially exceeded the task mean (68.05%) and median (70.37%) across all participating teams. Key contributions include explicit comorbidity exclusion prompting, context-aware nursing note handling, logical constraint enforcement for prediction consistency, and a comparative analysis demonstrating that self-consistency improves recall at moderate computational cost.

pdf bib abs

Team Gazoo! at #SMM4H-HeaRD 2026: Zero-Training NER via Iterative LLM Prompt Self-Optimization for Opioid Impact Span Detection
Diego Estuar

This paper describes the system submitted by Team Gazoo! for Task 7 of the #SMM4H-HeaRD 2026 shared task on detecting self-reported clinical and social impacts of nonmedical opioid use in social media text. We present a zero-training, prompt-only approach that uses a large language model (GPT-5.4) with structured few-shot prompting and autonomous, iterative rule optimization. Our system encodes a domain-specific entity ontology, three core decision rules, and 65 cognitively organized few-shot examples into a single prompt, with BIO constraint enforcement applied as post-processing. Crucially, the prompt itself is refined by the LLM: at each iteration the model analyzes its own errors and proposes targeted edits to its rules and examples. Through 18 such self-refinement cycles, our system achieved an F1-Strict of 0.53 and F1-Relaxed of 0.60 on the test set, ranking first among all participating teams under both evaluation criteria.

pdf bib abs

DNT at #SMM4H–HeaRD 2026: Leveraging BERT-based Encoders and LLMs for Medical Information Extraction
Doan Nhat Tien | Thìn Đặng Văn

This paper presents our systems for two tasks at #SMM4H-HeaRD 2026. For Task 1 (multilingual Adverse Drug Event detection), we fine-tune BERT-based multilingual models (InfoXLM and XLM-RoBERTa) and Qwen3.5-9B with ensemble methods, achieving 0.8584 macro F1 on the development set and 0.5304 F1 on unseen Farsi. For Task 7 (span detection of ClinicalImpacts and SocialImpacts in opioid narratives), DeBERTa-Large with simplified labeling achieves the best test performance (0.583 relaxed F1, 0.500 strict F1). Our analysis shows that LLMs excel on known languages in Task 1, while transformer-based models with simplified labeling generalize better for NER tasks.

pdf bib abs

BIT.UA at #SMM4H–HeaRD 2026: Towards Multi-Class Multilingual Clinical Entity Recognition with Multi-Head CRF Ensembles
Richard A. A. Jonker | Sérgio Matos

This paper describes the BIT.UA system for the MultiClinNER shared task at #SMM4H–HeaRD 2026, targeting multilingual clinical named entity recognition across seven languages for three entity types (Disease, Procedure, Symptom). We extend the Multi-Head CRF architecture, originally developed for multi-class NER on Spanish clinical text, to the multilingual setting. To enable joint multi-entity training despite per-entity text variations in the dataset, we develop an adaptive text consolidation pipeline that preserves over 94% of annotations. Our central finding is that a single xlm-roberta-large model, trained jointly on all seven languages and three entity types, achieves competition rank 2 for five of seven languages, outperforming dedicated monolingual models by up to +6.94 F1 points, while requiring only a single set of weights. Ensembling multiple seeds of this model achieves rank 1 for those five languages, and combining it with monolingual models yields rank 1 for the remaining two. Code and models are publicly available at https://github.com/ieeta-pt/Multi-Head-CRF/tree/MultiClinNER and https://huggingface.co/collections/IEETA/multiclinner-models.

pdf bib abs

Bhramastra at #SMM4H-HeaRD 2026: A Multi-Stage Hunter-Judge Pipeline using DSPy-Optimized LLMs for Multilingual ADE Detection
Bhaarat Pachori

This paper describes the submission by **Team Bhramastra** for the **#SMM4H-HeaRD 2026** Shared Task 1, focused on personal Adverse Drug Event (ADE) detection in multilingual social media. A decoupled architecture, **Hunter-Judge**, is proposed to handle extreme class imbalance and linguistic variance across seven languages, including a surprise zero-shot Farsi set. The system employs a fine-tuned multilingual mDeBERTa-v3 model as a high-recall filter (**Hunter**), followed by a Gemini-2.5-Flash model (**Judge**) optimized via the **DSPy** framework for precision-oriented agentic adjudication. By implementing a reasoning protocol grounded in clinical RAG evidence and universal ingredient mapping, the pipeline achieved the **highest average F1-score (0.6653)** among all teams. Strong zero-shot generalizability on Farsi (**F1: 0.5863**) was demonstrated, highlighting the effectiveness of medically-grounded adjudication in low-resource contexts.

pdf bib abs

LLATMU at #SMM4H-HeaRD 2026: Clinical Text Structuring with QLoRA-based Generation and Partial-Label TNM Classification
Eric Hsiao | Min-Hsuan Ku | Hsuan-Lei Shao

We describe the LLATMU systems submitted to the #SMM4H-HeaRD 2026 shared tasks, covering two related clinical text structuring problems: dialogue-to-SOAP note generation (Task 4) and TNM staging classification from pathology reports (Task 6). Although the two tasks differ in modeling paradigm (text generation versus supervised classification), both require transforming unstructured clinical narratives into structured representations.For Task 4, we instruction-tuned LLMs with parameter-efficient adaptation and submitted a QLoRA-based Ministral-3B system, achieving an official blind test average score of 0.53 and outperforming the task-wide mean and median. For Task 6, we formulate TNM prediction as a three-head classification problem using BioClinical-ModernBERT-large with long-context encoding, class-weighted loss, and normalized partial-label training. The model achieves a validation average macro-F1 of 0.9196 and continues to outperform the official baseline on the more challenging tie-break test set.Across both tasks, our results suggest that robust data handling, stable fine-tuning, and task-appropriate supervision are important for practical clinical NLP under constrained and imperfect shared-task settings.

pdf bib abs

Patient2Paper at #SMM4H-HeaRD 2026: Retrieval-Augmented Few-Shot Generation for Clinical Note Synthesis
Ioan-Tudor-Alexandru Anghel | Timotei Andrei | Comârdici Marian Bogdan | Carina Sâicu

We present a retrieval-augmented few-shot system for the MedSynth Dial2Note shared task at SMM4H-HEARD 2026, placing 3rd on the official leaderboard (0.51 avg). Across 28 configurations, we find that retrieval design (hybrid BM25 + medical-domain dense fused via RRF) and prompt presentation format (few-shot examples as conversation turns) are the primary quality drivers, while model scale has surprisingly limited impact: Llama 3.2:3B, Llama 3.1:8B and GPT-4o mini remain within a narrow band on our locally computed scores. Our final submission used GPT-4o mini with k=3 few-shot examples retrieved by RRF over BioLORD-2023 embeddings. We report a full ablation, including negative results, to show where the gains come from and where further engineering stops paying off.

pdf bib abs

In2Lab-TNT at #SMM4H-HeaRD 2026: An Application of QTT’s Terminological Entanglement to Leverage Insomnia Detection in Clinical Notes
Antonio Jesus Tamayo Herrera | Giovanny Díaz-Laínes | Carlos Mario Perez Perez | Diego A Burgos

We present a lightweight, deterministic post-processing approach for clinical text classification based on entanglement between clinically meaningful concepts. Our system was developed for the SMM4H 2026 shared task on insomnia detection and related information extraction from clinical notes. For Subtask 1, we introduce an entanglement-based rescue layer that models dependencies between sleep disturbance, daytime impairment, and sleep-targeted medication evidence. Applied as a false-negative correction on top of an LLM baseline, this approach improves recall while preserving precision. On the official test set, the rescue layer increases F1 by 25% without degrading precision (1.00). Local experiments show larger gains on weaker runs, suggesting a stabilizing effect on variable LLM outputs. For Subtask 2, we implement an LLM-based system for rule-based evidence and span extraction. Results highlight the effectiveness of modeling clinically grounded dependencies and suggest directions for improving evidence extraction and span matching.

pdf bib abs

blue at SMM4H-HeaRD 2026: Class-Weighted Transformer Ensembles with Structured Decoding and Chain-of-Thought Blending across Six Health NLP Shared Tasks
Krish Sharma | Rhea Singhal | Jatin Bedi

We describe team blue’s participation across six SMM4H-HeaRD 2026 shared tasks spanning multilingual adverse drug event detection (Task 1), influenza vaccine effectiveness estimation (Task 3), patient metadata classification (Task 5), TNM cancer staging (Task 6), opioid impact span detection (Task 7), and multilingual clinical NER with cross-lingual annotation projection (Task 8). Despite the heterogeneity of these tasks, binary, multi-class, multi-label, and sequence-labelling, our systems share three recurring design principles: (i) inverse-frequency class weighting to handle severe imbalance, (ii) multi-seed and/or multi-backbone ensembling to reduce variance, and (iii) post-hoc calibration of decision boundaries. Key results include micro-F1 of 0.990 on TNM staging (Task 6), 0.872/0.918 on flu vaccination/test classification surpassing the 70B CoT baseline on vaccination (Task 3), F1 of 0.764 on patient metadata approaching the fine-tuning benchmark of 0.776 (Task 5), and competitive performance on ADE detection (Task 1, F1 = 0.580), opioid spans (Task 7, relaxed F1 = 0.59), and multilingual clinical NER (Task 8, strict F1 0.20–0.41 across 7 languages).

pdf bib abs

DT4H.nl at #SMM4H-HeaRD 2026: Multilingual Clinical NER with multilingual and monolingual models
Bram van Es

We describe the setup we used to complete the MultiClinAI-NER task in the SMM4H-HeaRD workshop 2026. In this work we employed a dedicated multilingual encoder model (EuroBERT-610m), two Dutch encoder models trained from scratch on clinical corpora (MedRoBERTa.nl and CardioDeBERTa.nl) and a generic Dutch encoder model (RobBERT2023-large), all finetuned with a 3-layer DNN head. We find that the use of multilingual datasets is potentially beneficial in augmenting the training corpora of monolingual models.

pdf bib abs

SMMTech at #SMM4H-HeaRD 2026: Detection of Insomnia in Clinical Notes
Emilia-Ioana Cristea

This paper describes the participation of team SMMTech in the SMM4H-HeaRD 2026 Shared Task 2: Detection of Insomnia in Clinical Notes. We present a comparative architectural study exploring the friction between extractive token-classification models and generative Large Language Models (LLMs) in clinical span extraction, on the MIMIC-III Clinical Database. During the validation phase we established baselines using encoder-only transformers such as BERT, ClinicalBERT, BigBird and Clinical BigBird. For the official test phase, we deployed a 4-bit quantized generative hybrid pipeline using Llama3-Med42-8B to evaluate its multi-hop reasoning capabilities. While the generative pipeline achieved an F1-score of 0.4783 on Subtask 1 (Classification), it struggled with exact span matching on Subtask 2.In this paper we present the mechanical limitations of zero-shot JSON extraction and the necessity of decoupling clinical reasoning from character-level span extraction.

pdf bib abs

Beyond Lexical Similarity: Evaluating Faithfulness in LLM-Based Medical Question Reformulation
Md Rabiul Hasan | Aleka Melese Ayalew | Mourad Oussalah

Medical query rewriting transforms verbose consumer health questions into concise clinical queries, a critical step in health information retrieval. Large language models (LLMs) perform well on this task by standard metrics, yet high ROUGE or BERTScore does not guarantee preservation of clinical content. To address this issue, we introduce MedFaith-F1, a category-level faithfulness metric over four clinically salient categories: diagnoses, medications, procedures, and follow-up intent. We further propose a hybrid Evidence and Knowledge-Grounded Retrieval-Augmented Generation EKG-RAG, an evidence and knowledge-grounded framework combining hybrid retrieval over PubMed and MedlinePlus resources with UMLS (Unified Medical Language System)-aligned ontology grounding. Evaluating large language models LLaMA-3 and Qwen2.5 across zero-shot, few-shot, and QLoRA settings on MeQSum and medical question-pair (MQP) datasets revealed that base models exhibit category-level hallucination rates exceeding 40%, invisible to standard metrics, while EKG-RAG with QLoRA reduces this rate to 26.75%, achieving MedFaith-F1 of 0.73. Our findings call for faithfulness-aware evaluation in clinical query rewriting, and MedFaith-F1 provides a reproducible step in that direction.

pdf bib abs

NU_DeepHealthNLP at #SMM4H-HeaRD 2026: Entity-Conditioned Generation and a Four-Stage Pipeline for Automated SOAP Note Generation
Thanya Mysore Santhosh | Deahan Yu

We describe two system submissions to Task 4 of the SMM4H-HeaRD 2026 Shared Task on automated SOAP note generation from doctor–patient dialogues. Our first submission is a standalone entity-conditioned generation model: Mistral-7B-Instruct-v0.1 fine-tuned with QLoRA on 8,529 MedSynth training dialogues, where both training and inference prompts include clinical entities extracted and grouped by SOAP section. Our second submission is a four-stage modular pipeline that additionally incorporates a hybrid retrieval stage and a rule-based verification stage. The key finding of this work is that incorporating structured clinical domain knowledge, in the form of NER entities grouped by SOAP section, directly into the generation prompt produces consistent and reliable improvements over dialogue-only generation. Our four-stage pipeline submission achieved an average score of 0.54 on the official test set, ranking first on the shared task leaderboard.

pdf bib abs

GoBlueInformatics at #SMM4H-HeaRD 2026: Long-Context Encoders and Generative Biomedical LLMs for Pathological TNM Stage Prediction
Shangqing Wei

We describe our systems for #SMM4H-HeaRD 2026 Task 6, which requires predicting the T, N, and M components of pathological TNM stage from TCGA pathology reports. We explored both discriminative long-context encoders and generative biomedical LLMs. For the first test set, our BioClinical-ModernBERT-large ensemble achieved 0.993 micro-F1 and 0.915 macro-F1, improving over the BB-TEN baseline scoring-log result of 0.947 micro-F1 and 0.780 macro-F1. For the harder second test set, our OpenBioLLM-8B LoRA extractor improved component macro-F1 over the organizer baseline from 0.454 to 0.626 for T, from 0.591 to 0.758 for N, and from 0.554 to 1.000 for M. These results suggest that long-context encoders are strong for explicit T and N evidence, while constrained generative LLM extraction can be effective for harder reports. The main remaining weakness is rare-class T4 recognition.

pdf bib abs

This paper addresses the MultiClinAI challenge, subtask MultiClinNER, which focuses on clinical Named Entity Recognition (NER) across seven languages: Czech, Dutch, English, Italian, Romanian, Spanish, and Swedish. The main goal of MultiClinNER is to identify and extract clinical terms specifically related to diseases, procedures, and symptoms from discharge summaries. The paper explores a variety of state-of-the-art methods, both monolingual and multilingual, ranging from pretrained, zero-shot, domain-adapted transformers to fine-tuned transformer models, and demonstrates the benefits of ensemble modeling. Data augmentation through external resources significantly enhanced the models’ ability to recognize clinical entities. Both monolingual and multilingual approaches showed complementary strengths depending on the language and entity type. The average F1 score achieved across the best models for each language and category is 0.6502.

pdf bib abs

RACAI at #SMM4H-HeaRD: Named Entity Recognition for Detecting the Impacts of Drug Abuse in Social Media Posts: Zero-Shot and Fine-Tuning Approaches
Tiberiu Boros | Radu-Gabriel Chivereanu

In this work, we address the detection of drug abuse repercussions in Reddit posts, as part of SMM4H-HeaRD Task 7: Extraction of Social and Clinical Impacts of Substance Use from Social Media Posts. We evaluate multiple approaches, including fine-tuning and zero-shot inference, across several deep learning architectures. Our best result is obtained using an adapter-based fine-tuning approach on the DeBERTaV3 model. In addition, we explore text-based evolutionary optimization for Gemma 4 workflows and show that, on this task, they achieve competitive performance with the supervised DeBERTaV3 setup.

pdf bib abs

ICB-UMA at #SMM4H–HeaRD 2026: Hybrid Clinical Entity Projection for MultiClinAI: Adaptive Candidate Windows, XGBoost, and LLM Refinement
Alvaro Rey-Blanes | Sara Giménez-Gómez | Francisco J. Veredas | Francisco J. Moreno-Barea

This paper presents our submission to the MultiClinAI Shared Task (Gallego-Donoso et al., 2026) on cross-lingual clinical entity annotation projection from Spanish to English. Our system transfers expert annotations for Diseases, Symptoms and Procedures entities. The approach integrates three core components: adaptive candidate window generation, an XGBoost classifier leveraging surface and semantic features, and an LLM-based post-processing stage to resolve complex misalignments. Our highest-performing run ranked 3rd on the official leaderboard, achieving strict F1 scores of 0.737, 0.549, and 0.538 for Diseases, Symptoms and Procedures, respectively. These results show that combining supervised candidate scoring with targeted LLM refinement provides a robust strategy for clinical entity projection.

pdf bib abs

URJC-Team at #SMM4H-HeaRD 2026: TNM Stage Extraction with a Regex-LLM Workflow
Natalia Madrueño | Jose Walter Hernández Pérez | Rubén R. Fernández | Soto Montalvo

TNM cancer staging is a critical process for characterizing tumor burden and guiding clinical decisions. Nevertheless, its automated extraction remains challenging due to the unstructured and heterogeneous nature of free-text pathology reports. This paper describes the participation of the URJC-Team in Task 6 of the Social Media Mining for Health/Health Real-World Data (#SMM4H-HeaRD) 2026 Shared Tasks. It focuses on predicting TNM staging from pathology reports. The proposed workflow combines hand-crafted regular expressions with a Large Language Model (LLM). First, explicit TNM mentions are extracted using rule-based patterns. Then, any stage not recovered by these rules is inferred by an LLM. Overall, the proposal provides competitive results across all official shared-task phases.

pdf bib abs

LotusOrchid at #SMM4H–HeaRD 2026: Fitting pretrained encoders for Dutch medical data
Sophie Arnoult | Shutao Chen | Piek Vossen

This paper presents our submission to MultiClinAI’s NER subtask for #SMM4H-HeaRD 2026. We focus on the questions 1) which Language Model represents the clinical notes best and 2) which annotations can help training these models. To get answers for these questions, we follow a token-based classification approach with pretrained encoder language models, where we compare models that were pretrained on generic data against medical data, and on a single language, Dutch, against many languages. In addition, we present two data-augmented systems: one with data from the other languages of the workshop for multilingual training, and one with synthetic annotations.

pdf bib abs

PEI at #SMM4H-HeaRD 2026: Enhancing Patient Metadata Detection via Hypothesis-Conditioned Classification and Paraphrase-Based Data Augmentation
Farnaz Zeidi | Roman Christof | Farnoush Zeidi | Renate König | Liam Childs

This paper presents our approach to Task 5 of the #SMM4H-HeaRD 2026 Workshop, which focuses on detecting patient metadata in SARS-CoV-2 sequencing articles as a binary classification task. We explore both encoder-based and large language model (LLM) approaches, using BioM-BERT as a baseline and Mistral-Nemo as the LLM. To improve performance, we propose a paraphrase-based data augmentation pipeline using Qwen3, where paraphrased training and validation instances are added for fine-tuning. For the LLM, we perform prompt refinement and error analysis, while for the encoder-based model, we reformulate the task as a hypothesis-conditioned classification task inspired by Natural Language Inference (NLI). Our methods improve both models: Mistral-Nemo increases from 0.423 to 0.750 F1, and BioM-BERT from 0.801 to 0.821 on the validation set. Although Mistral-Nemo does not surpass BioM-BERT, our best BioM-BERT model achieves an F1-score of 0.786 on the test set, outperforming the mean and median of competing systems. To support reproducibility, we release our best-performing model on Hugging Face.

pdf bib abs

Dr-BERT-NL at #SMM4H–HeaRD 2026: DOKTERBERT – Ontology-Grounded Contextual Representations for Dutch Clinical NLP
Gijs Danoe | Andreas Voss | Axel Hamprecht | Matthijs S. Berends

We describe our submission to SMM4H-HeaRD 2026 Task 7, which asks systems tolabel ClinicalImpacts and SocialImpactsspans in Reddit posts about non-medical sub-stance use. We compare four pipeline shapesbuilt on the same DeBERTa-v3-base back-bone: (i) a direct 5-class encoder with a linear-chain CRF head, (ii) a two-stage detect-then-classify pipeline that delegates span typingto an instruction-tuned LLM (Qwen2.5-7Bor Gemma-3-12B, 4-bit NF4), (iii) an auditpipeline in which the same LLM verifies theencoder’s predictions, and (iv) a classical-MLvariant that replaces the LLM with an SVMtrained on encoder span embeddings. Across16 configurations, the encoder-only DeBERTa-v3 + CRF configuration is the strongest sin-gle system on the official test split, reaching45.4% strict and 54.2% relaxed F1 — +8.6/ +5.3 points above a mental-roberta-basebaseline. LLM audits give a small dev gain thatdoes not transfer to test.

pdf bib abs

Vasudev Awatramani at #SMM4H-HeaRD 2026: A Two-Pass LLM Pipeline with Deterministic Rule Derivation for Interpretable Insomnia Detection in Clinical Notes
Vasudev Awatramani

We describe our system for Shared Task 2 of #SMM4H–HeaRD 2026, which targets the detection of insomnia in MIMIC-III clinical notes. We frame the task as evidence extraction followed by deterministic rule application, rather than end-to-end label prediction. Our system operates in two passes: (1) a Gemini 2.5 Flash large language model (LLM), invoked through typed prompts written in BAML, extracts structured evidence (sleep difficulties, daytime impairment, hypnotic medications) with verbatim character-level citations from each note; (2) a small Python rule engine deterministically applies the task’s published Insomnia rules–Definition 1, Definition 2, and Rules B and C–to derive the binary patient-level label, the rule-component labels, and their evidence spans. We submitted two test-set systems: a zero-shot variant and a retrieval-augmented few-shot variant that selects nearest-neighbor training notes via FAISS over a sentence-embedding index. Our zero-shot variant achieved F1 = 0.8108 on Subtask 1 (binary classification) and a label-classification micro-F1 of 0.7126 with partial-match span F1 = 0.6621 on Subtask 2, both above the across-team mean. We additionally evaluate a GEPA-optimized prompt variant on the validation split. We discuss two findings of methodological interest: the few-shot variant improves Subtask 1 precision but does not improve F1, and does not move the multi-label or span metrics on Subtask 2 in our submission, and pushing the deterministic rule engine to consume LLM-extracted evidence (rather than asking the LLM to emit labels directly) gives strong, easily auditable behavior on a small test set.

pdf bib abs

Parallia at #SMM4H-HeaRD 2026: ClinicalAligner26AM: A Cross-Lingual Aligner for Dataset Translation; Evidences from the MultiClinCorpus Shared Task
François Remy

Word-level cross-lingual alignment is central to annotation projection, translation auditing, and cross-lingual faithfulness estimation, yet existing neural aligners are rarely adapted to specialized domains.In this paper, we introduce ClinicalAligner26AM, a large-context multilingual aligner model for biomedical and clinical text initialized from ClinicalEncoder26AM.Our training recipe is inspired by AWESoME Align. We build our soft alignment target by sharpening with Sinkhorn–Knop optimal transport a cost matrix established for parallel clinical texts and conversations through the fusion of sentence-level, phrase-level, and token-level signals. We distill this sharpened alignment matrix directly into our student aligner, by encouraging its naive cosine-based token similarity scores to match this target.At inference time, we project source-span scores through the learned token alignment matrix and decode the longest valid high-scoring span in the target text, optionally supported by MultiClinNER predictions.We evaluate CA26AM on the MultiClinCorpus shared task, which projects Spanish clinical entity annotations into six target languages. Our two submitted systems ranked respectively first and second across all languages and entity types, with character-weighted F1 scores above 0.95 in nearly all settings.

pdf bib abs

Discovery@FI at #SMM4H–HeaRD 2026: Ensemble Character Classifier for Multilingual Clinical NER
Petr Zelina | Vit Novacek

We present a system for multilingual clinical named entity recognition (NER) submitted to the MultiClinNER subtask of MultiClinAI 2026, covering all seven languages and three entity classes (disease, symptom, procedure).Our approach trains one binary token classifier ensemble per entity class using cross-lingual fine-tuning of XLM-RoBERTa-large, with all languages handled jointly.We apply character-level ensembling over six models (two encoder variants × three cross-validation folds).This ensembling method provides more granular probability estimates than single-model classifiers, allowing for more flexible precision-recall trade-off tuning.The system achieves character-level F1 scores of 0.70–0.88 on the official test set.

pdf bib abs

IITPatna_ADE at #SMM4H-HeaRD 2026: Multilingual Adverse Drug Event Detection with LoRA-XLM-RoBERTa, Cross-Fold Ensembles, and Post-hoc Calibration
Sofia Jamil | Manish Singh | Harshal Dharpure | Sriparna Saha | Rajiv Misra

We describe our submission to Task 1 of #SMM4H-HeaRD 2026: multilingual binary classification of adverse drug event (ADE) mentions in social media. Our system fine-tunes xlm-roberta-large with LoRA adapters and learned language embeddings, using two-stage training (CADEC translated domain adaptation, then five-fold cross-validation on the official training set). We ensemble the five fold checkpoints by mean logits, apply temperature scaling on the development set, and tune decision thresholds to maximize the official metric. On development, the final ensemble reaches macro-F₁ 0.788 with a global threshold and 0.796 with per-language thresholds; our best official test submission achieves macro-F₁ 0.616 (ID 678990).

pdf bib abs

CUET_DiagNLP at #SMM4H-HeaRD 2026: Per-Axis TNM Staging from Pathology Reports and Opioid Impact Span Detection from Social Media
Shuva Dey | Priyangshu Barua | Mohammad Ashfak Habib

In this paper, we describe systems for two #SMM4H-HeaRD 2026 shared tasks. Task 6 asks for per-axis TNM cancer staging from free-text TCGA pathology reports under severe label imbalance and long-document constraints. We fine-tune GatorTron-base separately on each axis using Focal loss with class weights and a pooled [CLS]–mean representation, reaching macro F1 of 0.700 (T), 0.774 (N), and 0.640 (M) on test set 2 against a baseline of 0.454, 0.591, and 0.554 respectively. Task 7 asks for span-level detection of opioid-related ClinicalImpacts and SocialImpacts in first-person Reddit posts. We combine DeBERTa-large and PubMedBERT (two seeds each) in a uniform-weight ensemble with boundary-aware loss, entity-replacement augmentation, and a first-person post filter, achieving strict F1 of 0.51 and relaxed F1 of 0.60, above both the task mean (0.46 / 0.55) and median (0.48 / 0.58).

pdf bib abs

MedMind AI at #SMM4H-HeaRD 2026: Data Extraction and Generation Using Prompt Engineering and Structured Outputs (Tasks 1–6)
Aatish Pradhan | Brian M. Habersberger

Six tasks from the SMM4H–HeaRD 2026 workshop were addressed with task-specific large-language-model (LLM) pipelines relying on prompt engineering, strict structured (JSON) responses, and deterministic rule sets. The pipelines utilize no task-specific fine-tuning and can be adapted across diverse clinical and social media data. This study demonstrates that general-purpose LLMs (gpt-5.4-mini and gpt-5.4) can accurately extract and classify crucial health information when constrained by strict output schemas. Notably, our hybrid approachachieved the best overall performance among all participating systems for Task 2 (Insomnia Detection).

pdf bib abs

CaresAI at SMM4H-HeaRD 2026: Predicting TNM Staging
Joseph Itopa Abubakar | Jorge Jarme | Favour Igwezeke | Mary Adewunmi

The Tumor, Node, and Metastasis (TNM) staging system is critical to cancer treatment. This study aims to predict TNM stage labels independently, with the Cancer Genome Atlas (TCGA) pathology report as the sixth shared task of SMM4H-HeaRD 2026. The problem is framed as three multi-label classification tasks. We explore both classical and deep learning approaches using Term Frequency-Inverse Document Frequency (TF-IDF) features and embeddings from ClinicalBERT, BioBERT, and PubMedBERT. These representations are used with Logistic Regression (LR), Light Gradient Boosting Machine (LightGBM), Feed-Forward Neural Networks (FFNN), and Wide Residual Networks (WRN). Our results show that individual embeddings perform similarly to the TNM label classification, while their combination improves its predictive ability. WRN achieves AUROC scores of 0.839 (T), 0.8502 (N), and 0.803 (M) with F1-scores of 0.622, 0.702, and 0.9337, respectively, for the training phase. LightGBM with TF-IDF performs best with AUROC scores of 0.9368 (T), 0.9524 (N), and 0.8311 (M) and F1-scores of 0.7559 (T), 0.7384 (N), and 0.7017 (M) during the training phase. Furthermore, the result of the Codabench for the test sets indicates a Macro-F1 score of 0.978, 0.957, and 0.879 for the T, N, and M categories respectively for test set 1; while test set 2 records a Macro-F1 score for T, N, and M is 0.807, 0.767, 1.0 respectively. However, performance declined during the evaluation phase of the test sets, a drop from 0.938 for test set 1 to 0.858 for test set 2, for the Macro-F1 score across all stages; suggesting limitations in model generalizability, sensitivity to class imbalance, and challenges in processing lengthy clinical documents. Although this study provides an efficient baseline model and a reproducible pipeline, further optimization and validation are required before it can be considered suitable for use in a real-world clinical setting.

pdf bib abs

Vinland_Vector at #SMM4H-HeaRD 2026: Multilingual ADE Detection and Query-Augmented Clinical NER for English
Nirjhar Das | Rathijit Aich | Mahfuzulhoq Chowdhury

In this paper, we address Task 1 on adverse drug event (ADE) detection and Task 8 on MultiClinNER at SMM4H-HeaRD 2026. ADE detection is formulated as a multilingual binary classification problem over social media posts spanning German, French, Russian, English, Mandarin and Japanese, with zero-shot on Farsi. Using XLM-RoBERTa-Large with a dual-pooling head, combined with stratified sampling, language-conditioned inputs, translation-based augmentation, and calibrated ensembling, our model achieves a macro F1 score of 0.6088, surpassing both the competition mean (0.5465) and median (0.5798). Our work in MultiClinNER targets clinical NER for English text. Using GLiNER-large with sliding-window inference, query augmentation, and calibrated thresholds, it achieves strict F1 scores of 0.7591 (Disease), 0.7263 (Procedure), and 0.6733 (Symptom), outperforming a PubMedBERT baseline across all entities.

pdf bib abs

SIEMENS at #SMM4H–HeaRD 2026: The Impact of Training Strategy and Backbone Selection on BERT-based Multilingual Clinical NER
Manuela Daniela Danu

This paper describes our participation in the MultiClinNER subtask of the MultiClinAI shared task, part of the #SMM4H-HeaRD Workshop at ACL 2026. The task requires identifying DISEASE, SYMPTOM, and PROCEDURE mentions in clinical case reports across seven languages: Czech, Dutch, English, Italian, Romanian, Spanish, and Swedish. We compare two BERT-based sequence labeling methods: (i) sentence-level token classification with a fixed train/validation split, and (ii) paragraph-level chunking with 5-fold cross-validation and checkpoint merging, using language-specific BERT models and multilingual XLM-RoBERTa-large as backbones. Our results show that 5-fold training with checkpoint merging consistently outperforms the fixed split strategy, with further analysis suggesting that the gains are primarily driven by improved training-set coverage rather than by differences in input granularity. Language-specific BERT encoders prove most effective for Spanish and English, while XLM-RoBERTa-large yields the strongest results for the remaining five languages through cross-lingual transfer.

pdf bib abs

HALELab-NITK at #SMM4H-HeaRD2026: Inclusion of Feature Engineering for Detection of Patient Metadata in SARS-CoV2 Sequencing Articles
Aakarsh Bansal | Abhishek Srinivas | Sowmya Kamath S.

This article presents a system description for our work as part of Task 5 of the SMM4H-HeaRD 2026 workshop. We fine-tune pretrained BERT and BiomedBERT models and further enhance them using custom feature augmentation techniques. Incorporating these engineered features results in improved performance, with the best model achieving a validation F1 score of 0.8419 and an evaluation phase F1 score of 0.753.

pdf bib abs

Cuet_Data_Wizards at #SMM4H-HeaRD 2026: Multilingual ADE Detection and Influenza Vaccine Effectiveness Estimation from Social Media
Abir Dey | Mohammed Omar Faiaz | Muhammad Ibrahim Khan

We present our systems for Task 1 and Task 3 of the #SMM4H-HeaRD 2026 shared tasks. Task 1 focuses on binary classification of adverse drug event (ADE) mentions across seven languages, including a zero-shot Persian setting without labeled training data. We fine-tune XLM-RoBERTa-large using weighted cross-entropy loss and augment low-resource settings with additional CADEC data and machine translation-based Persian augmentation. Our system achieves a macro F1 score of 0.582, outperforming the shared task average of 0.547. Task 3 addresses influenza vaccine effectiveness estimation through classification of vaccination status and flu-test results from X posts. We fine-tune twitter-roberta-large, achieving micro F1 scores of 0.845 for vaccination status and 0.883 for flu-test classification on the official test set. Post-evaluation experiments with focal loss, test-time augmentation, and head-tail truncation further improve performance. These results highlight the effectiveness of robust transformer adaptation for health-related social media classification.

pdf bib abs

Limics at #SMM4H-HeaRD 2026: Uncertainty-Driven Prediction for ADE Detection in Social Media
Nour Allam

This paper describes our system for the SMM4H-HeaRD 2026 Task 1: Detection of Adverse Drug Events in Multilingual and Multi-platform Social Media Posts. We developed a two-stage pipeline combining a fine-tuned XLM-RoBERTa-large encoder-only model with a large language model for final decision on ambiguous cases. To handle complex linguistic boundaries, we explore explicitly training the encoder to treat ambiguity as a discrete third label to delegate those cases to the generative model. Although introducing the third label was associated with lower performance than relying on a binary model, when using the encoder as a preliminary filter for classifying 78.62% of posts as negatives, we achieved a global F₁ score of 0.614 (+0.034 over task median).

pdf bib abs

FU-HU-P5 at #SMM4H-HeaRD 2026: MedSynth Dialogue-to-Note Generation
Jessica Ying En Wong

This paper demonstrates our system for shared task 4 of #SMM4H-HeaRD 2026 Workshop where a given doctor-patient dialogue is summarized into a clinical note in the corresponding SOAP format. Our proposed solution includes semi-supervised learning together with parameter efficient finetuning (PEFT) applied to a lightweight pre-trained QWEN3.5 model. Our model delivers competitive performance relative to its parameter count, and generalizes its performance to unseen test dataset.

pdf bib abs

ACSS-PSL at #SMM4H-HeaRD 2026: An LLM-Driven Autoresearch Loop for Opioid-Impact NER
Olivier Caron | Bruno Chaves Ferreira | Christophe Benavent

We apply an LLM-driven autoresearch protocol to Task 7 of #SMM4H-HeaRD 2026, which requires extracting ClinicalImpacts and SocialImpacts spans from Reddit posts about non-medical opioid use. A coding agent iteratively proposes a hypothesis, modifies the training configuration, and evaluates against the held-out validation set. Across 79 runs, only 9 improved strict F1, indicating a narrow viable search space on this small dataset (842 training examples). The submitted ensemble combines DeBERTa-large, MC Dropout blending, and a constrained multi-LLM consensus layer, reaching 0.46 strict and 0.52 relaxed F1 on test, though single-seed evaluation limits the reliability of run-level comparisons. The run log provides a reproducible case study of autonomous experimentation, including failure modes and guardrails for small-data NER.

pdf bib abs

Creative Catalysts at #SMM4H-HeaRD 2026: XLM-RoBERTa for Task 1 Binary Classification of Social Media Posts Containing Adverse Drug Events
Radja Afren | Hichem Rahab | Imane Guellil

Adverse drug events (ADEs) automatic detection from social media posts has become an important task for healthcare systems with real-world, patient-collected data. The current work deals with ADE on user generated content for Task 1 of the Social Media Mining for Health Research and Applications Workshop (SMM4H 2026), Creative Catalysts. We fine-tuned XLM-RoBERTa, pre-trained model chosen for its robustness in handling multilingual content and linguistic diversity common in social media text. To better handle the class imbalance, we subsequently implemented a class-weighting strategy to increase the model’s focus on the underrepresented positive class. This adjusted model improved the validation F1-score to 65%. Our results demonstrate the effectiveness of transformer-based architectures for ADE detection while highlighting the critical need for robust class-balancing techniques and multilingual generalization to handle real-world, imbalanced social media data.

pdf bib abs

BioNLP at #SMM4H-HeaRD 2026 Task 3 Estimating Flu Vaccine Effectiveness: A Temporal-Aware Fine-Tuning and Similarity-Based Few-Shot Prompting Approach
Irina Patularu

This paper presents our systems for the SMM4H 2026 shared task on flu-related tweetclassification across two subtasks: flu vaccination status and flu test outcome classification. For each subtask, we evaluate two approaches: fine-tuning BERTweet-large with atemporal-aware architecture, cross-validation ensembling, and regularization techniques, anda GPT-4o few-shot prompting system with similarity-based dynamic example retrieval,chain-of-thought reasoning and contrastive label ranking. Fine-tuning proves superior for theflu vaccination subtask (micro-F1: 87.90%), where sufficient and relatively balanced training datais available, while few-shot prompting performs better for the flu test subtask (micro-F1: 95.74%), where limited and heavily imbalanced training data renders fine-tuning less effective.

pdf bib abs

Infimobius at #SMM4H-HeaRD 2026: Multi-Seed DeBERTa Ensemble for Flu Vaccination and Testing Status Classification
Pradyumn Kejriwal | Suhani Singh Charan | Raksha Sharma | Rudra Murthy

This paper describes FluENS (Flu ENsemble System), our submission to the Social Media Mining for Health (SMM4H) 2026 Shared Task 3, which targets fine-grained classification of flu vaccination and flu testing statuses from tweets. FluENS builds on the microsoft/deberta-v2-xlarge pre-trained language model and employs a multi-seed ensemble strategy in which five models, each initialized with a different random seed and trained on the full training set, are aggregated through soft-voting over averaged softmax probabilities. We additionally incorporate balanced class weights to mitigate severe label imbalance and apply a two-stage learning rate schedule that separately controls the encoder and classification head. On the development set, FluENS achieves a macro F1 of 79.64% and micro F1 of 85.56% on the flu vaccination sub-task, and a macro F1 of 96.35% and micro F1 of 97.04% on the flu testing sub-task, substantially outperforming a roberta-base baseline across all metrics.

pdf bib abs

Thunderbolts at #SMM4H-HeaRD 2026: Detection of Insomnia in Clinical Notes using Transformers
Guddanti Venkata Sree Charan | Nama_Ss@Cs.Iitr.Ac.In Nama_Ss@Cs.Iitr.Ac.In | Raksha Sharma | Rudra Murthy

We present the SuSh system for Subtask 1 of the MultiClinAI shared task at the 11th SMM4H and HeaRD Workshop (ACL 2026), which addresses multilingual clinical named entity recognition (NER) across seven languages. Our system adopts a fully zero-shot approach using GLiNER-biomed-large-v1.0, a span-based NER model pre-trained on biomedical text, requiring no task-specific fine-tuning or labeled data in target languages. We apply a character-level sliding window strategy to handle long clinical documents that exceed the model’s token limit and incorporate a post processing pipeline including threshold optimization via F1-max sweep, entity-specific gazetteer lookup derived from DisTEMIST and SympTEMIST terminology lists, span boundary correction, and negation filtering. Our official submission achieves a Strict F1 of 0.5175, Strict Precision of 0.5536, Strict Recall of 0.4859, and CHR F1 of 0.6130 on the English disease subtask, demonstrating that domain adapted zero-shot biomedical NER models can serve as competitive baselines for multilingual026 clinical entity recognition without any task specific training data.

pdf bib abs

Team TIET at #SMM4H-HeaRD 2026: Fine-tuned Biomedical Transformers with Language-Balanced Sampling for Patient Metadata and Multilingual ADE Detection
Divrose Kaur | Jatin Bedi | Jasmeet Singh

We present Team TIET’s systems for two shared tasks at #SMM4H-HeaRD 2026: Task 5 (detection of patient metadata in SARS-CoV-2 sequencing papers) and Task 1 (multilingual adverse drug event detection across six languages plus an unseen Farsi subset). For Task 5 we explore iterative LLM prompting followed by fine-tuning BiomedBERT-base with weighted cross-entropy loss and probability threshold optimization, achieving F1 = 0.760 on the official test set (above the competition mean of 0.729). For Task 1 we fine-tune XLM-RoBERTa-base with a combined language- and class-balanced sampling strategy and per-language threshold tuning, achieving macro F1 = 0.497 overall (0.608 excluding the unseen Farsi subset). We report empirical findings on BERT+LLM ensemble failure with bimodal probability distributions, the superiority of base over large model variants under limited data, and the importance of language-balanced gradient contribution in multilingual classification.

pdf bib abs

MetaMiners at SMM4H-HeaRD 2026: A Semantic-Structural Knowledge-Enriched Ensemble for SARS-CoV-2 Metadata Identification
Claudia-Alexandra Ursu | Alecsandru-Florin Soare

This paper presents a hybrid solution for a binary classification of medical PubMed articles created for identifying reports that associate clinical metadata with SARS-CoV-2 genomic sequences. The system is designed to catch the subtle distinction between reports of sequence-associated patient metadata and sentences where such metadata is either unrelated, irellevant, or linked to previous studies. The biggest challenge is the fact that the medical dataset is highly imbalanced, consisting of only 13.3 % of medical reports labeled positive.Our system proposes a hybrid system that combines 4 approaches that includes dual-evidence tagging, negation-aware suppression, semantic frame extraction, adversarial training. All these approaches were tested on multiple models: BiomedBERT-base-abstract, BioLinkBERT-large, PubMedBERT-base-fulltext, followed by a best subset ensamble search to obtain the result of 0.792 F1 score, setting a new benchmark and positioning the solution on the 1st place of the competition.

pdf bib abs

No_gmail at #SMM4H-HeaRD 2026: Detecting Patient Metadata in COVID-19 Scientific Literature: A Comparative Study of Encoder-Only and Autoregressive Language Models
Stefanescu Anastasia

Identifying sentences in COVID-19 literature that report patient metadata is an important step in genomic epidemiology, currently requiring costly manual curation. We compare fine-tuned encoder-only models (BERT, BioLinkBERT) and autoregressive LLMs (Llama, Gemma, GPT-OSS) under prompting and fine-tuning regimes, using Focal Loss and undersampling to address severe class imbalance. Encoder-only models substantially outperform autoregressive models: BioLinkBERT-base with Focal Loss achieves macro F1 of 0.76, versus 0.54 for the best fine-tuned autoregressive model.

pdf bib abs

Understanding the Sociocultural Dimensions of Mental Health Discourse in Arabic X Communities
Amal Abdullah Alqahtani | Rana Aref Salama | Mona T. Diab

Computational mental health research has predominantly centered on English-speaking populations, leaving Arabic-language discourse comparatively under-examined. We present an exploratory computational study of 8,147 tweets from 607 users classified by a GPT-4.1 personal-disclosure pipeline as likely lived-experience authors in three condition-specific Arabic-language X (formerly Twitter) Communities. We focus on discourse related to borderline personality disorder (BPD), bipolar disorder, and ADHD, and characterize community-associated linguistic patterns using a multi-domain cultural keyword framework. The results suggest that in this corpus, Bipolar tweets contain more religious and medical vocabulary, BPD tweets contain more relational, identity, and emotional-distress vocabulary, and ADHD tweets more often focus on practical symptoms and medication management. We treat these patterns as hypothesis-generating rather than confirmatory because the corpus is imbalanced across conditions, some subcorpora are temporally concentrated, and the keyword framework is an initial operationalization rather than a validated measurement instrument. The paper contributes a reusable LLM-assisted personal-disclosure pipeline and an exploratory cultural keyword framework for Arabic mental health discourse.

pdf bib abs

Team Paradise at #SMM4H-HeaRD 2026: Multi-Task Approaches for Social Media Health Mining
Dhruv Goyal | Ishita Gupta | Jatin Bedi

We present Team Paradise’s systems for three tasks in the SMM4H-HeaRD 2026 shared task: multilingual adverse drug event detection (Task 1), influenza vaccine effectiveness estimation via two-subtask classification (Task 3), and opioid impact span extraction (Task 7). For Task 1, threshold-only ablation on XLMRoBERTa-large achieves a macro-F1 of 0.597, exceeding the field mean (0.547) by +0.050. For Task 3, a three-stage hybrid pipeline combining twitter-RoBERTa-base-2022 with rule-based post-processing achieves Micro-F1 0.8434 (Subtask 1: vaccination status) and 0.8936 (Subtask 2: test results). For Task 7, RoBERTa-large with CRF decoding and sliding-window inference obtains relaxed F1 0.60 despite severe train-test distributional shift Across tasks, we identify class imbalance, temporal ambiguity, and platform heterogeneity as central challenges.

pdf bib abs

The MultiClinAI Shared Task on Multilingual Clinical Corpus Construction and Concept Extraction: Systems, Evaluation, and Datasets
Fernando Gallego Donoso | Salvador Lima-Lopez | Judith Rosell | Eulàlia Farré-Maduel | Martin Krallinger

We present an overview of the MultiClinAI shared task, which focuses on multilingual clinical entity extraction and automatic corpus generation through annotation projection. It addresses two key challenges in clinical natural language processing (NLP): (i) developing comparable multilingual named entity recognition (NER) systems and (ii) automatically constructing multilingual clinical corpora through annotation projection. The MultiClinAI task provides a unified benchmark for evaluating multilingual and cross-lingual clinical NLP approaches that cover diseases, symptoms, and procedures in Spanish, English, Dutch, Italian, Romanian, Swedish, and Czech. A total of 21 teams from 13 countries participated, submitting 531 runs across the different subtasks. The top runs obtained very competitive results, close to human expert annotation quality. The results highlight both the challenges and opportunities of multilingual clinical information extraction. All resources, including a corpus of over 738,201 manually revised entity mentions across seven languages, are publicly available on Zenodo at: https://zenodo.org/records/19334278.

pdf bib abs

Overview of #SMM4H-HeaRD 2026 – Task 6: Predicting TNM staging from pathology reports
Jose Miguel Acitores Cortina | Jacob S. Berkowitz | Nadine A. Friedrich | Nicholas P Tatonetti

This paper provides an overview of Task 6 from the Social Media Mining for Health/Health Real-World Data shared task (#SMM4H-HeaRD 2026), which focused on predicting TNM staging from pathology reports from TCGA. Seven teams submitted systems spanning fine-tuned clinical encoders, open-source generative LLMs, and closed-source API models. On a straightforward test set, most teams achieved near-perfect F1 scores (average 0.993, 0.972, and 0.957 for T, N, and M). However, on a harder tiebreak set where explicit TNM notation was removed and staging had to be inferred from clinical descriptions, performance dropped substantially (average 0.725, 0.783, and 0.846). Notably, the two teams using large closed-source API models generalized best to the harder set, achieving the highest T and N scores despite not leading on the easy set. These results suggest that while fine-tuned domain-specific encoders excel at surface-level extraction, larger general-purpose LLMs may be more robust when staging must be inferred from contextual clinical findings. All teams surpassed baseline overall performance on both test sets.

pdf bib abs

NoviceTrio in #SMM4H-HeaRD 2026: Hybrid Clinical Transformer Ensembles for Insomnia Detection and Evidence Extraction from Clinical Notes
Abir Naskar | Mike Conway

We present two systems for the #SMM4H-HeaRD 2026 Task 2 shared task of automated insomnia detection from clinical notes. Our system addresses both subtasks: (1) binary insomnia classification and (2) multi-label rule prediction with evidence span extraction. For Subtask 1, we employ an ensemble architecture combining Qwen3-4B-Instruct and Bio_ClinicalBERT to capture both general semantic reasoning and domain-specific clinical representations. The framework utilizes chunk-based processing with overlapping token windows to handle long clinical notes efficiently. For Subtask 2, we develop a dual-head multi-task transformer model that jointly predicts insomnia labels and token-level evidence spans using BIO tagging. To improve clinical relevance, we additionally incorporate sentence-level filtering using sentence-transformer embeddings and similarity-based retrieval of insomnia-related contexts. Experimental results suggest competitive performance relative to the shared task mean and median scores across both subtasks. Our best Subtask 1 system achieves a recall of 0.9474, surpassing the shared-task mean and median recall, while our Subtask 2 system exceeds the mean and median scores across label classification, exact match, and partial match metrics. The end-to-end implementation is publicly available on GitHub.

pdf bib abs

Overview of #SMM4H-HeaRD 2026 - Task 2: Detection of Insomnia in Clinical Notes
Joey Chan | Lauren D. Gryboski | Guillermo Lopez-Garcia | Graciela Gonzalez-Hernandez

This paper provides an overview of Task 2 from the Social Media Mining for Health and Health Real-World Data (#SMM4H-HeaRD) 2026 Workshop and Shared Tasks, which focused on the detection of insomnia in clinical notes derived from the MIMIC-III dataset. The task consisted of two subtasks: binary text classification to determine whether a patient is likely experiencing insomnia (Subtask 1), and multi-label classification combined with character-level evidence extraction to identify supporting evidence for specific insomnia crite- ria (Subtask 2). Eight teams participated, using approaches ranging from large language model (LLM) prompting and fine-tuned encoder mod- els to hybrid rule-based pipelines. Results demonstrated that structured LLM pipelines with deterministic post-processing achieved the strongest overall performance, while character-level span extraction remained substantially harder than classification across all systems. These findings highlight both the promise of NLP for identifying underdiagnosed conditions in electronic health records and the ongoing difficulty of producing interpretable, evidence-grounded clinical predictions.

The aim of the Social Media Mining for Health Applications and Health Real-World Data (#SMM4H-HeaRD) shared tasks is to fos- ter the development and evaluation of natural language processing, machine learning, and artificial intelligence methods for analyzing health-related text from social media and other real-world data sources. For the 11th iteration, held online and co-located with ACL 2026, the workshop continued the expanded #SMM4H- HeaRD platform initiated in 2025, broaden-ing its scope beyond social media to include additional health real-world data sources such as clinical narratives and biomedical literature. The 8 shared tasks covered diverse data sources, health domains (e.g., adverse drug events, insomnia, influenza vaccine effectiveness, cancer staging, substance use), and task formulations (e.g., classification, named entity recognition, span extraction, and text generation). In total, 110 teams registered, representing 31 countries. In this paper, we present an overview of the datasets, participant systems, and performance results, providing insights into current methods for mining social media and health real-world data for biomedical and clinical applications.