Nicholas P Tatonetti


2026

This paper provides an overview of Task 6 from the Social Media Mining for Health/Health Real-World Data shared task (#SMM4H-HeaRD 2026), which focused on predicting TNM staging from pathology reports from TCGA. Seven teams submitted systems spanning fine-tuned clinical encoders, open-source generative LLMs, and closed-source API models. On a straightforward test set, most teams achieved near-perfect F1 scores (average 0.993, 0.972, and 0.957 for T, N, and M). However, on a harder tiebreak set where explicit TNM notation was removed and staging had to be inferred from clinical descriptions, performance dropped substantially (average 0.725, 0.783, and 0.846). Notably, the two teams using large closed-source API models generalized best to the harder set, achieving the highest T and N scores despite not leading on the easy set. These results suggest that while fine-tuned domain-specific encoders excel at surface-level extraction, larger general-purpose LLMs may be more robust when staging must be inferred from contextual clinical findings. All teams surpassed baseline overall performance on both test sets.

2024

SMM4H 2024 Task 1 is focused on the identification of standardized Adverse Drug Events (ADEs) in tweets. We introduce a novel Retrieval-Augmented Generation (RAG) method, leveraging the capabilities of Llama 3, GPT-4, and the SFR-embedding-mistral model, along with few-shot prompting techniques, to map colloquial tweet language to MedDRA Preferred Terms (PTs) without relying on extensive training datasets. Our method achieved competitive performance, with an F1 score of 0.359 in the normalization task and 0.392 in the named entity recognition (NER) task. Notably, our model demonstrated robustness in identifying previously unseen MedDRA PTs (F1=0.363) greatly surpassing the median task score of 0.141 for such terms.