Rubén R. Fernández
2026
URJC-Team at #SMM4H-HeaRD 2026: TNM Stage Extraction with a Regex-LLM Workflow
Natalia Madrueño | Jose Walter Hernández Pérez | Rubén R. Fernández | Soto Montalvo
Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks
Natalia Madrueño | Jose Walter Hernández Pérez | Rubén R. Fernández | Soto Montalvo
Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks
TNM cancer staging is a critical process for characterizing tumor burden and guiding clinical decisions. Nevertheless, its automated extraction remains challenging due to the unstructured and heterogeneous nature of free-text pathology reports. This paper describes the participation of the URJC-Team in Task 6 of the Social Media Mining for Health/Health Real-World Data (#SMM4H-HeaRD) 2026 Shared Tasks. It focuses on predicting TNM staging from pathology reports. The proposed workflow combines hand-crafted regular expressions with a Large Language Model (LLM). First, explicit TNM mentions are extracted using rule-based patterns. Then, any stage not recovered by these rules is inferred by an LLM. Overall, the proposal provides competitive results across all official shared-task phases.