A Multi-Agent Open-Source LLM for Structured Cancer Registry Information Extraction from Pathology and Medical Reports

Abdulrahman Aal Abdulsalam; Adhari Al Zaabi; Riham Jeeballah; Habiba El Keraby

A Multi-Agent Open-Source LLM for Structured Cancer Registry Information Extraction from Pathology and Medical Reports

Abdulrahman Aal Abdulsalam, Adhari Al Zaabi, Riham Jeeballah, Habiba El Keraby

Abstract

Extracting structured cancer registry information from pathology and medical reports is challenging due to heterogeneous reporting styles and implicit clinical reasoning. We propose a modular multi-agent framework that decomposes registry abstraction into semantic chunking, retrieval, field-specific extraction, validation, evaluation, and aggregation stages. The dataset includes 818 annotated cancer cases from Sultan Qaboos University Hospital. Evaluation in this study focuses on breast (n=454) and colorectal (n=174) reports across grade, morphology, TNM staging, and laterality extraction tasks. The framework is compared against prompt-based LLaMA 3.3 baselines using accuracy and weighted/macro F1-score metrics. The proposed framework improved performance in context-dependent tasks, particularly grade extraction, where weighted F1-score increased from 0.71 to 0.78 for breast cancer and from 0.56 to 0.67 for colorectal cancer. Improvements were also observed for colorectal laterality extraction. For other extraction tasks, particularly highly structured tasks such as TNM staging and morphology extraction, the multi-agent framework achieved performance comparable to direct prompting. Although the baseline achieved slightly higher average weighted F1-scores overall, the proposed framework provides improved modularity, traceability, and pipeline-level interpretability through explicit intermediate reasoning stages, supporting error analysis and future clinician-guided refinement.

Anthology ID:: 2026.bionlp-1.43
Volume:: BioNLP 2026
Month:: July
Year:: 2026
Address:: San Diego, California
Editors:: Dina Demner-Fushman, Sophia Ananiadou, Kirk Roberts, Junichi Tsujii
Venues:: BioNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 531–551
Language:
URL:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.43/
DOI:
Bibkey:
Cite (ACL):: Abdulrahman Aal Abdulsalam, Adhari Al Zaabi, Riham Jeeballah, and Habiba El Keraby. 2026. A Multi-Agent Open-Source LLM for Structured Cancer Registry Information Extraction from Pathology and Medical Reports. In BioNLP 2026, pages 531–551, San Diego, California. Association for Computational Linguistics.
Cite (Informal):: A Multi-Agent Open-Source LLM for Structured Cancer Registry Information Extraction from Pathology and Medical Reports (Aal Abdulsalam et al., BioNLP 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl-workshops/2026.bionlp-1.43.pdf

PDF Cite Search Fix data