Stefanescu Anastasia

2026

No_gmail at #SMM4H-HeaRD 2026: Detecting Patient Metadata in COVID-19 Scientific Literature: A Comparative Study of Encoder-Only and Autoregressive Language Models
Stefanescu Anastasia
Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks

Identifying sentences in COVID-19 literature that report patient metadata is an important step in genomic epidemiology, currently requiring costly manual curation. We compare fine-tuned encoder-only models (BERT, BioLinkBERT) and autoregressive LLMs (Llama, Gemma, GPT-OSS) under prompting and fine-tuning regimes, using Focal Loss and undersampling to address severe class imbalance. Encoder-only models substantially outperform autoregressive models: BioLinkBERT-base with Focal Loss achieves macro F1 of 0.76, versus 0.54 for the best fine-tuned autoregressive model.

Co-authors

Venues

SMM4H1
WS1

Fix author