Bruno Chaves Ferreira


2026

We apply an LLM-driven autoresearch protocol to Task 7 of #SMM4H-HeaRD 2026, which requires extracting ClinicalImpacts and SocialImpacts spans from Reddit posts about non-medical opioid use. A coding agent iteratively proposes a hypothesis, modifies the training configuration, and evaluates against the held-out validation set. Across 79 runs, only 9 improved strict F1, indicating a narrow viable search space on this small dataset (842 training examples). The submitted ensemble combines DeBERTa-large, MC Dropout blending, and a constrained multi-LLM consensus layer, reaching 0.46 strict and 0.52 relaxed F1 on test, though single-seed evaluation limits the reliability of run-level comparisons. The run log provides a reproducible case study of autonomous experimentation, including failure modes and guardrails for small-data NER.