VariantBench: A Framework for Evaluating LLMs on Justifications for Genetic Variant Interpretation

Humair Basharat, Simon Plotkin, Charlotte Le, Kevin Zhu, Michael Pink, Isabella Alfaro


Abstract
Accurate classification in high-stakes domains requires not only correct predictions but transparent, traceable reasoning. We instantiate this need in clinical genomics and present VariantBench, a reproducible benchmark and scoring harness that evaluates both the final American College of Medical Genetics and Genomics/Association for Molecular Pathology (ACMG/AMP) labels and criterion-level reasoning fidelity for missense single-nucleotide variants (SNVs). Each case pairs a variant with deterministic, machine-readable evidence aligned to five commonly used criteria (PM2, PP3, PS1, BS1, BA1), enabling consistent evaluation of large language models (LLMs). Unlike prior work that reports only final labels, our framework scores the correctness and faithfulness of per-criterion justifications against numeric evidence. On a balanced 100-variant freeze, Gemini 2.5 Flash and GPT-4o outperform Claude 3 Opus on label accuracy and criterion detection, and both improve materially when the decisive PS1 cue is provided explicitly. Error analyses show models master population-frequency cues yet underuse high-impact rules unless evidence is unambiguous. VariantBench delivers a substrate to track such improvements and compare prompting, calibration, and aggregation strategies in genomics and other rule-governed, safety-critical settings.
Anthology ID:
2025.ijcnlp-srw.26
Volume:
The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Santosh T.y.s.s, Shuichiro Shimizu, Yifan Gong
Venue:
IJCNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
314–321
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-srw.26/
DOI:
Bibkey:
Cite (ACL):
Humair Basharat, Simon Plotkin, Charlotte Le, Kevin Zhu, Michael Pink, and Isabella Alfaro. 2025. VariantBench: A Framework for Evaluating LLMs on Justifications for Genetic Variant Interpretation. In The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 314–321, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):
VariantBench: A Framework for Evaluating LLMs on Justifications for Genetic Variant Interpretation (Basharat et al., IJCNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.ijcnlp-srw.26.pdf