Evaluating Health Question Answering Under Readability-Controlled Style Perturbations

Md Mushfiqur Rahman, Kevin Lybarger


Abstract
Patients often ask semantically similar medical questions in linguistically diverse ways that vary in readability tone and background knowledge. A robust question answering QA system should both provide semantically consistent answers across stylistic differences and adapt its response style to match the users input however existing QA evaluations rarely test this capability creating critical gaps in QA evaluation that undermine accessibility and health literacy. We introduce SPQA an evaluation framework and benchmark that applies controlled stylistic perturbations to consumer health questions while preserving semantic intent then measures how model answers change across correctness completeness coherence fluency and linguistic adaptability using a human-validated LLM-based judge. The style axes include reading level formality and patient background knowledge all perturbations are grounded in human annotations to ensure fidelity and alignment with human judgments. Our contributions include a readability-aware evaluation methodology a style-diverse benchmark with human-grounded perturbations and an automated evaluation pipeline validated against expert judgments. Evaluation results across multiple health QA models indicate that stylistic perturbations lead to measurable performance degradation even when semantic intent is preserved during perturbation. The largest performance drops occur in answer correctness and completeness while models also show limited ability to adapt their style to match the input. These findings underscore the risk of inequitable information delivery and highlight the need for accessibility-aware QA evaluation.
Anthology ID:
2025.tsar-1.5
Volume:
Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025)
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Matthew Shardlow, Fernando Alva-Manchego, Kai North, Regina Stodden, Horacio Saggion, Nouran Khallaf, Akio Hayakawa
Venues:
TSAR | WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
70–86
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.tsar-1.5/
DOI:
Bibkey:
Cite (ACL):
Md Mushfiqur Rahman and Kevin Lybarger. 2025. Evaluating Health Question Answering Under Readability-Controlled Style Perturbations. In Proceedings of the Fourth Workshop on Text Simplification, Accessibility and Readability (TSAR 2025), pages 70–86, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Evaluating Health Question Answering Under Readability-Controlled Style Perturbations (Rahman & Lybarger, TSAR 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.tsar-1.5.pdf