pdf
bib
Proceedings of The First Workshop on Human–LLM Collaboration for Ethical and Responsible Science Production (SciProdLLM)
Wei Zhao
|
Jennifer D’Souza
|
Steffen Eger
|
Anne Lauscher
|
Yufang Hou
|
Nafise Sadat Moosavi
|
Tristan Miller
|
Chenghua Lin
pdf
bib
abs
Bridging Health Literacy Gaps in Indian Languages: Multilingual LLMs for Clinical Text Simplification
R S Pavithra
We demonstrate how open multilingual LLMs (mT5, IndicTrans2) can simplify complex medical documents into culturally sensitive, patient friendly text in Indian languages, advancing equitable healthcare communication and multilingual scientific accessibility.Clinical documents such as discharge summaries, consent forms, and medication instructions are essential for patient care but are often written in complex, jargon-heavy language. This barrier is intensified in multilingual and low-literacy contexts like India, where linguistic diversity meets limited health literacy. We present a multilingual clinical text simplification pipeline using open large language models (mT5 and IndicTrans2) to automatically rewrite complex medical text into accessible, culturally appropriate, and patient-friendly versions in English, Hindi, Tamil, and Telugu. Using a synthetic dataset of 2,000 discharge summaries, our models achieve up to 42% readability improvement while maintaining factual accuracy. The framework demonstrates how open, reproducible LLMs can bridge linguistic inequities in healthcare communication and support inclusive, patient-centric digital health access in India.
pdf
bib
abs
Human-Centered Disability Bias Detection in Large Language Models
Habiba Chakour
|
Fatiha Sadat
To promote a more just and inclusive society, developers and researchers are strongly encouraged to design Language Models (LM) with ethical considerations at the forefront, ensuring that the benefits and opportunities of AI are accessible to all users and communities. Incorporating humans in the loop is one approach recognized for mitigating general AI biases. Consequently, the development of new design guidelines and datasets is essential to help AI systems realize their full potential for the benefit of people with disabilities.This study aims to identify disability-related bias in Large Masked Language Models (MLMs), the Electra. A participatory and collaborative research approach was employed, involving three disability organizations to collect information on deaf and hard-of-hearing individuals. Our initial analysis reveals that the studied MLM is highly sensitive to the various identity references used to describe deaf and hard-of-hearing people.
pdf
bib
abs
TransLaTeX: Exposing the Last-Mile Execution Gap in LLM-Agent for Scientific Formatting
Jiawen Lyn
|
Yvette Graham
Large Language Models (LLMs) have achieved remarkable progress in tasks such as survey writing and language polishing, yet the final stage of LaTeX formatting and template adaptation remains a neglected and error-prone bottleneck.We identify an execution illusion, where LLMs produce linguistically fluent but unexecutable LaTeX code.To address this, we introduce TransLaTeX—the first reasoning-and-control framework that converts documents between scholarly templates with compiler-level verifiability.TransLaTeX achieves three key innovations:(1) Structure–content separation via placeholder masking, ensuring privacy and less token consumption;(2) SafeFormatBench, the first benchmark dedicated to executable LaTeX generation and template conversion; and(3) Execution-grounded verification across compilation, policy compliance, and visual consistency.TransLaTeX outperforms Pandoc and full-text LLM baselines on SafeFormatBench in compilation rate, ACL policy compliance, and layout fidelity, effectively mitigating the execution illusion.
pdf
bib
abs
MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning
Rajarshi Ghosh
|
Abhay Gupta
|
Hudson McBride
|
Anurag Jayant Vaidya
|
Faisal Mahmood
Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23k items each (69k total). We evaluate a frontier LLM and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS > 0.80) but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering—even when final diagnoses remain unchanged. Error analysis identifies specific cases where reasoning shifts occur, highlighting clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA provides a controlled diagnostic setting for auditing reasoning stability in medical AI.
pdf
bib
abs
Reasoning-Enhanced Retrieval for Misconception Prediction: A RAG-Inspired Approach with LLMs
Chaudhary Divya
|
Chang Xue
|
Shaorui Sun
Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23k items each (69k total). We evaluate a frontier LLM and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS > 0.80) but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering—even when final diagnoses remain unchanged. Error analysis identifies specific cases where reasoning shifts occur, highlighting clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA provides a controlled diagnostic setting for auditing reasoning stability in medical AI.