Piyush Patel
2026
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
Shubham Nigam | Suparnojit Sarkar | Piyush Patel
BioNLP 2026
Shubham Nigam | Suparnojit Sarkar | Piyush Patel
BioNLP 2026
We present IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages (Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu). The dataset extends the MDDial corpus with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors introduced during automatic translation. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation (LoRA) of a quantized small language model, incorporating an optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate IndicMedLM against zero-shot multilingual baselines across ten languages and conduct systematic error analysis, identifying five failure modes: Instruction Drift, Label Collapse, Cross-Domain Confusion, Tokenization Failure, and Paraphrase-over-Label Generation. Results show strong post-processed diagnostic accuracy in Hindi, Marathi, and Bengali, while Assamese, Tamil, and Telugu remain in an extreme failure tier attributable to base-model tokenizer gaps, a finding with direct patient safety implications. Medical expert evaluation confirms the clinical plausibility and safety of the generated consultations.