Kshitij Sharad Jadhav
Also published as: Kshitij Jadhav
Other people with similar names: Kshitij Jadhav
2026
Rad-Flamingo: A Multimodal Prompt driven Radiology Report Generation Framework with Patient-Centric Explanations
Md. Tousin Akhter | Devansh Lalwani | Kshitij Sharad Jadhav | Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EACL 2026
Md. Tousin Akhter | Devansh Lalwani | Kshitij Sharad Jadhav | Pushpak Bhattacharyya
Findings of the Association for Computational Linguistics: EACL 2026
In modern healthcare, radiology plays a pivotal role in diagnosing and managing diseases. However, the complexity of medical imaging data and the variability in interpretation can lead to inconsistencies and a lack of patient-centered insight in radiology reports. To address this challenge, a novel multimodal prompt-driven report generation framework Rad-Flamingo was developed, that integrates diverse data modalities—such as medical images, and clinical notes—to produce comprehensive and context-aware radiology reports. Our framework leverages innovative prompt engineering techniques to guide vision-language models in generating relevant information, ensuring these generated reports are not only accurate but also understandable to individual patients. A key feature of our framework is its ability to provide patient-centric explanations, offering clear and personalized insights into diagnostic findings and their implications. Additionally, we also demonstrate a synthetic data generation pipeline, to append any existing benchmark datasets’ findings and impressions with patient-centric explanation. Experimental results demonstrate that this framework’s effectiveness in enhancing report quality, improving understandability, and could foster better patient-doctor communication. This approach represents a significant step towards human-centered medical AI systems.
MedBench: Deliberative Evaluation of Medical Language Models
Pratik Jalan | Mukul Joshi | Akhilesh Magotra | Kshitij Jadhav
BioNLP 2026
Pratik Jalan | Mukul Joshi | Akhilesh Magotra | Kshitij Jadhav
BioNLP 2026
We introduce MedBench, a benchmark for evaluating medical language models as deliberating agents rather than isolated predictors. MedBench evaluates eight models (4B?32B) on 19,625 questions from six medical QA datasets using Consensus-Aware Model Panel (CAMP), a two-tier protocol in which five 4B?8B models answer independently, revise after observing peer reasoning, and escalate persistent disagreements to larger 20B?32B models. Compared with zero-shot, few-shot, and chain-of-thought baselines, CAMP shows that deliberation is not uniformly accuracy-improving, but reveals interaction-driven behaviors hidden by single-model evaluation. On PubMedQA without external context, the 4B?8B panel outperforms the evaluated 20B?32B individual zero-shot models (54.1% vs. 33.9%), and achieves the best evaluated result with context (75.7%), suggesting that structured interaction can sometimes complement scale. Across five datasets, initial inter-model agreement is positively associated with correctness and serves as a useful difficulty signal. However, on MedXpertQA, unanimous agreement yields only 6.6% accuracy despite 14.4% overall accuracy, suggesting correlated ignorance, where shared biases make consensus misleading. Error analysis shows that most failures are debate-insufficient cases, where incorrect majorities persist despite interaction (93?97%), while debate-harmful cases account for 3?7%. MedBench positions deliberative evaluation as a complement to accuracy-centric benchmarking, measuring when model interaction corrects errors, reinforces shared mistakes, or signals the need for stronger evidence and human review.
2024
Few shot chain-of-thought driven reasoning to prompt LLMs for open-ended medical question answering
Saeel Sandeep Nachane | Ojas Gramopadhye | Prateek Chanda | Ganesh Ramakrishnan | Kshitij Sharad Jadhav | Yatin Nandwani | Dinesh Raghu | Sachindra Joshi
Findings of the Association for Computational Linguistics: EMNLP 2024
Saeel Sandeep Nachane | Ojas Gramopadhye | Prateek Chanda | Ganesh Ramakrishnan | Kshitij Sharad Jadhav | Yatin Nandwani | Dinesh Raghu | Sachindra Joshi
Findings of the Association for Computational Linguistics: EMNLP 2024
In this paper, we propose a modified version of the MedQA-USMLE dataset, named MEDQA-OPEN, which contains open-ended medical questions without options to mimic clinical scenarios, along with clinician-approved reasoned answers. Additionally, we implement a prompt driven by Chain of Thought (CoT) reasoning, CLINICR, to mirror the prospective process of incremental reasoning, reaching a correct response to medical questions. We empirically demonstrate how CLINICR outperforms the state-of-the-art 5-shot CoT-based prompt (Liévin et al., 2022). We also present an approach that mirrors real-life clinical practice by first exploring multiple differential diagnoses through MCQ-CLINICR and subsequently narrowing down to a final diagnosis using MCQ-ELIMINATIVE. Finally, emphasizing the importance of response verification in medical settings, we utilize a reward model mechanism, replacing the elimination process performed by MCQ-ELIMINATIVE.
2023
Replace and Report: NLP Assisted Radiology Report Generation
Kaveri Kale | Pushpak Bhattacharyya | Kshitij Jadhav
Findings of the Association for Computational Linguistics: ACL 2023
Kaveri Kale | Pushpak Bhattacharyya | Kshitij Jadhav
Findings of the Association for Computational Linguistics: ACL 2023
Clinical practice frequently uses medical imaging for diagnosis and treatment. A significant challenge for automatic radiology report generation is that the radiology reports are long narratives consisting of multiple sentences for both abnormal and normal findings. Therefore, applying conventional image captioning approaches to generate the whole report proves to be insufficient, as these are designed to briefly describe images with short sentences. We propose a template-based approach to generate radiology reports from radiographs. Our approach involves the following: i) using a multilabel image classifier, produce the tags for the input radiograph; ii) using a transformer-based model, generate pathological descriptions (a description of abnormal findings seen on radiographs) from the tags generated in step (i); iii) using a BERT-based multi-label text classifier, find the spans in the normal report template to replace with the generated pathological descriptions; and iv) using a rule-based system, replace the identified span with the generated pathological description. We performed experiments with the two most popular radiology report datasets, IU Chest X-ray and MIMIC-CXR and demonstrated that the BLEU-1, ROUGE-L, METEOR, and CIDEr scores are better than the State-of-the-Art models by 25%, 36%, 44% and 48% respectively, on the IU X-RAY dataset. To the best of our knowledge, this is the first attempt to generate chest X-ray radiology reports by first creating small sentences for abnormal findings and then replacing them in the normal report template.