Rafiq Ali
2025
Truth, Trust, and Trouble: Medical AI on the Edge
Mohammad Anas Azeez
|
Rafiq Ali
|
Ebad Shabbir
|
Zohaib Hasan Siddiqui
|
Gautam Siddharth Kashyap
|
Jiechao Gao
|
Usman Naseem
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework via a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models—Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting challenges in clinical QA. Our code is available at: https://github.com/AnasAzeez/TTT
LLMs on a Budget? Say HOLA
Zohaib Hasan Siddiqui
|
Jiechao Gao
|
Ebad Shabbir
|
Mohammad Anas Azeez
|
Rafiq Ali
|
Gautam Siddharth Kashyap
|
Usman Naseem
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands—posing a barrier for real-time applications in industries like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and Retrieval-Augmented Generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with Lo-Bi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: +17.6% EMA on GSM8K, +10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano—proving both scalable and production-ready. Our code is available at: https://github.com/zohaibhasan066/HOLA_Codebase
Search
Fix author
Co-authors
- Mohammad Anas Azeez 2
- Jiechao Gao 2
- Gautam Siddharth Kashyap 2
- Usman Naseem 2
- Ebad Shabbir 2
- show all...