Ebad Shabbir
2026
Do Large Language Models Reflect Demographic Pluralism in Safety?
Usman Naseem | Gautam Siddharth Kashyap | Sushant Kumar Ray | Rafiq Ali | Ebad Shabbir | Abdullah Mohammad
Findings of the Association for Computational Linguistics: EACL 2026
Usman Naseem | Gautam Siddharth Kashyap | Sushant Kumar Ray | Rafiq Ali | Ebad Shabbir | Abdullah Mohammad
Findings of the Association for Computational Linguistics: EACL 2026
Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as Anthropic-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BeaverTails) using Mistral-7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters—Gemma-7B, GPT-4o, and LLaMA-2-7B—under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust. Code and data available at: https://github.com/usmaann/Demo-SafetyBench
Revealing the Truth with ConLLM for Detecting Multi-Modal Deepfakes
Gautam Siddharth Kashyap | Harsh Joshi | Niharika Jain | Ebad Shabbir | Jiechao Gao | Nipun Joshi | Usman Naseem
Findings of the Association for Computational Linguistics: EACL 2026
Gautam Siddharth Kashyap | Harsh Joshi | Niharika Jain | Ebad Shabbir | Jiechao Gao | Nipun Joshi | Usman Naseem
Findings of the Association for Computational Linguistics: EACL 2026
The rapid rise of deepfake technology poses a severe threat to social and political stability by enabling hyper-realistic synthetic media capable of manipulating public perception. However, existing detection methods struggle with two core limitations: (1) modality fragmentation, which leads to poor generalization across diverse and adversarial deepfake modalities; and (2) shallow inter-modal reasoning, resulting in limited detection of fine-grained semantic inconsistencies. To address these, we propose ConLLM (Contrastive Learning with Large Language Models), a hybrid framework for robust multimodal deepfake detection. ConLLM employs a two-stage architecture: stage 1 uses Pre-Trained Models (PTMs) to extract modality-specific embeddings; stage 2 aligns these embeddings via contrastive learning to mitigate modality fragmentation, and refines them using LLM-based reasoning to address shallow inter-modal reasoning by capturing semantic inconsistencies. ConLLM demonstrates strong performance across audio, video, and audio-visual modalities. It reduces audio deepfake EER by up to 50%, improves video accuracy by up to 8%, and achieves approximately 9% accuracy gains in audio-visual tasks. Ablation studies confirm that PTM-based embeddings contribute 9%–10% consistent improvements across modalities. Our code and data is available at: https://github.com/gskgautam/ConLLM/tree/main
FROST: Factual Reasoning via Optimized Stochastic Trajectories in Large Language Models during Inference
Soumedhik Bharati | Ebad Shabbir | Jiechao Gao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Soumedhik Bharati | Ebad Shabbir | Jiechao Gao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Large language models face a trade-off between factual consistency and reasoningdiversity: deterministic decoding prioritizes reliability but may miss alternativesolution paths, while high-temperature sampling increases exploration at the costof accuracy. We present FROST (Factual Reasoning via Optimized StochasticTrajectories), an inference-time framework that balances exploration andexploitation without additional training or context augmentation. FROST combinesdeterministic inference from a large model with targeted stochastic sampling froma smaller model, selecting outputs via multi-criteria validation over coherence,factual grounding, and semantic novelty. Across HotpotQA, CommonsenseQA, andMMLU, FROST achieves 2–5 percentage point improvements over standard chain-of-thoughtprompting and reduces unsupported outputs by 40% relative to Standard CoT. Comparedto Self-Consistency ensembles, FROST delivers comparable accuracy at 28% lowerinference cost through strategic delegation to smaller models. On an adversarialsubset with unanswerable queries, FROST abstains on 34% of cases versus 8% forstandard chain-of-thought, reducing false positives by 45%. Task-stratifiedevaluation shows that exploration benefits scale with problem ambiguity.Generalization to mathematical reasoning, code generation, and multimodal domainsremains future work.
Are Large Language Models Economically Viable for Industry Deployment?
Abdullah Mohammad | Sushant Kumar Ray | Pushkar Arora | Rafiq Ali | Ebad Shabbir | Gautam Siddharth Kashyap | Jiechao Gao | Usman Naseem
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Abdullah Mohammad | Sushant Kumar Ray | Pushkar Arora | Rafiq Ali | Ebad Shabbir | Gautam Siddharth Kashyap | Jiechao Gao | Usman Naseem
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Generative AI—powered by Large Language Models (LLMs)—is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization—not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment–Evaluation Gap—the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL—a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics—Economic Break-Even (Nbreak), Intelligence-Per-Watt (IP W ), System Density (ρsys), Cold-Start Tax (Ctax), and Quantization Fidelity (Qret)—capturing profitability, energy efficiency, hardware scaling, serverless feasibility, and compression safety. Our results reveal a clear efficiency frontier—models in the < 2B parameter class dominate larger baselines across economic and ecological dimensions. LLaMA-3.2-1B (INT4) achieves ROI break-even in 14 requests (median), delivers 3× higher energy-normalized intelligence than 7B models, and exceeds 6,900 tokens/s/GB under 4-bit quantization. We further uncover an efficiency anomaly—while QLoRA reduces memory footprint, it increases adaptation energy by up to 7× for small models—challenging prevailing assumptions about quantization-aware training in edge deployment.
2025
LLMs on a Budget? Say HOLA
Zohaib Hasan Siddiqui | Jiechao Gao | Ebad Shabbir | Mohammad Anas Azeez | Rafiq Ali | Gautam Siddharth Kashyap | Usman Naseem
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Zohaib Hasan Siddiqui | Jiechao Gao | Ebad Shabbir | Mohammad Anas Azeez | Rafiq Ali | Gautam Siddharth Kashyap | Usman Naseem
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Running Large Language Models (LLMs) on edge devices is constrained by high compute and memory demands—posing a barrier for real-time applications in industries like healthcare, education, and embedded systems. Current solutions such as quantization, pruning, and Retrieval-Augmented Generation (RAG) offer only partial optimizations and often compromise on speed or accuracy. We introduce HOLA, an end-to-end optimization framework for efficient LLM deployment. Internally, it leverages Hierarchical Speculative Decoding (HSD) for faster inference without quality loss. Externally, AdaComp-RAG adjusts retrieval complexity based on context needs. Together with Lo-Bi, which blends structured pruning (LoRA) and quantization, HOLA delivers significant gains: +17.6% EMA on GSM8K, +10.5% MCA on ARC, and reduced latency and memory on edge devices like Jetson Nano—proving both scalable and production-ready. Our code is available at: https://github.com/zohaibhasan066/HOLA_Codebase
Truth, Trust, and Trouble: Medical AI on the Edge
Mohammad Anas Azeez | Rafiq Ali | Ebad Shabbir | Zohaib Hasan Siddiqui | Gautam Siddharth Kashyap | Jiechao Gao | Usman Naseem
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Mohammad Anas Azeez | Rafiq Ali | Ebad Shabbir | Zohaib Hasan Siddiqui | Gautam Siddharth Kashyap | Jiechao Gao | Usman Naseem
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Large Language Models (LLMs) hold significant promise for transforming digital health by enabling automated medical question answering. However, ensuring these models meet critical industry standards for factual accuracy, usefulness, and safety remains a challenge, especially for open-source solutions. We present a rigorous benchmarking framework via a dataset of over 1,000 health questions. We assess model performance across honesty, helpfulness, and harmlessness. Our results highlight trade-offs between factual reliability and safety among evaluated models—Mistral-7B, BioMistral-7B-DARE, and AlpaCare-13B. AlpaCare-13B achieves the highest accuracy (91.7%) and harmlessness (0.92), while domain-specific tuning in BioMistral-7B-DARE boosts safety (0.90) despite smaller scale. Few-shot prompting improves accuracy from 78% to 85%, and all models show reduced helpfulness on complex queries, highlighting challenges in clinical QA. Our code is available at: https://github.com/AnasAzeez/TTT
TSR@CASE 2025: Low Dimensional Multimodal Fusion Using Multiplicative Fine Tuning Modules
Sushant Kr. Ray | Rafiq Ali | Abdullah Mohammad | Ebad Shabbir | Samar Wazir
Proceedings of the 8th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Texts
Sushant Kr. Ray | Rafiq Ali | Abdullah Mohammad | Ebad Shabbir | Samar Wazir
Proceedings of the 8th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Texts
This study describes our submission to the CASE 2025 shared task on multimodal hate event detection, which focuses on hate detection, hate target identification, stance determination, and humour detection on text embedded images as classification challenges. Our submission contains entries in all of the subtasks. We propose FIMIF, a lightweight and efficient classification model that leverages frozen CLIP encoders. We utilise a feature interaction module that allows the model to exploit multiplicative interactions between features without any manual engineering. Our results demonstrate that the model achieves comparable or superior performance to larger models, despite having a significantly smaller parameter count