Arka Dutta


2026

Large Language Models (LLMs) can exhibit imbalanced biases against vulnerable groups, but how they rationalize stereotypes and rights restrictions targeting mental health entities remains underexplored. We audit a broad suite of open-weight LLMs on stereotype-justification prompts tied to mental health identities. We find that several widely used models endorse harmful stereotypes when explicitly asked to justify them, with endorsement varying across model families, versions, and mental health conditions. Finally, we show that widely used harmful-content evaluation and moderation frameworks often miss these nuanced, discriminatory responses, highlighting a gap in current AI safety evaluation for mental health groups.
In Dravidian languages, political memes progressively shape public opinion and political discourse, influencing digital conversations andpublic narratives. Our paper proposes a multilevel multimodal framework for political meme classification in Tamil and Malayalam as part of the Multi Level Political Meme ClassificationDravidianLangTech@ACL 2026 shared task. The task has involved two levels: Level 1 has identified whether a meme expresses Troll/Oppose or Support/Praise, while Level 2 has determined the specific target category (Individual, Party, or Intersection). We have evaluated unimodal and multimodal architectures to analyze the impact of textual and visual representation. Experimental results have highlighted the importance of a multimodal approach over unimodal approaches. This workconfirms the effectiveness of combining image and text features in meme understanding. Among the evaluated models, the mBERT+ViTarchitecture has achieved the best overall performance across both languages and classification levels. According to the evaluation of shared task we achieved average F1 score of 0.72 securing the 2nd rank in Malayalam task and F1 score of 0.76 in Tamil task securing the 6th rank. However after our experimental evaluation we got best average F1 score of 0.62 for Tamil and 0.49 for Malayalam. Despite moderate results, challenges have remained mainly due to the dataset size, class imbalance, and noisy text extraction from images.
Hallucinations pose a critical challenge to the real-world deployment of large language models (LLMs) in high-stakes domains. In this paper, we present a framework for stress testing factual fidelity in LLMs in the presence of adversarial nudge. Our framework consists of three steps. First, we instruct the LLM to produce sets of truths and lies consistent with the closed domain in question. Next, we instruct the LLM to verify the same set of assertions as truths and lies consistent with the same closed domain. Finally, we test the robustness of the LLM against the lies generated (and verified) by itself. Our extensive evaluation, conducted using five widely known proprietary and six open LLMs across two closed domains of popular movies and novels, reveals a wide range of susceptibility to adversarial nudges: even among the strongest proprietary LLMs, Claude exhibits strong resilience, GPT and Grok demonstrate moderate resilience, while Gemini and DeepSeek show weak resilience and open models fall short significantly.