Proceedings of the 9th Widening NLP Workshop

Chen Zhang, Emily Allaway, Hua Shen, Lesly Miculicich, Yinqiao Li, Meryem M'hamdi, Peerat Limkonchotiwat, Richard He Bai, Santosh T.y.s.s., Sophia Simeng Han, Surendrabikram Thapa, Wiem Ben Rim (Editors)

Anthology ID:: 2025.winlp-main
Month:: November
Year:: 2025
Address:: Suzhou, China
Venues:: WiNLP | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.winlp-main/
DOI:
ISBN:: 979-8-89176-351-7
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.winlp-main.pdf

PDF (full) BibTeX Search

pdf bib abs
Seeing Symbols, Missing Cultures: Probing Vision-Language Models’ Reasoning on Fire Imagery and Cultural Meaning
Haorui Yu | Yang Zhao | Yijia Chu | Qiufeng Yi

Vision-Language Models (VLMs) often appearculturally competent but rely on superficial pat.tern matching rather than genuine cultural understanding. We introduce a diagnostic framework to probe VLM reasoning on fire-themedcultural imagery through both classification andexplanation analysis. Testing multiple modelson Western festivals, non-Western traditions.and emergency scenes reveals systematic biases: models correctly identify prominent Western festivals but struggle with underrepresentedcultural events, frequently offering vague labelsor dangerously misclassifying emergencies ascelebrations. These failures expose the risksof symbolic shortcuts and highlight the needfor cultural evaluation beyond accuracy metrics to ensure interpretable and fair multimodalsystems.

pdf bib abs
GPT4AMR: Does LLM-based Paraphrasing Improve AMR-to-text Generation Fluency?
Jiyuan Ji | Shira Wein

Abstract Meaning Representation (AMR) is a graph-based semantic representation that has been incorporated into numerous downstream tasks, in particular due to substantial efforts developing text-to-AMR parsing and AMR-to-text generation models. However, there still exists a large gap between fluent, natural sentences and texts generated from AMR-to-text generation models. Prompt-based Large Language Models (LLMs), on the other hand, have demonstrated an outstanding ability to produce fluent text in a variety of languages and domains. In this paper, we investigate the extent to which LLMs can improve the AMR-to-text generated output fluency post-hoc via prompt engineering. We conduct automatic and human evaluations of the results, and ultimately have mixed findings: LLM-generated paraphrases generally do not exhibit improvement in automatic evaluation, but outperform baseline texts according to our human evaluation. Thus, we provide a detailed error analysis of our results to investigate the complex nature of generating highly fluent text from semantic representations.

pdf bib abs
Probing Gender Bias in Multilingual LLMs: A Case Study of Stereotypes in Persian
Ghazal Kalhor | Behnam Bahrak

Multilingual Large Language Models (LLMs) are increasingly used worldwide, making it essential to ensure they are free from gender bias to prevent representational harm. While prior studies have examined such biases in high-resource languages, low-resource languages remain understudied. In this paper, we propose a template-based probing methodology, validated against real-world data, to uncover gender stereotypes in LLMs. As part of this framework, we introduce the Domain-Specific Gender Skew Index (DS-GSI), a metric that quantifies deviations from gender parity. We evaluate four prominent models, GPT-4o mini, DeepSeek R1, Gemini 2.0 Flash, and Qwen QwQ 32B, across four semantic domains, focusing on Persian, a low-resource language with distinct linguistic features. Our results show that all models exhibit gender stereotypes, with greater disparities in Persian than in English across all domains. Among these, sports reflect the most rigid gender biases. This study underscores the need for inclusive NLP practices and provides a framework for assessing bias in other low-resource languages.

pdf bib abs
Whose Palestine Is It? A Topic Modelling Approach to National Framing in Academic Research
Maida Aizaz | Taegyoon Kim | Lanu Kim

In this study, we investigate how author affiliation shapes academic discourse, proposing it as an effective proxy for author perspective in understanding what topics are studied, how nations are framed, and whose realities are prioritised. Using Palestine as a case study, we apply BERTopic and Structural Topic Modelling (STM) to 29,536 English-language academic articles collected from the OpenAlex database. We find that domestic authors focus on practical, local issues like healthcare, education, and the environment, while foreign authors emphasise legal, historical, and geopolitical discussions. These differences, in our interpretation, reflect lived proximity to war and crisis. We also note that while BERTopic captures greater lexical nuance, STM enables covariate-aware comparisons, offering deeper insight into how affiliation correlates with thematic emphasis. We propose extending this framework to other underrepresented countries, including a future study focused on Gaza post-October 7.

pdf bib abs
Fine-tuning XLM-RoBERTa for Named Entity Recognition in Kurmanji Kurdish
Hossein Hassani

Named Entity Recognition (NER) is the information extraction task of identifying predefined named entities such as person names, location names, organization names and more. High-resource languages have made significant progress in NER tasks. However, low-resource languages such as Kurmanji Kurdish have not seen the same advancements, due to these languages having less available data online. This research aims to close this gap by developing an NER system via fine-tuning XLM-RoBERTa on a manually annotated dataset for Kurmanji. The dataset used for fine-tuning consists of 7,919 annotated sentences, which were manually annotated by three native Kurmanji speakers. The classes labeled in the dataset are Person (PER), Organization (ORG), and Location (LOC). A web-based application has also been developed using Streamlit to make the model more accessible. The model achieved an F1 score of 0.8735, precision of 0.8668, and recall of 0.8803, demonstrating the effectiveness of fine-tuning transformer-based models for NER tasks in low-resource languages. This work establishes a methodology that can be applied to other low-resource languages and Kurdish varieties.

pdf bib abs
Human-AI Moral Judgment Congruence on Real-World Scenarios: A Cross-Lingual Analysis
Nan Li | Bo Kang | Tijl De Bie

As Large Language Models (LLMs) are deployed in every aspect of our lives, understanding how they reason about moral issues becomes critical for AI safety. We investigate this using a dataset we curated from Reddit’s r/AmItheAsshole, comprising real-world moral dilemmas with crowd-sourced verdicts. Through experiments on five state-of-the-art LLMs across 847 posts, we find a significant and systematic divergence where LLMs are more lenient than humans. Moreover, we find that translating the posts into another language changes LLMs’ verdicts, indicating their judgments lack cross-lingual stability.

pdf bib abs
Transfer learning for dependency parsing of Vedic Sanskrit
Abhiram Vinjamuri | Weiwei Sun

This paper focuses on data-driven dependency parsing for Vedic Sanskrit. We propose and evaluate a transfer learning approach that benefits from syntactic analysis of typologically related languages, including Ancient Greek and Latin, and a descendant language - Classical Sanskrit. Experiments on the Vedic TreeBank demonstrate the effectiveness of cross-lingual transfer, demonstrating improvements from the biaffine baseline as well as outperforming the current state of the art benchmark, the deep contextualised self-training algorithm, across a wide range of experimental setups.

pdf bib abs
Debiasing Large Language Models in Thai Political Stance Detection via Counterfactual Calibration
Kasidit Sermsri | Teerapong Panboonyuen

Political stance detection in low-resource and culturally complex settings poses a critical challenge for large language models (LLMs). In the Thai political landscape—rich with indirect expressions, polarized figures, and sentiment-stance entanglement—LLMs often exhibit systematic biases, including sentiment leakage and entity favoritism. These biases not only compromise model fairness but also degrade predictive reliability in real-world applications. We introduce ThaiFACTUAL, a lightweight, model-agnostic calibration framework that mitigates political bias without fine-tuning LLMs. ThaiFACTUAL combines counterfactual data augmentation with rationale-based supervision to disentangle sentiment from stance and neutralize political preferences. We curate and release the first high-quality Thai political stance dataset with stance, sentiment, rationale, and bias markers across diverse political entities and events. Our results show that ThaiFACTUAL substantially reduces spurious correlations, improves zero-shot generalization, and enhances fairness across multiple LLMs. This work underscores the need for culturally grounded bias mitigation and offers a scalable blueprint for debiasing LLMs in politically sensitive, underrepresented languages.

Large language models (LLMs) have significantly advanced automated code generation and debugging, facilitating powerful multi-agent coding frameworks. However, deploying these sophisticated models on resource-constrained edge devices remains challenging due to high computational demands, limited adaptability, and significant privacy risks associated with cloud-based processing. Motivated by these constraints, we propose Edge Code Cloak Coder (ECCC), a novel edge-cloud hybrid framework integrating lightweight quantized LLM with robust AST-based anonymization and edge-side privacy validation. ECCC enables high-performance, privacy-preserving LLM capabilities on consumer GPUs, anonymizing user code before securely delegating abstracted tasks to cloud LLMs. Experimental evaluations demonstrate that ECCC achieves competitive correctness (within 4–5pp of the GPT-4-based frameworks) and a perfect privacy score of 10/10, effectively balancing functionality and security for sensitive and proprietary code applications.

As AI advances, aligning it with diverse human and societal values grows critical. But how do we define these values and measure AI’s adherence to them? We present ValueCompass, a framework grounded in psychological theories, to assess human-AI alignment. Applying it to five diverse LLMs and 112 humans from seven countries across four scenarios—collaborative writing, education, public sectors, and healthcare—we uncover key misalignments. For example, humans prioritize national security, while LLMs often reject it. Values also shift across contexts, demanding scenario-specific alignment strategies. This work advances AI design by mapping how systems can better reflect societal ethics.

pdf bib abs
ASR Under Noise: Exploring Robustness for Sundanese and Javanese
Salsabila Zahirah Pranida | Rifo Ahmad Genadi | Muhammad Cendekia Airlangga | Shady Shehata

We investigate the robustness of Whisper-based automatic speech recognition (ASR) models for two major Indonesian regional languages: Javanese and Sundanese. While recent work has demonstrated strong ASR performance under clean conditions, their effectiveness in noisy environments remains unclear. To address this, we experiment with multiple training strategies, including synthetic noise augmentation and SpecAugment, and evaluate performance across a range of signal-to-noise ratios (SNRs). Our results show that noise-aware training substantially improves robustness, particularly for larger Whisper models. A detailed error analysis further reveals language-specific challenges, highlighting avenues for future improvements.

pdf bib abs
A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA
Belal Shoer | Yova Kementchedjhieva

Scientific visual question answering poses significant challenges for vision-language models due to the complexity of scientific figures and their multimodal context. Traditional approaches treat the figure and accompanying text (e.g., questions and answer options) as separate inputs. EXAMS-V introduced a new paradigm by embedding both visual and textual content into a single image. However, even state-of-the-art proprietary models perform poorly on this setup in zero-shot settings, underscoring the need for task-specific fine-tuning. To address the scarcity of training data in this “text-in-image” format, we synthesize a new dataset by converting existing separate image-text pairs into unified images. Fine-tuning a small multilingual multimodal model on a mix of our synthetic data and EXAMS-V yields notable gains across 13 languages, demonstrating strong average improvements and cross-lingual transfer.

pdf bib abs
Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification
Shaghayeghkolli | Richard Rosenbaum | Timo Cavelius | Lasse Strothe | Andrii Lata | Jana Diesner

Large language models (LLMs) excel in generating fluent utterances but can lack reliable grounding in verified information. At the same time, knowledge-graph-based fact-checkers deliver precise and interpretable evidence, yet suffer from limited coverage or latency. By integrating LLMs with knowledge graphs and real-time search agents, we introduce a hybrid fact-checking approach that leverages the individual strengths of each component. Our system comprises three autonomous steps: 1) a Knowledge Graph (KG) Retrieval for rapid one‐hop lookups in DBpedia, 2) an LM-based classification guided by a task-specific labeling prompt, producing outputs with internal rule-based logic, and 3) a Web Search Agent invoked only when KG coverage is insufficient. Our pipeline achieves an F1 score of 0.93 on the FEVER benchmark on the Supported/Refuted split without task‐specific fine‐tuning. To address Not enough information cases, we conduct a targeted reannotation study showing that our approach frequently uncovers valid evidence for claims originally labeled as Not Enough Information (NEI), as confirmed by both expert annotators and LLM reviewers. With this paper, we present a modular, open-source fact-checking pipeline with fallback strategies and generalization across datasets.

pdf bib abs
Insights from a Disaggregated Analysis of Kinds of Biases in a Multicultural Dataset
Guido Ivetta | Hernán Maina | Luciana Benotti

Warning: This paper contains explicit statements of offensive stereotypes which may be upsetting.Stereotypes vary across cultural contexts, making it essential to understand how language models encode social biases. MultiLingualCrowsPairs is a dataset of culturally adapted stereotypical and anti-stereotypical sentence pairs across nine languages. While prior work has primarily reported average fairness metrics on masked language models, this paper analyzes social biases in generative models by disaggregating results across specific bias types.We find that although most languages show an overall preference for stereotypical sentences, this masks substantial variation across different types of bias, such as gender, religion, and socioeconomic status. Our findings underscore that relying solely on aggregated metrics can obscure important patterns, and that fine-grained, bias-specific analysis is critical for meaningful fairness evaluation.

As Large Language Models (LLMs) gain mainstream public usage, understanding how users interact with them becomes increasingly important. Limited variety in training data raises concerns about LLM reliability across different language inputs. To explore this, we test several LLMs using functionally equivalent prompts expressed in different English sublanguages. We frame this analysis using Question-Answer (QA) pairs, which allow us to detect and evaluate appropriate and anomalous model behavior. We contribute a cross-LLM testing method and a new QA dataset translated into AAVE and WAPE variants. Early results reveal a notable drop in accuracy for one sublanguage relative to the baseline.

pdf bib abs
Amharic News Topic Classification: Dataset and Transformer-Based Model Benchmarks
Dagnachew Mekonnen Marilign | Eyob Nigussie Alemu

News classification is a downstream task in Natural Language Processing (NLP) that involves the automatic categorization of news articles into predefined thematic categories. Although notable advancements have been made for high-resource languages, low-resource languages such as Amharic continue to encounter significant challenges, largely due to the scarcity of annotated corpora and the limited availability of language-specific, state-of-the-art model adaptations. To address these limitations, this study significantly expands an existing Amharic news dataset, increasing its size from 50,000 to 144,000 articles, thus enriching the linguistic and topical diversity available for the model training and evaluation. Using this expanded dataset, we systematically evaluated the performance of five transformer-based models: mBERT, XLM-R, DistilBERT, AfriBERTa, and AfroXLM in the context of Amharic news classification. Among these, AfriBERTa and XLM-R achieved the highest F1-scores of 90.25% and 90.11%, respectively, establishing a new performance baseline for the task. These findings underscore the efficacy of advanced multilingual and Africa-centric transformer architectures when applied to under-resourced languages, and further emphasize the critical importance of large-scale, high-quality datasets in enabling robust model generalization. This study offers a robust empirical foundation for advancing NLP research in low-resource languages, which remain underrepresented in current NLP resources and methodologies.

pdf bib abs
Is this Chatbot Trying to Sell Something? Towards Oversight of Chatbot Sales Tactics
Simrat Deol | Jack Luigi Henry Contro | Martim Brandao

This research investigates the detection of covert sales tactics in human-chatbot interactions with a focus on the classification of solicited and unsolicited product recommendations. A custom dataset of 630 conversations was generated using a Large Language Model (LLM) to simulate chatbot-user interactions in various contexts, such as when interacting with users from different age groups, recommending different types of products and using different types of sales tactics. We then employ various approaches, including BiLSTM-based classification with sentence and word-level embeddings, as well as zero-shot, few-shot and CoT classification on large state-of-the-art LLMs. Our results show that few-shot GPT4 (86.44%) is the most accurate model on our dataset, followed by our compact SBERT+BiLSTM model (78.63%) - despite its small size.Our work demonstrates the feasibility of implementing oversight algorithms for monitoring chatbot conversations for undesired practices and that such monitoring could potentially be implemented locally on-device to mitigate privacy concerns. This research thus lays the groundwork for the development of auditing and oversight methods for virtual assistants such as chatbots, allowing consumer protection agencies to monitor the ethical use of conversational AI.

pdf bib abs
Sarc7: Evaluating Sarcasm Detection and Generation with Seven Types and Emotion-Informed Techniques
Lang Xiong | Raina Gao | Alyssa Jeong

Sarcasm is a complex linguistic and pragmatic phenomenon where expressions convey meanings that contrast with their literal interpretations, requiring sensitivity to the speaker’s intent and context. Misinterpreting sarcasm in collaborative human–AI settings can lead to under- or overreliance on LLM outputs, with consequences ranging from breakdowns in communication to critical safety failures. We introduce Sarc7, a benchmark for fine-grained sarcasm evaluation based on the MUStARD dataset, annotated with seven pragmatically defined sarcasm types: self-deprecating, brooding, deadpan, polite, obnoxious, raging, and manic. These categories are adapted from prior linguistic work and used to create a structured dataset suitable for LLM evaluation. For classification, we evaluate multiple prompting strategies—zero-shot, few-shot, chain-of-thought (CoT), and a novel emotion-based technique—across five major LLMs. Emotion-based prompting yields the highest macro-averaged F1 score of 0.3664 (Gemini 2.5), outperforming CoT for several models and demonstrating its effectiveness in sarcasm type recognition. For sarcasm generation, we design structured prompts using fixed values across four sarcasm-relevant dimensions: incongruity, shock value, context dependency, and emotion. Using Claude 3.5 Sonnet, this approach produces more subtype-aligned outputs, with human evaluators preferring emotion-based generations 38.46% more often than zero-shot baselines. Sarc7 offers a foundation for evaluating nuanced sarcasm understanding and controllable generation in LLMs, pushing beyond binary classification toward interpretable, emotion-informed language modeling.

pdf bib abs
Emotionally Aware or Tone-Deaf? Evaluating Emotional Alignment in LLM-Based Conversational Recommendation Systems
Darshna Parmar | Pramit Mazumdar

Recent advances in Large Language Models (LLMs) have enhanced the fluency and coherence of Conversational Recommendation Systems (CRSs), yet emotional intelligence remains a critical gap. In this study, we systematically evaluate the emotional behavior of six state-of-the-art LLMs in CRS settings using the ReDial and INSPIRED datasets. We propose an emotion-aware evaluation framework incorporating metrics such as Emotion Alignment, Emotion Flatness, and per-emotion F1-scores. Our analysis shows that most models frequently default to emotionally flat or mismatched responses, often misaligning with user affect (e.g., joy misread as neutral). We further examine patterns of emotional misalignment and their impact on user-centric qualities such as personalization, justification, and satisfaction. Through qualitative analysis, we demonstrate that emotionally aligned responses enhance user experience, while misalignments lead to loss of trust and relevance. This work highlights the need for emotion-aware design in CRS and provides actionable insights for improving affective sensitivity in LLM-generated recommendations.

pdf bib abs
MULBERE: Multilingual Jailbreak Robustness Using Targeted Latent Adversarial Training
Anastasia Dunca | Maanas Kumar Sharma | Olivia Munoz | Victor Rosales

Jailbreaking, the phenomenon where specific prompts cause LLMs to assist with harmful requests, remains a critical challenge in NLP, particularly in non-English and lower-resourced languages. To address this, we introduce MULBERE, a method that extends the method of Targeted Latent Adversarial Training (T-LAT) to a multilingual context. We first create and share a multilingual jailbreak dataset spanning high-, medium-, and low-resource languages, and then fine-tune LLaMA-2-7b-chat with interleaved T-LAT for jailbreak robustness and chat examples for model performance. Our evaluations show that MULBERE reduces average multilingual jailbreak success rates by 75% compared to the base LLaMA safety training and 71% compared to English-only T-LAT while maintaining or improving standard LLM performance.

pdf bib abs
Investigating Motivated Inference in Large Language Models
Nutchanon Yongsatianchot | Stacy Marsella

Our desires often influence our beliefs and expectations. Humans tend to think good things are more likely to happen than they actually are, while believing bad things are less likely. This tendency has been referred to as wishful thinking in research on coping strategies. With large language models (LLMs) increasingly being considered as computational models of human cognition, we investigate whether they can simulate this distinctly human bias. We conducted two systematic experiments across multiple LLMs, manipulating outcome desirability and information uncertainty across multiple scenarios including probability games, natural disasters, and sports events. Our experiments revealed limited wishful thinking in LLMs. In Experiment 1, only two models showed the bias, and only in sports-related scenarios when role-playing characters. Models exhibited no wishful thinking in mathematical contexts. Experiment 2 found that explicit prompting about emotional states (being hopeful) was necessary to elicit wishful thinking in logical domains. These findings reveal a significant gap between human cognitive biases and LLMs’ default behavior patterns, suggesting that current models require explicit guidance to simulate wishful thinking influences on belief formation.

pdf bib abs
Large Language Models as Detectors or Instigators of Hate Speech in Low-resource Ethiopian Languages
Nuhu Ibrahim | Felicity Mulford | Riza Batista-Navarro

We introduce a multilingual benchmark for evaluating large language models (LLMs) on hate speech detection and generation in low-resource Ethiopian languages: Afaan Oromo, Amharic and Tigrigna, and English (both monolingual and code-mixed). Using a balanced and expert-annotated dataset, we assess five state-of-the-art LLM families across both tasks. Our results show that while LLMs perform well on English detection, their performance on low-resource languages is significantly weaker, revealing that increasing model size alone does not ensure multilingual robustness. More critically, we find that all models, including closed and open-source variants, can be prompted to generate profiled hate speech with minimal resistance. These findings underscore the dual risk of exclusion and exploitation: LLMs fail to protect low-resource communities while enabling scalable harm against them. We make our evaluation framework available to facilitate future research on multilingual model safety and ethical robustness.

pdf bib abs
Brown Like Chocolate: How Vision-Language Models Associate Skin Tone with Food Colors
Nutchanon Yongsatianchot | Pachaya Sailamul

We investigate how Vision-Language Models (VLMs) leverage visual features when making analogical comparisons about people. Using synthetic images of individuals varying in skin tone and nationality, we prompt GPT and Gemini models to make analogical associations with desserts and drinks. Results reveal that VLMs systematically associate darker-skinned individuals with brown-colored food items, with GPT showing stronger associations than Gemini. These patterns are amplified in Thai versus English prompts, suggesting language-dependent encoding of visual stereotypes. The associations persist across manipulation checks including position swapping and clothing changes, though presenting individuals alone yields divergent language-specific patterns. This work reveals concerning associations in VLMs’ visual reasoning that vary by language, with important implications for multilingual deployment.

pdf bib abs
Improving BGE-M3 Multilingual Dense Embeddings for Nigerian Low Resource Languages
Abdulmatin Omotoso | Habeeb Shopeju | Adejumobi Monjolaoluwa Joshua | Shiloh Oni

Multilingual dense embedding models such as Multilingual E5, LaBSE, and BGE-M3 have shown promising results on diverse benchmarks for information retrieval in low-resource languages. But their result on low resource languages is not up to par with other high resource languages. This work improves the performance of BGE-M3 through contrastive fine-tuning; the model was selected because of its superior performance over other multilingual embedding models across MIRACL, MTEB, and SEB benchmarks. To fine-tune this model, we curated a comprehensive dataset comprising Yorùbá (32.9k rows), Igbo (18k rows) and Hausa (85k rows) from mainly news sources. We further augmented our multilingual dataset with English queries and mapped it to each of the Yoruba, Igbo, and Hausa documents, enabling cross-lingual semantic training. We evaluate on two settings: the Wura test set and the MIRACL benchmark. On Wura, the fine-tuned BGE-M3 raises mean reciprocal rank (MRR) to 0.9201 for Yorùbá, 0.8638 for Igbo, 0.9230 for Hausa, and 0.8617 for English queries matched to local documents, surpassing the BGE-M3 baselines of 0.7846, 0.7566, 0.8575, and 0.7377, respectively. On MIRACL (Yorùbá subset), the fine-tuned model attains 0.5996 MRR, slightly surpassing base BGE-M3 (0.5952) and outperforming ML-E5-large (0.5632) and LaBSE (0.4468).

pdf bib abs
Challenges in Processing Chinese Texts Across Genres and Eras
Minghao Zheng | Sarah Moeller

Pre-trained Chinese Natural Language Processing (NLP) tools show reduced performance when analyzing poetry compared to prose. This study investigates the discrepancies between tools trained on either Classical or Modern Chinese prose when handling Classical Chinese prose and Classical Chinese poetry. Three experiments reveal error patterns that indicate the weaker performance on Classical Chinese poemsis due to challenges identifying word boundaries. Specifically, tools trained on Classical prose struggle recognizing word boundaries within Classical poetic structures and tools trained on Modern prose have difficulty with word segmentation in both Classical Chinese genres. These findings provide valuable insights into the limitations of current NLP tools for studying Classical Chinese literature.

pdf bib abs
The Gemma Sutras: Fine-Tuning Gemma 3 for Sanskrit Sandhi Splitting
Samarth P | Sanjay Balaji Mahalingam

Sandhi, the phonological merging of morphemes, is a central feature of Sanskrit grammar. While Sandhi formation is well-defined by Pāṇini’s Aṣṭādhyāyī, the reverse task—Sandhi splitting—is substantially more complex due to inherent ambiguity and context-sensitive transformations. Accurate splitting is a critical precursor to tokenization in Sanskrit, which lacks explicit word boundaries and presents densely fused compounds. In this work, we present a data-driven approach, fine-tuning the Gemma-3 4B large language model on a dataset of over 49,000 training and 2,000 test examples of compound words and their morpheme-level decompositions. Leveraging the Unsloth framework with low-rank adaptation (LoRA) and 4-bit quantization, we train the model to predict these splits. Our work yields a scalable, Sandhi-aware system designed to enhance modern NLP pipelines for classical Sanskrit, demonstrating an effective application of LLMs to this linguistic challenge.

pdf bib abs
Evaluation Sheet for Deep Research: A Use Case for Academic Survey Writing
Israel Abebe Azime | Tadesse Destaw Belay | Atnafu Lambebo Tonja

Large Language Models (LLMs) powered with argentic capabilities are able to do knowledge-intensive tasks without human involvement. A prime example of this tool is Deep research with the capability to browse the web, extract information and generate multi-page reports.In this work, we introduce an evaluation sheet that can be used for assessing the capability of Deep Research tools. In addition, we selected academic survey writing as a use case task and evaluated output reports based on the evaluation sheet we introduced. Our findings show the need to have carefully crafted evaluation standards. The evaluation done on OpenAI‘s Deep Search and Google’s Deep Search in generating an academic survey showed the huge gap between search engines and standalone Deep Research tools, as well as the shortcomings in representing the targeted area.

pdf bib abs
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
Sher Badshah | Hassan Sajjad

The emergence of Large Language Models (LLMs) as chat assistants capable of generating human-like conversations has amplified the need for robust evaluation methods, particularly for open-ended tasks. Conventional metrics such as EM and F1, while useful, are inadequate for capturing the full semantics and contextual depth of such generative outputs. We propose a reference-guided verdict method that automates the evaluation process by leveraging multiple LLMs as judges. Through experiments on free-form question-answering tasks, we demonstrate that combining multiple models improves the reliability and accuracy of evaluations, especially in tasks where a single model may struggle. The results indicate a strong correlation with human evaluations, establishing the proposed method as a reliable alternative to traditional metrics.

Large language models (LLMs) are increasingly integrated into our daily lives and personalized. However, LLM personalization might also increase unintended side effects. Recent work suggests that persona prompting can lead models to falsely refuse user requests. However, no work has fully quantified the extent of this issue. To address this gap, we measure the impact of 15 sociodemographic personas (based on gender, race, religion, and disability) on false refusal. To control for other factors, we also test 16 different models, 3 tasks (Natural Language Inference, politeness, and offensiveness classification), and nine prompt paraphrases. We propose a Monte Carlo-based method to quantify this issue in a sample-efficient manner. Our results show that as models become more capable, personas impact the refusal rate less. However, we find that the choice of model significantly influence false refusals, especially in sensitive content tasks. The impact of certain sociodemographic personas further increases the false refusal effect in some models, which suggests that there are underlying biases in the alignment strategies or safety mechanisms.