Other Workshops and Events (2025)


Volumes


up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script

pdf bib
Proceedings of the 1st Workshop on NLP for Languages Using Arabic Script
Mo El-Haj

pdf bib
The Best of Both Worlds: Exploring Wolofal in the Context of NLP
Ngoc Tan Le | Ali Mijiyawa | Abdoulahat Leye | Fatiha Sadat

This paper examines the three writing systems used for the Wolof language: the Latin script, the Ajami script (Wolofal), and the Garay script. Although the Latin alphabet is now the official standard for writing Wolof in Senegal, Garay and Ajami still play an important cultural and religious role, especially the latter. This article focuses specifically on Ajami, a system based on the Arabic script, and describes its history, its use, and its modern writings. We also analyze the challenges and prospects of these systems from the perspective of language preservation.

pdf bib
MultiProp Framework: Ensemble Models for Enhanced Cross-Lingual Propaganda Detection in Social Media and News using Data Augmentation, Text Segmentation, and Meta-Learning
Farizeh Aldabbas | Shaina Ashraf | Rafet Sifa | Lucie Flek

Propaganda, a pervasive tool for influenc- ing public opinion, demands robust auto- mated detection systems, particularly for under- resourced languages. Current efforts largely focus on well-resourced languages like English, leaving significant gaps in languages such as Arabic. This research addresses these gaps by introducing MultiProp Framework, a cross- lingual meta-learning framework designed to enhance propaganda detection across multiple languages, including Arabic, German, Italian, French and English. We constructed a mul- tilingual dataset using data translation tech- niques, beginning with Arabic data from PTC and WANLP shared tasks, and expanded it with translations into German Italian and French, further enriched by the SemEval23 dataset. Our proposed framework encompasses three distinct models: MultiProp-Baseline, which combines ensembles of pre-trained models such as GPT-2, mBART, and XLM-RoBERTa; MultiProp-ML, designed to handle languages with minimal or no training data by utiliz- ing advanced meta-learning techniques; and MultiProp-Chunk, which overcomes the chal- lenges of processing longer texts that exceed the token limits of pre-trained models. To- gether, they deliver superior performance com- pared to state-of-the-art methods, representing a significant advancement in the field of cross- lingual propaganda detection.

pdf bib
Towards Unified Processing of Perso-Arabic Scripts for ASR
Srihari Bandarupalli | Bhavana Akkiraju | Sri Charan Devarakonda | Harinie Sivaramasethu | Vamshiraghusimha Narasinga | Anil Vuppala

Automatic Speech Recognition (ASR) systems for morphologically complex languages like Urdu, Persian, and Arabic face unique challenges due to the intricacies of Perso-Arabic scripts. Conventional data processing methods often fall short in effectively handling these languages’ phonetic and morphological nuances. This paper introduces a unified data processing pipeline tailored specifically for Perso-Arabic languages, addressing the complexities inherent in these scripts. The proposed pipeline encompasses comprehensive steps for data cleaning, tokenization, and phonemization, each of which has been meticulously evaluated and validated by expert linguists. Through expert-driven refinements, our pipeline presents a robust foundation for advancing ASR performance across Perso-Arabic languages, supporting the development of more accurate and linguistically informed multilingual ASR systems in future.

pdf bib
In-Depth Analysis of Arabic-Origin Words in the Turkish Morpholex
Mounes Zaval | Abdullah İhsanoğlu | Asım Ersoy | Olcay Taner Yıldız

MorphoLex is an investigation that focuses on analyzing the roots, prefixes, and suffixes of words. Turkish Morpholex, for example, analyzes 48,472 Turkish words. Unfortunately, it lacks in-depth analysis of the Arabic-origin words, and does not include their accurate and correct roots. This study analyzes Arabic-origin words in the Turkish Morpholex, annotating their roots, morphological patterns, and semantic categories. The methodology developed for this work is adaptable to other languages influenced by Arabic, such as Urdu and Persian, offering broader implications for studying loanword integration across linguistic contexts.

pdf bib
DadmaTools V2: an Adapter-Based Natural Language Processing Toolkit for the Persian Language
Sadegh Jafari | Farhan Farsi | Navid Ebrahimi | Mohamad Bagher Sajadi | Sauleh Eetemadi

DadmaTools V2 is a comprehensive repository designed to enhance NLP capabilities for the Persian language, catering to industry practitioners seeking practical and efficient solutions. The toolkit provides extensive code examples demonstrating the integration of its models with popular NLP frameworks such as Trankit and Transformers, as well as deep learning frameworks like PyTorch. Additionally, DadmaTools supports widely used Persian embeddings and datasets, ensuring robust language processing capabilities. The latest version of DadmaTools introduces an adapter-based technique, significantly reducing memory usage by employing a shared pre-trained model across various tasks, supplemented with task-specific adapter layers. This approach eliminates the need to maintain multiple pre-trained models and optimize resource utilization. Enhancements in this version include adding new modules such as a sentiment detector, an informal-to-formal text converter, and a spell checker, further expanding the toolkit’s functionality. DadmaTools V2 thus represents a powerful, efficient, and versatile resource for advancing Persian NLP applications.

pdf bib
Developing an Informal-Formal Persian Corpus: Highlighting the Differences between Two Writing Styles
Vahide Tajalli | Mehrnoush Shamsfard | Fateme Kalantari

Informal language is a style of spoken or written language frequently used in casual conversations, social media, weblogs, emails and text messages. In informal writing, the language undergoes some lexical and/or syntactic changes varying among different languages. Persian is one of the languages with many differences between its formal and informal styles of writing, thus developing informal language processing tools for this language seems necessary. In the present paper, the methodology in building a parallel corpus of 50,000 sentence pairs with alignments in the word/phrase level is described. The resulting corpus has about 530,000 alignments and a dictionary containing 49,397 word and phrase pairs. The observed differences between formal and informal writing are explained in detail.

pdf bib
Boosting Sentiment Analysis in Persian through a GAN-Based Synthetic Data Augmentation Method
Masoumeh Mohammadi | Mohammad Ruhul Amin | Shadi Tavakoli

This paper presents a novel Sentiment Analysis (SA) dataset in the low-resource Persian language including a data augmentation technique using Generative Adversarial Networks (GANs) to generate synthetic data, boosting the volume and variety of data, for achieving state-of-the-art performance. We propose a novel annotated SA dataset, called Senti-Persian, made of 67,743 public comments on movie reviews from Iranian websites (Namava, Filimo and Aparat) and social media (YouTube, Twitter and Instagram). These reviews are labeled with one of the polarity labels, namely positive, negative, and neutral. Our study includes a novel text augmentation model based on GANs. The generator was designed following the linguistic properties of Persian linguistics, while the discriminator was designed based on the cosine similarity of the vectorized original and generated sentences, i.e. using CLS-embedings of BERT. A SA task applied on both collected and augmented datasets for which we observed a significant improvement in the accuracy from 88.4% for the original dataset to the 96% when augmented with synthetic data. Senti-Parsian dataset including the original and the augmented ones will be available on github.

pdf bib
Psychological Health Chatbot, Detecting and Assisting Patients in their Path to Recovery
Sadegh Jafari | Mohammad Erfan Zare | Amireza Vishte | Mirzae Melike | Zahra Amiri | Sima Mohammadparast | Sauleh Eetemadi

Mental health disorders such as stress, anxiety, and depression are increasingly prevalent globally, yet access to care remains limited due to barriers like geographic isolation, financial constraints, and stigma. Conversational agents or chatbots have emerged as viable digital tools for personalized mental health support. This paper presents the development of a psychological health chatbot designed specifically for Persian-speaking individuals, offering a culturally sensitive tool for emotion detection and disorder identification. The chatbot integrates several advanced natural language processing (NLP) modules, leveraging the ArmanEmo dataset to identify emotions, assess psychological states, and ensure safe, appropriate responses. Our evaluation of various models, including ParsBERT and XLM-RoBERTa, demonstrates effective emotion detection with accuracy up to 75.39%. Additionally, the system incorporates a Large Language Model (LLM) to generate messages. This chatbot serves as a promising solution for addressing the accessibility gap in mental health care and provides a scalable, language-inclusive platform for psychological support.

pdf bib
A Derivational ChainBank for Modern Standard Arabic
Reham Marzouk | Sondos Krouna | Nizar Habash

We introduce the new concept of an Arabic Derivational Chain Bank (CHAINBANK) to leverage the relationship between form and meaning in modeling Arabic derivational morphology. We constructed a knowledge graph network of abstract patterns and their derivational relations, and aligned it with the lemmas of the CAMELMORPH morphological analyzer database. This process produced chains of derived words’ lemmas linked to their correspond- ing lemma bases through derivational relations, encompassing 23,333 derivational connections. The CHAINBANK is publicly available.1

pdf bib
Sentiment Analysis of Arabic Tweets Using Large Language Models
Pankaj Dadure | Ananya Dixit | Kunal Tewatia | Nandini Paliwal | Anshika Malla

In the digital era, sentiment analysis has become an indispensable tool for understanding public sentiments, optimizing market strategies, and enhancing customer engagement across diverse sectors. While significant advancements have been made in sentiment analysis for high-resource languages such as English, French, etc. This study focuses on Arabic, a low-resource language, to address its unique challenges like morphological complexity, diverse dialects, and limited linguistic resources. Existing works in Arabic sentiment analysis have utilized deep learning architectures like LSTM, BiLSTM, and CNN-LSTM, alongside embedding techniques such as Word2Vec and contextualized models like ARABERT. Building on this foundation, our research investigates sentiment classification of Arabic tweets, categorizing them as positive or negative, using embeddings derived from three large language models (LLMs): Universal Sentence Encoder (USE), XLM-RoBERTa base (XLM-R base), and MiniLM-L12-v2. Experimental results demonstrate that incorporating emojis in the dataset and using the MiniLM embeddings yield an accuracy of 85.98%. In contrast, excluding emojis and using embeddings from the XLM-R base resulted in a lower accuracy of 78.98%. These findings highlight the impact of both dataset composition and embedding techniques on Arabic sentiment analysis performance.

pdf bib
Evaluating Large Language Models on Health-Related Claims Across Arabic Dialects
Abdulsalam obaid Alharbi | Abdullah Alsuhaibani | Abdulrahman Abdullah Alalawi | Usman Naseem | Shoaib Jameel | Salil Kanhere | Imran Razzak

While the Large Language Models (LLMs) have been popular in different tasks, their capability to handle health-related claims in diverse linguistic and cultural contexts, such as Arabic dialects, Saudi, Egyptian, Lebanese, and Moroccan has not been thoroughly explored. To this end, we develop a comprehensive evaluation framework to assess how LLMs particularly GPT-4 respond to health-related claims. Our framework focuses on measuring factual accuracy, consistency, and cultural adaptability. It introduces a new metric, the “Cultural Sensitivity Score”, to evaluate the model’s ability to adjust responses based on dialectal differences. Additionally, the reasoning patterns used by the models are analyzed to assess their effectiveness in engaging with claims across these dialects. Our findings highlight that while LLMs excel in recognizing true claims, they encounter difficulties with mixed and ambiguous claims, especially in underrepresented dialects. This work underscores the importance of dialect-specific evaluations to ensure accurate, contextually appropriate, and culturally sensitive responses from LLMs in real-world applications.

pdf bib
Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs
Ayushman Gupta | Aryan Singhal | Thomas Law | Veekshith Rao | Evan Duan | Ryan Luo Li

Large language models (LLMs) have demonstrated potential in fact-checking claims, yet their capabilities in verifying claims in multilingual contexts remain largely understudied. This paper investigates the efficacy of various prompting techniques, viz. Zero-Shot, English Chain-of-Thought, Self-Consistency, and Cross-Lingual Prompting, in enhancing the fact-checking and claim-verification abilities of LLMs for Arabic claims. We utilize 771 Arabic claims sourced from the X-fact dataset to benchmark the performance of four LLMs. To the best of our knowledge, ours is the first study to benchmark the inherent Arabic fact-checking abilities of LLMs stemming from their knowledge of Arabic facts, using a variety of prompting methods. Our results reveal significant variations in accuracy across different prompting methods. Our findings suggest that Cross-Lingual Prompting outperforms other methods, leading to notable performance gains.

pdf bib
Can LLMs Translate Cultural Nuance in Dialects? A Case Study on Lebanese Arabic
Silvana Yakhni | Ali Chehab

Machine Translation (MT) of Arabic-script languages presents unique challenges due to their vast linguistic diversity and lack of standardization. This paper focuses on the Lebanese dialect, investigating the effectiveness of Large Language Models (LLMs) in handling culturally-aware translations. We identify critical limitations in existing Lebanese-English parallel datasets, particularly their non-native nature and lack of cultural context. To address these gaps, we introduce a new culturally-rich dataset derived from the Language Wave (LW) podcast. We evaluate the performance of LLMs: Jais, AceGPT, Cohere, and GPT-4 models against Neural Machine Translation (NMT) systems: NLLB-200, and Google Translate. Our findings reveal that while both architectures perform similarly on non-native datasets, LLMs demonstrate superior capabilities in preserving cultural nuances when handling authentic Lebanese content. Additionally, we validate xCOMET as a reliable metric for evaluating the quality of Arabic dialect translation, showing a strong correlation with human judgment. This work contributes to the growing field of Culturally-Aware Machine Translation and highlights the importance of authentic, culturally representative datasets in advancing low-resource translation systems.

pdf bib
Automated Generation of Arabic Verb Conjugations with Multilingual Urdu Translation: An NLP Approach
Haq Nawaz | Manal Elobaid | Ali Al-Laith | Saif Ullah

This paper presents a rule-based automated system for generating both Arabic verb conjugations and their corresponding Urdu translations. The system processes triliteral, non-weak Arabic roots across key tenses Past Simple, Past Simple Negative, Present Simple, and Present Simple Negative. Addressing the challenges posed by Arabic morphology, our rule-based approach applies patterns and morphological rules to accurately produce verb conjugations, capturing essential grammatical variations in gender, number, and person. Simultaneously, the system generates Urdu translations using predefined patterns that is aligned with the grammatical nuances of Arabic, ensuring semantic consistency. As the first system of its kind, it uniquely provides a cross-lingual resource that bridges two linguistically similar but distinct languages. By focusing on rule based precision and dual-language outputs, it addresses critical gaps in NLP resources, serving as a valuable tool for linguists, educators, and NLP researchers in academic and religious contexts where Arabic and Urdu coexist.

pdf bib
Evaluation of Large Language Models on Arabic Punctuation Prediction
Asma Ali Al Wazrah | Afrah Altamimi | Hawra Aljasim | Waad Alshammari | Rawan Al-Matham | Omar Elnashar | Mohamed Amin | Abdulrahman AlOsaimy

The linguistic inclusivity of Large Language Models (LLMs) such as ChatGPT, Gemni, JAIS, and AceGPT has not been sufficiently explored, particularly in their handling of low-resource languages like Arabic compared to English. While these models have shown impressive performance across various tasks, their effectiveness in Arabic remains under-examined. Punctuation, critical for sentence structure and comprehension in tasks like speech analysis, synthesis, and machine translation, requires precise prediction. This paper assesses seven LLMs: GPT4-o, Gemni1.5, JAIS, AceGPT, SILMA, ALLaM, and CommandR+ for Arabic punctuation prediction. Additionally, the performance of fine-tuned AraBERT is compared with these models in zero-shot and few-shot settings using a proposed Arabic punctuation prediction corpus of 10,044 sentences. The experiments demonstrate that while AraBERT performs well for specific punctuation marks, LLMs show significant promise in zero-shot learning, with further improvements in few-shot scenarios. These findings highlight the potential of LLMs to enhance the automation and accuracy of Arabic text processing.

pdf bib
Evaluating RAG Pipelines for Arabic Lexical Information Retrieval: A Comparative Study of Embedding and Generation Models
Raghad Al-Rasheed | Abdullah Al Muaddi | Hawra Aljasim | Rawan Al-Matham | Muneera Alhoshan | Asma Al Wazrah | Abdulrahman AlOsaimy

This paper investigates the effectiveness of retrieval-augmented generation (RAG) pipelines, focusing on the Arabic lexical information retrieval. Specifically, it analyzes how embedding models affect the recall of Arabic lexical information and evaluates the ability of large language models (LLMs) to produce accurate and contextually relevant answers within the RAG pipelines. We examine a dataset of over 88,000 words from the Riyadh dictionary and evaluate the models using metrics such as Top-K Recall, Mean Reciprocal Rank (MRR), F1 Score, Cosine Similarity, and Accuracy. The research assesses the capabilities of several embedding models, including E5-large, BGE, AraBERT, CAMeLBERT, and AraELECTRA, highlighting a disparity in performance between sentence embeddings and word embeddings. Sentence embedding with E5 achieved the best results, with a Top-5 Recall of 0.88, and an MRR of 0.48. For the generation models, we evaluated GPT-4, GPT-3.5, SILMA-9B, Gemini-1.5, Aya-8B, and AceGPT-13B based on their ability to generate accurate and contextually appropriate responses. GPT-4 demonstrated the best performance, achieving an F1 score of 0.90, an accuracy of 0.82, and a cosine similarity of 0.87. Our results emphasize the strengths and limitations of both embedding and generation models in Arabic tasks.

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities

pdf bib
Proceedings of the 1st Workshop on AI and Scientific Discovery: Directions and Opportunities
Peter Jansen | Bhavana Dalvi Mishra | Harsh Trivedi | Bodhisattwa Prasad Majumder | Tom Hope | Tushar Khot | Doug Downey | Eric Horvitz

pdf bib
Variable Extraction for Model Recovery in Scientific Literature
Chunwei Liu | Enrique Noriega-Atala | Adarsh Pyarelal | Clayton T Morrison | Mike Cafarella

Due to the increasing productivity in the scientific community, it is difficult to keep up with the literature without the assistance of AI methods. This paper evaluates various methods for extracting mathematical model variables from epidemiological studies, such as ‘infection rate (𝛼),” ‘recovery rate (𝛾),” and ‘mortality rate (𝜇).” Variable extraction appears to be a basic task, but plays a pivotal role in recovering models from scientific literature. Once extracted, we can use these variables for automatic mathematical modeling, simulation, and replication of published results. We also introduce a benchmark dataset comprising manually-annotated variable descriptions and variable values extracted from scientific papers. Our analysis shows that LLM-based solutions perform the best. Despite the incremental benefits of combining rule-based extraction outputs with LLMs, the leap in performance attributed to the transfer-learning and instruction-tuning capabilities of LLMs themselves is far more significant. This investigation demonstrates the potential of LLMs to enhance automatic comprehension of scientific artifacts and for automatic model recovery and simulation.

pdf bib
How Well Do Large Language Models Extract Keywords? A Systematic Evaluation on Scientific Corpora
Nacef Ben Mansour | Hamed Rahimi | Motasem Alrahabi

Automatic keyword extraction from scientific articles is pivotal for organizing scholarly archives, powering semantic search engines, and mapping interdisciplinary research trends. However, existing methods—including statistical and graph-based approaches—struggle to handle domain-specific challenges such as technical terminology, cross-disciplinary ambiguity, and dynamic scientific jargon. This paper presents an empirical comparison of traditional keyword extraction methods (e.g. TextRank and YAKE) with approaches based on Large Language Model. We introduce a novel evaluation framework that combines fuzzy semantic matching based on Levenshtein Distance with exact-match metrics (F1, precision, recall) to address inconsistencies in keyword normalization across scientific corpora. Through an extensive ablation study across nine different LLMs, we analyze their performance and associated costs. Our findings reveal that LLM-based methods consistently achieve superior precision and relevance compared to traditional approaches. This performance advantage suggests significant potential for improving scientific search systems and information retrieval in academic contexts.

pdf bib
A Human-LLM Note-Taking System with Case-Based Reasoning as Framework for Scientific Discovery
Douglas B Craig

Scientific discovery is an iterative process that requires transparent reasoning, empirical validation, and structured problem-solving. This work presents a novel human-in-the-loop AI system that leverages case-based reasoning to facilitate structured scientific inquiry. The system is designed to be note-centric, using the Obsidian note-taking application as the primary interface where all components, including user inputs, system cases, and tool specifications, are represented as plain-text notes. This approach ensures that every step of the research process is visible, editable, and revisable by both the user and the AI. The system dynamically retrieves relevant cases from past experience, refines hypotheses, and structures research workflows in a transparent and iterative manner. The methodology is demonstrated through a case study investigating the role of TLR4 in sepsis, illustrating how the system supports problem framing, literature review, hypothesis formulation, and empirical validation. The results highlight the potential of AI-assisted scientific workflows to enhance research efficiency while preserving human oversight and interpretability.

pdf bib
Towards AI-assisted Academic Writing
Daniel J. Liebling | Malcolm Kane | Madeleine Grunde-McLaughlin | Ian Lang | Subhashini Venugopalan | Michael Brenner

We present components of an AI-assisted academic writing system including citation recommendation and introduction writing. The system recommends citations by considering the user’s current document context to provide relevant suggestions. It generates introductions in a structured fashion, situating the contributions of the research relative to prior work. We demonstrate the effectiveness of the components through quantitative evaluations. Finally, the paper presents qualitative research exploring how researchers incorporate citations into their writing workflows. Our findings indicate that there is demand for precise AI-assisted writing systems and simple, effective methods for meeting those needs.

pdf bib
Evaluating and Enhancing Large Language Models for Novelty Assessment in Scholarly Publications
Ethan Lin | Zhiyuan Peng | Yi Fang

Recent studies have evaluated creativity, where novelty is an important aspect, of large language models (LLMs) primarily from a semantic perspective, using benchmarks from cognitive science. However, assessing the novelty in scholarly publications, a critical facet of evaluating LLMs as scientific discovery assistants, remains underexplored, despite its potential to accelerate research cycles and prioritize high-impact contributions in scientific workflows. We introduce SchNovel, a benchmark to evaluate LLMs’ ability to assess novelty in scholarly papers, a task central to streamlining discovery pipeline. SchNovel consists of 15000 pairs of papers across six fields sampled from the arXiv dataset with publication dates spanning 2 to 10 years apart. In each pair, the more recently published paper is assumed to be more novel. Additionally, we propose RAG-Novelty, a retrieval-augmented method that mirrors human peer review by grounding novelty assessment in retrieved context. Extensive experiments provide insights into the capabilities of different LLMs to assess novelty and demonstrate that RAG-Novelty outperforms recent baseline models highlight LLMs’ promise as tools for automating novelty detection in scientific workflows.

pdf bib
LLM-Assisted Translation of Legacy FORTRAN Codes to C++: A Cross-Platform Study
Nishath Rajiv Ranasinghe | Shawn M. Jones | Michal Kucer | Ayan Biswas | Daniel O’Malley | Alexander Most | Selma Liliane Wanna | Ajay Sreekumar

Large Language Models (LLMs) are increasinglybeing leveraged for generating andtranslating scientific computer codes by bothdomain-experts and non-domain experts. Fortranhas served as one of the go to programminglanguages in legacy high-performance computing(HPC) for scientific discoveries. Despitegrowing adoption, LLM-based code translationof legacy code-bases has not been thoroughlyassessed or quantified for its usability.Here, we studied the applicability of LLMbasedtranslation of Fortran to C++ as a step towardsbuilding an agentic-workflow using openweightLLMs on two different computationalplatforms. We statistically quantified the compilationaccuracy of the translated C++ codes,measured the similarity of the LLM translatedcode to the human translated C++ code, andstatistically quantified the output similarity ofthe Fortran to C++ translation.

pdf bib
FlavorDiffusion: Modeling Food-Chemical Interactions with Diffusion
Junpyo Seo | Dongwan Kim | Jaewook Jeong | Inkyu Park | Junho Min

The study of food pairing has evolved beyond subjective expertise with the advent of machine learning. This paper presents FlavorDiffusion, a novel framework leveraging diffusion models to predict food-chemical interactions and ingredient pairings without relying on chromatography. By integrating graph-based embeddings, diffusion processes, and chemical property encoding, FlavorDiffusion addresses data imbalances and enhances clustering quality. Using a heterogeneous graph derived from datasets like Recipe1M and FlavorDB, our model demonstrates superior performance in reconstructing ingredient-ingredient relationships. The addition of a Chemical Structure Prediction (CSP) layer further refines the embedding space, achieving state-of-the-art NMI scores and enabling meaningful discovery of novel ingredient combinations. The proposed framework represents a significant step forward in computational gastronomy, offering scalable, interpretable, and chemically informed solutions for food science.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Ancient Language Processing

pdf bib
Proceedings of the Second Workshop on Ancient Language Processing
Adam Anderson | Shai Gordin | Bin Li | Yudong Liu | Marco C. Passarotti | Rachele Sprugnoli

pdf bib
Automatic Text Segmentation of Ancient and Historic Hebrew
Elisha Rosensweig | Benjamin Resnick | Hillel Gershuni | Joshua Guedalia | Nachum Dershowitz | Avi Shmidman

Ancient texts often lack punctuation marks, making it challenging to determine sentence boundaries and clause boundaries. Texts may contain sequences of hundreds of words without any period or indication of a full stop. Determining such boundaries is a crucial step in various NLP pipelines, especially regarding language models such as BERT that have context window constraints and regarding machine translation models which may become far less accurate when fed too much text at a time. In this paper, we consider several novel approaches to automatic segmentation of unpunctuated ancient texts into grammatically complete or semi-complete units. Our work here focuses on ancient and historical Hebrew and Aramaic texts, but the tools developed can be applied equally to similar languages. We explore several approaches to addressing this task: masked language models (MLM) to predict the next token; fewshot completions via an open-source foundational LLM; and the “Segment-Any-Text” (SaT) tool by Frohmann et al. (Frohmann et al., 2024). These are then compared to instructbased flows using commercial (closed, managed) LLMs, to be used as a benchmark. To evaluate these approaches, we also introduce a new ground truth (GT) dataset of manually segmented texts. We explore the performance of our different approaches on this dataset. We release both our segmentation tools and the dataset to support further research into computational processing and analysis of ancient texts, which can be found here ‘https://github.com/ERC-Midrash/rabbinic_chunker’.

pdf bib
Integrating Semantic and Statistical Features for Authorial Clustering of Qumran Scrolls
Yonatan Lourie | Jonathan Ben-Dov | Roded Sharan

We present a novel framework for authorial classification and clustering of the Qumran Dead Sea Scrolls (DSS). Our approach com-bines modern Hebrew BERT embeddings with traditional natural language processing features in a graph neural network (GNN) architecture. Our results outperform baseline methods on both the Dead Sea Scrolls and a validation dataset of the Hebrew Bible. In particular, we leverage our model to provide significant insights into long-standing debates, including the classification of sectarian and non-sectarian texts and the division of the Hodayot collection of hymns.

pdf bib
Assignment of account type to proto-cuneiform economic texts with Multi-Class Support Vector Machines
Piotr Zadworny | Shai Gordin

We investigate the use of machine learning for classifying proto-cuneiform economic texts (3,500-3,000 BCE), leveraging Multi-Class Support Vector Machines (MSVM) to assign text type based on content. Proto-cuneiform presents unique challenges, as it does not en-code spoken language, yet is transcribed into linear formats that obscure original structural elements. We address this by reformatting tran-scriptions, experimenting with different tok-enization strategies, and optimizing feature ex-traction. Our workflow achieves high label-ing reliability and enables significant metadata enrichment. In addition to improving digital corpus organization, our approach opens the chance to identify economic institutions in an-cient Mesopotamian archives, providing a new tool for Assyriological research.

pdf bib
Using Cross-Linguistic Data Formats to Enhance the Annotation of Ancient Chinese Documents Written on Bamboo Slips
Michele Pulini | Johann-Mattis List

Ancient Chinese documents written on bam-boo slips more than 2000 years ago offer a rich resource for research in linguistics, paleogra-phy, and historiography. However, since most documents are only available in the form of scans, additional steps of analysis are needed to turn them into interactive digital editions, amenable both for manual and computational exploration. Here, we present a first attempt to establish a workflow for the annotation of an-cient bamboo slips. Based on a recently redis-covered dialogue on warfare, we illustrate how a digital edition amenable for manual and com-putational exploration can be created by inte-grating standards originally designed for cross-linguistic data collections.

pdf bib
Accessible Sanskrit: A Cascading System for Text Analysis and Dictionary Access
Giacomo De Luca

Sanskrit text processing presents unique com-putational challenges due to its complex mor-phology, frequent compound formation, and the phenomenon of Sandhi. While several ap-proaches to Sanskrit word segmentation ex-ist, the field lacks integrated tools that make texts accessible while maintaining high accu-racy. We present a hybrid approach combining rule-based and statistical methods that achieves reliable Sanskrit text analysis through a cascade mechanism in which a deterministic matching using inflection tables is used for simple cases and statistical approaches are used for the more complex ones. The goal of the system is to provide automatic text annotation and inflected dictionary search, returning for each word root forms, comprehensive grammatical analysis, inflection tables, and dictionary entries from multiple sources. The system is evaluated on 300 randomly selected compounds from the GRETIL corpus across different length cate-gories and maintains 90% accuracy regardless of compound length, with 91% accuracy on the 40+ characters long compounds. The approach is also tested on the complete text of the Yoga Sutra, demonstrating 96% accuracy in the prac-tical use case. This approach is implemented both as an open-source Python library and a web application, making Sanskrit text analysis accessible to scholars and interested readers while retaining state-of-the-art accuracy.

pdf bib
Towards an Integrated Methodology of Dating Biblical Texts: The Case of the Book of Jeremiah
Martijn Naaijer | Aren Wilson-Wright

In this paper we describe our research project on dating the language of the Book of Jeremiah using a combination of traditional biblical scholarship and machine learning. Jeremiah is a book with a long history of composing and editing, and the historical background of many of the sections in the book are unclear. Moreover, redaction criticism and historical linguistics are mostly separate fields within the discipline of Biblical Studies. With our approach we want to integrate these areas of research and make new strides in uncovering the compositional history of Book of Jeremiah.

pdf bib
The Development of Hebrew in Antiquity – A Computational Linguistic Study
Hallel Baitner | Dimid Duchovny | Lee-Ad Gottlieb | Amir Yorav | Nachum Dershowitz | Eshbal Ratzon

The linguistic nature of Qumran Hebrew (QH) remains a central debate in the study of the Dead Sea Scrolls (DSS). Although some schol-ars view QH as an artificial imitation of Biblical Hebrew (BH), others argue that it represents a spoken dialect of ancient Judea. The present study employs computational lin-guistic techniques, clustering, classification, and machine learning, to analyze the relation-ship of QH with Biblical and Mishnaic He-brew. Preliminary findings confirm existing scholarly conclusions regarding the linguistic affinity of certain texts. This demonstrates that our methodology has a fundamental capacity to identify linguistic relationships. They also contribute new leads, on which we are now working to refine and enhance our analytical methods so as to provide founded insights into the historical development of Hebrew and the process of DSS textual composition.

pdf bib
A Dataset of Ancient Chinese Math Word Problems and an Application for Research in Historic Mathematics
Florian Keßler

Solving math word problems, i.e. mathemati-cal problems stated in natural language, has re-ceived much attention in the Artificial Intelli-gence (AI) community over the last years. Un-surprisingly, research has focused on problems stated in contemporary languages. In contrast to this, in this article, we introduce a dataset of math word problems that is extracted from an-cient Chinese mathematical texts. The dataset is made available.1 We report a baseline per-formance for GPT-4o solving the problems in the dataset using a Program-of-Thought paradigm that translates the mathematical pro-cedures in the original texts into Python code, giving acceptable performance but showing that the model often struggles with understand-ing the pre-modern language. Finally, we de-scribe how the generated code can be used for research into the history of mathematics, by of-fering a way to search the texts by abstract op-erations instead of specific lexemes.

pdf bib
Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation
Eric R. Bennett | HyoJung Han | Xinchen Yang | Andrew Schonebaum | Marine Carpuat

Evaluation metrics are an important driver of progress in Machine Translation (MT), but they have been primarily validated on high-resource modern languages. In this paper, we conduct an empirical evaluation of metrics commonly used to evaluate MT from Ancient Chinese into English. Using LLMs, we construct a contrastive test set, pairing high-quality MT and purposefully flawed MT of the same Pre-Qin texts. We then evaluate the ability of each metric to discriminate between accurate and flawed translations.

pdf bib
From Clay to Code: Transforming Hittite Texts for Machine Learning
Emma Yavasan | Shai Gordin

This paper presents a comprehensive method-ology for transforming XML-encoded Hittite cuneiform texts into computationally accessi-ble formats for machine learning applications. Drawing from a corpus of 8,898 texts (558,349 tokens in total) encompassing 145 cataloged genres and compositions, we develop a struc-tured approach to preserve both linguistic and philological annotations while enabling compu-tational analysis. Our methodology addresses key challenges in ancient language processing, including the handling of fragmentary texts, multiple language layers, and complex anno-tation systems. We demonstrate the applica-tion of our corpus through experiments with T5 models, achieving significant improvements in Hittite-to-German translation (ROUGE-1: 0.895) while identifying limitations in morpho-logical glossing tasks. This work establishes a standardized, machine-readable dataset in Hit-tite cuneiform, which also maintains a balance with philological accuracy and current state-of-the-art.

pdf bib
Towards Ancient Meroitic Decipherment: A Computational Approach
Joshua N. Otten | Antonios Anastasopoulos

The discovery of the Rosetta Stone was one of the keys that helped unlock the secrets of Ancient Egypt and its hieroglyphic lan-guage. But what about languages with no such “Rosetta Stone?” Meroitic is an ancient lan-guage from what is now present-day Sudan, but even though it is connected to Egyptian in many ways, much of its grammar and vocabu-lary remains undeciphered. In this work, we in-troduce the challenge of Meroitic decipherment as a computational task, and present the first Meroitic machine-readable corpus. We then train embeddings and perform intrinsic evalu-ations, as well as cross-lingual alignment ex-periments between Meroitic and Late-Egyptian. We conclude by outlining open problems and potential research directions.

pdf bib
Neural Models for Lemmatization and POS-Tagging of Earlier and Late Egyptian (Supporting Hieroglyphic Input) and Demotic
Aleksi Sahala | Eliese-Sophia Lincke

We present updated models for BabyLemma-tizer for lemmatizing and POS-tagging De-motic, Late Egyptian and Earlier Egyptian with a support for using hieroglyphs as an input. In this paper, we also use data that has not been cleaned from breakages. We achieve consistent UPOS tagging accuracy of 94% or higher and an XPOS tagging accuracy of 93% and higher for all languages. For lemmatization, which is challenging in all of our test languages due to extensive ambiguity, we demonstrate accu-racies from 77% up to 92% depending on the language and the input script.

pdf bib
Bringing Suzhou Numerals into the Digital Age: A Dataset and Recognition Study on Ancient Chinese Trade Records
Ting-Lin Wu | Zih-Ching Chen | Chen-Yuan Chen | Pi-Jhong Chen | Li-Chiao Wang

Suzhou numerals, a specialized numerical no-tation system historically used in Chinese com-merce and accounting, played a pivotal role in financial transactions from the Song Dynasty to the early 20th century. Despite their his-torical significance, they remain largely absent from modern OCR benchmarks, limiting com-putational access to archival trade documents. This paper presents a curated dataset of 773 expert-annotated Suzhou numeral samples ex-tracted from late Qing-era trade ledgers. We provide a statistical analysis of character distri-butions, offering insights into their real-world usage in historical bookkeeping. Additionally, we evaluate baseline performance with hand-written text recognition (HTR) model, high-lighting the challenges of recognizing low-resource brush-written numerals. By introduc-ing this dataset and initial benchmark results, we aim to facilitate research in historical doc-umentation in ancient Chinese characters, ad-vancing the digitization of early Chinese finan-cial records. The dataset is publicly available at our huggingface hub, and our codebase can be accessed at our github repository.

pdf bib
Detecting Honkadori based on Waka Embeddings
Hayato Ogawa | Kaito Horio | Daisuke Kawahara

We develop an embedding model specifically designed for Waka poetry and use it to build a model for detecting Honkadori. Waka is a tradi-tional form of old Japanese poetry that has been composed since ancient times. Honkadori is a sophisticated poetic technique in Japanese clas-sical literature where poets incorporate words or poetic sentiments from old Wakas (Honka) into their own work. First, we fine-tune a pre-trained language model using contrastive learn-ing to construct a Waka-specialized embedding model. Then, using the embedding vectors ob-tained from this model and features extracted from them, we train a machine learning model to detect the Honka (original poem) of Wakas that employ the Honkadori technique. Using paired data of Honka and Wakas that are consid-ered to use Honkadori, we evaluated the Honka detection model and demonstrated that it can detect Honka with reasonable accuracy.

pdf bib
The Historian’s Fingerprint: A Computational Stylometric Study of the Zuo Commentary and Discourses of the States
Wenjie Hua

Previous studies suggest that authorship can be inferred through stylistic features like func-tion word usage and grammatical patterns, yet such analyses remain limited for Old Chinese texts with disputed authorship. Computational methods enable a more nuanced exploration of these texts. This study applies stylometric anal-ysis to examine the authorship controversy be-tween the Zuo Commentary and the Discourses of the States. Using PoS 4-grams, Kullback-Leibler divergence, and multidimensional scal-ing (MDS), we systematically compare their stylistic profiles. Results show that the Zuo Commentary exhibits high internal consistency, especially in the later eight Dukes chapters, supporting its integration by a single scholarly tradition. In contrast, the Discourses of the States displays greater stylistic diversity, align-ing with the multiple-source compilation the-ory. Further analysis reveals partial stylistic similarities among the Lu, Jin, and Chu-related chapters, suggesting shared influences. These findings provide quantitative support for Tong Shuye’s arguments and extend statistical vali-dation of Bernhard Karlgren’s assertion on the textual unity of the Zuo Commentary.

pdf bib
Incorporating Lexicon-Aligned Prompting in Large Language Model for Tangut–Chinese Translation
Yuxi Zheng | Jingsong Yu

This paper proposes a machine translation approach for Tangut–Chinese using a large language model (LLM) enhanced with lexical knowledge. We fine-tune a Qwen-based LLM using Tangut–Chinese parallel corpora and dictionary definitions. Experimental results demonstrate that incorporating single-character dictionary definitions leads to the best BLEU-4 score of 72.33 for literal translation. Additionally, applying a chain-of-thought prompting strategy significantly boosts free translation performance to 64.20. The model also exhibits strong few-shot learning abilities, with performance improving as the training dataset size increases. Our approach effectively translates both simple and complex Tangut sentences, offering a robust solution for low-resource language translation and contributing to the digital preservation of Tangut texts.

pdf bib
ParsiPy: NLP Toolkit for Historical Persian Texts in Python
Farhan Farsi | Parnian Fazel | Sepand Haghighi | Sadra Sabouri | Farzaneh Goshtasb | Nadia Hajipour | Ehsaneddin Asgari | Hossein Sameti

The study of historical languages presents unique challenges due to their complex ortho-graphic systems, fragmentary textual evidence, and the absence of standardized digital repre-sentations of text in those languages. Tack-ling these challenges needs special NLP digi-tal tools to handle phonetic transcriptions and analyze ancient texts. This work introduces ParsiPy1, an NLP toolkit designed to facili-tate the analysis of historical Persian languages by offering modules for tokenization, lemma-tization, part-of-speech tagging, phoneme-to-transliteration conversion, and word embed-ding. We demonstrate the utility of our toolkit through the processing of Parsig (Middle Per-sian) texts, highlighting its potential for ex-panding computational methods in the study of historical languages. Through this work, we contribute to the field of computational philol-ogy, offering tools that can be adapted for the broader study of ancient texts and their digital preservation.

pdf bib
Exploring the Application of 7B LLMs for Named Entity Recognition in Chinese Ancient Texts
Chenrui Zheng | Yicheng Zhu | Han Bi

This paper explores the application of fine-tuning methods based on 7B large language models (LLMs) for named entity recognition (NER) tasks in Chinese ancient texts. Targeting the complex semantics and domain-specific characteristics of ancient texts, particularly in Traditional Chinese Medicine (TCM) texts, we propose a comprehensive fine-tuning and pre-training strategy. By introducing multi-task learning, domain-specific pre-training, and efficient fine-tuning techniques based on LoRA, we achieved significant performance improvements in ancient text NER tasks. Experimental results show that the pre-trained and fine-tuned 7B model achieved an F1 score of 0.93, significantly outperforming general-purpose large language models.

pdf bib
Overview of EvaHan2025: The First International Evaluation on Ancient Chinese Named Entity Recognition
Bin Li | Bolin Chang | Ruilin Liu | Xue Zhao | Si Shen | Lihong Liu | Yan Zhu | Zhixing Xu | Weiguang Qu | Dongbo Wang

Ancient Chinese books have great values in history and cultural studies. Named en-tities like person, location, time are cru-cial elements, thus automatic Named En-tity Recognition (NER) is considered a ba-sic task in ancient Chinese text processing. This paper introduces EvaHan2025, the first international ancient Chinese Named Entity Recognition bake-off. The evalua-tion introduces a rigorous benchmark for assessing NER performance across histori-cal and medical texts, covering 12 named entity types. A total of 13 teams par-ticipated in the competition, submitting 77 system runs. In the closed modality, where participants were restricted to us-ing only the training data, the highest F1 scores reached 85.04% on TestA and 90.28% on TestB, both derived from his-torical texts, while performance on medi-cal texts (TestC) reached 84.49%. The re-sults indicate that text genre significantly impacts model performance, with histori-cal texts generally yielding higher scores. Additionally, the intrinsic characteristics of named entities also influence recogni-tion performance. These findings high-light the challenges and opportunities in ancient Chinese NER and underscore the importance of domain adaptation and en-tity type diversity in future research.

pdf bib
Construction of NER Model in Ancient Chinese: Solution of EvaHan 2025 Challenge
Yi Lu | Minyi Lei

This paper introduces the system submit-ted for EvaHan 2025, focusing on the Named Entity Recognition (NER) task for ancient Chinese texts. Our solution is built upon two specified pre-trained BERT models, namely GujiRoBERTa_jian_fan and GujiRoBERTa_fan, and further en-hanced by a deep BiLSTM network with a Conditional Random Field (CRF) decod-ing layer. Extensive experiments on three test dataset splits demonstrate that our system’s performance, 84.58% F1 in the closed-modality track and 82.78% F1 in the open-modality track, significantly out-performs the official baseline, achieving no-table improvements in F1 score.

pdf bib
LLM’s Weakness in NER Doesn’t Stop It from Enhancing a Stronger SLM
Weilu Xu | Renfei Dang | Shujian Huang

Large Language Models (LLMs) demonstrate strong semantic understanding ability and extensive knowledge, but struggle with Named Entity Recognition (NER) due to hallucination and high training costs. Meanwhile, supervised Small Language Models (SLMs) efficiently provide structured predictions but lack adaptability to unseen entities and complex contexts. In this study, we investigate how a relatively weaker LLM can effectively support a supervised model in NER tasks. We first improve the LLM using LoRA-based fine-tuning and similarity-based prompting, achieving performance comparable to a SLM baseline. To further improve results, we propose a fusion strategy that integrates both models: prioritising SLM’s predictions while using LLM guidance in low confidence cases. Our hybrid approach outperforms both baselines on three classic Chinese NER datasets.

pdf bib
Named Entity Recognition in Context: Edit_Dunhuang team Technical Report for Evahan2025 NER Competition
Colin Brisson | Ayoub Kahfy | Marc Bui | Frédéric Constant

We present the Named Entity Recognition sys-tem developed by the Edit Dunhuang team for the EvaHan2025 competition. Our approach in-tegrates three core components: (1) Pindola, a modern transformer-based bidirectional en-coder pretrained on a large corpus of Classi-cal Chinese texts; (2) a retrieval module that fetches relevant external context for each target sequence; and (3) a generative reasoning step that summarizes retrieved context in Classical Chinese for more robust entity disambiguation. Using this approach, we achieve an average F1 score of 85.58, improving upon the competition baseline by nearly 5 points.

pdf bib
Make Good Use of GujiRoBERTa to Identify Entities in Ancient Chinese
Lihan Lin | Yiming Wang | Jiachen Li | Huan Ouyang | Si Li

This report describes our model submitted for the EvaHan 2025 shared task on named entity recognition for ancient Chinese literary works. Since we participated in the task of closed modality, our method is based on the appointed pretrained language model GujiRoBERTajian-fan and we used appointed datasets.We carried out experiments on decodingstrategies and schedulers to verify the effect of our method. In the final test, our method outperformed the official baseline, demonstrating its effectiveness. In the end, for the results, this report gives an analysis from the perspective of data composition.

pdf bib
GRoWE: A GujiRoBERTa-Enhanced Approach to Ancient Chinese NER via Word-Word Relation Classification and Model Ensembling
Tian Xia | Yilin Wang | Xinkai Wang | Yahe Yang | Qun Zhao | Menghui Yang

Named entity recognition is a fundamental task in ancient Chinese text analysis.Based on the pre-trained language model of ancient Chinese texts, this paper proposes a new named entity recognition method GRoWE. It uses the ancient Chinese texts pre-trained language model GujiRoBERTa as the base model, and the wordword relation prediction model is superposed upon the base model to construct a superposition model. Then ensemble strategies are used to multiple superposition models. On the EvaHan 2025 public test set, the F1 value of the proposed method reaches 86.79%, which is 6.18% higher than that of the mainstream BERT_LSTM_CRF baseline model, indicating that the model architecture and ensemble strategy play an important role in improving the recognition effect of naming entities in ancient Chinese texts.

pdf bib
When Less Is More: Logits-Constrained Framework with RoBERTa for Ancient Chinese NER
Wenjie Hua | Shenghan Xu

This report presents our team’s work on ancient Chinese Named Entity Recognition (NER) for EvaHan 20251. We propose a two-stage framework combining GujiRoBERTa with a Logits-Constrained (LC) mechanism. The first stage generates contextual embeddings using GujiRoBERTa, followed by dynamically masked decoding to enforce valid BMES transitions. Experiments on EvaHan 2025 datasets demonstrate the framework’s effectiveness. Key findings include the LC framework’s superiority over CRFs in high-label scenarios and the detrimental effect of BiLSTM modules. We also establish empirical model selection guidelines based on label complexity and dataset size.

pdf bib
Lemmatization of Cuneiform Languages Using the ByT5 Model
Pengxiu Lu | Yonglong Huang | Jing Xu | Minxuan Feng | Chao Xu

Lemmatization of cuneiform languages presents a unique challenge due to their complex writing system, which combines syllabic and logographic elements. In this study, we investigate the effectiveness of the ByT5 model in addressing this challenge by developing and evaluating a ByT5-based lemmatization system. Experimental results demonstrate that ByT5 outperforms mT5 in this task, achieving an accuracy of 80.55% on raw lemmas and 82.59% on generalized lemmas, where sense numbers are removed. These findings highlight the potential of ByT5 for lemmatizing cuneiform languages and provide useful insights for future work on ancient text lemmatization.

pdf bib
Simple Named Entity Recognition (NER) System with RoBERTa for Ancient Chinese
Yunmeng Zhang | Meiling Liu | Hanqi Tang | Shige Lu | Lang Xue

Named Entity Recognition (NER) is a fun-damental task in Natural Language Process-ing (NLP), particularly in the analysis of Chi-nese historical texts. In this work, we pro-pose an innovative NER model based on Gu-jiRoBERTa, incorporating Conditional Ran-dom Fields (CRF) and Long Short Term Mem-ory Network(LSTM) to enhance sequence la-beling performance. Our model is evaluated on three datasets from the EvaHan2025 competi-tion, demonstrating superior performance over the baseline model, SikuRoBERTa-BiLSTM-CRF. The proposed approach effectively cap-tures contextual dependencies and improves entity boundary recognition. Experimental re-sults show that our method achieves consistent improvements across almost all evaluation met-rics, highlighting its robustness and effective-ness in handling ancient Chinese texts.

pdf bib
Multi-Strategy Named Entity Recognition System for Ancient Chinese
Wenxuan Dong | Meiling Liu

We present a multi-strategy Named Entity Recognition (NER) system for ancient Chi-nese texts in EvaHan2025. Addressing dataset heterogeneity, we use a Conditional Random Field (CRF) for Tasks A and C to handle six entity types’ complex dependencies, and a lightweight Softmax classifier for Task B’s simpler three-entity tagset. Ablation studies on training data confirm CRF’s superiority in capturing sequence dependencies and Softmax’s computational advantage for simpler tasks. On blind tests, our system achieves F1-scores of 83.94%, 88.31%, and 82.15% for Test A, B, and C—outperforming baselines by 2.46%, 0.81%, and 9.75%. With an overall F1 improvement of 4.30%, it excels across historical and medical domains. This adaptability enhances knowledge extraction from ancient texts, offering a scalable NER framework for low-resource, complex languages.

pdf bib
Finetuning LLMs for EvaCun 2025 token prediction shared task
Josef Jon | Ondřej Bojar

In this paper, we present our submission for the token prediction task of EvaCun 2025. Our sys-tems are based on LLMs (Command-R, Mistral, and Aya Expanse) fine-tuned on the task data provided by the organizers. As we only pos-sess a very superficial knowledge of the subject field and the languages of the task, we simply used the training data without any task-specific adjustments, preprocessing, or filtering. We compare 3 different approaches (based on 3 different prompts) of obtaining the predictions, and we evaluate them on a held-out part of the data.

pdf bib
Beyond Base Predictors: Using LLMs to Resolve Ambiguities in Akkadian Lemmatization
Frederick Riemenschneider

We present a hybrid approach for Akkadian lemmatization in the EvaCun 2025 Shared Task that combines traditional NLP techniques with large language models (LLMs). Our system employs three Base Predictors–a dictionary lookup and two T5 models–to establish initial lemma candidates. For cases where these pre-dictors disagree (18.72% of instances), we im-plement an LLM Resolution module, enhanced with direct access to the electronic Babylonian Library (eBL) dictionary entries. This module includes a Predictor component that generates initial lemma predictions based on dictionary information, and a Validator component that refines these predictions through contextual rea-soning. Error analysis reveals that the system struggles most with small differences (like cap-italization) and certain ambiguous logograms (like BI). Our work demonstrates the benefits of combining traditional NLP approaches with the reasoning capabilities of LLMs when provided with appropriate domain knowledge.

pdf bib
A Low-Shot Prompting Approach to Lemmatization in the EvaCun 2025 Shared Task
John Sbur | Brandi Wilkins | Elizabeth Paul | Yudong Liu

This study explores the use of low-shot prompt-ing techniques for the lemmatization of ancient cuneiform languages using Large Language Models (LLMs). To structure the input data and systematically design effective prompt tem-plates, we employed a hierarchical clustering approach based on Levenshtein distance The prompt design followed established engineer-ing patterns, incorporating instructional and response-guiding elements to enhance model comprehension. We employed the In-Context Learning (ICL) prompting strategy, selecting example words primarily based on lemma fre-quency, ensuring a balance between commonly occurring words and rare cases to improve gen-eralization. During testing on the develop-ment set, prompts included structured examples and explicit formatting rules, with accuracy assessed by comparing model predictions to ground truth lemmas. The results showed that model performance varied significantly across different configurations, with accuracy reach-ing approximately 90% in the best case for in-vocabulary words and around 9% in the best case for out-of-vocabulary (OOV) words. De-spite resource constraints and the lack of input from a language expert, oour findings suggest that prompt engineering strategies hold promise for improving LLM performance in cuneiform language lemmatization.

pdf bib
Multi-Domain Ancient Chinese Named Entity Recognition Based on Attention-Enhanced Pre-trained Language Model
Qi Zhang | Zhiya Duan | Shijie Ma | Shengyu Liu | Zibo Yuan | RuiMin Ma

Recent advancements in digital humanities have intensified the demand for intelligent processing of ancient Chinese texts, particularly across specialized domains such as historical records and ancient medical literature. Among related research areas, Named Entity Recognition (NER) plays a crucial role, serving as the foundation for knowledge graph construction and deeper humanities computing studies. In this paper, we introduce a architecture specifically designed for multi-domain ancient Chinese NER tasks based on a pre-trained language model (PLM). Building upon the GujiRoberta backbone, we propose the GujiRoberta-BiLSTM-Attention-CRF model. Experimental results on three distinct domain-specific datasets demonstrate that our approach significantly outperforms the official baselines across all three datasets, highlighting the particular effectiveness of integrating an attention mechanism within our architecture.

pdf bib
EvaCun 2025 Shared Task: Lemmatization and Token Prediction in Akkadian and Sumerian using LLMs
Shai Gordin | Aleksi Sahala | Shahar Spencer | Stav Klein

The EvaCun 2025 Shared Task, organized as part of ALP 2025 workshop and co-located with NAACL 2025, explores how Large Language Models (LLMs) and transformer-based models can be used to improve lemmatization and token prediction tasks for low-resource ancient cuneiform texts. This year our datasets focused on the best attested ancient Near Eastern languages written in cuneiform, namely, Akkadian and Sumerian texts. However, we utilized the availability of datasets never before used on scale in NLP tasks, primarily first millennium literature (i.e. “Canonical”) provided by the Electronic Babylonian Library (eBL), and Old Babylonian letters and archival texts, provided by Archibab. We aim to encourage the development of new computational methods to better analyze and reconstruct cuneiform inscriptions, pushing NLP forward for ancient and low-resource languages. Three teams competed for the lemmatization subtask and one for the token prediction subtask. Each subtask was evaluated alongside a baseline model, provided by the organizers.

up

pdf (full)
bib (full)
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

pdf bib
Proceedings of the Fifth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Manuel Mager | Abteen Ebrahimi | Robert Pugh | Shruti Rijhwani | Katharina Von Der Wense | Luis Chiruzzo | Rolando Coto-Solano | Arturo Oncevay

pdf bib
Text-to-speech system for low-resource languages: A case study in Shipibo-Konibo (a Panoan language from Peru)
Daniel Menendez | Hector Gomez

This paper presents the design and development of a Text-to-Speech (TTS) model for Shipibo-Konibo, a low-resource indigenous language spoken mainly in the Peruvian Amazon. Despite the challenge posed by the scarcity of data, the model was trained with over 4 hours of recordings and 3,025 meticulously collected written sentences. The tests results demon strated an intelligibility rate (IR) exceeding 88% and a mean opinion score (MOS) of 4.01, confirming the quality of the audio generated by the model, which comprises the Tacotron 2 spectrogram predictor and the HiFi-GAN vocoder. Furthermore, the potential of this model to be trained in other indigenous languages spoken in Peru is highlighted, opening a promising avenue for the documentation and revitalization of these languages.

pdf bib
Does a code-switching dialogue system help users learn conversational fluency in Choctaw?
Jacqueline Brixey | David Traum

We investigate the learning outcomes and user response to a chatbot for practicing conversational Choctaw, an endangered American Indigenous language. Conversational fluency is a goal for many language learners, however, for learners of endangered languages in North America, access to fluent speakers may be limited. Chatbots are potentially ideal dialogue partners as this kind of dialogue system fulfills a non-authoritative role by focusing on carrying on a conversation as an equal conversational partner. The goal of the chatbot investigated in this work is to serve as a conversational partner in the absence of a fluent Choctaw-speaking human interlocutor. We investigate the impact of code-switching in the interaction, comparing a bilingual chatbot against a monolingual Choctaw version. We evaluate the systems for user engagement and enjoyment, as well as gains in conversational fluency from interacting with the system.

pdf bib
A hybrid Approach to low-resource machine translation for Ojibwe verbs
Minh Nguyen | Christopher Hammerly | Miikka Slifverberg

Machine translation is a tool that can help teachers, learners, and users of low-resourced languages. However, there are significant challenges in developing these tools, such as the lack of large-scale parallel corpora and complex morphology. We propose a novel hybrid system that combines LLM and rule-based methods in two distinct stages to translate inflected Ojibwe verbs into English. We use an LLM to automatically annotate dictionary data to build translation templates. Then, our rulebased module performs translation using inflection and slot-filling processes built on top of an FST-based analyzer. We test the system with a set of automated tests. Thanks to the ahead-of-time nature of the template-building process and the light-weight rule-based translation module, the end-to-end translation process has an average translation speed of 70 milliseconds per word. The system achieved an average ChrF score of 0.82 and a semantic similarity score of 0.93 among the successfully translated verbs in a test set. The approach has the potential to be extended to other low-resource Indigenous languages with dictionary data.

pdf bib
Advancing Uto-Aztecan Language Technologies: A Case Study on the Endangered Comanche Language
Jesus Alvarez C | Daua Karajeanes | Ashley Prado | John Ruttan | Ivory Yang | Sean O’brien | Vasu Sharma | Kevin Zhu

The digital exclusion of endangered languages remains a critical challenge in NLP, limiting both linguistic research and revitalization efforts. This study introduces the first computational investigation of Comanche, an Uto-Aztecan language on the verge of extinction, demonstrating how minimal-cost, community-informed NLP interventions can support language preservation. We present a manually curated dataset of 412 phrases, a synthetic data generation pipeline, and an empirical evaluation of GPT-4o and GPT-4o-mini for language identification. Our experiments reveal that while LLMs struggle with Comanche in zero-shot settings, few-shot prompting significantly improves performance, achieving near-perfect accuracy with just five examples. Our findings highlight the potential of targeted NLP methodologies in low-resource contexts and emphasize that visibility is the first step toward inclusion. By establishing a foundation for Comanche in NLP, we advocate for computational approaches that prioritize accessibility, cultural sensitivity, and community engagement.

pdf bib
Py-Elotl: A Python NLP package for the languages of Mexico
Ximena Gutierrez-Vasques | Robert Pugh | Victor Mijangos | Diego Barriga Martínez | Paul Aguilar | Mikel Segura | Paola Innes | Javier Santillan | Cynthia Montaño | Francis Tyers

This work presents Py-elotl, a suite of tools and resources in Python for processing text in several indigenous languages spoken in Mexico. These resources include parallel corpora, linguistic taggers/analyzers, and orthographic normalization tools. This work aims to develop essential resources to support language pre-processing and linguistic research, and the future creation of more complete downstream applications that could be useful for the speakers and enhance the visibility of these languages. The current version supports language groups such as Nahuatl, Otomi, Mixtec, and Huave. This project is open-source and freely available for use and collaboration

pdf bib
Analyzing and generating English phrases with finite-state methods to match and translate inflected Plains Cree word-forms
Antti Arppe

This paper presents two finite-state transducer tools, which can be used to analyze or generate simple English verb and noun phrases, that can be mapped with inflected Plains Cree (nêhiyawêwin) verb and noun forms. These tools support fetching an inflected Cree word-form directly with an appropriate plain English phrase, and conversely providing a rough translation of an inflected Cree word-form. Such functionalities can be used to improve the user friendliness of on-line dictionaries. The tools are extendable to other similarly morphologically complex languages.

pdf bib
Unsupervised, Semi-Supervised and LLM-Based Morphological Segmentation for Bribri
Carter Anderson | Mien Nguyen | Rolando Coto-Solano

Morphological Segmentation is a major task in Indigenous language documentation. In this paper we (a) introduce a novel statistical algorithm called Morphemo to split words into their constituent morphemes. We also (b) study how large language models perform on this task. We use these tools to analyze Bribri, an under-resourced Indigenous language from Costa Rica. Morphemo has better performance than the LLM when splitting multimorphemic words, mainly because the LLMs are more conservative, which also gives them an advantage when splitting monomorphemic words. In future work we will use these tools to tag Bribri language corpora, which currently lack morphological segmentation.

pdf bib
FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages
Rahul Raja | Arpita Vats

This paper presents the winning submission of the RaaVa team to the AmericasNLP 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation (MT) into Indigenous Languages of America, where our system ranked first overall based on average Pearson correlation with the human annotations. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. In addition to FUSE, we explore five alternative approaches leveraging different combinations of linguistic similarity features and learning paradigms. FUSE Score highlights the effectiveness of combining lexical, phonetic, semantic, and fuzzy token similarity with learning-based modeling to improve MT evaluation for morphologically rich and low-resource languages. MT into Indigenous languages poses unique challenges due to polysynthesis, complex morphology, and non-standardized orthography. Conventional automatic metrics such as BLEU, TER, and ChrF often fail to capture deeper aspects like semantic adequacy and fluency. Our proposed framework, formerly referred to as FUSE, incorporates multilingual sentence embeddings and phonological encodings to better align with human evaluation. We train supervised models on human-annotated development sets and evaluate held-out test data. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments, offering a robust and linguistically informed solution for MT evaluation in low-resource settings.

pdf bib
UCSP Submission to the AmericasNLP 2025 Shared Task
Jorge Asillo Congora | Julio Santisteban | Ricardo Lazo Vasquez

Quechua is a low-resource language spoken by more than 7 million people in South America. While Quechua is primarily an oral language, several orthographic standards do exist. There is no universally adopted writing standard for Quechua, and variations exist across dialects and regions; its current writing is based on how it is uttered and how the sound is written. Quechua is a family of languages with similarities among the seven variants. The lack of a parallel dataset has reduced the opportunities for developing machine translation. We investigated whether increasing the current Quechua Parallel dataset with synthetic sentences and using a pre-trained large language model improves the performance of a Quechua machine translation. A Large language model has been used to generate synthetic sentences to extend the current parallel dataset. We use the mt5 model to fine-tune it to develop a machine translation for Quechua to Spanish and vice versa. Our survey identified the gaps in the state of the art of Quechua machine translation, and our BLEU/Chrf++ results show an improvement over the state of the art.

pdf bib
Machine Translation Using Grammar Materials for LLM Post-Correction
Jonathan Hus | Antonios Anastasopoulos | Nathaniel Krasner

This paper describes George Mason University’s submission to the AmericasNLP 2025 Shared Task on Machine Translation into Indigenous Languages. We prompt a large language model (LLM) with grammar reference materials to correct the translations produced by a finetuned Encoder-Decoder machine translation system. This system leads to improvements when translating from the indigenous languages into Spanish indicating that LLMs are capable of using grammar materials to decipher an unseen language.

pdf bib
Machine Translation Metrics for Indigenous Languages Using Fine-tuned Semantic Embeddings
Nathaniel Krasner | Justin Vasselli | Belu Ticona | Antonios Anastasopoulos | Chi-Kiu Lo

This paper describes the Tekio submission to the AmericasNLP 2025 shared task on machine translation metrics for Indigenous languages. We developed two primary metric approaches leveraging multilingual semantic embeddings. First, we fine-tuned the Language-agnostic BERT Sentence Encoder (LaBSE) specifically for Guarani, Bribri, and Nahuatl, significantly enhancing semantic representation quality. Next, we integrated our fine-tuned LaBSE into the semantic similarity metric YiSi-1, exploring the effectiveness of averaging multiple layers. Additionally, we trained regression-based COMET metrics (COMET-DA) using the fine-tuned LaBSE embeddings as a semantic backbone, comparing Mean Absolute Error (MAE) and Mean Squared Error (MSE) loss functions. Our YiSi-1 metric using layer-averaged embeddings chosen by having the best performance on the development set for each individual language achieved the highest average correlation across languages among our submitted systems, and our COMET models demonstrated competitive performance for Guarani.

pdf bib
JHU’s Submission to the AmericasNLP 2025 Shared Task on the Creation of Educational Materials for Indigenous Languages
Tom Lupicki | Lavanya Shankar | Kaavya Chaparala | David Yarowsky

This paper presents JHU’s submission to the AmericasNLP shared task on the creation of educational materials for Indigenous languages. The task involves transforming a base sentence given one or more tags that correspond to grammatical features, such as negation or tense. The task also spans four languages: Bribri, Maya, Guaraní, and Nahuatl. We experiment with augmenting prompts to large language models with different information, chain of thought prompting, ensembling large language models by majority voting, and training a pointer-generator network. Our System 1, an ensemble of large language models, achieves the best performance on Maya and Guaraní, building upon the previous successes in leveraging large language models for this task and highlighting the effectiveness of ensembling large language models.

pdf bib
Leveraging Dictionaries and Grammar Rules for the Creation of Educational Materials for Indigenous Languages
Justin Vasselli | Haruki Sakajo | Arturo Martínez Peguero | Frederikus Hudi | Taro Watanabe

This paper describes the NAIST submission to the AmericasNLP 2025 shared task on the creation of educational materials for Indigenous languages. We implement three systems to tackle the unique challenges of each language. The first system, used for Maya and Guarani, employs a straightforward GPT-4o few-shot prompting technique, enhanced by synthetically generated examples to ensure coverage of all grammatical variations encountered. The second system, used for Bribri, integrates dictionary-based alignment and linguistic rules to systematically manage linguisticand lexical transformations. Finally, we developed a specialized rule-based system for Nahuatl that systematically reduces sentences to their base form, simplifying the generation of correct morphology variants.

pdf bib
Harnessing NLP for Indigenous Language Education: Fine-Tuning Large Language Models for Sentence Transformation
Mahshar Yahan | Dr. Mohammad Islam

Indigenous languages face significant challenges due to their endangered status and limited resources which makes their integration into NLP systems difficult. This study investigates the use of Large Language Models (LLMs) for sentence transformation tasks in Indigenous languages, focusing on Bribri, Guarani, and Maya. Here, the dataset from the AmericasNLP 2025 Shared Task 2 is used to explore sentence transformations in Indigenous languages. The goal is to create educational tools by modifying sentences based on linguistic instructions, such as changes in tense, aspect, voice, person, and other grammatical features. The methodology involves preprocessing data, simplifying transformation tags, and designing zero-shot and few-shot prompts to guide LLMs in sentence rewriting. Fine-tuning techniques like LoRA and Bits-and-Bytes quantization were employed to optimize model performance while reducing computational costs. Among the tested models, Llama 3.2(3B-Instruct) demonstrated superior performance across all languages with high BLEU and ChrF++ scores, particularly excelling in few-shot settings. The Llama 3.2 model achieved BLEU scores of 19.51 for Bribri, 13.67 for Guarani, and 55.86 for Maya in test settings. Additionally, ChrF++ scores reached 50.29 for Bribri, 58.55 for Guarani, and 80.12 for Maya, showcasing its effectiveness in handling sentence transformation. These results highlight the potential of LLMs that can improve NLP tools for indigenous languages and help preserve linguistic diversity.

pdf bib
Leveraging Large Language Models for Spanish-Indigenous Language Machine Translation at AmericasNLP 2025
Mahshar Yahan | Dr. Mohammad Islam

This paper presents our approach to machine translation between Spanish and 13 Indigenous languages of the Americas as part of the AmericasNLP 2025 shared task. Addressing the challenges of low-resource translation, we fine-tuned advanced multilingual models, including NLLB-200 (Distilled-600M), Llama 3.1 (8B-Instruct) and XGLM 1.7B, using techniques such as dynamic batching, token adjustments, and embedding initialization. Data preprocessing steps like punctuation removal and tokenization refinements were employed to achieve data generalization. While our models demonstrated strong performance for Awajun and Quechua translations, they struggled with morphologically complex languages like Nahuatl and Otomí. Our approach achieved competitive ChrF++ scores for Awajun (35.16) and Quechua (31.01) in the Spanish-to-Indigenous translation track (Es→Xx). Similarly, in the Indigenous-to-Spanish track (Xx→Es), we obtained ChrF++ scores of 33.70 for Awajun and 31.71 for Quechua. These results underscore the potential of tailored methodologies in preserving linguistic diversity while advancing machine translation for endangered languages.

pdf bib
Findings of the AmericasNLP 2025 Shared Tasks on Machine Translation, Creation of Educational Material, and Translation Metrics for Indigenous Languages of the Americas
Ona De Gibert | Robert Pugh | Ali Marashian | Raul Vazquez | Abteen Ebrahimi | Pavel Denisov | Enora Rice | Edward Gow-Smith | Juan Prieto | Melissa Robles | Rubén Manrique | Oscar Moreno | Angel Lino | Rolando Coto-Solano | Aldo Alvarez | Marvin Agüero-Torales | John E. Ortega | Luis Chiruzzo | Arturo Oncevay | Shruti Rijhwani | Katharina Von Der Wense | Manuel Mager

This paper presents the findings of the AmericasNLP 2025 Shared Tasks: (1) machine translation for truly low-resource languages, (2) morphological adaptation for generating educational examples, and (3) developing metrics for machine translation in Indigenous languages. The shared tasks cover 14 diverse Indigenous languages of the Americas. A total of 11 teams participated, submitting 26 systems across all tasks, languages, and models. We describe the shared tasks, introduce the datasets and evaluation metrics used, summarize the baselines and submitted systems, and report our findings.

up

pdf (full)
bib (full)
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)

pdf bib
Proceedings of the 18th Workshop on Building and Using Comparable Corpora (BUCC)
Serge Sharoff | Ayla Rigouts Terryn | Pierre Zweigenbaum | Reinhard Rapp

pdf bib
Bilingual resources for Moroccan Sign Language Generation and Standard Arabic Skills Improvement of Deaf Children
Abdelhadi Soudi | Corinne Vinopol | Kristof Van Laerhoven

This paper presents a set of bilingual Standard Arabic (SA)-Moroccan Sign Language (MSL) tools and resources to improve Moroccan Deaf children’s SA skills. An MSL Generator based on rule-based machine translation (MT) is described that enables users and educators of Deaf children, in particular, to enter Arabic text and generate its corresponding MSL translation in both graphic and video format. The generated graphics can be printed and imported into an Arabic reading passage. We have also developed MSL Clip and Create software that includes a bilingual database of 3,000 MSL signs and SA words, a Publisher for the incorporation of MSL graphic support into SA reading passages, and six Templates that create customized bilingual crossword puzzles, word searches, Bingo cards, matching games, flashcards, and fingerspelling scrambles. A crowdsourcing platform for MSL data collection is also described. A major social benefit of the development of these resources is in relation to equity and the status of deaf people in Moroccan society. More appropriate resources for the bilingual education of Deaf children (in MSL and SA) will lead to improved quality of educational services.

pdf bib
Harmonizing Annotation of Turkic Postverbial Constructions: A Comparative Study of UD Treebanks
Arofat Akhundjanova

As the number of treebanks within the same language family continues to grow, the importance of establishing consistent annotation practices has become increasingly evident. In this paper, we evaluate various approaches to annotating Turkic postverbial constructions across UD treebanks. Our comparative analysis reveals that none of the existing methods fully capture the unique semantic and syntactic characteristics of these complex constructions. This underscores the need to adopt a balanced approach that can achieve broad consensus and be implemented consistently across Turkic treebanks. By examining the phenomenon and the available annotation strategies, our study aims to improve the consistency of Turkic UD treebanks and enhance their utility for cross-linguistic research.

pdf bib
Towards Truly Open, Language-Specific, Safe, Factual, and Specialized Large Language Models
Preslav Nakov

First, we will argue for the need for fully transparent open-source large language models (LLMs), and we will describe the efforts of MBZUAI’s Institute on Foundation Models (IFM) towards that based on the LLM360 initiative. Second, we will argue for the need for language-specific LLMs, and we will share our experience from building Jais, the world’s leading open Arabic-centric foundation and instruction-tuned large language model, Nanda, our recently released open Hindi LLM, and some other models. Third, we will argue for the need for safe LLMs, and we will present Do-Not-Answer, a dataset for evaluating the guardrails of LLMs, which is at the core of the safety mechanisms of our LLMs. Forth, we will argue for the need for factual LLMs, we will discuss the factuality challenges that LLMs pose. We will then present some recent relevant tools for addressing these challenges developed at MBZUAI: (i) OpenFactCheck, a framework for fact-checking LLM output, for building customized fact-checking systems, and for benchmarking LLMs for factuality, (ii) LM-Polygraph, a tool for predicting an LLM’s uncertainty in its output using cheap and fast uncertainty quantification techniques, and (iii) LLM-DetectAIve, a tool for machine-generated text detection. Finally, we will argue for the need for specialized models, and we will present the zoo of LLMs currently being developed at MBZUAI’s IFM.

pdf bib
Make Satire Boring Again: Reducing Stylistic Bias of Satirical Corpus by Utilizing Generative LLMs
Asli Umay Ozturk | Recep Firat Cekinel | Pinar Karagoz

Satire detection is essential for accurately extracting opinions from textual data and combating misinformation online. However, the lack of diverse corpora for satire leads to the problem of stylistic bias which impacts the models’ detection performances. This study proposes a debiasing approach for satire detection, focusing on reducing biases in training data by utilizing generative large language models. The approach is evaluated in both cross-domain (irony detection) and cross-lingual (English) settings. Results show that the debiasing method enhances the robustness and generalizability of the models for satire and irony detection tasks in Turkish and English. However, its impact on causal language models, such as Llama-3.1, is limited. Additionally, this work curates and presents the Turkish Satirical News Dataset with detailed human annotations, with case studies on classification, debiasing, and explainability.

pdf bib
BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language
Ehsan Lotfi | Nikolay Banar | Walter Daelemans

Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content reduces its utility for underrepresented languages in IR, including Dutch. To address this limitation and encourage the development of Dutch IR models, we introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch. Using BEIR-NL, we evaluated a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method. Our experiments show that BM25 remains a competitive baseline, and is only outperformed by the larger dense models trained for retrieval. When combined with reranking models, BM25 achieves performance on par with the best dense ranking models. In addition, we explored the impact of translation on the data by back-translating a selection of datasets to English, and observed a performance drop for both dense and lexical methods, indicating the limitations of translation for creating benchmarks. BEIR-NL is publicly available on the Hugging Face hub.

pdf bib
Refining Dimensions for Improving Clustering-based Cross-lingual Topic Models
Chia-Hsuan Chang | Tien Yuan Huang | Yi-Hang Tsai | Chia-Ming Chang | San-Yih Hwang

Recent works in clustering-based topic models perform well in monolingual topic identification by introducing a pipeline to cluster the contextualized representations. However, the pipeline is suboptimal in identifying topics across languages due to the presence of language-dependent dimensions (LDDs) generated by multilingual language models. To address this issue, we introduce a novel, SVD-based dimension refinement component into the pipeline of the clustering-based topic model. This component effectively neutralizes the negative impact of LDDs, enabling the model to accurately identify topics across languages. Our experiments on three datasets demonstrate that the updated pipeline with the dimension refinement component generally outperforms other state-of-the-art cross-lingual topic models.

pdf bib
The Role of Handling Attributive Nouns in Improving Chinese-To-English Machine Translation
Adam Meyers | Rodolfo Joel Zevallos | John E. Ortega | Lisa Wang

Translating between languages with drastically different grammatical conventions poses significant challenges, not just for human interpreters but also for machine translation systems. In this work, we specifically target the translation challenges posed by attributive nouns in Chinese, which frequently cause ambiguities in English translation. By manually inserting the omitted particle ‘DE’ in news article titles from the Penn Chinese Discourse Treebank, we developed a targeted dataset to fine-tune Hugging Face Chinese to English translation models, specifically improving how this critical function word is handled. This focused approach not only complements the broader strategies suggested by previous studies but also offers a practical enhancement by specifically addressing a common error type in Chinese-English translation.

pdf bib
Can a Neural Model Guide Fieldwork? A Case Study on Morphological Data Collection
Aso Mahmudi | Borja Herce | Demian Inostroza Améstica | Andreas Scherbakov | Eduard H. Hovy | Ekaterina Vylomova

Linguistic fieldwork is an important component in language documentation and the creation of comprehensive linguistic corpora. Despite its significance, the process is often lengthy, exhaustive, and time-consuming. This paper presents a novel model that guides a linguist during the fieldwork and accounts for the dynamics of linguist-speaker interactions. We introduce a novel framework that evaluates the efficiency of various sampling strategies for obtaining morphological data and assesses the effectiveness of state-of-the-art neural models in generalising morphological structures. Our experiments highlight two key strategies for improving the efficiency: (1) increasing the diversity of annotated data by uniform sampling among the cells of the paradigm tables, and (2) using model confidence as a guide to enhance positive interaction by providing reliable predictions during annotation.

pdf bib
Comparable Corpora: Opportunities for New Research Directions
Kenneth Ward Church

Most conference papers present new results, but this paper will focus more on opportunities for the audience to make their own contributions. This paper is intended to challenge the community to think more broadly about what we can do with comparable corpora. We will start with a review of the history, and then suggest new directions for future research.

pdf bib
SELEXINI – a large and diverse automatically parsed corpus of French
Manon Scholivet | Agata Savary | Louis Estève | Marie Candito | Carlos Ramisch

The annotation of large text corpora is essential for many tasks. We present here a large automatically annotated corpus for French. This corpus is separated into two parts: the first from BigScience, and the second from HPLT. The annotated documents from HPLT were selected in order to optimise the lexical diversity of the final corpus SELEXINI. An analysis of the impact of this selection was carried out on syntactic diversity, as well as on the quality of the new words resulting from the HPLT part of SELEXINI. We have shown that despite the introduction of interesting new words, the texts extracted from HPLT are very noisy. Furthermore, increasing lexical diversity did not increase syntactic diversity.

up

pdf (full)
bib (full)
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)

pdf bib
Proceedings of the 3rd Workshop on Cross-Cultural Considerations in NLP (C3NLP 2025)
Vinodkumar Prabhakaran | Sunipa Dev | Luciana Benotti | Daniel Hershcovich | Yong Cao | Li Zhou | Laura Cabello | Ife Adebara

pdf bib
LLM Alignment for the Arabs: A Homogenous Culture or Diverse Ones
Amr Keleg

Large Language Models (LLMs) have the potential of being a useful tool that can automate tasks, and assist humans. However, these models are more fluent in English and more aligned with Western cultures, norms, and values. Arabic-specific LLMs are being developed to better capture the nuances of the Arabic language, and the views of the Arabs. However, Arabs are sometimes assumed to share the same culture. In this position paper, we discuss the limitations of this assumption and provide our recommendations for how to curate better alignment data that models the cultural diversity within the Arab world.

pdf bib
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son | Hyunwoo Ko | Dasol Choi

pdf bib
Fair Summarization: Bridging Quality and Diversity in Extractive Summaries
Sina Bagheri Nezhad | Sayan Bandyapadhyay | Ameeta Agrawal

Fairness in multi-document summarization of user-generated content remains a critical challenge in natural language processing (NLP). Existing summarization methods often fail to ensure equitable representation across different social groups, leading to biased outputs. In this paper, we introduce two novel methods for fair extractive summarization: FairExtract, a clustering-based approach, and FairGPT, which leverages GPT-3.5-turbo with fairness constraints. We evaluate these methods using Divsumm summarization dataset of White-aligned, Hispanic, and African-American dialect tweets and compare them against relevant baselines. The results obtained using a comprehensive set of summarization quality metrics such as SUPERT, BLANC, SummaQA, BARTScore, and UniEval, as well as a fairness metric F, demonstrate that FairExtract and FairGPT achieve superior fairness while maintaining competitive summarization quality. Additionally, we introduce composite metrics (e.g., SUPERT+F, BLANC+F) that integrate quality and fairness into a single evaluation framework, offering a more nuanced understanding of the trade-offs between these objectives. Our code is available online.

pdf bib
InspAIred: Cross-cultural Inspiration Detection and Analysis in Real and LLM-generated Social Media Data
Oana Ignat | Gayathri Ganesh Lakshmy | Rada Mihalcea

Inspiration is linked to various positive outcomes, such as increased creativity, productivity, and happiness. Although inspiration has great potential, there has been limited effort toward identifying content that is inspiring, as opposed to just engaging or positive. Additionally, most research has concentrated on Western data, with little attention paid to other cultures. This work is the first to study cross-cultural inspiration through machine learning methods. We aim to identify and analyze real and AI-generated cross-cultural inspiring posts. To this end, we compile and make publicly available the InspAIred dataset, which consists of 2,000 real inspiring posts, 2,000 real non-inspiring posts, and 2,000 generated inspiring posts evenly distributed across India and the UK. The real posts are sourced from Reddit, while the generated posts are created using the GPT-4 model. Using this dataset, we conduct extensive computational linguistic analyses to (1) compare inspiring content across cultures, (2) compare AI-generated inspiring posts to real inspiring posts, and (3) determine if detection models can accurately distinguish between inspiring content across cultures and data sources.

pdf bib
DaKultur: Evaluating the Cultural Awareness of Language Models for Danish with Native Speakers
Max Müller-Eberstein | Mike Zhang | Elisa Bassignana | Peter Brunsgaard Trolle | Rob Van Der Goot

Large Language Models (LLMs) have seen widespread societal adoption. However, while they are able to interact with users in languages beyond English, they have been shown to lack cultural awareness, providing anglocentric or inappropriate responses for underrepresented language communities. To investigate this gap and disentangle linguistic versus cultural proficiency, we conduct the first cultural evaluation study for the mid-resource language of Danish, in which native speakers prompt different models to solve tasks requiring cultural awareness. Our analysis of the resulting 1,038 interactions from 63 demographically diverse participants highlights open challenges to cultural adaptation: Particularly, how currently employed automatically translated data are insufficient to train or measure cultural adaptation, and how training on native-speaker data can more than double response acceptance rates. We release our study data as DaKultur - the first native Danish cultural awareness dataset.

pdf bib
Korean Stereotype Content Model: Translating Stereotypes Across Cultures
Michelle YoungJin Kim | Kristen Johnson

To address bias in language models, researchers are leveraging established social psychology research on stereotyping. This interdisciplinary approach uses frameworks like the Stereotype Content Model (SCM) to understand how stereotypes about social groups are formed and perpetuated. The SCM posits that stereotypes are based on two dimensions: warmth (intent to harm) and competence (ability to harm). This framework has been applied in NLP for various tasks, including stereotype identification, bias mitigation, and hate speech detection. While the SCM has been extensively studied in English language models and Western cultural contexts, its applicability as a cross-cultural measure of stereotypes remains an open research question. This paper explores the cross-cultural validity of the SCM by developing a Korean Stereotype Content Model (KoSCM). We create a Korean warmth-competence lexicon through machine translation of existing English lexicons, validated by an expert translator, and utilize this lexicon to develop a labeled training dataset of Korean sentences. This work presents the first extension of SCM lexicons to a non-English language (Korean), aiming to broaden understanding of stereotypes and cultural dynamics.

pdf bib
LLM-C3MOD: A Human-LLM Collaborative System for Cross-Cultural Hate Speech Moderation
Junyeong Park | Seogyeong Jeong | Seyoung Song | Yohan Lee | Alice Oh

Content moderation platforms concentrate resources on English content despite serving predominantly non-English speaking users.Also, given the scarcity of native moderators for low-resource languages, non-native moderators must bridge this gap in moderation tasks such as hate speech moderation.Through a user study, we identify that non-native moderators struggle with understanding culturally-specific knowledge, sentiment, and internet culture in the hate speech.To assist non-native moderators, we present LLM-C3MOD, a human-LLM collaborative pipeline with three steps: (1) RAG-enhanced cultural context annotations; (2) initial LLM-based moderation; and (3) targeted human moderation for cases lacking LLM consensus.Evaluated on Korean hate speech dataset with Indonesian and German participants, our system achieves 78% accuracy (surpassing GPT-4o’s 71% baseline) while reducing human workload by 83.6%.In addition, cultural context annotations improved non-native moderator accuracy from 22% to 61%, with humans notably excelling at nuanced tasks where LLMs struggle.Our findings demonstrate that non-native moderators, when properly supported by LLMs, can effectively contribute to cross-cultural hate speech moderation.

pdf bib
One world, one opinion? The superstar effect in LLM responses
Sofie Goethals | Lauren Rhue

As large language models (LLMs) are shaping the way information is shared and accessed online, their opinions have the potential to influence a wide audience. This study examines who is predicted by the studied LLMs as the most prominent figures across various fields, while using prompts in ten different languages to explore the influence of linguistic diversity. Our findings reveal low diversity in responses, with a small number of figures dominating recognition across languages (also known as the “superstar effect”). These results highlight the risk of narrowing global knowledge representation when LLMs are used to retrieve subjective information.

pdf bib
Towards Region-aware Bias Evaluation Metrics
Angana Borah | Aparna Garimella | Rada Mihalcea

When exposed to human-generated data, language models are known to learn and amplify societal biases. While previous works introduced metrics that can be used to assess the bias in these models, they rely on assumptions that may not be universally true. For instance, a gender bias dimension commonly used by these metrics is that of family–career, but this may not be the only common bias in certain regions of the world. In this paper, we identify topical differences in gender bias across different regions and propose a region-aware bottom-up approach for bias assessment. Several of our proposed region-aware gender bias dimensions are found to be aligned with the human perception of gender biases in these regions.

pdf bib
Cross-Cultural Differences in Mental Health Expressions on Social Media
Sunny Rai | Khushi Shelat | Devansh Jain | Ashwin Kishen | Young Min Cho | Maitreyi Redkar | Samindara Hardikar-Sawant | Lyle Ungar | Sharath Chandra Guntuku

Culture moderates the way individuals perceive and express mental distress. Current understandings of mental health expressions on social media, however, are predominantly derived from WEIRD (Western, Educated, Industrialized, Rich, and Democratic) contexts. To address this gap, we examine mental health posts on Reddit made by individuals geolocated in India, to identify variations in social media language specific to the Indian context compared to users from Western nations. Our experiments reveal significant psychosocial variations in emotions and temporal orientation. This study demonstrates the potential of social media platforms for identifying cross-cultural differences in mental health expressions (e.g. seeking advice in India vs seeking support by Western users). Significant linguistic variations in online mental health-related language emphasize the importance of developing precision-targeted interventions that are culturally appropriate.

pdf bib
WHEN TOM EATS KIMCHI: Evaluating Cultural Awareness of Multimodal Large Language Models in Cultural Mixture Contexts
Jun Seong Kim | Kyaw Ye Thu | Javad Ismayilzada | Junyeong Park | Eunsu Kim | Huzama Ahmad | Na Min An | James Thorne | Alice Oh

In a highly globalized world, it is important for multi-modal large language models (MLLMs) to recognize and respond correctly to mixed-cultural inputs.For example, a model should correctly identify kimchi (Korean food) in an image both when an Asian woman is eating it, as well as an African man is eating it.However, current MLLMs show an over-reliance on the visual features of the person, leading to misclassification of the entities. To examine the robustness of MLLMs to different ethnicity, we introduce MIXCUBE, a cross-cultural bias benchmark, and study elements from five countries and four ethnicities. Our findings reveal that MLLMs achieve both higher accuracy and lower sensitivity to such perturbation for high-resource cultures, but not for low-resource cultures. GPT-4o, the best-performing model overall, shows up to 58% difference in accuracy between the original and perturbed cultural settings in low-resource cultures

up

pdf (full)
bib (full)
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching

pdf bib
Proceedings of the 7th Workshop on Computational Approaches to Linguistic Code-Switching
Genta Indra Winata | Sudipta Kar | Marina Zhukova | Thamar Solorio | Xi Ai | Injy Hamed | Mahardika Krisna Krisna Ihsani | Derry Tanti Wijaya | Garry Kuwanto

pdf bib
EuskañolDS: A Naturally Sourced Corpus for Basque-Spanish Code-Switching
Maite Heredia | Jeremy Barnes | Aitor Soroa

Code-switching (CS) remains a significant challenge in Natural Language Processing (NLP), mainly due a lack of relevant data. In the context of the contact between the Basque and Spanish languages in the north of the Iberian Peninsula, CS frequently occurs in both formal and informal spontaneous interactions. However, resources to analyse this phenomenon and support the development and evaluation of models capable of understanding and generating code-switched language for this language pair are almost non-existent. We introduce a first approach to develop a naturally sourced corpus for Basque-Spanish code-switching. Our methodology consists of identifying CS texts from previously available corpora using language identification models, which are then manually validated to obtain a reliable subset of CS instances. We present the properties of our corpus and make it available under the name EuskañolDS.

pdf bib
The Impact of Code-switched Synthetic Data Quality is Task Dependent: Insights from MT and ASR
Injy Hamed | Thang Vu | Nizar Habash

Code-switching, the act of alternating between languages, emerged as a prevalent global phenomenon that needs to be addressed for building user-friendly language technologies. A main bottleneck in this pursuit is data scarcity, motivating research in the direction of code-switched data augmentation. However, current literature lacks comprehensive studies that enable us to understand the relation between the quality of synthetic data and improvements on NLP tasks. We extend previous research conducted in this direction on machine translation (MT) with results on automatic speech recognition (ASR) and cascaded speech translation (ST) to test generalizability of findings. Our experiments involve a wide range of augmentation techniques, covering lexical replacements, linguistic theories, and back-translation. Based on the results of MT, ASR, and ST, we draw conclusions and insights regarding the efficacy of various augmentation techniques and the impact of quality on performance.

pdf bib
Beyond Monolingual Limits: Fine-Tuning Monolingual ASR for Yoruba-English Code-Switching
Oreoluwa Boluwatife Babatunde | Victor Tolulope Olufemi | Emmanuel Bolarinwa | Kausar Yetunde Moshood | Chris Chinenye Emezue

Code-switching (CS) presents a significant challenge for Automatic Speech Recognition (ASR) systems, particularly in low-resource settings. While multilingual ASR models like OpenAI Whisper Large v3 are designed to handle multiple languages, their high computational demands make them less practical for real-world deployment in resource-constrained environments. In this study, we investigate the effectiveness of fine-tuning both monolingual and multilingual ASR models for Yoruba-English CS speech. Our results show that unadapted monolingual ASR models outperform Whisper Large v3 in a zero-shot setting on CS speech. Fine-tuning significantly reduces WER for both monolingual and multilingual models, with monolingual models achieving over a 20% WER reduction on CS and Yoruba speech while maintaining lower computational costs. However, we observe a trade-off, as fine-tuning leads to some degradation in English recognition, particularly for multilingual models. Our findings highlight that while multilingual models benefit from fine-tuning, monolingual models provide a computationally efficient and competitive alternative for CS-ASR, making them a viable choice for resource-constrained environments.

pdf bib
Where and How Do Languages Mix? A Study of Spanish-Guaraní Code-Switching in Paraguay
Olga Kellert | Nemika Tyagi

Code-switching, the alternating use of multiple languages within a single utterance, is a widespread linguistic phenomenon that poses unique challenges for both sociolinguistic analysis and Natural Language Processing (NLP). While prior research has explored code-switching from either a syntactic or geographic perspective, few studies have integrated both aspects, particularly for underexplored language pairs like Spanish-Guaraní. In this paper, we analyze Spanish-Guaraní code-switching using a dataset of geotagged tweets from Asunción, Paraguay, collected from 2017 to 2021. We employ a differential distribution method to map the geographic distribution of code-switching across urban zones and analyze its syntactic positioning within sentences. Our findings reveal distinct spatial patterns, with Guaraní-dominant tweets concentrated in the western and southwestern areas, while Spanish-only tweets are more prevalent in central and eastern regions. Syntactic analysis shows that code-switching occurs most frequently in the middle of sentences, often involving verbs, pronouns, and adjectives. These results provide new insights into the interaction between linguistic, social, and geographic factors in bilingual communication. Our study contributes to both sociolinguistic research and NLP applications, offering a framework for analyzing mixed-language data in digital communication.

pdf bib
Tongue-Tied: Breaking LLMs Safety Through New Language Learning
Bibek Upadhayay | Vahid Behzadan

The safety mechanisms of large language models (LLMs) have been shown to be fragile, as attackers can exploit prompts to generate harmful responses. Low-cost jailbreak attacks, such as those utilizing low-resource languages and code-switching, demonstrate that LLM safety mechanisms are vulnerable to low-resource languages. This indicates that safety training is particularly ineffective in low-resource languages. Furthermore, research has shown that fine-tuning LLMs with a small number of adversarial samples can compromise their safety training, implying that safety mechanism objectives can be overridden with the latest fine-tuning objectives. Based on the aforementioned statements, we hypothesize that the safety training of LLMs is language-dependent, and LLMs can potentially be compromised by fine-tuning them with new languages, even when using only harmless data.In this work, we used the low-resource language Newari and created two fake languages to LoRA-finetune LLMs with non-harmful data. Our results show that simply fine-tuning LLMs with new languages, even without the presence of harmful data, will jailbreak LLMs. Furthermore, we demonstrate that as we introduce English-to-and-from new language translation pairs in the training dataset, the attack success rate increases with harmful responses becoming more coherent. Additionally, we show the transferability of the attack by jailbreaking GPT-4 through finetuning with only 4,000 data points, and demonstrate that higher-capability models such as Claude-3.5-Sonnet can be compelled to learn to write in new languages through few-shot examples from in-context learning and can be jailbroken with new languages without fine-tuning. We furthermore investigate the fine-tuned LLMs’ latents with logit lens and find that the new language fine-tuning weakens safety mechanisms by prioritizing new language fidelity over alignment, enabling jailbreaks via late-layer pivots to new language tokens that bypass English-centric safeguards. We have publicly released our trained model weights, dataset, and artifacts at this URL: https://github.com/UNHSAILLab/tongue-tied-breaking-llms-safety-through-new-language-learning

pdf bib
LexiLogic@CALCS 2025: Predicting Preferences in Generated Code-Switched Text
Pranav Gupta | Souvik Bhattacharyya | Niranjan Kumar M | Billodal Roy

Code-switched generation is an emerging application in NLP systems, as code-switched text and speech are common and natural forms of conversation in multilingual communities worldwide. While monolingual generation has matured significantly with advances in large language models, code-switched generation still remains challenging, especially for languages and domains with less representation in pre-training datasets. In this paper, we describe our submission to the shared task of predicting human preferences for code-switched text in English-Malayalam, English-Tamil, and English-Hindi. We discuss our various approaches and report on the accuracy scores for each approach.

up

bib (full) Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP

pdf bib
Proceedings of the 9th Workshop on Constraint Grammar and Finite State NLP
Trond Trosterud | Linda Wiechetek | Flammie Pirinen

pdf bib
An Annotated Error Corpus for Esperanto
Eckhard Bick

This paper presents and evaluates a new multi-genre error corpus for (written) Esperanto, EspEraro, building on both learner, news and internet data and covering both ordinary spelling errors and real-word errors such as grammatical and word choice errors. Because the corpus has been annotated not only for errors, error types and corrections, but also with Constraint Grammar (CG) tags for part-of-speech, inflection, affixation, syntactic function, dependency and semantic class, it allows users to linguistically contextualize errors and to craft and test CG rules aiming at the recognition and/or correction of the various error types covered in the corpus. The resource was originally created for regression-testing a newly developed spell- and grammar checker, and contains about 75,000 tokens ( 4,000 sentences), with 3,330 tokens annotated for one or more errors and a combined correction suggestion. We discuss the different error types and evaluate their weight in the corpus. Where relevant, we explain the role of Constraint Grammar (CG) in the identification and correction of the individual error types.

pdf bib
Rule-based Surface Realization of Romanian Weak Pronouns
Ciprian Gerstenberger

Due to its reliance on context and intricate grammatical rules, the Romanian weak pronoun system presents a challenge not only for language learners – both native and non-native speakers – but also for linguistic description and computational processing. The present work addresses the challenges of Romanian weak pronouns from a computational processing perspective. Accordingly, it has three main goals: (1) to present the implementation of a rule-based model for generating contextually accurate surface forms of Romanian weak pronouns, (2) to describe the compilation of a database of relevant inputs for testing surface realization, and (3) to test the effectiveness of the model. This serves as a proof of concept, demonstrating both the transparency and the effectiveness of the model when based on an appropriate linguistic description.

pdf bib
Drawing Blue Lines - What can Constraint Grammar do for GEC?
Linda Wiechetek | Kevin Brubeck Unhammer

This paper presents the application of rule-based methods for Grammatical Error Correction (GEC) across multiple low-resource languages. We describe new functionality using the Constraint Grammar (CG) formalism, designed for detecting and correcting different types of complex grammatical errors in a range of morphologically complex languages. These errors require transformations such as reordering, word additions/deletions, and alternative choices for multiword suggestions. New perspectives are gained from end-to-end-testing – this work aims to clarify the relationship between the command-line interface used by developers and the user interfaces of our grammar checker plug-in for common word processors. We discuss challenges and solutions in correcting complex errors, with examples from languages like Lule Sámi, Irish, and Greenlandic, enabling linguists to adapt these methods in order to provide accurate and context-aware proofing tools for their own languages in mainstream word processors like Microsoft Word, Google Docs or LibreOffice.

pdf bib
Towards Natural Language Explanations of Constraint Grammar Rules
Daniel Swanson

This paper presents a general-purpose parser for static analysis of Constraint Grammar rules (that is, examining only the rules, not potential inputs and outputs) and applies it to the task of translating rules into comprehensible explanations of behavior. An interactive interface for exploring how individual components of each rule contribute to these translations is also presented.

pdf bib
A Mansi FST and spellchecker
Jack Rueter | Csilla Horváth | Trond Trosterud

The article presents a finite state transducer and spellchecker for Mansi, an Ob-Ugric Uralic language spoken in northwestern Siberia. Mansi has a rich but mostly agglutinative morphology, with a morphophonology dominated by sandhi phenomena. With a small set of morphophonological rules (32 twolc rules) and a lexicon consisting of 12,000 Mansi entries and a larger set of propernouns we were able to build a transducer covering 98.9 % of a large (700k) newspaper corpus. Being a part of the GiellaLT infrastructure, the transducer was turned into a spellchecker. The most common spelling error in Mansi is the omission of length marks on vowels, and for the 1000 most common words containing long vowels, the spellchecker was able to give a correct suggestion as top-five in 98.3 % of the cases, and as first suggestion in 91.3 % of the cases.

pdf bib
A grammatical analyser for Tokelau
Trond Trosterud | Arnfinn Muruvik Vonen

This article will present a grammatical aunalyser, disambiguator and dependency analysis of Tokelau. The grammatical analyser is written as a finite-state transducer (FST), whereas the disambiguator and dependency analyser are written in Constraint Grammar (CG), both within the GiellaLT infrastructure. Contrary to most languages analyzed within this framework, Being a Polynesian language, Tokelau is a predominantly isolating language, with reduplication and affixation as the main morphological processes. The article will discuss how FST and CG deal with Polynesian languages.

pdf bib
A Grammar-Based Method for Instilling Empirical Dependency Structure in LLMs
Olle Torstensson | Oskar Holmström

We investigate whether synthetic pretraining data generated from a formal grammar modeling syntactic dependencies can improve English language models. Building upon the structured pretraining data approach of Papadimitriou and Jurafsky (2023), we develop a grammar that more closely mirrors empirical dependency structures. Our results are negative – this type of pretraining significantly degrades model performance, with both our and their pretraining approach performing worse than no pretraining at all. We analyze potential explanations for these findings and discuss implications for future work on structured-data pretraining.

pdf bib
Case error corrections for noun phrases containing deverbal attributive nouns in Greenlandic
Judithe Denbæk

This paper contains preliminary findings using Constraint Grammar (CG) in seman- tic annotation in a specific type of noun phrases in Greenlandic, in which the at- tributive noun is a nominalized predica- tive verbal stem. The annotation is used in a grammar checker pipeline for the pur- pose of making case error correction sug- gestions.

pdf bib
Divvunspell—Finite-State Spell-Checking and Correction on Modern Platforms
Flammie A Pirinen | Sjur Nørstebø Moshagen

Spell-checking and correction is one of the key applications of natural language support. Historically, for the biggest, less morphologically complex languages, spell-checking and correction could be implemented by relatively simple means; however, for morphologically complex and low-resource languages, the solutions were often suboptimal. Finite-state methods are the state of the art in rule-based natural language processing and also for spell-checking and correction they have been effectively used. In this article, we show some recent developments of a finite-state spell-checker implementation that works with modern operating systems and platforms.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)

pdf bib
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Kengatharaiyer Sarveswaran | Ashwini Vaidya | Bal Krishna Bal | Sana Shams | Surendrabikram Thapa

pdf bib
A Brief Overview of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL)
Kengatharaiyer Sarveswaran | Surendrabikram Thapa | Sana Shams | Ashwini Vaidya | Bal Krishna Bal

In this paper, we provide a brief summary of the inaugural workshop on Challenges in Processing South Asian Languages (CHiPSAL) held as part of COLING 2025. The workshop included regular papers, invited keynotes, and shared task papers, fostering a collaborative platform for exploring challenges in processing South Asian languages. The shared task focused on Devanagari-script language understanding, encompassing subtasks on language identification, hate speech detection, and target classification. This workshop series aims to address linguistic and cultural nuances, resource constraints, and orthographic complexities in low-resource South Asian languages while advancing NLP research and promoting multilingual inclusivity.

pdf bib
Development of Pre-Trained Transformer-based Models for the Nepali Language
Prajwal Thapa | Jinu Nyachhyon | Mridul Sharma | Bal Krishna Bal

Transformer-based pre-trained language models have dominated the field of Natural Language Processing (NLP) for quite some time now. However, the Nepali language, spoken by approximately 32 million people worldwide, remains significantly underrepresented in this domain. This underrepresentation is primarily attributed to the scarcity of monolingual data corpora and limited available resources for the Nepali language. While existing efforts have predominantly concentrated on basic encoder-based models, there is a notable gap in the exploration of decoder-based architectures. To address this gap, we have collected 27.5 GB of Nepali text data, approximately 2.4x larger than any previously available Nepali language corpus. Leveraging this data, we pre-trained three different models i.e., BERT, RoBERTa, and GPT-2, exclusively for the Nepali Language. Furthermore, we performed instruction tuning and explored its potential for monolingual Nepali data, providing a foundation for future research. Our models outperformed the existing best model by 2 points on Nep-gLUE benchmark, scoring 95.60 and also outperformed existing models on text generation tasks, demonstrating improvements in both understanding and generating Nepali text.

pdf bib
Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks
Munief Hassan Tahir | Sana Shams | Layba Fiaz | Farah Adeeba | Sarmad Hussain

Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research, by transitioning from languages and task specific model pipelines to a single model adapted on a variety of tasks. However majority of existing multilingual NLP benchmarks for LLMs provide evaluation data in only few languages with little linguistic diversity. In addition these benchmarks lack quality assessment against the respective state-of the art models. This study presents an in-depth examination of 7 prominent LLMs: GPT-3.5-turbo, Llama 2-7B-Chat, Llama 3.1-8B, Bloomz 3B, Bloomz 7B1, Ministral-8B and Whisper (Large, medium and small variant) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models, has been compared and analyzed. Our experiments show that SOTA models currently outperform encoder-decoder models in majority of Urdu NLP tasks under zero-shot settings. However, comparing Llama 3.1-8B over prior version Llama 2-7B-Chat, we can deduce that with improved language coverage, LLMs can surpass these SOTA models. Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.

pdf bib
Bengali ChartSumm: A Benchmark Dataset and study on feasibility of Large Language Models on Bengali Chart to Text Summarization
Nahida Akter Tanjila | Afrin Sultana Poushi | Sazid Abdullah Farhan | Abu Raihan Mostofa Kamal | Md. Azam Hossain | Md. Hamjajul Ashmafee

In today’s data-driven world, effectively organizing and presenting data is challenging, particularly for non-experts. While tabular formats structure data, they often lack intuitive insights; charts, however, prefer accessible and impactful visual summaries. Although recent advancements in NLP, powered by large language models (LLMs), have primarily beneʐʒted high-resource languages like English, low-resource languages such as Bengali—spoken by millions globally—still face significant data limitations. This research addresses this gap by introducing “Bengali ChartSumm,” a benchmark dataset with 4,100 Bengali chart images, metadata, and summaries. This dataset facilitates the analysis of LLMs (mT5, BanglaT5, Gemma) in Bengali chart-to-text summarization, offering essential baselines and evaluations that enhance NLP research for low-resource languages.

pdf bib
DweshVaani: An LLM for Detecting Religious Hate Speech in Code-Mixed Hindi-English
Varad Srivastava

Traditional language models in NLP have been extensively made use of, in hateful speech detection problems. With the growth of social media, content in regional languages has grown exponentially. However, use of language models as well as LLMs on code-mixed Hindi-English hateful speech detection is under-explored. Our work addresses this gap by investigating both cutting-edge LLMs by Meta, Google, OpenAI, Nvidia as well as Indic-LLMs like Sarvam, Indic-Gemma, and Airavata on hateful speech detection in code-mixed Hindi-English languages in a comprehensive set of few-shot scenarios which include examples selected randomly, as well as with retrieval-augmented generation (RAG) based on MuRIL language model. We observed that Indic-LLMs which are instruction tuned on Indian content fall behind on the task. We also experimented with fine-tuning approaches, where we use knowledge-distillation based-finetuning by using extracted information about rationale behind hate speech, as part of the fine-tuning process. Finally, we also propose Dwesh-Vaani, an LLM based on fine-tuned Gemma-2, that out-performs all other approaches at the task of religious hateful speech detection as well as targeted religion identification in code-mixed Hindi-English languages.

pdf bib
Improving Accuracy of Low-resource ASR using Rule-Based Character Constituency Loss (RBCCL)
Rupak Raj Ghimire | Prakash Poudyal | Bal Krishna Bal

Modern general-purpose speech recognition systems are more robust in languages with high resources. However, achieving state-of-the-art accuracy for low-resource languages is still challenging. To deal with this challenge, one of the popular practice is fine-tuning the pre-trained model on low-resource settings. Nevertheless, pre-trained or fine-tuned model fails to capture the complex character and word constituency in the Devanagari script transcription. We proposed a complementary loss function designed to force the model to learn the character constituency of Devanagari script. Our complementary loss function, called as Rule-Based Character Constituency Loss (RBCCL), that penalizes incorrect transcriptions and updates the overall loss during the model training phase. This loss function can be combined with CTC loss or cross-entropy loss as well which are widely used in ASR training. Our experiment shows that combining the existing cross-entropy loss with new complementary loss (RBCCL) improves the Word Error Rate (WER ), reducing it from 47.1% to 23.41% which is quite promising result.

pdf bib
Natural Language Understanding of Devanagari Script Languages: Language Identification, Hate Speech and its Target Detection
Surendrabikram Thapa | Kritesh Rauniyar | Farhan Ahmad Jafri | Surabhi Adhikari | Kengatharaiyer Sarveswaran | Bal Krishna Bal | Hariram Veeramani | Usman Naseem

The growing use of Devanagari-script languages such as Hindi, Nepali, Marathi, Sanskrit, and Bhojpuri on social media presents unique challenges for natural language understanding (NLU), particularly in language identification, hate speech detection, and target classification. To address these challenges, we organized a shared task with three subtasks: (i) identifying the language of Devanagari-script text, (ii) detecting hate speech, and (iii) classifying hate speech targets into individual, community, or organization. A curated dataset combining multiple corpora was provided, with splits for training, evaluation, and testing. The task attracted 113 participants, with 32 teams submitting models evaluated on accuracy, precision, recall, and macro F1-score. Participants applied innovative methods, including large language models, transformer models, and multilingual embeddings, to tackle the linguistic complexities of Devanagari-script languages. This paper summarizes the shared task, datasets, and results, and aims to contribute to advancing NLU for low-resource languages and fostering inclusive, culturally aware natural language processing (NLP) solutions.

pdf bib
SiTa - Sinhala and Tamil Speaker Diarization Dataset in the Wild
Uthayasanker Thayasivam | Thulasithan Gnanenthiram | Shamila Jeewantha | Upeksha Jayawickrama

The dynamic field of speaker diarization continues to present significant challenges, despite notable advancements in recent years and the rising focus on complex acoustic scenarios emphasizes the importance of sustained research efforts in this area. While speech resources for speaker diarization are expanding rapidly, aided by semi-automated techniques, many existing datasets remain outdated and lack authentic real-world conversational data. This challenge is particularly acute for low-resource South Asian languages, due to limited public media data and reduced research efforts. Sinhala and Tamil are two such languages with limited speaker diarization datasets. To address this gap, we introduce a new speaker diarization dataset for these languages and evaluate multiple existing models to assess their performance. This work provides essential resources, a novel dataset and valuable insights from model benchmarks to advance speaker diarization for low-resource languages, particularly Sinhala and Tamil.

pdf bib
Sandhi Splitting in Tamil and Telugu: A Sequence-to-Sequence Approach Leveraging Transformer Models
Priyanka Dasari | Mupparapu Sohan Gupta | Nagaraju Vuppala | Pruthwik Mishra | Parameswari Krishnamurthy

Dravidian languages like Tamil and Telugu are agglutinative languages, they form wordforms by combining two or more elements into a single string with morpho-phonemic changes at the point of concatenation, known as sandhi. This linguistic feature adds complexity to automatic language processing, making the pre-processing of sandhi words essential for NLP applications. We developed extensive sandhi-annotated corpora of 15K for Telugu and Tamil, focusing on the systematic application of sandhi rules which explains the word formation patterns by showing how lexical and functional categories combine to create composite non-compound words. We implemented compact sequence-to-sequence transformer networks for the automatic sandhi processing. To evaluate our models, we manually annotated Telugu and Tamil IN22-Conv Benchmark datasets with sandhi annotations. Our experiments aim to enhance the language processing tasks like machine translation in morphologically rich languages.

pdf bib
Bridge the GAP: Multi-lingual Models For Ambiguous Pronominal Coreference Resolution in South Asian Languages
Rahothvarman P | Adith John Rajeev | Kaveri Anuranjana | Radhika Mamidi

Coreference resolution, the process of determining what a referring expression (a pronoun or a noun phrase) refers to in discourse, is a critical aspect of natural language understanding. However, the development of computational models for coreference resolution in low-resource languages, such as the Dravidian (and more broadly all South Asian) languages, still remains a significant challenge due to the scarcity of annotated corpora in these languages. To address this data scarcity, we adopt a pipeline that translates the English GAP dataset into various South Asian languages, creating a multi-lingual coreference dataset mGAP. Our research aims to leverage this dataset and develop two novel models, namely the joint embedding model and the cross attention model for coreference resolution with Dravidian languages in mind. We also demonstrate that cross-attention captures pronoun-candidate relations better leading to improved coreference resolution. We also harness the similarity across South Asian languages via transfer learning in order to use high resource languages to learn coreference for low resource languages.

pdf bib
A Dual Contrastive Learning Framework for Enhanced Hate Speech Detection in Low-Resource Languages
Krishan Chavinda | Uthayasanker Thayasivam

Hate speech on social media platforms is a critical issue, especially in low-resource languages such as Sinhala and Tamil, where the lack of annotated datasets and linguistic tools hampers the development of effective detection systems. This research introduces a novel framework for detecting hate speech in low resource languages by leveraging Multilingual Large Language Models (MLLMs) integrated with a Dual Contrastive Learning (DCL) strategy. Our approach enhances detection by capturing the nuances of hate speech in low-resource settings, applying both self-supervised and supervised contrastive learning techniques. We evaluate our framework using datasets from Facebook and Twitter, demonstrating its superior performance compared to traditional deep learning models like CNN, LSTM, and BiGRU. The results highlight the efficacy of DCL models, particularly when fine-tuned on domain-specific data, with the best performance achieved using the Twitter/twhin-bert-base model. This study underscores the potential of advanced machine learning techniques in improving hate speech detection for under-resourced languages, paving the way for further research in this domain.

pdf bib
Abstractive Summarization of Low resourced Nepali language using Multilingual Transformers
Prakash Dhakal | Daya Sagar Baral

Nepali, one of the prominent languages of South Asia, remains underrepresented in natural language processing (NLP) research, particularly in the domain of abstractive summarization. While significant progress has been made in extractive summarization, the complexity of generating coherent, human-like summaries from low-resource languages like Nepali is still largely unexplored. This paper introduces the first comprehensive study on applying multilingual transformer-based models, specifically mBART and mT5, to the task of generating headlines for Nepali news articles through abstractive summarization. Given the absence of large-scale datasets for this task, a new Nepali news headline summarization corpus was created by scraping data from multiple online news portals. The models were fine-tuned with this novel dataset using Low-Rank Adaptation (LoRA) and quantization techniques, allowing for more computationally efficient training while preserving performance. The models’ effectiveness was evaluated using ROUGE scores and a human evaluation approach that focused on relevance, fluency, conciseness, informativeness, factual accuracy, and coverage. The findings demonstrate that a 4-bit quantized mBART model achieves superior performance, offering significant potential for improving digital content summarization for Nepali. This study highlights key challenges in processing Nepali, particularly its orthographic and resource limitations, while providing a path forward for advancing NLP tools for South Asian languages.

pdf bib
Structured Information Extraction from Nepali Scanned Documents using Layout Transformer and LLMs
Aayush Neupane | Aayush Lamichhane | Ankit Paudel | Aman Shakya

Despite growing global interest in information extraction from scanned documents, there is still a significant research gap concerning Nepali documents. This study seeks to address this gap by focusing on methods for extracting information from texts with Nepali typeface or Devanagari characters. The primary focus is on the performance of the Language Independent Layout Transformer (LiLT), which was employed as a token classifier to extract information from Nepali texts. LiLT achieved F1 score of approximately 0.87. Complementing this approach, large language models (LLMs), including OpenAI’s proprietary GPT-4o and the open-source Llama 3.1 8B, were also evaluated. The GPT-4o model exhibited promising performance, with an accuracy of around 55-80% accuracy for a complete match, accuracy varying among different fields. Llama 3.1 8B model achieved only 20-40% accuracy. For 90% match both GPT-4o and Llama 3.1 8B had higher accuracy by varying amounts for different fields. Llama 3.1 8B performed particularly poorly compared to the LiLT model. These results aim to provide a foundation for future work in the domain of digitization of Nepali documents.

pdf bib
Domain-adaptative Continual Learning for Low-resource Tasks: Evaluation on Nepali
Sharad Duwal | Suraj Prasai | Suresh Manandhar

Continual learning has emerged as an important research direction due to the infeasibility of retraining large language models (LLMs) from scratch in the event of new data availability. Of great interest is the domain-adaptive pre-training (DAPT) paradigm, which focuses on continually training a pre-trained language model to adapt it to a domain it wasn’t originally trained on. In this work, we evaluate the feasibility of DAPT in a low-resource setting, namely the Nepali language. We use synthetic data to continue training Llama 3 8B to adapt it to the Nepali language in a 4-bit QLoRA setting. We evaluate the adapted model on its performance, catastrophic forgetting, and knowledge acquisition. We compare the base model and the final model on their Nepali generation abilities, their performance on popular benchmarks, and run case-studies to probe their linguistic knowledge in Nepali. We use GPT-4o as an evaluator to establish that the final model has learned to generate Nepali. We see some unsurprising forgetting in the final model, but also surprisingly find that increasing the number of shots while evaluation yields better percent increases in the final model (as high as 19.29% increase) compared to the base model (4.98%), suggesting latent retention. We also explore layer–head self-attention heatmaps to establish the dependency resolution abilities of the final model in Nepali. We open-source the model and the code.

pdf bib
POS-Aware Neural Approaches for Word Alignment in Dravidian Languages
Antony Alexander James | Parameswari Krishnamurthy

This research explores word alignment in low-resource languages, specifically focusing on Telugu and Tamil, two languages within the Dravidian language family. Traditional statistical models such as FastAlign, GIZA++, and Eflomal serve as baselines but are often limited in low-resource settings. Neural methods, including SimAlign and AWESOME-align, which leverage multilingual BERT, show promising results by achieving alignment without extensive parallel data. Applying these neural models to Telugu-Tamil and Tamil-Telugu alignments, we found that fine-tuning with POS-tagged data significantly improves alignment accuracy compared to untagged data, achieving an improvement of 6–7%. However, our combined embeddings approach, which merges word embeddings with POS tags, did not yield additional gains. Expanding the study, we included Tamil, Telugu, and English alignments to explore linguistic mappings between Dravidian and an Indo-European languages. Results demonstrate the comparative performance across models and language pairs, emphasizing both the benefits of POS-tag fine-tuning and the complexities of cross-linguistic alignment.

pdf bib
neDIOM: Dataset and Analysis of Nepali Idioms
Rhitabrat Pokharel | Ameeta Agrawal

Idioms, integral to any language, convey nuanced meanings and cultural references. However, beyond English, few resources exist to support any meaningful exploration of this unique linguistic phenomenon. To facilitate such an inquiry in a low resource language, we introduce a novel dataset of Nepali idioms and the sentences in which these naturally appear. We describe the methodology of creating this resource as well as discuss some of the challenges we encountered. The results of our empirical analysis under various settings using four distinct multilingual models consistently highlight the difficulties these models face in processing Nepali figurative language. Even fine-tuning the models yields limited benefits. Interestingly, the larger models from the BLOOM family of models failed to consistently outperform the smaller models. Overall, we hope that this new resource will facilitate further development of models that can support processing of idiomatic expressions in low resource languages such as Nepali.

pdf bib
Bridging the Bandwidth Gap: A Mixed Band Telephonic Urdu ASR Approach with Domain Adaptation for Banking Applications
Ayesha Khalid | Farah Adeeba | Najm Ul Sehar | Sarmad Hussain

The accuracy of Automatic Speech Recognition (ASR) systems is influenced by the quality and context of speech signals, particularly in telephonic environments prone to errors like channel drops and noise, leading to higher Word Error Rates (WER). This paper presents the development of a large vocabulary Urdu ASR system for telephonic speech, based on a corpus of 445 speakers from diverse domains. The corpus, annotated at the sentence level, is used to train and evaluate GMM-HMM and chain Time-Delay Neural Network (TDNN) models on a 10-hour test set. Results show that the TDNN model outperforms GMM-HMM. Mixing narrowband and wideband speech further reduces WER. The test sets are also evaluated for the pre-trained model Whisper for performance comparison. Additionally, system adaptation for the banking domain with a specialized lexicon and language model demonstrates the system’s potential for domain-specific applications.

pdf bib
Impacts of Vocoder Selection on Tacotron-based Nepali Text-To-Speech Synthesis
Ganesh Dhakal Chhetri | Kiran Chandra Dahal | Prakash Poudyal

Text-to-speech (TTS) technology enhances human-computer interaction and increases content accessibility. Tacotron and other deep learning models have enhanced the naturalness of text-to-speech systems. The vocoder, which transforms mel-spectrograms into audio waveforms, significantly influences voice quality. This study evaluates Tacotron2 vocoders for Nepali text-to speech synthesis. While English language vocoders have been thoroughly examined, Nepali language vocoders remain underexplored. The study utilizes the WaveNet and MelGAN vocoders to generate speech from mel-spectrograms produced by Tacotron2 for Nepali text. In order to assess the quality of voice synthesis, this paper study the mel-cepstral distortion (MCD) and Mean Opinion Score (MOS) for speech produced by both vocoders. The comparative investigation of the Tacotron2 + MelGAN and Tacotron2 + WaveNet models, utilizing the Nepali OpenSLR and News male voice datasets, consistently reveals the advantage of Tacotron2 + MelGAN in terms of naturalness and accuracy. The Tacotron2 + MelGAN model achieved an average MOS score of 4.245 on the Nepali OpenSLR dataset and 2.885 on the male voice dataset.

pdf bib
EmoTa: A Tamil Emotional Speech Dataset
Jubeerathan Thevakumar | Luxshan Thavarasa | Thanikan Sivatheepan | Sajeev Kugarajah | Uthayasanker Thayasivam

This paper introduces EmoTa, the first emotional speech dataset in Tamil, designed to reflect the linguistic diversity of Sri Lankan Tamil speakers. EmoTa comprises 936 recorded utterances from 22 native Tamil speakers (11 male, 11 female), each articulating 19 semantically neutral sentences across five primary emotions: anger, happiness, sadness, fear, and neutrality. To ensure quality, inter-annotator agreement was assessed using Fleiss’ Kappa, resulting in a substantial agreement score of 0.74. Initial evaluations using machine learning models, including XGBoost and Random Forest, yielded a high F1-score of 0.91 and 0.90 for emotion classification tasks. By releasing EmoTa, we aim to encourage further exploration of Tamil language processing and the development of innovative models for Tamil Speech Emotion Recognition.

pdf bib
Benchmarking Whisper for Low-Resource Speech Recognition: An N-Shot Evaluation on Pashto, Punjabi, and Urdu
Najm Ul Sehar | Ayesha Khalid | Farah Adeeba | Sarmad Hussain

Whisper, a large-scale multilingual model, has demonstrated strong performance in speech recognition benchmarks, but its effectiveness on low-resource languages remains under-explored. This paper evaluates Whisper’s performance on Pashto, Punjabi, and Urdu, three underrepresented languages. While Automatic Speech Recognition (ASR) has advanced for widely spoken languages, low-resource languages still face challenges due to limited data. Whisper’s zero-shot performance was benchmarked and then its small variant was fine-tuned to improve transcription accuracy. Significant reductions in Word Error Rate (WER) were achieved through few-shot fine-tuning, which helped the model better handle challenges such as complex phonetic structures, compared to zero-shot performance. This study contributes to improving multilingual ASR for low-resource languages and highlights Whisper’s adaptability and potential for further enhancement.

pdf bib
Leveraging Machine-Generated Data for Joint Intent Detection and Slot Filling in Bangla: A Resource-Efficient Approach
A H M Rezaul Karim | Özlem Uzuner

Natural Language Understanding (NLU) is crucial for conversational AI, yet low-resource languages lag behind in essential tasks like intent detection and slot-filling. To address this gap, we converted the widely-used English SNIPS dataset to Bangla using LLaMA 3, creating a dataset that captures the linguistic complexities of the language. With this translated dataset for model training, our experimental evaluation compares both independent and joint modeling approaches using transformer architecture. Results demonstrate that a joint approach based on multilingual BERT (mBERT) achieves superior performance, with 97.83% intent accuracy and 91.03% F1 score for slot filling. This work advances NLU capabilities for Bangla and provides insights for developing robust models in other low-resource languages.

pdf bib
Challenges in Adapting Multilingual LLMs to Low-Resource Languages using LoRA PEFT Tuning
Omkar Khade | Shruti Jagdale | Abhishek Phaltankar | Gauri Takalikar | Raviraj Joshi

Large Language Models (LLMs) have demonstrated remarkable multilingual capabilities, yet challenges persist in adapting these models for low-resource languages. In this study, we investigate the effects of Low-Rank Adaptation (LoRA) Parameter-Efficient Fine-Tuning (PEFT) on multilingual Gemma models for Marathi, a language with limited resources. Using a translated Alpaca dataset with 52,000 instruction-response pairs, our findings reveal that while evaluation metrics often show a performance decline post-fine-tuning, manual assessments frequently suggest that the fine-tuned models outperform their original counterparts. The observations indicate improvements in target language generation capabilities but a reduction in reasoning abilities following language adaptation. These results underscore the need for improved evaluation methodologies and the creation of high-quality native datasets to accurately assess language-specific model performance in low-resource settings.

pdf bib
1-800-SHARED-TASKS@NLU of Devanagari Script Languages 2025: Detection of Language, Hate Speech, and Targets using LLMs
Jebish Purbey | Siddartha Pullakhandam | Kanwal Mehreen | Muhammad Arham | Drishti Sharma | Ashay Srivastava | Ram Mohan Rao Kadiyala

This paper presents a detailed system description of our entry for the CHiPSAL 2025 challenge, focusing on language detection, hate speech identification, and target detection in Devanagari script languages. We experimented with a combination of large language models and their ensembles, including MuRIL, IndicBERT, and Gemma-2, and leveraged unique techniques like focal loss to address challenges in the natural understanding of Devanagari languages, such as multilingual processing and class imbalance. Our approach achieved competitive results across all tasks: F1 of 0.9980, 0.7652, and 0.6804 for Sub-tasks A, B, and C respectively. This work provides insights into the effectiveness of transformer models in tasks with domain-specific and linguistic challenges, as well as areas for potential improvement in future iterations.

pdf bib
AniSan@NLU of Devanagari Script Languages 2025: Optimizing Language Identification with Ensemble Learning
Anik Mahmud Shanto | Mst. Sanjida Jamal Priya | Mohammad Shamsul Arefin

Identifying languages written in Devanagari script, including Hindi, Marathi, Nepali, Bhojpuri, and Sanskrit, is essential in multilingual contexts but challenging due to the high overlap between these languages. To address this, a shared task on “Devanagari Script Language Identification” has been organized, with a dataset available for subtask A to test language identification models. This paper introduces an ensemble-based approach that combines mBERT, XLM-R, and IndicBERT models through majority voting to improve language identification accuracy across these languages. Our ensemble model has achieved an impressive accuracy of 99.68%, outperforming individual models by capturing a broader range of language features and reducing model biases that often arise from closely related linguistic patterns. Additionally, we have fine-tuned other transformer models as part of a comparative analysis, providing further validation of the ensemble’s effectiveness. The results highlight the ensemble model’s ability in distinguishing similar languages within the Devanagari script, offering a promising approach for accurate language identification in complex multilingual contexts.

pdf bib
byteSizedLLM@NLU of Devanagari Script Languages 2025: Hate Speech Detection and Target Identification Using Customized Attention BiLSTM and XLM-RoBERTa Base Embeddings
Rohith Gowtham Kodali | Durga Prasad Manukonda | Daniel Iglesias

This paper presents a novel approach to hate speech detection and target identification across Devanagari-script languages, with a focus on Hindi and Nepali. Leveraging an Attention BiLSTM-XLM-RoBERTa architecture, our model effectively captures language-specific features and sequential dependencies crucial for multilingual natural language understanding (NLU). In Task B (Hate Speech Detection), our model achieved a Macro F1 score of 0.7481, demonstrating its robustness in identifying hateful content across linguistic variations. For Task C (Target Identification), it reached a Macro F1 score of 0.6715, highlighting its ability to classify targets into “individual,” “organization,” and “community” with high accuracy. Our work addresses the gap in Devanagari-scripted multilingual hate speech analysis and sets a benchmark for future research in low-resource language contexts.

pdf bib
byteSizedLLM@NLU of Devanagari Script Languages 2025: Language Identification Using Customized Attention BiLSTM and XLM-RoBERTa base Embeddings
Durga Prasad Manukonda | Rohith Gowtham Kodali

This study explores the challenges of natural language understanding (NLU) in multilingual contexts, focusing on Devanagari-scripted languages such as Nepali, Marathi, Sanskrit, Bhojpuri, and Hindi. Language identification within these languages is complex due to their structural and lexical similarities. We present a hybrid Attention BiLSTM-XLM-RoBERTa model, achieving a state-of-the-art F1 score of 0.9974 on the test set, despite limited resources. Our model effectively distinguishes between closely related Devanagari-scripted languages, providing a solid foundation for context-aware NLU systems that enhance language-specific processing and promote inclusive digital interactions across diverse linguistic communities.

pdf bib
CUET_Big_O@NLU of Devanagari Script Languages 2025: Identifying Script Language and Detecting Hate Speech Using Deep Learning and Transformer Model
Md. Refaj Hossan | Nazmus Sakib | Md. Alam Miah | Jawad Hossain | Mohammed Moshiul Hoque

Text-based hate speech has been prevalent and is usually used to incite hostility and violence. Detecting this content becomes imperative, yet the task is challenging, particularly for low-resource languages in the Devanagari script, which must have the extensive labeled datasets required for effective machine learning. To address this, a shared task has been organized for identifying hate speech targets in Devanagari-script text. The task involves classifying targets such as individuals, organizations, and communities and identifying different languages within the script. We have explored several machine learning methods such as LR, SVM, MNB, and Random Forest, deep learning models using CNN, BiLSTM, GRU, CNN+BiLSTM, and transformer-based models like Indic-BERT, m-BERT, Verta-BERT, XLM-R, and MuRIL. The CNN with BiLSTM yielded the best performance (F1-score of 0.9941), placing the team 13th in the competition for script identification. Furthermore, the fine-tuned MuRIL-BERT model resulted in an F1 score of 0.6832, ranking us 4th for detecting hate speech targets.

pdf bib
CUET_HateShield@NLU of Devanagari Script Languages 2025: Transformer-Based Hate Speech Detection in Devanagari Script Languages
Sumaiya Rahman Aodhora | Shawly Ahsan | Mohammed Moshiul Hoque

Social media has become a vital platform for information exchange and free expression, yet its open nature also contributes to the spread of harmful content, including hate speech, cyberbullying, and offensive language, posing serious risks to societal well-being. Such content is linked to adverse impacts, including mental health issues. This study aims to develop an automated system for detecting hate speech in Devanagari script languages, enabling efficient moderation and prompt intervention. Our approach utilizes a fine-tuned transformer model to classify offensive content. We experimented with various machine learning (Logistic Regression, SVM, Ensemble methods) and deep learning architectures (CNN, BiLSTM, CNN-BiLSTM) alongside transformer-based models (Indic-SBERT, m-BERT, MuRIL, Indic-SBERT, XLM-R). Notably, the fine-tuned XLM-Roberta model achieved the highest performance, reaching a macro-average F1-score of 0.74, demonstrating its efficacy in detecting hate speech in Devanagari script languages. However, the model we submitted achieved a macro-average F1-score of 0.73, securing 13th place in the subtask.

pdf bib
CUET_INSights@NLU of Devanagari Script Languages 2025: Leveraging Transformer-based Models for Target Identification in Hate Speech
Farjana Alam Tofa | Lorin Tasnim Zeba | Md Osama | Ashim Dey

Hate speech detection in multilingual content is a challenging problem especially when it comes to understanding the specific targets of hateful expressions. Identifying the targets of hate speech whether directed at individuals, organizations or communities is crucial for effective content moderation and understanding the context. A shared task on hate speech detection in Devanagari Script Languages organized by CHIPSAL@COLING 2025 allowed us to address the challenge of identifying the target of hate speech in the Devanagari Script Language. For this task, we experimented with various machine learning (ML) and deep learning (DL) models including Logistic Regression, Decision Trees, Random Forest, SVM, CNN, LSTM, BiLSTM, and transformer-based models like MiniLM, m-BERT, and Indic-BERT. Our experiments demonstrated that Indic-BERT achieved the highest F1-score of 0.69, ranked 3rd in the shared task. This research contributes to advancing the field of hate speech detection and natural language processing in low-resource languages.

pdf bib
CUFE@NLU of Devanagari Script Languages 2025: Language Identification using fastText
Michael Ibrahim

Language identification is a critical area of research within natural language processing (NLP), particularly in multilingual contexts where accurate language detection can enhance the performance of various applications, such as machine translation, content moderation, and user interaction systems. This paper presents a language identification system developed using fastText. In the CHIPSAL@COLING 2025 Task on Devanagari Script Language Identification, the proposed method achieved first place, with an F1 score of 0.9997.

pdf bib
Dll5143A@NLU of Devanagari Script Languages 2025: Detection of Hate Speech and Targets Using Hierarchical Attention Network
Ashok Yadav | Vrijendra Singh

Hate speech poses a significant challenge on social networks, particularly in Devanagari scripted languages, where subtle expressions can lead to harmful narratives. This paper details our participation in the “Shared Task on Natural Language Understanding of Devanagari Script Languages” at CHIPSAL@COLING 2025, addressing hate speech detection and target identification. In Sub-task B, we focused on classifying the text either hate or non-hate classified text to determine the presence of hate speech, while Sub-task C focused on identifying targets, such as individuals, organizations, or communities. We utilized the XLM-RoBERTa model as our base and explored various adaptations, including Adaptive Weighting and Gated Adaptive Weighting methods. Our results demonstrated that the Hierarchical Gated adaptive weighting model achieved 86% accuracy in hate speech detection with a macro F1 score of 0.72, particularly improving performance for minority class detection. For target detection, the same model achieved 75% accuracy and a 0.69 macro F1 score. Our proposed architecture demonstrated competitive performance, ranking 8th in Subtask B and 11th in Subtask C among all participants.

pdf bib
DSLNLP@NLU of Devanagari Script Languages 2025: Leveraging BERT-based Architectures for Language Identification, Hate Speech Detection and Target Classification
Shraddha Chauhan | Abhinav Kumar

The rapid rise of social media has emphasized the spread of harmful and hateful content, making it challenging for its identification. Contextual semantics is very important as prior studies present that context level semantics is a more trustworthy indicator of hatefulness than word level semantics for detecting hate speech. This paper attempts to check the usability of transformer-based models for the identification of hate speech on code-mixed datasets, which includes Google-MuRIL, LaBSE, XLMRoberta-base, mbert and distil-mbert. The above is largely due to its ability for high-level representations of complex and context-dense meaning. Besides this, we experiment on ensemble approach that covers all of the above models to reach out for an even higher level of performance in detection. The experiment results show the best performing macro F1-scores are reported in case of MuRIL in comparison to other implemented models.

pdf bib
IITR-CIOL@NLU of Devanagari Script Languages 2025: Multilingual Hate Speech Detection and Target Identification in Devanagari-Scripted Languages
Siddhant Gupta | Siddh Singhal | Azmine Toushik Wasi

This work focuses on two subtasks related to hate speech detection and target identification in Devanagari-scripted languages, specifically Hindi, Marathi, Nepali, Bhojpuri, and Sanskrit. Subtask B involves detecting hate speech in online text, while Subtask C requires identifying the specific targets of hate speech, such as individuals, organizations, or communities. We develop a deep neural network built on the pretrained multilingual transformer model ‘ia-multilingual-transliterated-roberta’ by IBM, optimized for classification tasks in multilingual and transliterated contexts. The model leverages contextualized embeddings to handle linguistic diversity, with a classifier head for binary classification. We received 88.40% accuracy in Subtask B and 66.11% accuracy in Subtask C, in the test set.

pdf bib
LLMsAgainstHate@NLU of Devanagari Script Languages 2025: Hate Speech Detection and Target Identification in Devanagari Languages via Parameter Efficient Fine-Tuning of LLMs
Rushendra Sidibomma | Pransh Patwa | Parth Patwa | Aman Chadha | Vinija Jain | Amitava Das

The detection of hate speech has become increasingly important in combating online hostility and its real-world consequences. Despite recent advancements, there is limited research addressing hate speech detection in Devanagari-scripted languages, where resources and tools are scarce. While large language models (LLMs) have shown promise in language-related tasks, traditional fine-tuning approaches are often infeasible given the size of the models. In this paper, we propose a Parameter Efficient Fine tuning (PEFT) based solution for hate speech detection and target identification. We evaluate multiple LLMs on the Devanagari dataset provided by Thapa et al. (2025), which contains annotated instances in 2 languages - Hindi and Nepali. The results demonstrate the efficacy of our approach in handling Devanagari-scripted content. Code will be made publicly available on GitHub following acceptance.

pdf bib
MDSBots@NLU of Devanagari Script Languages 2025: Detection of Language, Hate Speech, and Targets using MURTweet
Prabhat Ale | Anish Thapaliya | Suman Paudel

In multilingual contexts, an automated system for accurate language identification, followed by hate speech detection and target identification, plays a critical role in processing low-resource hate speech data and mitigating its negative impact. This paper presents our approach to the three subtasks in the Shared Task on Natural Language Understanding of Devanagari Script Languages at CHIPSAL@COLING 2025: (i) Language Identification, (ii) Hate Speech Detection, and (iii) Target Identification. Both classical machine learning and multilingual transformer models were explored, where MuRIL Large, trained on undersampled data for subtasks A and B outperformed the classical models. For subtask C, the Hybrid model trained on augmented data achieved superior performance over classical and transformer-based approaches. The top-performing models, named MURTweet for subtasks A and B and NER-MURTweet for subtask C, secured sixth, third, and first rank respectively, in the competition.

pdf bib
Nepali Transformers@NLU of Devanagari Script Languages 2025: Detection of Language, Hate Speech and Targets
Pilot Khadka | Ankit Bk | Ashish Acharya | Bikram K.c. | Sandesh Shrestha | Rabin Thapa

The Devanagari script, an Indic script used by a diverse range of South Asian languages, presents a significant challenge in Natural Language Processing (NLP) research. The dialect and language variation, complex script features, and limited language-specific tools make development difficult. This shared task aims to address this challenge by bringing together researchers and practitioners to solve three key problems: Language identification, Hate speech detection, and Targets of Hate speech identification. The selected languages- Hindi, Nepali, Marathi, Sanskrit, and Bhojpuri- are widely used in South Asia and represent distinct linguistic structures. In this work, we explore the effectiveness of both machine-learning models and transformer-based models on all three sub-tasks. Our results demonstrate strong performance of the multilingual transformer model, particularly one pre-trained on domain-specific social media data, across all three tasks. The multilingual RoBERTa model, trained on the Twitter dataset, achieved a remarkable accuracy and F1-score of 99.5% on language identification (Task A), 88.3% and 72.5% on Hate Speech detection (Task B), and 68.6% and 61.8% on Hate Speech Target Classification (Task C).

pdf bib
NLPineers@ NLU of Devanagari Script Languages 2025: Hate Speech Detection using Ensembling of BERT-based models
Nadika Poudel | Anmol Guragain | Rajesh Piryani | Bishesh Khanal

This paper explores hate speech detection in Devanagari-scripted languages, focusing on Hindi and Nepali, for Subtask B of the CHIPSAL@COLING 2025 Shared Task. Using a range of transformer-based models such as XLM-RoBERTa, MURIL, and IndicBERT, we examine their effectiveness in navigating the nuanced boundary between hate speech and free expression. Our best performing model, implemented as ensemble of multilingual BERT models achieve Recall of 0.7762 (Rank 3/31 in terms of recall) and F1 score of 0.6914 (Rank 17/31). To address class imbalance, we used backtranslation for data augmentation, and cosine similarity to preserve label consistency after augmentation. This work emphasizes the need for hate speech detection in Devanagari-scripted languages and presents a foundation for further research. We plan to release the code upon acceptance.

pdf bib
One_by_zero@ NLU of Devanagari Script Languages 2025: Target Identification for Hate Speech Leveraging Transformer-based Approach
Dola Chakraborty | Jawad Hossain | Mohammed Moshiul Hoque

People often use written words to spread hate aimed at different groups that cannot be practically detected manually. Therefore, developing an automatic system capable of identifying hate speech is crucial. However, creating such a system in a low-resourced languages (LRLs) script like Devanagari becomes challenging. Hence, the Devanagari script has organized a shared task targeting hate speech identification. This work proposes a pre-trained transformer-based model to identify the target of hate speech, classifying it as directed toward an individual, organization, or community. We performed extensive experiments, exploring various machine learning (LR, SVM, and ensemble), deep learning (CNN, LSTM, CNN+BiLSTM), and transformer-based models (IndicBERT, mBERT, MuRIL, XLM-R) to identify hate speech. Experimental results indicate that the IndicBERT model achieved the highest performance among all other models, obtaining a macro F1-score of 0.6785, which placed the team 6th in the task.

pdf bib
Paramananda@NLU of Devanagari Script Languages 2025: Detection of Language, Hate Speech and Targets using FastText and BERT
Darwin Acharya | Sundeep Dawadi | Shivram Saud | Sunil Regmi

This paper presents a comparative analysis of FastText and BERT-based approaches for Natural Language Understanding (NLU) tasks in Devanagari script languages. We evaluate these models on three critical tasks: language identification, hate speech detection, and target identification across five languages: Nepali, Marathi, Sanskrit, Bhojpuri, and Hindi. Our experiments, although with raw tweet dataset but extracting only devanagari script, demonstrate that while both models achieve exceptional performance in language identification (F1 scores > 0.99), they show varying effectiveness in hate speech detection and target identification tasks. FastText with augmented data outperforms BERT in hate speech detection (F1 score: 0.8552 vs 0.5763), while BERT shows superior performance in target identification (F1 score: 0.5785 vs 0.4898). These findings contribute to the growing body of research on NLU for low-resource languages and provide insights into model selection for specific tasks in Devanagari script processing.

pdf bib
SKPD Emergency @ NLU of Devanagari Script Languages 2025: Devanagari Script Classification using CBOW Embeddings with Attention-Enhanced BiLSTM
Shubham Shakya | Saral Sainju | Subham Krishna Shrestha | Prekshya Dawadi | Shreya Khatiwada

Devanagari script, encompassing languages such as Nepali, Marathi, Sanskrit, Bhojpuri and Hindi, involves challenges for identification due to its overlapping character sets and lexical characteristics. To address this, we propose a method that utilizes Continuous Bag of Words (CBOW) embeddings integrated with attention-enhanced Bidirectional Long Short-Term Memory (BiLSTM) network. Our methodology involves meticulous data preprocessing and generation of word embeddings to better the model’s ability. The proposed method achieves an overall accuracy of 99%, significantly outperforming character level identification approaches. The results reveal high precision across most language pairs, though minor classification confusions persist between closely related languages. Our findings demonstrate the robustness of the CBOW-BiLSTM model for Devanagari script classification and highlights the importance of accurate language identification in preserving linguistic diversity in multilingual environments. Keywords: Language Identification, Devanagari Script, Natural Language Processing, Neural Networks

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Computational Humor (CHum)

pdf bib
Proceedings of the 1st Workshop on Computational Humor (CHum)
Christian F. Hempelmann | Julia Rayz | Tiansi Dong | Tristan Miller

pdf bib
The Exception of Humor: Iconicity, Phonemic Surprisal, Memory Recall, and Emotional Associations
Alexander Kilpatrick | Maria Flaksman

This meta-study explores the relationships between humor, phonemic bigram surprisal, emotional valence, and memory recall. Prior research indicates that words with higher phonemic surprisal are more readily remembered, suggesting that unpredictable phoneme sequences promote long-term memory recall. Emotional valence is another well-documented factor influencing memory, with negative experiences and stimuli typically being remembered more easily than positive ones. Building on existing findings, this study highlights that words with negative associations often exhibit greater surprisal and are easier to recall. Humor, however, presents an exception: while associated with positive emotions, humorous words also display heightened surprisal and enhanced memorability.

pdf bib
Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor
Ashwin Baluja

While Large Language Models (LLMs) have demonstrated impressive natural language understanding capabilities across various text-based tasks, understanding humor has remained a persistent challenge. Humor is frequently multimodal, relying not only on the meaning of the words, but also their pronunciations, and even the speaker’s intonations. In this study, we explore a simple multimodal prompting approach to humor understanding and explanation. We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system. Using multimodal cues improves the explanations of humor compared to textual prompts across all tested datasets.

pdf bib
Rule-based Approaches to the Automatic Generation of Puns Based on Given Names in French
Mathieu Dehouck | Marine Delaborde

Humor is a cornerstone of human interactions. Because puns and word plays lie in the margins of phonology, syntax and semantics, large language models struggle with their generation. In this paper, we present two versions of a tool designed to create a typical kind of French jokes known as “Monsieur et Madame” jokes. We then discuss the main challenges and limitations rule based systems face when creating this kind of puns.

pdf bib
Homophonic Pun Generation in Code Mixed Hindi English
Yash Raj Sarrof

In this study, we investigate Hinglish—a blend of Hindi and English commonly found in informal online communication—with a particular focus on automated pun generation. Our work examines the applicability and adaptability of existing English pun generation pipelines to Hinglish. We assess the pun generation capabilities of Large Language Models (LLMs), particularly GPT-3.5. By employing Chain of Thought prompting and Self-Refine techniques, we identify cross-linguistic homophone detection as a central difficulty. To address this, we propose a novel algorithm for cross-lingual homophone identification and develop a Latin-to-Devanagari transliteration module to leverage the widespread use of Latin-script Hindi in online settings. Building on existing frameworks for pun generation, we incorporate our homophone and transliteration modules to improve output quality. Crowd-sourced human evaluations validate the effectiveness of our approach.

pdf bib
Bridging Laughter Across Languages: Generation of Hindi-English Code-mixed Puns
Likhith Asapu | Prashant Kodali | Ashna Dua | Kapil Rajesh Kavitha | Manish Shrivastava

Puns, as a linguistic phenomenon, hold significant importance in both humor and language comprehension. While extensive research has been conducted in the realm of pun generation in English, there exists a notable gap in the exploration of pun generation within code-mixed text, particularly in Hindi-English code-mixed text. This study addresses this gap by offering a computational method specifically designed to create puns in Hindi-English code-mixed text. In our investigation, we delve into three distinct methodologies aimed at pun generation utilizing pun-alternate word pairs. Furthermore, this novel dataset, HECoP, comprising of 2000 human-annotated sentences serves as a foundational resource for training diverse pun detection models. Additionally, we developed a structured pun generation pipeline capable of generating puns from a single input word without relying on predefined word pairs. Through rigorous human evaluations, our study demonstrates the efficacy of our proposed models in generating code-mixed puns. The findings presented herein lay a solid groundwork for future endeavours in pun generation and computational humor within diverse linguistic contexts.

pdf bib
Testing Humor Theory Using Word and Sentence Embeddings
Stephen Skalicky | Salvatore Attardo

A basic prediction of incongruity theory is that semantic scripts in verbal humor should be in a state of incongruity. We test this prediction using a dataset of 1,182 word/phrase pairs extracted from a set of imperfect puns. Incongruity was defined as the cosine distance between their word vector representations. We compare these pun distances against similarity metrics for the pun words against their synonyms, extracted from WordNet. Results indicate a significantly lower degree of similarity between pun words when compared to their synonyms. Our findings support the basic predictions of incongruity theory and provide computational researchers with a baseline metric to model humorous incongruity.

pdf bib
Pragmatic Metacognitive Prompting Improves LLM Performance on Sarcasm Detection
Joshua Lee | Wyatt Fong | Alexander Le | Sur Shah | Kevin Han | Kevin Zhu

Sarcasm detection is a significant challenge in sentiment analysis due to the nuanced and context-dependent nature of verbiage. We introduce Pragmatic Metacognitive Prompting (PMP) to improve the performance of Large Language Models (LLMs) in sarcasm detection, which leverages principles from pragmatics and reflection helping LLMs interpret implied meanings, consider contextual cues, and reflect on discrepancies to identify sarcasm. Using state-of-the-art LLMs such as LLaMA-3-8B, GPT-4o, and Claude 3.5 Sonnet, PMP achieves state-of-the-art performance on GPT-4o on MUStARD and SemEval2018. This study demonstrates that integrating pragmatic reasoning and metacognitive strategies into prompting significantly enhances LLMs’ ability to detect sarcasm, offering a promising direction for future research in sentiment analysis.

pdf bib
Can AI Make Us Laugh? Comparing Jokes Generated by Witscript and a Human Expert
Joe Toplyn | Ori Amir

This study compares the funniness of AI-generated jokes and those written by a professional human joke writer, using audience laughter as a direct measure. Prior research has typically relied on numerical ratings, which have limitations. Our findings show that AI-generated jokes elicited as much laughter as human-crafted ones, indicating that advanced AI joke generators can now produce original jokes on par with those of a professional human comedy writer.

pdf bib
Evaluating Human Perception and Bias in AI-Generated Humor
Narendra Nath Joshi

This paper explores human perception of AI-generated humor, examining biases and the ability to distinguish between human and AI-created jokes. Through a between-subjects user study involving 174 participants, we tested hypotheses on quality perception, source identification, and demographic influences. Our findings reveal that AI-generated jokes are rated comparably to human-generated ones, with source blindness improving AI humor ratings. Participants struggled to identify AI-generated jokes accurately, and repeated exposure led to increased appreciation. Younger participants showed more favorable perceptions, while technical background had no significant impact. These results challenge preconceptions about AI’s humor capabilities and highlight the importance of addressing biases in AI content evaluation. We also suggest pathways for enhancing human-AI creative collaboration and underscore the need for transparency and ethical considerations in AI-generated content.

pdf bib
The Theater Stage as Laboratory: Review of Real-Time Comedy LLM Systems for Live Performance
Piotr Mirowski | Kory Mathewson | Boyd Branch

In this position paper, we review the eclectic recent history of academic and artistic works involving computational systems for humor generation, and focus specifically on live performance. We make the case that AI comedy should be evaluated in live conditions, in front of audiences sharing either physical or online spaces, and under real-time constraints. We further suggest that improvised comedy is therefore the perfect substrate for deploying and assessing computational humor systems. Using examples of successful AI-infused shows, we demonstrate that live performance raises three sets of challenges for computational humor generation: 1) questions around robotic embodiment, anthropomorphism and competition between humans and machines, 2) questions around comedic timing and the nature of audience interaction, and 3) questions about the human interpretation of seemingly absurd AI-generated humor. We argue that these questions impact the choice of methodologies for evaluating computational humor, as any such method needs to work around the constraints of live audiences and performance spaces. These interrogations also highlight different types of collaborative relationship of human comedians towards AI tools.

pdf bib
The Algorithm is the Message: Computing as a Humor-Generating Mode
Vittorio Marone

This position paper starts from the examination of the “Universal Handbook for Political Speeches,” a satirical manual created during communist Poland as a modular tool to parody propaganda’s rigid linguistic patterns and its absence of meaning, humorously revealing the absurdity of totalitarian “newspeak.” Presented here in English for the first time, the “Handbook” is explored as an analog precursor to computational humor systems. More importantly, this artifact shows that humor, rather than being the product of computing, can also arise from a computationalized, combinatorial structure and process. This shifts the focus on computational algorithms and processes as a mode of humor generation, rather than a tool. That is, computing itself—with its processes, structure, iteration, and combinatorial logic—can be a source of humor, rather than an instrument to fabricate it. The very workings of the machine are what can make us laugh, regardless of what the machine carries or produces. The “Handbook” functions here as a spark for reflection, and hopefully a broader discussion, on how this alternative view may impact the evolution of computational humor and its applications at the dawn of the era of artificial general intelligence.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)

pdf bib
Proceedings of the Second Workshop on Patient-Oriented Language Processing (CL4Health)
Sophia Ananiadou | Dina Demner-Fushman | Deepak Gupta | Paul Thompson

pdf bib
PatientDx: Merging Large Language Models for Protecting Data-Privacy in Healthcare
Jose G. Moreno | Jesus Lovon-Melgarejo | M’rick Robin-Charlet | Christine Damase-Michel | Lynda Tamine

pdf bib
Synthetic Documents for Medical Tasks: Bridging Privacy with Knowledge Injection and Reward Mechanism
Simon Meoni | Éric De La Clergerie | Théo Ryffel

pdf bib
Prefix-Enhanced Large Language Models with Reused Training Data in Multi-Turn Medical Dialogue
Suxue Ma | Zhicheng Yang | Ruei-Sung Lin | Youbao Tang | Ning Zhang | Zhenjie Cao | Yuan Ni | Jing Xiao | Jieke Hou | Peng Chang

Large Language Models have made impressive progress in the medical field. In medical dialogue scenarios, unlike traditional single-turn question-answering tasks, multi-turn doctor-patient dialogue tasks require AI doctors to interact with patients in multiple rounds, where the quality of each response impacts the overall model performance. In this paper, we propose PERT to re-explore values of multi-turn dialogue training data after the supervised fine-tuning phase by integrating a prefix learning strategy, further enhancing the response quality. Our preliminary results show that PERT achieves notable improvements on gynecological data, with an increase of up to 0.22 on a 5-point rating scale.

pdf bib
SpecialtyScribe: Enhancing SOAP note Scribing for Medical Specialties using LLM’s
Sagar Goyal | Eti Rastogi | Fen Zhao | Dong Yuan | Andrew Beinstein

The healthcare industry has accumulated vast amounts of clinical data, much of which has traditionally been unstructured, including medical records, clinical data, patient communications, and visit notes. Clinician-patient conversations form a crucial part of medical records, with the resulting medical note serving as the ground truth for future interactions and treatment plans. Generating concise and accurate SOAP notes is critical for quality patient care and is especially challenging in specialty care, where relevance, clarity, and adherence to clinician preferences are paramount. These requirements make general-purpose LLMs unsuitable for producing high-quality specialty notes. While recent LLMs like GPT-4 and Sonnet 3.5 have shown promise, their high cost, size, latency, and privacy issues remain barriers for many healthcare providers.We introduce SpecialtyScribe, a modular pipeline for generating specialty-specific medical notes. It features three components: an Information Extractor to capture relevant data, a Context Retriever to verify and augment content from transcripts, and a Note Writer to produce high quality notes. Our framework and in-house models outperform similarly sized open-source models by over 12% on ROUGE metrics.Additionally, these models match top closed-source LLMs’ performance while being under 1% of their size. We specifically evaluate our framework for oncology, with the potential for adaptation to other specialties.

pdf bib
Explainability for NLP in Pharmacovigilance: A Study on Adverse Event Report Triage in Swedish
Luise Dürlich | Erik Bergman | Maria Larsson | Hercules Dalianis | Seamus Doyle | Gabriel Westman | Joakim Nivre

In fields like healthcare and pharmacovigilance, explainability has been raised as one way of approaching regulatory compliance with machine learning and automation.This paper explores two feature attribution methods to explain predictions of four different classifiers trained to assess the seriousness of adverse event reports. On a global level, differences between models and how well important features for serious predictions align with regulatory criteria for what constitutes serious adverse reactions are analysed. In addition, explanations of reports with incorrect predictions are manually explored to find systematic features explaining the misclassification.We find that while all models seemingly learn the importance of relevant concepts for adverse event report triage, the priority of these concepts varies from model to model and between explanation methods, and the analysis of misclassified reports indicates that reporting style may affect prediction outcomes.

pdf bib
When Multilingual Models Compete with Monolingual Domain-Specific Models in Clinical Question Answering
Vojtech Lanz | Pavel Pecina

This paper explores the performance of multilingual models in the general domain on the clinical Question Answering (QA) task to observe their potential medical support for languages that do not benefit from the existence of clinically trained models. In order to improve the model’s performance, we exploit multilingual data augmentation by translating an English clinical QA dataset into six other languages. We propose a translation pipeline including projection of the evidences (answers) into the target languages and thoroughly evaluate several multilingual models fine-tuned on the augmented data, both in mono- and multilingual settings. We find that the translation itself and the subsequent QA experiments present a differently challenging problem for each of the languages. Finally, we compare the performance of multilingual models with pretrained medical domain-specific English models on the original clinical English test set. Contrary to expectations, we find that monolingual domain-specific pretraining is not always superior to general-domain multilingual pretraining. The source code is available at https://github.com/lanzv/Multilingual-emrQA

pdf bib
Mining Social Media for Barriers to Opioid Recovery with LLMs
Vinu Ekanayake | Md Sultan Al Nahian | Ramakanth Kavuluru

Opioid abuse and addiction remain a major public health challenge in the US. At a broad level, barriers to recovery often take the form of individual, social, and structural issues. However, it is crucial to know the specific barriers patients face to help design better treatment interventions and healthcare policies. Researchers typically discover barriers through focus groups and surveys. While scientists can exercise better control over these strategies, such methods are both expensive and time consuming, needing repeated studies across time as new barriers emerge. We believe, this traditional approach can be complemented by automatically mining social media to determine high-level trends in both well-known and emerging barriers. In this paper, we report on such an effort by mining messages from the r/OpiatesRecovery subreddit to extract, classify, and examine barriers to opioid recovery, with special attention to the COVID-19 pandemic’s impact. Our methods involve multi-stage prompting to arrive at barriers from each post and map them to existing barriers or identify new ones. The new barriers are refined into coherent categories using embedding-based similarity measures and hierarchical clustering. Temporal analysis shows that some stigma-related barriers declined (relative to pre-pandemic), whereas systemic obstacles—such as treatment discontinuity and exclusionary practices—rose significantly during the pandemic. Our method is general enough to be applied to barrier extraction for other substance abuse scenarios (e.g., alcohol or stimulants)

pdf bib
Multimodal Transformers for Clinical Time Series Forecasting and Early Sepsis Prediction
Jinghua Xu | Michael Staniek

Sepsis is a leading cause of death in Intensive Care Units (ICU). Early detection of sepsis is crucial to patient survival. Existing works in the clinical domain focus mainly on directly predicting a ground truth label that is the outcome of a medical syndrome or condition such as sepsis. In this work, we primarily focus on clinical time series forecasting as a means to solve downstream predictive tasks intermediately. We base our work on a strong monomodal baseline and propose multimodal transformers using set functions via fusing both physiological features and texts in electronic health record (EHR) data. Furthermore, we propose hierarchical transformers to effectively represent clinical document time series via attention mechanism and continuous time encoding. Our multimodal models significantly outperform baseline on MIMIC-III data by notable gaps. Our ablation analysis show that our atomic approaches to multimodal fusion and hierarchical transformers for document series embedding are effective in forecasting. We further fine-tune the forecasting models with labelled data and found some of the multimodal models consistently outperforming baseline on downstream sepsis prediction task.

pdf bib
Comparing representations of long clinical texts for the task of patient-note identification
Safa Alsaidi | Marc Vincent | Olivia Boyer | Nicolas Garcelon | Miguel Couceiro | Adrien Coulet

In this paper, we address the challenge of patient-note identification, which involves accurately matching an anonymized clinical note to its corresponding patient, represented by a set of related notes. This task has broad applications, including duplicate records detection and patient similarity analysis, which require robust patient-level representations. We explore various embedding methods, including Hierarchical Attention Networks (HAN), three-level Hierarchical Transformer Networks (HTN), LongFormer, and advanced BERT-based models, focusing on their ability to process medium-to-long clinical texts effectively. Additionally, we evaluate different pooling strategies (mean, max, and mean_max) for aggregating word-level embeddings into patient-level representations and we examine the impact of sliding windows on model performance. Our results indicate that BERT-based embeddings outperform traditional and hierarchical models, particularly in processing lengthy clinical notes and capturing nuanced patient representations. Among the pooling strategies, mean_max pooling consistently yields the best results, highlighting its ability to capture critical features from clinical notes. Furthermore, the reproduction of our results on both MIMIC dataset and Necker hospital data warehouse illustrates the generalizability of these approaches to real-world applications, emphasizing the importance of both embedding methods and aggregation strategies in optimizing patient-note identification and enhancing patient-level modeling.

pdf bib
MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters
Amin Dada | Osman Koras | Marie Bauer | Amanda Butler | Kaleb Smith | Jens Kleesiek | Julian Friedrich

While increasing patients’ access to medical documents improves medical care, this benefit is limited by varying health literacy levels and complex medical terminology. Large language models (LLMs) offer solutions by simplifying medical information. However, evaluating LLMs for safe and patient-friendly text generation is difficult due to the lack of standardized evaluation resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset created from MIMIC-IV discharge summaries through an automated pipeline combining LLM-based question-answer generation with manual quality checks. We use this dataset to evaluate various LLMs on patient-oriented question-answering. Our findings reveal that general-purpose LLMs frequently surpass biomedical-adapted models, while automated metrics correlate with human judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the development of LLMs to enhance patient understanding and ultimately improve care outcomes.

pdf bib
Using LLMs to improve RL policies in personalized health adaptive interventions
Karine Karine | Benjamin Marlin

Reinforcement learning (RL) is increasingly used in the healthcare domain, particularly for the development of personalized adaptive health interventions. However, RL methods are often applied to this domain using small state spaces to mitigate data scarcity. In this paper, we aim to use Large Language Models (LLMs) to incorporate text-based user preferences and constraints, to update the RL policy. The LLM acts as a filter in the action selection. To evaluate our method, we develop a novel simulation environment that generates text-based user preferences and incorporates corresponding constraints that impact behavioral dynamics. We show that our method can take into account the text-based user preferences, while improving the RL policy, thus improving personalization in adaptive intervention.

pdf bib
LLM Based Efficient CSR Summarization using Structured Fact Extraction and Feedback
Kunwar Zaid | Amit Sangroya | Lovekesh Vig

Summarizing clinical trial data poses a significant challenge due to the structured, voluminous, and domain-specific nature of clinical tables. While large language models (LLMs) such as ChatGPT, Llama, and DeepSeek demonstrate potential in table-to-text generation, they struggle with raw clinical tables that exceed context length, leading to incomplete, inconsistent, or imprecise summaries. These challenges stem from the structured nature of clinical tables, complex study designs, and the necessity for precise medical terminology. To address these limitations, we propose an end-to-end pipeline that enhances the summarization process by integrating fact selection, ensuring that only the most relevant data points are extracted for summary generation. Our approach also incorporates a feedback-driven refinement mechanism, allowing for iterative improvements based on domain-specific requirements and external expert input. By systematically filtering critical information and refining outputs, our method enhances the accuracy, completeness, and clinical reliability of generated summaries while reducing irrelevant or misleading content. This pipeline significantly improves the usability of LLM-generated summaries for medical professionals, regulators, and researchers, facilitating more efficient interpretation of clinical trial results. Our findings suggest that targeted preprocessing and iterative refinement strategies within the proposed piepline can mitigate LLM limitations, offering a scalable solution for summarizing complex clinical trial tables.

pdf bib
On Large Foundation Models and Alzheimer’s Disease Detection
Chuyuan Li | Giuseppe Carenini | Thalia Field

Large Foundation Models have displayed incredible capabilities in a wide range of domains and tasks. However, it is unclear whether these models match specialist capabilities without special training or fine-tuning. In this paper, we investigate the innate ability of foundation models as neurodegenerative disease specialists. Precisely, we use a language model, Llama-3.1, and a visual language model, Llama3-LLaVA-NeXT, to detect language specificity between Alzheimer’s Disease patients and healthy controls through a well-known Picture Description task. Results show that Llama is comparable to supervised classifiers, while LLaVA, despite its additional “vision”, lags behind.

pdf bib
Benchmarking IsiXhosa Automatic Speech Recognition and Machine Translation for Digital Health Provision
Abby Blocker | Francois Meyer | Ahmed Biyabani | Joyce Mwangama | Mohammed Ishaaq Datay | Bessie Malila

As digital health becomes more ubiquitous, people from different geographic regions are connected and there is thus a need for accurate language translation services. South Africa presents opportunity and need for digital health innovation, but implementing indigenous translation systems for digital health is difficult due to a lack of language resources. Understanding the accuracy of current models for use in medical translation of indigenous languages is crucial for designers looking to build quality digital health solutions. This paper presents a new dataset with audio and text of primary health consultations for automatic speech recognition and machine translation in South African English and the indigenous South African language of isiXhosa. We then evaluate the performance of well-established pretrained models on this dataset. We found that isiXhosa had limited support in speech recognition models and showed high, variable character error rates for transcription (26-70%). For translation tasks, Google Cloud Translate and ChatGPT outperformed the other evaluated models, indicating large language models can have similar performance to dedicated machine translation models for low-resource language translation.

pdf bib
Preliminary Evaluation of an Open-Source LLM for Lay Translation of German Clinical Documents
Tabea Pakull | Amin Dada | Hendrik Damm | Anke Fleischhauer | Sven Benson | Noëlle Bender | Nicola Prasuhn | Katharina Kaminski | Christoph Friedrich | Peter Horn | Jens Kleesiek | Dirk Schadendorf | Ina Pretzell

Clinical documents are essential to patient care, but their complexity often makes them inaccessible to patients. Large Language Models (LLMs) are a promising solution to support the creation of lay translations of these documents, addressing the infeasibility of manually creating these translations in busy clinical settings. However, the integration of LLMs into medical practice in Germany is challenging due to data scarcity and privacy regulations. This work evaluates an open-source LLM for lay translation in this data-scarce environment using datasets of German synthetic clinical documents and real tumor board protocols. The evaluation framework used combines readability, semantic, and lexical measures with the G-Eval framework. Preliminary results show that zero-shot prompts significantly improve readability (e.g., FREde: 21.4 → 39.3) and few-shot prompts improve semantic and lexical fidelity. However, the results also reveal G-Eval’s limitations in distinguishing between intentional omissions and factual inaccuracies. These findings underscore the need for manual review in clinical applications to ensure both accessibility and accuracy in lay translations. Furthermore, the effectiveness of prompting highlights the need for future work to develop applications that use predefined prompts in the background to reduce clinician workload.

pdf bib
Leveraging External Knowledge Bases: Analyzing Presentation Methods and Their Impact on Model Performance
Hui-Syuan Yeh | Thomas Lavergne | Pierre Zweigenbaum

Integrating external knowledge into large language models has demonstrated potential for performance improvement across a wide range of tasks. This approach is particularly appealing in domain-specific applications, such as in the biomedical field. However, the strategies for effectively presenting external knowledge to these models remain underexplored. This study investigates the impact of different knowledge presentation methods and their influence on model performance. Our results show that inserting knowledge between demonstrations helps the models perform better, and improve smaller LLMs (7B) to perform on par with larger LLMs (175B). Our further investigation indicates that the performance improvement, however, comes more from the effect of additional tokens and positioning than from the relevance of the knowledge.

pdf bib
LT3: Generating Medication Prescriptions with Conditional Transformer
Samuel Belkadi | Nicolo Micheletti | Lifeng Han | Warren Del-Pinto | Goran Nenadic

pdf bib
Explainable ICD Coding via Entity Linking
Leonor Barreiros | Isabel Coutinho | Gonçalo Correia | Bruno Martins

Clinical coding is a critical task in healthcare, although traditional methods for automating clinical coding may not provide sufficient explicit evidence for coders in production environments. This evidence is crucial, as medical coders have to make sure there exists at least one explicit passage in the input health record that justifies the attribution of a code. We therefore propose to reframe the task as an entity linking problem, in which each document is annotated with its set of codes and respective textual evidence, enabling better human-machine collaboration. By leveraging parameter-efficient fine-tuning of Large Language Models (LLMs), together with constrained decoding, we introduce three approaches to solve this problem that prove effective at disambiguating clinical mentions and that perform well in few-shot scenarios.

pdf bib
Will Gen Z users look for evidence to verify QA System-generated answers?
Souma Gayen | Dina Demner-Fushman | Deepak Gupta

The remarkable results shown by medicalquestion-answering systems lead to theiradoption in real-life applications. The systems,however, may misinform the users, even whendrawing on scientific evidence to ground theresults. The quality of the answers maybe verified by the users if they analyze theevidence provided by the systems. Userinterfaces play an important role in engagingthe users. While studies of the user interfacesfor biomedical literature search and clinicaldecision support are abundant, little is knownabout users’ interactions with medical questionanswering systems and the impact of thesesystems on health-related decisions. In a studyof several different user interface layouts, wefound that only a small number of participantsfollowed the links to verify automaticallygenerated answers, independently of theinterface design. The users who followed thelinks made better health-related decisions.

pdf bib
Predicting Chronic Kidney Disease Progression from Stage III to Stage V using Language Models
Zainab Awan | Rafael Henkin | Nick Reynolds | Michael Barnes

pdf bib
Am I eligible? Natural Language Inference for Clinical Trial Patient Recruitment: the Patient’s Point of View
Mathilde Aguiar | Pierre Zweigenbaum | Nona Naderi

Recruiting patients to participate in clinical trials can be challenging and time-consuming. Usually, participation in a clinical trial is initiated by a healthcare professional and proposed to the patient. Promoting clinical trials directly to patients via online recruitment might help to reach them more efficiently. In this study, we address the case where a patient is initiating their own recruitment process and wants to determine whether they are eligible for a given clinical trial, using their own language to describe their medical profile. To study whether this creates difficulties in the patient-trial matching process, we design a new dataset and task, Natural Language Inference for Patient Recruitment (NLI4PR), in which patient-language profiles must be matched to clinical trials. We create it by adapting the TREC 2022 Clinical Trial Track dataset, which provides patients’ medical profiles, and rephrasing them manually using patient language. We also use the associated clinical trial reports where the patients are either eligible or excluded. We prompt several open-source Large Language Models on our task and achieve from 56.5 to 71.8 of F1 score using patient language, against 64.7 to 73.1 for the same task using medical language. When using patient language, we observe only a small loss in performance for the best model, suggesting that having the patient as a starting point could be adopted to help recruit patients for clinical trials. The corpus and code bases are all freely available on our GitHub and HuggingFace repositories.

pdf bib
Towards Understanding LLM-Generated Biomedical Lay Summaries
Rohan Charudatt Salvi | Swapnil Panigrahi | Dhruv Jain | Shweta Yadav | Md. Shad Akhtar

In this paper, we investigate using large language models to generate accessible lay summaries of medical abstracts, targeting non-expert audiences. We assess the ability of models like GPT-4 and LLaMA 3-8B-Instruct to simplify complex medical information, focusing on layness, comprehensiveness, and factual accuracy. Utilizing both automated and human evaluations, we discover that automatic metrics do not always align with human judgments. Our analysis highlights the potential benefits of developing clear guidelines for consistent evaluations conducted by non-expert reviewers. It also points to areas for improvement in the evaluation process and the creation of lay summaries for future research.

pdf bib
Bridging the Gap in Health Literacy: Harnessing the Power of Large Language Models to Generate Plain Language Summaries from Biomedical Texts
Andrés Arias-Russi | Carolina Salazar-Lara | Rubén Manrique

pdf bib
Towards Knowledge-Guided Biomedical Lay Summarization using Large Language Models
Shufan Ming | Yue Guo | Halil Kilicoglu

The massive size, continual growth, and technical jargon in biomedical publications make it difficult for laypeople to stay informed about the latest scientific advances, motivating research on lay summarization of biomedical literature. Large language models (LLMs) are increasingly used for this task. Unlike typical automatic summarization, lay summarization requires incorporating background knowledge not found in a paper and explanations of technical jargon. This study explores the use of MeSH terms (Medical Subject Headings), which represent an article’s main topics, to enhance background information generation in biomedical lay summarization. Furthermore, we introduced a multi-turn dialogue approach that more effectively leverages MeSH terms in the instruction-tuning of LLMs to enhance the quality of lay summaries. The best model improved the state-of-the-art on the eLife test set in terms of the ROUGE-1 score by nearly 2%, with competitive scores in other metrics. These results indicate that MeSH terms can guide LLMs to generate more relevant background information for laypeople. Additionally, evaluation on a held-out dataset, one that was not used during model pre-training, shows that this capability generalizes well to unseen data, further demonstrating the effectiveness of our approach.

pdf bib
A Preliminary Study on NLP-Based Personalized Support for Type 1 Diabetes Management
Sandra Mitrović | Federico Fontana | Andrea Zignoli | Felipe Mattioni Maturana | Christian Berchtold | Daniele Malpetti | Sam Scott | Laura Azzimonti

The proliferation of wearable devices and sports monitoring apps has made tracking physical activity more accessible than ever. For individuals with Type 1 diabetes, regular exercise is essential for managing the condition, making personalized feedback particularly valuable. By leveraging data from physical activity sessions, NLP-generated messages can offer tailored guidance to help users optimize their workouts and make informed decisions. In this study, we assess several open-source pre-trained NLP models for this purpose. Contrary to expectations, our findings reveal that models fine-tuned on medical data or excelling in medical benchmarks do not necessarily produce high-quality messages.

pdf bib
Medication Extraction and Entity Linking using Stacked and Voted Ensembles on LLMs
Pablo Romero | Lifeng Han | Goran Nenadic

pdf bib
Bias in Danish Medical Notes: Infection Classification of Long Texts Using Transformer and LSTM Architectures Coupled with BERT
Mehdi Parviz | Rudi Agius | Carsten Niemann | Rob Van Der Goot

Medical notes contain a wealth of information related to diagnosis, prognosis, and overall patient care that can be used to help physicians make informed decisions. However, like any other data sets consisting of data from diverse demographics, they may be biased toward certain subgroups or subpopulations. Consequently, any bias in the data will be reflected in the output of the machine learning models trained on them. In this paper, we investigate the existence of such biases in Danish medical notes related to three types of blood cancer, with the goal of classifying whether the medical notes indicate severe infection. By employing a hierarchical architecture that combines a sequence model (Transformer and LSTM) with a BERT model to classify long notes, we uncover biases related to demographics and cancer types. Furthermore, we observe performance differences between hospitals. These findings underscore the importance of investigating bias in critical settings such as healthcare and the urgency of monitoring and mitigating it when developing AI-based systems.

pdf bib
Capturing Patients’ Lived Experiences with Chronic Pain through Motivational Interviewing and Information Extraction
Hadeel R A Elyazori | Rusul Abdulrazzaq | Hana Al Shawi | Isaac Amouzou | Patrick King | Syleah Manns | Mahdia Popal | Zarna Patel | Secili Destefano | Jay Shah | Naomi Gerber | Siddhartha Sikdar | Seiyon Lee | Samuel Acuna | Kevin Lybarger

Chronic pain affects millions, yet traditional assessments often fail to capture patients’ lived experiences comprehensively. In this study, we used a Motivational Interviewing framework to conduct semi-structured interviews with eleven adults experiencing chronic pain and then applied Natural Language Processing (NLP) to their narratives. We developed an annotation schema that integrates the International Classification of Functioning, Disability, and Health (ICF) with Aspect-Based Sentiment Analysis (ABSA) to convert unstructured narratives into structured representations of key patient experience dimensions. Furthermore, we evaluated whether Large Language Models (LLMs) can automatically extract information using this schema. Our findings advance scalable, patient-centered approaches to chronic pain assessment, paving the way for more effective, data-driven management strategies.

pdf bib
Medifact at PerAnsSumm 2025: Leveraging Lightweight Models for Perspective-Specific Summarization of Clinical Q&A Forums
Nadia Saeed

The PerAnsSumm 2025 challenge focuses on perspective-aware healthcare answer summarization (Agarwal et al., 2025). This work proposes a few-shot learning framework using a Snorkel-BART-SVM pipeline for classifying and summarizing open-ended healthcare community question-answering (CQA).An SVM model is trained with weak supervision via Snorkel, enhancing zero-shot learning. Extractive classification identifies perspective-relevant sentences, which are then summarized using a pretrained BART-CNN model. The approach achieved 12th place among 100 teams in the shared task, demonstrating computational efficiency and contextual accuracy. By leveraging pretrained summarization models, this work advances medical CQA research and contributes to clinical decision support systems.

pdf bib
The Manchester Bees at PerAnsSumm 2025: Iterative Self-Prompting with Claude and o1 for Perspective-aware Healthcare Answer Summarisation
Pablo Romero | Libo Ren | Lifeng Han | Goran Nenadic

pdf bib
MNLP at PerAnsSumm: A Classifier-Refiner Architecture for Improving the Classification of Consumer Health User Responses
Jooyeon Lee | Luan Pham | Özlem Uzuner

Community question-answering (CQA) platforms provide a crucial space for users to share experiences, seek medical advice, and exchange health-related information. However, these platforms, by nature of their user-generated content as well as the complexity and subjectivity of natural language, remain a significant challenge for tasks related to the automatic classification of diverse perspectives. The PerAnsSumm shared task involves extracting perspective spans from community users’ answers, classifying them into specific perspective categories (Task A), and then using these perspectives and spans to generate structured summaries (Task B). Our focus is on Task A. To address this challenge, we propose a Classifier-Refiner Architecture (CRA), a two-stage framework designed to enhance classification accuracy. The first stage employs a Classifier to segment user responses into self-contained snippets and assign initial perspective labels along with a binary confidence value. If the classifier is not confident, a secondary Refiner stage is triggered, incorporating retrieval-augmented generation to enhance classification through contextual examples. Our methodology integrates instruction-driven classification, tone definitions, and Chain-of-Thought (CoT) prompting, leading to improved F1 scores compared to single-pass approaches. Experimental evaluations on the Perspective Summarization Dataset (PUMA) demonstrate that our framework improves classification performance by leveraging multi-stage decision-making. Our submission ranked among the top-performing teams, achieving an overall score of 0.6090, with high precision and recall in perspective classification.

pdf bib
WisPerMed @ PerAnsSumm 2025: Strong Reasoning Through Structured Prompting and Careful Answer Selection Enhances Perspective Extraction and Summarization of Healthcare Forum Threads
Tabea Pakull | Hendrik Damm | Henning Schäfer | Peter Horn | Christoph Friedrich

Healthcare community question-answering (CQA) forums provide multi-perspective insights into patient experiences and medical advice. Summarizations of these threads must account for these perspectives, rather than relying on a single “best” answer. This paper presents the participation of the WisPerMed team in the PerAnsSumm shared task 2025, which consists of two sub-tasks: (A) span identification and classification, and (B) perspectivebased summarization. For Task A, encoder models, decoder-based LLMs, and reasoningfocused models are evaluated under finetuning, instruction-tuning, and prompt-based paradigms. The experimental evaluations employing automatic metrics demonstrate that DeepSeek-R1 attains a high proportional recall (0.738) and F1-Score (0.676) in zero-shot settings, though strict boundary alignment remains challenging (F1-Score: 0.196). For Task B, filtering answers by labeling them with perspectives prior to summarization with Mistral-7B-v0.3 enhances summarization. This approach ensures that the model is trained exclusively on relevant data, while discarding non-essential information, leading to enhanced relevance (ROUGE-1: 0.452) and balanced factuality (SummaC: 0.296). The analysis uncovers two key limitations: data imbalance and hallucinations of decoder-based LLMs, with underrepresented perspectives exhibiting suboptimal performance. The WisPerMed team’s approach secured the highest overall ranking in the shared task.

pdf bib
DataHacks at PerAnsSumm 2025: LoRA-Driven Prompt Engineering for Perspective Aware Span Identification and Summarization
Vansh Nawander | Chaithra Reddy Nerella

This paper presents the approach of the DataHacks team in the PerAnsSumm Shared Task at CL4Health 2025, which focuses on perspective-aware summarization of healthcare community question-answering (CQA) forums. Unlike traditional CQA summarization, which relies on the best-voted answer, this task captures diverse perspectives, including ‘cause,’ ‘suggestion,’ ‘experience,’ ‘question,’ and ‘information.’ The task is divided into two subtasks: (1) identifying and classifying perspective-specific spans, and (2) generating perspective-specific summaries. We addressed these tasks using Large Language Models (LLM), fine-tuning it with different low-rank adaptation (LoRA) configurations to balance performance and computational efficiency under resource constraints. In addition, we experimented with various prompt strategies and analyzed their impact on performance. Our approach achieved a combined average score of 0.42, demonstrating the effectiveness of fine-tuned LLMs with adaptive LoRA configurations for perspective-aware summarization.

pdf bib
LMU at PerAnsSumm 2025: LlaMA-in-the-loop at Perspective-Aware Healthcare Answer Summarization Task 2.2 Factuality
Tanalp Ağustoslu

In this paper, we describe our submission for the shared task on Perspective-aware Healthcare Answer Summarization. Our system consists of two quantized models of the LlaMA family, applied across fine-tuning and few-shot settings. Additionally, we adopt the SumCoT prompting technique to improve the factual correctness of the generated summaries. We show that SumCoT yields more factually accurate summaries, even though this improvement comes at the expense of lower performance on lexical overlap and semantic similarity metrics such as ROUGE and BERTScore. Our work highlights an important trade-off when evaluating summarization models.

pdf bib
Lightweight LLM Adaptation for Medical Summarisation: Roux-lette at PerAnsSumm Shared Task
Anson Antony | Peter Vickers | Suzanne Wendelken

The PerAnsSumm Shared Task at CL4Health@NAACL 2025 focused on Perspective-Aware Summarization of Healthcare Q/A forums, requiring participants to extract and summarize spans based on predefined perspective categories. Our approach leveraged LLM-based zero-shot prompting enhanced by semantically-similar In-Context Learning (ICL) examples. Using Qwen-Turbo with 20 exemplar samples retrieved through NV-Embed-v2 embeddings, we achieved a mean score of 0.58 on Task A (span identification) and Task B (summarization) mean scores of 0.36 in Relevance and 0.28 in Factuality, finishing 12th on the final leaderboard. Notably, our system achieved higher precision in strict matching (0.20) than the top-performing system, demonstrating the effectiveness of our post-processing techniques. In this paper, we detail our ICL approach for adapting Large Language Models to Perspective-Aware Medical Summarization, analyze the improvements across development iterations, and finally discuss both the limitations of the current evaluation framework and future challenges in modeling this task. We release our code for reproducibility.

pdf bib
AICOE at PerAnsSumm 2025: An Ensemble of Large Language Models for Perspective-Aware Healthcare Answer Summarization
Rakshith R | Mohammed Sameer Khan | Ankush Chopra

The PerAnsSumm 2024 shared task at the CL4Health workshop focuses on generating structured, perspective-specific summaries to enhance the accessibility of health-related information. Given a Healthcare community QA dataset containing a question, context, and multiple user-answers, the task involves identifying relevant perspective categories, extracting spans from these perspectives, and generating concise summaries for the extracted spans. We fine-tuned open-source models such as Llama-3.2 3B, Llama-3.1 8B, and Gemma-2 9B, while also experimenting with proprietary models including GPT-4o, o1, Gemini-1.5 Pro, and Gemini-2 Flash Experimental using few-shot prompting. Our best-performing approach leveraged an ensemble strategy, combining span outputs from o1 (CoT) and Gemini-2 Flash Experimental. For overlapping perspectives, we prioritized Gemini. The final spans were summarized using Gemini, preserving the higher classification accuracy of o1 while leveraging Gemini’s superior span extraction and summarization capabilities. This hybrid method secured fourth place on the final leaderboard among 100 participants and 206 submissions.

pdf bib
LTRC-IIITH at PerAnsSumm 2025: SpanSense - Perspective-specific span identification and Summarization
Sushvin Marimuthu | Parameswari Krishnamurthy

Healthcare community question-answering (CQA) forums have become popular for users seeking medical advice, offering answers that range from personal experiences to factual information. Traditionally, CQA summarization relies on the best-voted answer as a reference summary. However, this approach overlooks the diverse perspectives across multiple responses. Structuring summaries by perspective could better meet users’ informational needs. The PerAnsSumm shared task addresses this by identifying and classifying perspective-specific spans (Task_A) and generating perspective-specific summaries from question-answer threads (Task_B). In this paper, we present our work on the PerAnsSumm shared task 2025 at the CL4Health Workshop, NAACL 2025. Our system leverages the RoBERTa-large model for identifying perspective-specific spans and the BART-large model for summarization. We achieved a Macro-F1 score of 0.9 (90%) and a Weighted-F1 score of 0.92 (92%) for classification. For span matching, our strict matching F1 score was 0.21 (21%), while proportional matching reached 0.68 (68%), resulting in an average Task A score of 0.6 (60%). For Task B, we achieved a ROUGE-1 score of 0.4 (40%), ROUGE-2 of 0.18 (18%), and ROUGE-L of 0.36 (36%). Additionally, we obtained a BERTScore of 0.84 (84%), METEOR of 0.37 (37%), and BLEU of 0.13 (13%), resulting in an average Task B score of 0.38 (38%). Combining both tasks, our system achieved an overall average score of 49% and ranked 6th on the official leaderboard for the shared task.

pdf bib
YaleNLP @ PerAnsSumm 2025: Multi-Perspective Integration via Mixture-of-Agents for Enhanced Healthcare QA Summarization
Dongsuk Jang | Haoxin Li | Arman Cohan

pdf bib
Abdelmalak at PerAnsSumm 2025: Leveraging a Domain-Specific BERT and LLaMA for Perspective-Aware Healthcare Answer Summarization
Abanoub Abdelmalak

The PerAnsSumm Shared Task - CL4Health@NAACL 2025 aims to enhance healthcare community question-answering (CQA) by summarizing diverse user perspectives. It consists of two tasks: identifying and classifying perspective-specific spans (Task A) and generating structured, perspective-specific summaries from question-answer threads (Task B). The dataset used for this task is the PUMA dataset. For Task A, a COVID-Twitter-BERT model pre-trained on COVID-related text from Twitter was employed, improving the model’s understanding of relevant vocabulary and context. For Task B, LLaMA was utilized in a prompt-based fashion. The proposed approach achieved 9th place in Task A and 16th place overall, with the best proportional classification F1-score of 0.74.

pdf bib
UMB@PerAnsSumm 2025: Enhancing Perspective-Aware Summarization with Prompt Optimization and Supervised Fine-Tuning
Kristin Qi | Youxiang Zhu | Xiaohui Liang

We present our approach to the PerAnsSumm Shared Task, which involves perspective span identification and perspective-aware summarization in community question-answering (CQA) threads. For span identification, we adopt ensemble learning that integrates three transformer models through averaging to exploit individual model strengths, achieving an 82.91% F1-score on test data. For summarization, we design a suite of Chain-of-Thought (CoT) prompting strategies that incorporate keyphrases and guide information to structure summary generation into manageable steps. To further enhance summary quality, we apply prompt optimization using the DSPy framework and supervised fine-tuning (SFT) on Llama-3 to adapt the model to domain-specific data. Experimental results on validation and test sets show that structured prompts with keyphrases and guidance improve summaries aligned with references, while the combination of prompt optimization and fine-tuning together yields significant improvement in both relevance and factuality evaluation metrics.

pdf bib
Overview of the PerAnsSumm 2025 Shared Task on Perspective-aware Healthcare Answer Summarization
Siddhant Agarwal | Md. Shad Akhtar | Shweta Yadav

This paper presents an overview of the Perspective-aware Answer Summarization (PerAnsSumm) Shared Task on summarizing healthcare answers in Community Question Answering forums hosted at the CL4Health Workshop at NAACL 2025. In this shared task, we approach healthcare answer summarization with two subtasks: (a) perspective span identification and classification and (b) perspective-based answer summarization (summaries focused on one of the perspective classes). Wedefine a benchmarking setup for comprehensive evaluation of predicted spans and generated summaries. We encouraged participants to explore novel solutions to the proposed problem and received high interest in the task with 23 participating teams and 155 submissions. This paper describes the task objectives, the dataset, the evaluation metrics and our findings. We share the results of the novel approaches adopted by task participants, especially emphasizing the applicability of Large Language Models in this perspective-based answer summarization task.

pdf bib
Bridging the Gap: Inclusive Artificial Intelligence for Patient-Oriented Language Processing in Conversational Agents in Healthcare
Kerstin Denecke

Conversational agents (CAs), such as medical interview assistants, are increasingly used in healthcare settings due to their potential for intuitive user interaction. Ensuring the inclusivity of these systems is critical to provide equitable and effective digital health support. However, the underlying technology, models and data can foster inequalities and exclude certain individuals. This paper explores key principles of inclusivity in patient-oriented language processing (POLP) for healthcare CAs to improve accessibility, cultural sensitivity, and fairness in patient interactions. We will outline, how considering the six facets of inclusive Artificial Intelligence (AI) will shape POLP within healthcare CA. Key considerations include leveraging diverse datasets, incorporating gender-neutral and inclusive language, supporting varying levels of health literacy, and ensuring culturally relevant communication. To address these issues, future research in POLP should focus on optimizing conversation structure, enhancing the adaptability of CAs’ language and content, integrating cultural awareness, improving explainability, managing cognitive load, and addressing bias and fairness concerns.

up

pdf (full)
bib (full)
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)

pdf bib
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025)
Ayah Zirikly | Andrew Yates | Bart Desmet | Molly Ireland | Steven Bedrick | Sean MacAvaney | Kfir Bar | Yaakov Ophir

pdf bib
Assessing the Reliability and Validity of GPT-4 in Annotating Emotion Appraisal Ratings
Deniss Ruder | Andero Uusberg | Kairit Sirts

Appraisal theories suggest that emotions arise from subjective evaluations of events, referred to as appraisals. The taxonomy of appraisals is quite diverse, and they are usually given ratings on a Likert scale to be annotated in an experiencer-annotator or reader-annotator paradigm. This paper studies GPT-4 as a reader-annotator of 21 specific appraisal ratings in different prompt settings, aiming to evaluate and improve its performance compared to human annotators. We found that GPT-4 is an effective reader-annotator that performs close to or even slightly better than human annotators, and its results can be significantly improved by using a majority voting of five completions. GPT-4 also effectively predicts appraisal ratings and emotion labels using a single prompt, but adding instruction complexity results in poorer performance. We also found that longer event descriptions lead to more accurate annotations for both model and human annotator ratings. This work contributes to the growing usage of LLMs in psychology and the strategies for improving GPT-4 performance in annotating appraisals.

pdf bib
AutoPsyC: Automatic Recognition of Psychodynamic Conflicts from Semi-structured Interviews with Large Language Models
Sayed Hossain | Simon Ostermann | Patrick Gebhard | Cord Benecke | Josef van Genabith | Philipp Müller

Psychodynamic conflicts are persistent, often unconscious themes that shape a person’s behaviour and experiences. Accurate diagnosis of psychodynamic conflicts is crucial for effective patient treatment and is commonly done via long, manually scored semi-structured interviews. Existing automated solutions for psychiatric diagnosis tend to focus on the recognition of broad disorder categories such as depression, and it is unclear to what extent psychodynamic conflicts which even the patient themselves may not have conscious access to could be automatically recognised from conversation. In this paper, we propose AutoPsyC, the first method for recognising the presence and significance of psychodynamic conflicts from full-length Operationalized Psychodynamic Diagnostics (OPD) interviews using Large Language Models (LLMs). Our approach combines recent advances in parameter-efficient fine-tuning and Retrieval-Augmented Generation (RAG) with a summarisation strategy to effectively process entire 90 minute long conversations. In evaluations on a dataset of 141 diagnostic interviews we show that AutoPsyC consistently outperforms all baselines and ablation conditions on the recognition of four highly relevant psychodynamic conflicts.

pdf bib
The Emotional Spectrum of LLMs: Leveraging Empathy and Emotion-Based Markers for Mental Health Support
Alessandro De Grandi | Federico Ravenda | Andrea Raballo | Fabio Crestani

The increasing demand for mental health services has highlighted the need for innovative solutions, particularly in the realm of psychological conversational AI, where the availability of sensitive data is scarce. In this work, we explored the development of a system tailored for mental health support with a novel approach to psychological assessment based on explainable emotional profiles in combination with empathetic conversational models, offering a promising tool for augmenting traditional care, particularly where immediate expertise is unavailable. Our work can be divided into two main parts, intrinsecaly connected to each other. First, we present RACLETTE, a conversational system that demonstrates superior emotional accuracy compared to considered benchmarks in both understanding users’ emotional states and generating empathetic responses during conversations, while progressively building an emotional profile of the user through their interactions. Second, we show how the emotional profiles of a user can be used as interpretable markers for mental health assessment. These profiles can be compared with characteristic emotional patterns associated with different mental disorders, providing a novel approach to preliminary screening and support.

pdf bib
Enhancing Depression Detection via Question-wise Modality Fusion
Aishik Mandal | Dana Atzil-Slonim | Thamar Solorio | Iryna Gurevych

Depression is a highly prevalent and disabling condition that incurs substantial personal and societal costs. Current depression diagnosis involves determining the depression severity of a person through self-reported questionnaires or interviews conducted by clinicians. This often leads to delayed treatment and involves substantial human resources. Thus, several works try to automate the process using multimodal data. However, they usually overlook the following: i) The variable contribution of each modality for each question in the questionnaire and ii) Using ordinal classification for the task. This results in sub-optimal fusion and training methods. In this work, we propose a novel Question-wise Modality Fusion (QuestMF) framework trained with a novel Imbalanced Ordinal Log-Loss (ImbOLL) function to tackle these issues. The performance of our framework is comparable to the current state-of-the-art models on the E-DAIC dataset and enhances interpretability by predicting scores for each question. This will help clinicians identify an individual’s symptoms, allowing them to customise their interventions accordingly. We also make the code for the QuestMF framework publicly available.

pdf bib
Linking Language-based Distortion Detection to Mental Health Outcomes
Vasudha Varadarajan | Allison Lahnala | Sujeeth Vankudari | Akshay Raghavan | Scott Feltman | Syeda Mahwish | Camilo Ruggero | Roman Kotov | H. Andrew Schwartz

Recent work has suggested detection of cognitive distortions as an impactful task for NLP in the clinical space, but the connection between language-detected distortions and validated mental health outcomes has been elusive. In this work, we evaluate the co-occurrence of (a) 10 distortions derived from language-based detectors trained over two common distortion datasets with (b) 12 mental health outcomes contained within two new language-to-mental-health datasets: DS4UD and iHiTOP. We find higher rates of distortions for those with greater mental health condition severity (ranging from r = 0.16 for thought disorders to r = 0.46 for depressed mood), and that the specific distortions of should statements and fortune telling were associated with a depressed mood and being emotionally drained, respectively. This suggested that language-based assessments of cognitive distortion could play a significant role in detection and monitoring of mental health conditions.

pdf bib
Measuring Mental Health Variables in Computational Research: Toward Validated, Dimensional, and Transdiagnostic Approaches
Chen Shani | Elizabeth Stade

Computational mental health research develops models to predict and understand psychological phenomena, but often relies on inappropriate measures of psychopathology constructs, undermining validity. We identify three key issues: (1) reliance on unvalidated measures (e.g., self-declared diagnosis) over validated ones (e.g., diagnosis by clinician); (2) treating mental health constructs as categorical rather than dimensional; and (3) focusing on disorder-specific constructs instead of transdiagnostic ones. We outline the benefits of using validated, dimensional, and transdiagnostic measures and offer practical recommendations for practitioners. Using valid measures that reflect the nature and structure of psychopathology is essential for computational mental health research.

pdf bib
Automatic Scoring of an Open-Response Measure of Advanced Mind-Reading Using Large Language Models
Yixiao Wang | Russel Dsouza | Robert Lee | Ian Apperly | Rory Devine | Sanne van der Kleij | Mark Lee

A rigorous psychometric approach is crucial for the accurate measurement of mind-reading abilities. Traditional scoring methods for such tests, which involve lengthy free-text responses, require considerable time and human effort. This study investigates the use of large language models (LLMs) to automate the scoring of psychometric tests. Data were collected from participants aged 13 to 30 years and scored by trained human coders to establish a benchmark. We evaluated multiple LLMs against human assessments, exploring various prompting strate- gies to optimize performance and fine-tuning the models using a subset of the collected data to enhance accuracy. Our results demonstrate that LLMs can assess advanced mind-reading abilities with over 90% accuracy on average. Notably, in most test items, the LLMs achieved higher Kappa agreement with the lead coder than two trained human coders, highlighting their potential to reliably score open-response psychometric tests.

pdf bib
Bigger But Not Better: Small Neural Language Models Outperform LLMs in Detection of Thought Disorder
Changye Li | Weizhe Xu | Serguei Pakhomov | Ellen Bradley | Dror Ben-Zeev | Trevor Cohen

Disorganized thinking is a key diagnostic indicator of schizophrenia-spectrum disorders. Recently, clinical estimates of the severity of disorganized thinking have been shown to correlate with measures of how difficult speech transcripts would be for large language models (LLMs) to predict. However, LLMs’ deployment challenges – including privacy concerns, computational and financial costs, and lack of transparency of training data – limit their clinical utility. We investigate whether smaller neural language models can serve as effective alternatives for detecting positive formal thought disorder, using the same sliding window based perplexity measurements that proved effective with larger models. Surprisingly, our results show that smaller models are more sensitive to linguistic differences associated with formal thought disorder than their larger counterparts. Detection capability declines beyond a certain model size and context length, challenging the common assumption of “bigger is better” for LLM-based applications. Our findings generalize across audio diaries and clinical interview speech samples from individuals with psychotic symptoms, suggesting a promising direction for developing efficient, cost-effective, and privacy-preserving screening tools that can be deployed in both clinical and naturalistic settings.

pdf bib
CFiCS: Graph-Based Classification of Common Factors and Microcounseling Skills
Fabian Schmidt | Karin Hammerfald | Henrik Haaland Jahren | Vladimir Vlassov

Common factors and microcounseling skills are critical to the effectiveness of psychotherapy. Understanding and measuring these elements provides valuable insights into therapeutic processes and outcomes. However, automatic identification of these change principles from textual data remains challenging due to the nuanced and context-dependent nature of therapeutic dialogue. This paper introduces CFiCS, a hierarchical classification framework integrating graph machine learning with pre-trained contextual embeddings. We represent common factors, intervention concepts, and microcounseling skills as a heterogeneous graph, where textual information from ClinicalBERT enriches each node. This structure captures both the hierarchical relationships (e.g., skill-level nodes linking to broad factors) and the semantic properties of therapeutic concepts. By leveraging graph neural networks, CFiCS learns inductive node embeddings that generalize to unseen text samples lacking explicit connections. Our results demonstrate that integrating ClinicalBERT node features and graph structure significantly improves classification performance, especially in fine-grained skill prediction. CFiCS achieves substantial gains in both micro and macro F1 scores across all tasks compared to baselines, including random forests, BERT-based multi-task models, and graph-based methods.

pdf bib
Datasets for Depression Modeling in Social Media: An Overview
Ana-Maria Bucur | Andreea Moldovan | Krutika Parvatikar | Marcos Zampieri | Ashiqur Khudabukhsh | Liviu Dinu

Depression is the most common mental health disorder, and its prevalence increased during the COVID-19 pandemic. As one of the most extensively researched psychological conditions, recent research has increasingly focused on leveraging social media data to enhance traditional methods of depression screening. This paper addresses the growing interest in interdisciplinary research on depression, and aims to support early-career researchers by providing a comprehensive and up-to-date list of datasets for analyzing and predicting depression through social media data. We present an overview of datasets published between 2019 and 2024. We also make the comprehensive list of datasets available online as a continuously updated resource, with the hope that it will facilitate further interdisciplinary research into the linguistic expressions of depression on social media.

pdf bib
Exploratory Study into Relations between Cognitive Distortions and Emotional Appraisals
Navneet Agarwal | Kairit Sirts

In recent years, there has been growing interest in studying cognitive distortions and emotional appraisals from both computational and psychological perspectives. Despite considerable similarities between emotional reappraisal and cognitive reframing as emotion regulation techniques, these concepts have largely been examined in isolation. This research explores the relationship between cognitive distortions and emotional appraisal dimensions, examining their potential connections and relevance for future interdisciplinary studies. Under this pretext, we conduct an exploratory computational study, aimed at investigating the relationship between cognitive distortion and emotional appraisals. We show that the patterns of statistically significant relationships between cognitive distortions and appraisal dimensions vary across different distortion categories, giving rise to distinct appraisal profiles for individual distortion classes. Additionally, we analyze the impact of cognitive restructuring on appraisal dimensions, exemplifying the emotion regulation aspect of cognitive restructuring.

pdf bib
Socratic Reasoning Improves Positive Text Rewriting
Anmol Goel | Nico Daheim | Christian Montag | Iryna Gurevych

Reframing a negative into a positive thought is at the crux of several cognitive approaches to mental health and psychotherapy that could be made more accessible by large language model-based solutions. Such reframing is typically non-trivial and requires multiple rationalization steps to uncover the underlying issue of a negative thought and transform it to be more positive. However, this rationalization process is currently neglected by both datasets and models which reframe thoughts in one step. In this work, we address this gap by augmenting open-source datasets for positive text rewriting with synthetically-generated Socratic rationales using a novel framework called SOCRATICREFRAME. SOCRATICREFRAME uses a sequence of question-answer pairs to rationalize the thought rewriting process. We show that such Socratic rationales significantly improve positive text rewriting for different open-source LLMs according to both automatic and human evaluations guided by criteria from psychotherapy research. We validate our framework and the synthetic rationalizations with expert judgements from domain experts and psychology students in an IRB-approved annotation study. Our findings highlight the potential of utilizing the synergy between LLM reasoning and established psychotherapy techniques to build assistive solutions for reframing negative thoughts.

pdf bib
Synthetic Empathy: Generating and Evaluating Artificial Psychotherapy Dialogues to Detect Empathy in Counseling Sessions
Daniel Cabrera Lozoya | Eloy Hernandez Lua | Juan Alberto Barajas Perches | Mike Conway | Simon D’Alfonso

Natural language processing (NLP) holds potential for analyzing psychotherapy transcripts. Nonetheless, gathering the necessary data to train NLP models for clinical tasks is a challenging process due to patient confidentiality regulations that restrict data sharing. To overcome this obstacle, we propose leveraging large language models (LLMs) to create synthetic psychotherapy dialogues that can be used to train NLP models for downstream clinical tasks. To evaluate the quality of our synthetic data, we trained three multi-task RoBERTa-based bi-encoder models, originally developed by Sharma et al., to detect empathy in dialogues. These models, initially trained on Reddit data, were developed alongside EPITOME, a framework designed to characterize empathetic communication in conversations. We collected and annotated 579 therapeutic interactions between therapists and patients using the EPITOME framework. Additionally, we generated 10,464 synthetic therapeutic dialogues using various LLMs and prompting techniques, all of which were annotated following the EPITOME framework. We conducted two experiments: one where we augmented the original dataset with synthetic data and another where we replaced the Reddit dataset with synthetic data. Our first experiment showed that incorporating synthetic data can improve the F1 score of empathy detection by up to 10%. The second experiment revealed no substantial differences between organic and synthetic data, as their performance remained on par when substituted.

pdf bib
A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG
Arshia Kermani | Veronica Perez-Rosas | Vangelis Metsis

This study presents a systematic comparison of three approaches for the analysis of mental health text using large language models (LLMs): prompt engineering, retrieval augmented generation (RAG), and fine-tuning. Using LLaMA 3, we evaluate these approaches on emotion classification and mental health condition detection tasks across two datasets. Fine-tuning achieves the highest accuracy (91% for emotion classification, 80% for mental health conditions) but requires substantial computational resources and large training sets, while prompt engineering and RAG offer more flexible deployment with moderate performance (40-68% accuracy). Our findings provide practical insights for implementing LLM-based solutions in mental health applications, highlighting the trade-offs between accuracy, computational requirements, and deployment flexibility.

pdf bib
Using LLMs to Aid Annotation and Collection of Clinically-Enriched Data in Bipolar Disorder and Schizophrenia
Ankit Aich | Avery Quynh | Pamela Osseyi | Amy Pinkham | Philip Harvey | Brenda Curtis | Colin Depp | Natalie Parde

Natural Language Processing (NLP) in mental health has largely focused on social media data or classification problems, often shifting focus from high caseloads or domain-specific needs of real-world practitioners. This study utilizes a dataset of 644 participants, including those with Bipolar Disorder, Schizophrenia, and Healthy Controls, who completed tasks from a standardized mental health instrument. Clinical annotators were used to label this dataset on five clinical variables. Expert annotations across five clinical variables demonstrated that contempo- rary language models, particularly smaller, fine-tuned models, can enhance data collection and annotation with greater accuracy and trust than larger commercial models. We show that these models can effectively capture nuanced clinical variables, offering a powerful tool for advancing mental health research. We also show that for clinically advanced tasks such as domain-specific annotation LLMs provide wrong labels as compared to a fine-tuned smaller model.

pdf bib
Overview of the CLPsych 2025 Shared Task: Capturing Mental Health Dynamics from Social Media Timelines
Talia Tseriotou | Jenny Chim | Ayal Klein | Aya Shamir | Guy Dvir | Iqra Ali | Cian Kennedy | Guneet Singh Kohli | Anthony Hills | Ayah Zirikly | Dana Atzil-Slonim | Maria Liakata

We provide an overview of the CLPsych 2025 Shared Task, which focuses on capturing mental health dynamics from social media timelines. Building on CLPsych 2022’s longitudinal modeling approach, this work combines monitoring mental states with evidence and summary generation through four subtasks: (A.1) Evidence Extraction, highlighting text spans reflecting adaptive or maladaptive self-states; (A.2) Well-Being Score Prediction, assigning posts a 1 to 10 score based on social, occupational, and psychological functioning; (B) Post-level Summarization of the interplay between adaptive and maladaptive states within individual posts; and (C) Timeline-level Summarization capturing temporal dynamics of self-states over posts in a timeline. We describe key findings and future directions.

pdf bib
A baseline for self-state identification and classification in mental health data: CLPsych 2025 Task
Laerdon Kim

We present a baseline for the CLPsych 2025 A.1 task: classifying self-states in mental health data taken from Reddit. We use few-shot learning with a 4-bit quantized Gemma 2 9B model (Gemma Team, 2024; Brown et al., 2020; Daniel Han and team, 2023) and a data preprocessing step which first identifies relevant sentences indicating self-state evidence, and then performs a binary classification to determine whether the sentence is evidence of an adaptive or maladaptive self-state. This system outperforms our other method which relies on an LLM to highlight spans of variable length independently. We attribute the performance of our model to the benefits of this sentence chunking step for two reasons: partitioning posts into sentences 1) broadly matches the granularity at which self-states were human-annotated and 2) simplifies the task for our language model to a binary classification problem. Our system placed third out of fourteen systems submitted for Task A.1, earning a test-time recall of 0.579.

pdf bib
Capturing the Dynamics of Mental Well-Being: Adaptive and Maladaptive States in Social Media
Anastasia Sandu | Teodor Mihailescu | Ana Sabina Uban | Ana-Maria Bucur

This paper describes the contributions of the BLUE team in the CLPsych 2025 Shared Task on Capturing Mental Health Dynamics from Social Media Timelines. We participate in all tasks with three submissions, for which we use two sets of approaches: an unsupervised approach using prompting of various large language models (LLM) with no fine-tuning for this task or domain, and a supervised approach based on several lightweight machine learning models trained to classify sentences for evidence extraction, based on an augmented training dataset sourced from public psychological questionnaires. We obtain the best results for summarization Tasks B and C in terms of consistency, and the best F1 score in Task A.2.

pdf bib
CIOL at CLPsych 2025: Using Large Lanuage Models for Understanding and Summarizing Clinical Texts
Md. Iqramul Hoque | Mahfuz Ahmed Anik | Azmine Toushik Wasi

The increasing prevalence of mental health discourse on social media has created a need for automated tools to assess psychological wellbeing. In this study, we propose a structured framework for evidence extraction, well-being scoring, and summary generation, developed as part of the CLPsych 2025 shared task. Our approach integrates feature-based classification with context-aware language modeling to identify self-state indicators, predict well-being scores, and generate clinically relevant summaries. Our system achieved a recall of 0.56 for evidence extraction, an MSE of 3.89 in well-being scoring, and high consistency scores (0.612 post-level, 0.801 timeline-level) in summary generation, ensuring strong alignment with extracted evidence. With an overall good rank, our framework demonstrates robustness in social media-based mental health monitoring. By providing interpretable assessments of psychological states, our work contributes to early detection and intervention strategies, assisting researchers and mental health professionals in understanding online well-being trends and enhancing digital mental health support systems.

pdf bib
From Evidence Mining to Meta-Prediction: a Gradient of Methodologies for Task-Specific Challenges in Psychological Assessment
Federico Ravenda | Fawzia-Zehra Kara-Isitt | Stephen Swift | Antonietta Mira | Andrea Raballo

Large Language Models are increasingly used in the medical field, particularly in psychiatry where language plays a fundamental role in diagnosis. This study explores the use of open-source LLMs within the MIND framework. Specifically, we implemented a mixed-methods approach for the CLPsych 2025 shared task: (1) we used a combination of retrieval and few-shot learning approaches to highlight evidence of mental states within the text and to generate comprehensive summaries for post-level and timeline-level analysis, allowing for effective tracking of psychological state fluctuations over time (2) we developed different types of ensemble methods for well-being score prediction, combining Machine Learning and Optimization approaches on top of zero-shot LLMs predictions. Notably, for the latter task, our approach demonstrated the best performance within the competition.

pdf bib
From Posts to Timelines: Modeling Mental Health Dynamics from Social Media Timelines with Hybrid LLMs
Zimu Wang | Hongbin Na | Rena Gao | Jiayuan Ma | Yining Hua | Ling Chen | Wei Wang

Social media data is recognized for its usefulness in the early detection of mental disorders; however, there is a lack of research focused on modeling individuals’ longitudinal mental health dynamics. Moreover, fine-tuning large language models (LLMs) on large-scale, annotated datasets presents challenges due to privacy concerns and the difficulties on data collection and annotation. In this paper, we propose a novel approach for modeling mental health dynamics using hybrid LLMs, where we first apply both classification-based and generation-based models to identify adaptive and maladaptive evidence from individual posts. This evidence is then used to predict well-being scores and generate post-level and timeline-level summaries. Experimental results on the CLPsych 2025 shared task demonstrate the effectiveness of our method, with the generative-based model showing a marked advantage in evidence identification.

pdf bib
Prompt Engineering for Capturing Dynamic Mental Health Self States from Social Media Posts
Callum Chan | Sunveer Khunkhun | Diana Inkpen | Juan Antonio Lossio-Ventura

With the advent of modern Computational Linguistic techniques and the growing societal mental health crisis, we contribute to the field of Clinical Psychology by participating in the CLPsych 2025 shared task. This paper describes the methods and results obtained by the uOttawa team’s submission (which included a researcher from the National Institutes of Health in the USA, in addition to three researchers from the University of Ottawa, Canada). The task consists of four subtasks focused on modeling longitudinal changes in social media users’ mental states and generating accurate summaries of these dynamic self-states. Through prompt engineering of a modern large language model (Llama-3.3-70B-Instruct), the uOttawa team placed first, sixth, fifth, and second, respectively, for each subtask, amongst the other submissions. This work demonstrates the capacity of modern large language models to recognize nuances in the analysis of mental states and to generate summaries through carefully crafted prompting.

pdf bib
Retrieval-Enhanced Mental Health Assessment: Capturing Self-State Dynamics from Social Media Using In-Context Learning
Anson Antony | Annika Schoene

This paper presents our approach to the CLPsych 2025 (Tseriotou et al., 2025) shared task, where our proposed system implements a comprehensive solution using In-Context Learning (ICL) with vector similarity to retrieve relevant examples that guide Large Language Models (LLMs) without specific fine-tuning. We leverage ICL to analyze self-states and mental health indicators across three tasks. We developed a pipeline architecture using Ollama, where we are running Llama 3.3 70B locally and specialized vector databases for post- and timeline-level examples. We experimented with different numbers of retrieved examples (k=5 and k=10) to optimize performance. Our results demonstrate the effectiveness of ICL for clinical assessment tasks, particularly when dealing with limited training data in sensitive domains. The system shows strong performance across all tasks, with particular strength in capturing self-state dynamics.

pdf bib
Self-State Evidence Extraction and Well-Being Prediction from Social Media Timelines
Suchandra Chakraborty | Sudeshna Jana | Manjira Sinha | Tirthankar Dasgupta

This study explores the application of Large Language Models (LLMs) and supervised learning to analyze social media posts from Reddit users, addressing two key objectives: first, to extract adaptive and maladaptive self-state evidence that supports psychological assessment (Task A1); and second, to predict a well-being score that reflects the user’s mental state (Task A2). We propose i) a fine-tuned RoBERTa (Liu et al., 2019) model for Task A1 to identify self-state evidence spans and ii) evaluate two approaches for Task A2: a retrieval-augmented DeepSeek-7B (DeepSeek-AI et al., 2025) model and a Random Forest regression model trained on sentence embeddings. While LLM-based prompting utilizes contextual reasoning, our findings indicate that supervised learning provides more reliable numerical predictions. The RoBERTa model achieves the highest recall (0.602) for Task A1, and Random Forest regression outperforms DeepSeek-7B for Task A2 (MSE: 2.994 vs. 6.610). These results highlight the strengths and limitations of generative vs. supervised methods in mental health NLP, contributing to the development of privacy-conscious, resource-efficient approaches for psychological assessment. This work is part of the CLPsych 2025 shared task (Tseriotou et al., 2025).

pdf bib
Team ISM at CLPsych 2025: Capturing Mental Health Dynamics from Social Media Timelines using A Pretrained Large Language Model with In-Context Learning
Vu Tran | Tomoko Matsui

We tackle the task by using a pretrained large language model (LLM) and in-context learning with template-based instructions to guide the LLM. To improve generation quality, we employ a two-step procedure: sampling and selection. For the sampling step, we randomly sample a subset of the provided training data for the context of LLM prompting. Next, for the selection step, we map the LLM generated outputs into a vector space and employ the Gaussian kernel density estimation to select the most likely output. The results show that the approach can achieve a certain degree of performance and there is still room for improvement.

pdf bib
Transformer-Based Analysis of Adaptive and Maladaptive Self-States in Longitudinal Social Media Data
Abhin B | Renukasakshi V Patil

The CLPsych workshop, held annually since 2014, promotes the application of computational linguistics to behavioral analysis and neurological health assessment. The CLPsych 2025 shared task, extending the framework of the 2022 iteration, leverages the MIND framework to model temporal fluctuations in mental states. This shared task comprises three sub-tasks, each presenting substantial challenges to natural language processing (NLP) systems, requiring sensitive and precise outcomes in analyzing adaptive and maladaptive behaviors. In this study, we employed a range of modeling strategies tailored to the requirements and expected outputs of each subtask. Our approach mostly utilized traditional language models like BERT, LongFormer and Pegasus diverging from the prevalent trend of prompt-tuned large language models. We achieved an overall ranking of 13th, with subtask rankings of 8th in Task 1a, 13th in Task 1b, 8th in Task 2, and 7th in Task 3. These results highlight the efficacy of our methods while underscoring areas for further refinement in handling complex behavioral data.

pdf bib
Who We Are, Where We Are: Mental Health at the Intersection of Person, Situation, and Large Language Models
Nikita Soni | August Håkan Nilsson | Syeda Mahwish | Vasudha Varadarajan | H. Andrew Schwartz | Ryan L. Boyd

Mental health is not a fixed trait but a dynamic process shaped by the interplay between individual dispositions and situational contexts. Building on interactionist and constructionist psychological theories, we develop interpretable models to predict well-being and identify adaptive and maladaptive self-states in longitudinal social media data. Our approach integrates person-level psychological traits (e.g., resilience, cognitive distortions, implicit motives) with language-inferred situational features derived from the Situational 8 DIAMONDS framework. We compare these theory-grounded features to embeddings from a psychometrically-informed language model that captures temporal and individual-specific patterns. Results show that our principled, theory-driven features provide competitive performance while offering greater interpretability. Qualitative analyses further highlight the psychological coherence of features most predictive of well-being. These findings underscore the value of integrating computational modeling with psychological theory to assess dynamic mental states in contextually sensitive and human-understandable ways.

up

pdf (full)
bib (full)
Proceedings of the New Horizons in Computational Linguistics for Religious Texts

pdf bib
Proceedings of the New Horizons in Computational Linguistics for Religious Texts
Sane Yagi | Sane Yagi | Majdi Sawalha | Bayan Abu Shawar | Abdallah T. AlShdaifat | Norhan Abbas | Organizers

pdf bib
Comparative Analysis of Religious Texts: NLP Approaches to the Bible, Quran, and Bhagavad Gita
Mahit Nandan A D | Ishan Godbole | Pranav M Kapparad | Shrutilipi Bhattacharjee

Religious texts have long influenced cultural, moral, and ethical systems, and have shaped societies for generations. Scriptures like the Bible, the Quran, and the Bhagavad Gita offer insights into fundamental human values and societal norms. Analyzing these texts with advanced methods can help improve our understanding of their significance and the similarities or differences between them. This study uses Natural Language Processing (NLP) techniques to examine these religious texts. Latent Dirichlet Allocation (LDA) is used for topic modeling to explore key themes, while GloVe embeddings and Sentence Transformers are used to compare topics between the texts. Sentiment analysis using Valence Aware Dictionary and sEntiment Reasoner (VADER) assesses the emotional tone of the verses, and corpus distance measurement is done to analyze semantic similarities and differences. The findings reveal unique and shared themes and sentiment patterns across the Bible, the Quran, and the Bhagavad Gita, offering new perspectives in computational religious studies.

pdf bib
Messages from the Quran and the Bible in Mandarin through Factor Analysis with Syntactic and Semantic Tags
Kuanlin Liu

This paper tries to decipher messages from the Quran and the Bible’s Mandarin translation using the multidimensional factor analysis (MDA) approach. Part-of-speech and word-meaning annotations were employed for data tagging. Seven syntactic and six semantic factors derived from the tagging systems demonstrated how the two scriptures are interpreted on the factor score scales. The analyses indicated that both holy books uphold a “persuade” and “preach” style with higher frequencies of imperative, advocative, and explanatory expressions. In addition, both favor the “interpersonal, non-numeric, and indicative” strategies to impress followers and practitioners alike with more elaborative wordings. The factor analysis approach also revealed that the Bible differs from the Quran by adopting more “motion, direction, and transportation” information, reflecting the deviation in their historical and religious backgrounds.

pdf bib
Semantic Analysis of Jurisprudential Zoroastrian Texts in Pahlavi: A Word Embedding Approach for an Extremely Under-Resourced, Extinct Language
Rashin Rahnamoun | Ramin Rahnamoun

Zoroastrianism, one of the earliest known religions, reached its height of influence during the Sassanian period, embedding itself within the governmental structure before the rise of Islam in the 7th century led to a significant shift. Subsequently, a substantial body of Zoroastrian literature in Middle Persian (Pahlavi) emerged, primarily addressing religious, ethical, and legal topics and reflecting Zoroastrian responses to evolving Islamic jurisprudence. The text Šāyist nē šāyist (Licit and Illicit), which is central to this study, provides guidance on purity and pollution, offering insights into Zoroastrian legal principles during the late Sassanian period. This study marks the first known application of machine processing to Book Pahlavi texts, focusing on a jurisprudential Zoroastrian text. A Pahlavi corpus was compiled, and word embedding techniques were applied to uncover semantic relationships within the selected text. Given the lack of digital resources and data standards for Pahlavi, a unique dataset of vocabulary pairs was created for evaluating embedding models, allowing for the selection of optimal methods and hyperparameter settings. By constructing a complex network using these embeddings, and leveraging the scarcity of texts in this field, we used complex network analysis to extract additional information about the features of the text. We applied this approach to the chapters of the Šāyist nē šāyist book, uncovering more insights from each chapter. This approach facilitated the initial semantic analysis of Pahlavi legal concepts, contributing to the computational exploration of Middle Persian religious literature.

pdf bib
Multi-stage Training of Bilingual Islamic LLM for Neural Passage Retrieval
Vera Pavlova

This study examines the use of Natural Language Processing (NLP) technology within the Islamic domain, focusing on developing an Islamic neural retrieval model. By leveraging the robust XLM-R base model, the research employs a language reduction technique to create a lightweight bilingual large language model (LLM). Our approach for domain adaptation addresses the unique challenges faced in the Islamic domain, where substantial in-domain corpora exist only in Arabic while limited in other languages, including English. The work utilizes a multi-stage training process for retrieval models, incorporating large retrieval datasets, such as MS MARCO, and smaller, in-domain datasets to improve retrieval performance. Additionally, we have curated an in-domain retrieval dataset in English by employing data augmentation techniques and involving a reliable Islamic source. This approach enhances the domain-specific dataset for retrieval, leading to further performance gains. The findings suggest that combining domain adaptation and a multi-stage training method for the bilingual Islamic neural retrieval model enables it to outperform monolingual models on downstream retrieval tasks.

pdf bib
Automated Translation of Islamic Literature Using Large Language Models: Al-Shamela Library Application
Mohammad Mohammad Khair | Majdi Sawalha

Large Language Models (LLM) can be useful tools for translating Islamic literature written in Arabic into several languages, making this complex task technologically feasible, providing high-quality translations, at low cost and high-speed production enabled by parallel computing. We applied LLM-driven translation automation on a diverse corpus of Islamic scholarly works including: the Qur’an, Quranic exegesis (Tafseer), Hadith, and Jurisprudence from the Al-Shamela library. More than 250,000 pages have been translated into English, emphasizing the potential of LLMs to cross language barriers and increase global access to Islamic knowledge. OpenAI’s gpt-4o-mini model was used for the forward translation from Arabic to English with acceptable translation quality. Translation quality validation was achieved by reproducing Arabic text via back-translation from English using both the OpenAI LLM and an independent Anthropic LLM. Correlating the original source Arabic text and the back-translation Arabic text using a vector embedding cosine similarity metric demonstrated comparable translation quality between the two models.

pdf bib
Automated Authentication of Quranic Verses Using BERT (Bidirectional Encoder Representations from Transformers) based Language Models
Khubaib Amjad Alam | Maryam Khalid | Syed Ahmed Ali | Haroon Mahmood | Qaisar Shafi | Muhammad Haroon | Zulqarnain Haider

The proliferation of Quranic content on digital platforms, including websites and social media, has brought about significant challenges in verifying the authenticity of Quranic verses. The inherent complexity of the Arabic language, with its rich morphology, syntax, and semantics, makes traditional text-processing techniques inadequate for robust authentication. This paper addresses this problem by leveraging state-of-the-art transformer-based Language models tailored for Arabic text processing. Our approach involves fine-tuning three transformer architectures BERT-Base-Arabic, AraBERT, and MarBERT on a curated dataset containing both authentic and non-authentic verses. Non-authentic examples were created using sentence-BERT, which applies cosine similarity to introduce subtle modifications. Comprehensive experiments were conducted to evaluate the performance of the models. Among the three candidate models, MarBERT, which is specifically designed for handling Arabic dialects demonstrated superior performance, achieving an F1-score of 93.80%. BERT-Base-Arabic also showed competitive F1 score of 92.90% reflecting its robust understanding of Arabic text. The findings underscore the potential of transformer-based models in addressing linguistic complexities inherent in Quranic text and pave the way for developing automated, reliable tools for Quranic verse authentication in the digital era.

pdf bib
MASAQ Parser: A Fine-grained MorphoSyntactic Analyzer for the Quran
Majdi Sawalha | Faisal Alshargi | Sane Yagi | Abdallah T. AlShdaifat | Bassam Hammo

This paper introduces the Morphological and Syntactical analysis for the Quran text. In this research we have constructed the MASAQ dataset, a comprehensive resource designed to address the scarcity of annotated Quranic Arabic corpora and facilitate the development of advanced Natural Language Processing (NLP) models. The Quran, being a cornerstone of classical Arabic, presents unique challenges for NLP due to its sacred nature and complex linguistic features. MASAQ provides a detailed syntactic and morphological annotation of the entire Quranic text that includes more than 131K morphological entries and 123K instances of syntactic functions, covering a wide range of grammatical roles and relationships. MASAQ’s unique features include a comprehensive tagset of 72 syntactic roles, detailed morphological analysis, and context-specific annotations. This dataset is particularly valuable for tasks such as dependency parsing, grammar checking, machine translation, and text summarization. The potential applications of MASAQ are vast, ranging from pedagogical uses in teaching Arabic grammar to developing sophisticated NLP tools. By providing a high-quality, syntactically annotated dataset, MASAQ aims to advance the field of Arabic NLP, enabling more accurate and more efficient language processing tools. The dataset is made available under the Creative Commons Attribution 3.0 License, ensuring compliance with ethical guidelines and respecting the integrity of the Quranic text.

pdf bib
Leveraging AI to Bridge Classical Arabic and Modern Standard Arabic for Text Simplification
Shatha Altammami

This paper introduces the Hadith Simplification Dataset, a novel resource comprising 250 pairs of Classical Arabic (CA) Hadith texts and their simplified Modern Standard Arabic (MSA) equivalents. Addressing the lack of resources for simplifying culturally and religiously significant texts, this dataset bridges linguistic and accessibility gaps while preserving theological integrity. The simplifications were generated using a large language model and rigorously verified by an Islamic Studies expert to ensure precision and cultural sensitivity. By tackling the unique lexical, syntactic, and cultural challenges of CA-to-MSA transformation, this resource advances Arabic text simplification research. Beyond religious texts, the methodology developed is adaptable to other domains, such as poetry and historical literature. This work underscores the importance of ethical AI applications in preserving the integrity of religious texts while enhancing their accessibility to modern audiences.

pdf bib
Word boundaries and the morphology-syntax trade-off
Pablo Mosteiro | Damián Blasi

This paper investigates the relationship between syntax and morphology in natural languages, focusing on the relation between the amount of information stored by word structure on the one hand, and word order on the other. In previous work, a trade-off between these was observed in a large corpus covering over a thousand languages, suggesting a dynamic ‘division of labor’ between syntax and morphology, as well as yielding proof for the efficient coding of information in language. In contrast, we find that the trade-off can be explained by differing conventions in orthographic word boundaries. We do so by redefining word boundaries within languages either by increasing or decreasing the domain of wordhood implied by orthographic words. Namely, we paste frequent word-pairs together and split words into their frequently occurring component parts. These interventions yield the same trade-off within languages across word domains as what is observed across languages in the orthographic word domain. This allows us to conclude that the original claims on syntax-morphology trade-offs were spurious and that, more importantly, there does not seem to exist a privileged wordhood domain where within- and across-word regularities yield an optimal or optimized amount of information.

up

pdf (full)
bib (full)
Proceedings of the 5th Celtic Language Technology Workshop

pdf bib
Proceedings of the 5th Celtic Language Technology Workshop
Brian Davis | Theodorus Fransen | Elaine Uí Dhonnchadha | Abigail Walsh

pdf bib
An Assessment of Word Separation Practices in Old Irish Text Resources and a Universal Method for Tokenising Old Irish Text
Adrian Doyle | John P. McCrae

The quantity of Old Irish text which survives in contemporary manuscripts is relatively small by comparison to what is available for well-resourced modern languages. Moreover, as it is a historical language, no more text will ever be generated by native speakers of Old Irish. This makes the text which has survived particularly valuable, and ideally, all of it would be annotated using a single, common annotation standard, thereby ensuring compatibility between text resources. At present, Old Irish text repositories separate words or sub-word morphemes in accordance with different methodologies, and each uses a different style of lexical annotation. This makes it difficult to utilise content from more than any one repository in NLP applications. This paper provides an assessment of distinctions between existing annotated corpora, showing that the primary point of divergence is at the token level. For this reason, this paper also describes a new method for tokenising Old Irish text. This method can be applied even to diplomatic editions, and has already been utilised in various text resources.

pdf bib
Synthesising a Corpus of Gaelic Traditional Narrative with Cross-Lingual Text Expansion
William Lamb | Dongge Han | Ondrej Klejch | Beatrice Alex | Peter Bell

Advances in large language modelling have disproportionately benefited high-resource languages due to their vastly greater training data reserves. This paper proposes a novel cross-lingual text expansion (XLTE) technique using multilingual large language models (MLLMs) to mitigate data sparsity in low-resource languages. We apply XLTE to the domain of traditional Scottish Gaelic storytelling to generate a training corpus suitable for language modelling, for example as part of an automatic speech recognition system. The effectiveness of this technique is demonstrated using OpenAI’s GPT-4o, with supervised fine-tuning (SFT) providing decreased neologism rates and a 57.2% reduction in perplexity over the baseline model. Despite these promising results, qualitative analyses reveal important stylistic divergences between synthesised and genuine data. Nevertheless, XLTE offers a promising, scalable method for synthesising training sets in other languages and domains, opening avenues for further improvements in low-resource language modelling.

pdf bib
A Pragmatic Approach to Using Artificial Intelligence and Virtual Reality in Digital Game-Based Language Learning
Monica Ward | Liang Xu | Elaine Uí Dhonnchadha

Computer-Assisted Language Learning (CALL) applications have many benefits for language learning. However, they can be difficult to develop for low-resource languages such as Irish and the other Celtic languages. It can be difficult to assemble the multidisciplinary team needed to develop CALL resources and there are fewer language resources available for the language. This paper provides an overview of a pragmatic approach to using Artificial Intelligence (AI) and Virtual Reality (VR) in developing a Digital Game-Based Language Learning (DGBLL) app for Irish. This pragmatic approach was used to develop Cipher - a DGBLL app for Irish (Xu et al, 2022b) where a number of existing resources including text repositories and NLP tools were used. In this paper the focus is on the incorporation of Artificial Intelligence (AI) technologies including AI image generation, text-to-speech (TTS) and Virtual Reality (VR), in a pedagogically informed manner to support language learning in a way that is both challenging and enjoyable. Cipher has been designed to be language independent and can be adapted for various cohorts of learners and for other languages. Cipher has been played and tested in a number of schools in Dublin and the feedback from teachers and students has been very positive. This paper outlines how AI and VR technologies have been utilised in Cipher and how it could be adapted to other Celtic languages and low-resource languages in general.

pdf bib
Fotheidil: an Automatic Transcription System for the Irish Language
Liam Lonergan | Ibon Saratxaga | John Sloan | Oscar Maharg Bravo | Mengjie Qian | Neasa Ní Chiaráin | Christer Gobl | Ailbhe Ní Chasaide

This paper sets out the first web-based transcription system for the Irish language - Fotheidil, a system that utilises speech-related AI technologies as part of the ABAIR initiative. The system includes both off-the-shelf pre-trained voice activity detection and speaker diarisation models and models trained specifically for Irish automatic speech recognition and capitalisation and punctuation restoration. Semi-supervised learning is explored to improve the acoustic model of a modular TDNN-HMM ASR system, yielding substantial improvements for out-of-domain test sets and dialects that are underrepresented in the supervised training set. A novel approach to capitalisation and punctuation restoration involving sequence-to-sequence models is compared with the conventional approach using a classification model. Experimental results show here also substantial improvements in performance. It is intended that will be made freely available for public use, and represents an important resource researchers and others who transcribe Irish language materials. Human-corrected transcriptions will be collected and included in the training dataset as the system is used, which should lead to incremental improvements to the ASR model in a cyclical, community-driven fashion.

pdf bib
Gaeilge Bhriste ó Shamhlacha Cliste: How Clever Are LLMs When Translating Irish Text?
Teresa Clifford | Abigail Walsh | Brian Davis | Mícheál J. Ó Meachair

Large Language Models have been widely adopted in NLP tasks and applications, how- ever, their ability to accurately process Irish and other minority languages has not been fully explored. In this paper we describe prelim- inary experiments examining the capacity of publicly-available machine translation engines (Google Translate, Microsoft Bing, and eTrans- lation) and prompt-based AI systems systems (ChatGPT 3.5, Llama 2) for translating and handling challenging language features of Irish. A hand-crafted selection of challenging Irish language features were incorporated into trans- lation prompts, and the output from each model was examined by a human evaluator. The re- sults of these experiments indicate that these LLM-based models still struggle with translat- ing rare linguistic phenomena and ambiguous constructions. This preliminary analysis helps to inform further research in this field, pro- viding a simple ranking of publicly-available models, and indicating which language features require particular attention when evaluating model capacity.

up

pdf (full)
bib (full)
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

pdf bib
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Tatsuki Kuribayashi | Giulia Rambelli | Ece Takmaz | Philipp Wicke | Jixing Li | Byung-Doh Oh

pdf bib
Linguistic Blind Spots of Large Language Models
Jiali Cheng | Hadi Amiri

Large language models (LLMs) serve as the foundation of numerous AI applications today. However, despite their remarkable proficiency in generating coherent text, questions linger regarding their ability in performing fine-grained linguistic annotation tasks, such as detecting nouns or verbs, or identifying more complex syntactic structures like clauses or T-units in input texts. These tasks require precise syntactic and semantic understanding of input text, and when LLMs underperform on specific linguistic structures, it raises concerns about their reliability for detailed linguistic analysis and whether their (even correct) outputs truly reflect an understanding of the inputs. In this paper, we empirically study recent LLMs performance across fine-grained linguistic annotation tasks. Through a series of experiments, we find that recent LLMs show limited efficacy in addressing linguistic queries and often struggle with linguistically complex inputs. We show that the most capable LLM (Llama3-70b) makes notable errors in detecting linguistic structures, such as misidentifying embedded clauses, failing to recognize verb phrases, and confusing complex nominals with clauses. Our study provides valuable insights to inform future endeavors in LLM design and development.

pdf bib
ParaBLoCC: Parallel Basic Locative Constructions Corpus
Peter Viechnicki | Anthony Kostacos

We introduce ParaBLoCC, the Parallel Basic Locative Construction Corpus, the first multilingual compendium of this important grammatico-functional construction, and particularly the first such corpus containing semantically equivalent BLCs in source/target language pairs. The data – taken from bitext corpora in English paired with twenty-six typologically diverse languages – are likely to prove useful for studying questions of cognitive underpinnings and cross-linguistic usage patterns of spatial expressions, as well as for improving multilingual spatial relation extraction and related tasks. The data are being made available at https://github.com/pviechnicki/parablocc.

pdf bib
Capturing Online SRC/ORC Effort with Memory Measures from a Minimalist Parser
Aniello De Santo

A parser for Minimalist grammars (Stabler, 2013) has been shown to successfully model sentence processing preferences across an array of languages and phenomena when combined with complexity metrics that relate parsing behavior to memory usage (Gerth, 2015; Graf et al., 2017; De Santo, 2020, a.o.). This model provides a quantifiable theory of the effects of fine-grained grammatical structure on cognitive cost, and can help strengthen the link between generative syntactic theory and sentence processing.However, work on it has focused on offline asymmetries.Here, we extend this approach by showing how memory-based measures of effort that explicitly consider minimalist-like structure-building operations improve our ability to account for word-by-word (online) behavioral data.

pdf bib
From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy
Adrianna Romanowski | Pedro H. V. Valois | Kazuhiro Fukui

Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems’ abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy’s unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model’s performance. The model’s results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts.

pdf bib
Profiling neural grammar induction on morphemically tokenised child-directed speech
Mila Marcheva | Theresa Biberauer | Weiwei Sun

We investigate the performance of state-of-the-art (SotA) neural grammar induction (GI) models on a morphemically tokenised English dataset based on the CHILDES treebank (Pearl and Sprouse, 2013). Using implementations from Yang et al. (2021a), we train models and evaluate them with the standard F1 score. We introduce novel evaluation metrics—depth-of-morpheme and sibling-of-morpheme—which measure phenomena around bound morpheme attachment. Our results reveal that models with the highest F1 scores do not necessarily induce linguistically plausible structures for bound morpheme attachment, highlighting a key challenge for cognitively plausible GI.

pdf bib
Exploring the Integration of Eye Movement Data on Word Embeddings
Fermín Travi | Gabriel Aimé Leclercq | Diego Fernandez Slezak | Bruno Bianchi | Juan E Kamienkowski

Reading, while structured, is a non-linear process. Readers may skip some words, linger on others, or revisit earlier text. Emerging work has started exploring the incorporation of reading behaviour through eye-tracking into the training of specific language tasks. In this work, we investigate the broader question of how gaze data can shape word embeddings by using text as read by human participants and predicting gaze measures from them. To that end, we conducted an eye-tracking experiment with 76 participants reading 20 short stories in Spanish and fine-tuned Word2Vec and LSTM models on the collected data. Evaluations with representational similarity analysis and word pair similarities showed a limited, but largely consistent, gain from gaze incorporation, suggesting future work should expand linguistic diversity and use cognitively aligned evaluations to better understand its role in bridging computational and human language representations.

pdf bib
Unzipping the Causality of Zipf’s Law and Other Lexical Trade-offs
Amanda Doucette | Timothy J. O’Donnell | Morgan Sonderegger

There are strong constraints on the structure of a possible lexicon. For example, the negative correlation between word frequency and length known as Zipf’s law, and a negative correlation between word length and phonotactic complexity appear to hold across languages. While lexical trade-offs like these have been examined individually, it is unclear how they interact as a system. In this paper, we propose causal discovery as a method for identifying lexical biases and their interactions in a set of variables. We represent the lexicon as a causal model, and apply the Fast Causal Discovery algorithm (Spirtes et al., 1995) to identify both causal relationships between measured variables and the existence of possible unmeasured confounding variables. We apply this method to lexical data including measures of word length, frequency, phonotactic complexity, and morphological irregularity for 25 languages and find evidence of universal associations involving word length with a high likelihood of involving an unmeasured confounder, suggesting that additional variables need to be measured to determine how they are related. We also find evidence of variation across languages in relationships between the remaining variables, and suggest that given a larger dataset, causal discovery algorithms can be a useful tool in assessing the universality of lexical biases.

pdf bib
Quantifying Semantic Functional Specialization in the Brain Using Encoding Models of Natural Language
Jiaqi Chen | Richard Antonello | Kaavya Chaparala | Coen Arrow | Nima Mesgarani

Although functional specialization in the brain - a phenomenon where different regions process different types of information - is well documented, we still lack precise mathematical methods with which to measure it. This work proposes a technique to quantify how brain regions respond to distinct categories of information. Using a topic encoding model, we identify brain regions that respond strongly to specific semantic categories while responding minimally to all others. We then use a language model to characterize the common themes across each region’s preferred categories. Our technique successfully identifies previously known functionally selective regions and reveals consistent patterns across subjects while also highlighting new areas of high specialization worthy of further study.

pdf bib
“Is There Anything Else?”: Examining Administrator Influence on Linguistic Features from the Cookie Theft Picture Description Cognitive Test
Changye Li | Zhecheng Sheng | Trevor Cohen | Serguei V. S. Pakhomov

Alzheimer’s Disease (AD) dementia is a progressive neurodegenerative disease that negatively impacts patients’ cognitive ability. Previous studies have demonstrated that changes in naturalistic language samples can be useful for early screening of AD dementia. However, the nature of language deficits often requires test administrators to use various speech elicitation techniques during spontaneous language assessments to obtain enough propositional utterances from dementia patients. This could lead to the “observer’s effect” on the downstream analysis that has not been fully investigated. Our study seeks to quantify the influence of test administrators on linguistic features in dementia assessment with two English corpora the “Cookie Theft” picture description datasets collected at different locations and test administrators show different levels of administrator involvement. Our results show that the level of test administrator involvement significantly impacts observed linguistic features in patient speech. These results suggest that many of significant linguistic features in the downstream classification task may be partially attributable to differences in the test administration practices rather than solely to participants’ cognitive status. The variations in test administrator behavior can lead to systematic biases in linguistic data, potentially confounding research outcomes and clinical assessments. Our study suggests that there is a need for a more standardized test administration protocol in the development of responsible clinical speech analytics frameworks.

pdf bib
Cross-Framework Generalizable Discourse Relation Classification Through Cognitive Dimensions
Yingxue Fu

Existing discourse corpora annotated under different frameworks adopt distinct but somewhat related taxonomies of relations. How to integrate discourse frameworks has been an open research question. Previous studies on this topic are mainly theoretical, although such research is typically performed with the hope of benefiting computational applications. In this paper, we show how the proposal by Sanders et al. (2018) based on the Cognitive approach to Coherence Relations (CCR) (Sanders et al.,1992, 1993) can be used effectively to facilitate cross-framework discourse relation (DR) classification. To address the challenges of using predicted UDims for DR classification, we adopt the Bayesian learning framework based on Monte Carlo dropout (Gal and Ghahramani, 2016) to obtain more robust predictions. Data augmentation enabled by our proposed method yields strong performance (55.75 for RST and 55.01 for PDTB implicit DR classification in macro-averaged F1). We compare four model designs and analyze the experimental results from different perspectives. Our study shows an effective and cross-framework generalizable approach for DR classification, filling a gap in existing studies.

pdf bib
Distinct social-linguistic processing between humans and large audio-language models: Evidence from model-brain alignment
Hanlin Wu | Xufeng Duan | Zhenguang Cai

Voice-based AI development faces unique challenges in processing both linguistic and paralinguistic information. This study compares how large audio-language models (LALMs) and humans integrate speaker characteristics during speech comprehension, asking whether LALMs process speaker-contextualized language in ways that parallel human cognitive mechanisms. We compared two LALMs’ (Qwen2-Audio and Ultravox 0.5) processing patterns with human EEG responses. Using surprisal and entropy metrics from the models, we analyzed their sensitivity to speaker-content incongruency across social stereotype violations (e.g., a man claiming to regularly get manicures) and biological knowledge violations (e.g., a man claiming to be pregnant). Results revealed that Qwen2-Audio exhibited increased surprisal for speaker-incongruent content and its surprisal values significantly predicted human N400 responses, while Ultravox 0.5 showed limited sensitivity to speaker characteristics. Importantly, neither model replicated the human-like processing distinction between social violations (eliciting N400 effects) and biological violations (eliciting P600 effects). These findings reveal both the potential and limitations of current LALMs in processing speaker-contextualized language, and suggest differences in social-linguistic processing mechanisms between humans and LALMs.

pdf bib
SPACER: A Parallel Dataset of Speech Production And Comprehension of Error Repairs
Shiva Upadhye | Jiaxuan Li | Richard Futrell

Speech errors are a natural part of communication, yet they rarely lead to complete communicative failure because both speakers and comprehenders can detect and correct errors. Although prior research has examined error monitoring and correction in production and comprehension separately, integrated investigation of both systems has been impeded by the scarcity of parallel data. In this study, we present SPACER, a parallel dataset that captures how naturalistic speech errors are corrected by both speakers and comprehenders. We focus on single-word substitution errors extracted from the Switchboard speech corpus, accompanied by speaker’s self-repairs and comprehenders’ responses from an offline text-editing experiment. Our exploratory analysis suggests asymmetries in error correction strategies: speakers are more likely to repair errors that introduce greater semantic and phonemic deviations, whereas comprehenders tend to correct errors that are phonemically similar to more plausible alternatives or do not fit into prior contexts. Our dataset enables future research on the integrated approach of language production and comprehension.

pdf bib
Are Larger Language Models Better at Disambiguation?
Ziyuan Cao | William Schuler

Humans deal with temporary syntactic ambiguity all the time in incremental sentence processing. Sentences with temporary ambiguity that causes processing difficulties, often reflected by increase in reading time, are referred to as garden-path sentences. Garden-path theories of sentence processing attribute the increases in reading time to the reanalysis of the previously ambiguous syntactic structure to make it consistent with the new disambiguating text. It is unknown whether transformer-based language models successfully resolve the temporary ambiguity after encountering the disambiguating text. We investigated this question by analyzing completions generated from language models for a type of garden-path sentence with ambiguity between a complement clause interpretation and a relative clause interpretation. We found that larger language models are worse at resolving such ambiguity.

pdf bib
Towards a Bayesian hierarchical model of lexical processing
Cassandra L Jacobs | Loïc Grobol

In cases of pervasive uncertainty, cognitive systems benefit from heuristics or committing to more general hypotheses. Here we have presented a hierarchical cognitive model of lexical processing that synthesizes advances in early rational cognitive models with modern-day neural architectures. Probabilities of higher-order categories derived from layers extracted from the middle layers of an encoder language model have predictive power in accounting for several reading measures for both predicted and unpredicted words and influence even early first fixation duration behavior. The results suggest that lexical processing can take place within a latent, but nevertheless discrete, space in cases of uncertainty.

pdf bib
Modeling Chinese L2 Writing Development: The LLM-Surprisal Perspective
Jingying Hu | Yan Cong

LLM-surprisal is a computational measure of how unexpected a word or character is given the preceding context, as estimated by large language models (LLMs). This study investigated the effectiveness of LLM-surprisal in modeling second language (L2) writing development, focusing on Chinese L2 writing as a case to test its cross-linguistical generalizability. We selected three types of LLMs with different pretraining settings: a multilingual model trained on various languages, a Chinese-general model trained on both Simplified and Traditional Chinese, and a Traditional-Chinese-specific model. This comparison allowed us to explore how model architecture and training data affect LLM-surprisal estimates of learners’ essays written in Traditional Chinese, which in turn influence the modeling of L2 proficiency and development. We also correlated LLM-surprisals with 16 classic linguistic complexity indices (e.g., character sophistication, lexical diversity, syntactic complexity, and discourse coherence) to evaluate its interpretability and validity as a measure of L2 writing assessment. Our findings demonstrate the potential of LLM-surprisal as a robust, interpretable, cross-linguistically applicable metric for automatic writing assessment and contribute to bridging computational and linguistic approaches in understanding and modeling L2 writing development. All analysis scripts are available at https://github.com/JingyingHu/ChineseL2Writing-Surprisals.

pdf bib
Beyond Binary Animacy: A Multi-Method Investigation of LMs’ Sensitivity in English Object Relative Clauses
Yue Li | Yan Cong | Elaine J. Francis

Animacy is a well-documented factor affecting language production, but its influence on Language Models (LMs) in complex structures like Object Relative Clauses (ORCs) remains underexplored. This study examines LMs’ sensitivity to animacy in English ORC structure choice (passive vs. active) using surprisal-based and prompting-based analyses, alongside human baselines. In surprisal-based analysis, DistilGPT-2 best mirrored human preferences, while GPT-Neo and BERT-base showed rigid biases, diverging from human patterns. Prompting-based analysis expanded testing to GPT-4o-mini, Gemini models, and DeepSeek-R1, revealing GPT-4o-mini’s stronger human alignment but limited animacy sensitivity in Gemini models and DeepSeek-R1. Some LMs exhibited inconsistencies between analyses, reinforcing that prompting alone is unreliable for assessing linguistic competence. Corpus analysis confirmed that training data alone cannot fully explain animacy sensitivity, suggesting emergent animacy-aware representations. These findings underscore the interaction between training data, model architecture, and linguistic generalization, highlighting the need for integrating structured linguistic knowledge into LMs to enhance their alignment with human sentence processing mechanisms.

pdf bib
An Empirical Study of Language Syllabification using Syllabary and Lexical Networks
Rusali Saha | Yannick Marchand

Language syllabification is the separation of a word into written or spoken syllables. The study of syllabification plays a pivotal role in morphology and there have been previous attempts to study this phenomenon using graphs or networks. Previous approaches have claimed through visual estimation that the degree distribution of language networks follows the Power Law distribution, however, there have not been any empirically grounded metrics to determine the same. In our study, we implement two kinds of language networks, namely, syllabary and lexical networks, and investigate the syllabification of four European languages: English, French, German and Spanish using network analysis and examine their small-world, random and scale-free nature. We additionally empirically prove that contrary to claims in previous works, although the degree distribution of these networks appear to follow a power law distribution, they are actually more in agreement with a log-normal distribution, when a numerically grounded curve-fitting is applied. Finally, we explore how syllabary and lexical networks for the English language change over time using a database of age-of-acquisition rating words. Our analysis further shows that the preferential attachment mechanism appears to be a well-grounded explanation for the degree distribution of the syllabary network.

pdf bib
Creolization versus code-switching: An agent-based cognitive model for bilingual strategies in language contact
Charles John Torres | Weijie Xu | Yanting Li | Richard Futrell

Creolization and code-switching are closely related contact-induced linguistic phenomena, yet little attention has been paid to the connection between them. In this paper, we propose an agent-based cognitive model which provides a linkage between these two phenomena focusing on the statistical regularization of language use. That is, we identify that creolization as a conventionalization process and code-switching as flexible language choice can emerge from the same cognitive model in different social environments. Our model postulates a social structure of bilingual and monolingual populations, in which a set of agents seek for optimal communicative strategy shaped by multiple cognitive constraints. The simulation results show that our model successfully captures both phenomena as two ends of a continuum, characterized by varying degrees of regularization in the use of linguistic constructions from multiple source languages. The model also reveals a subtle dynamic between social structure and individual-level cognitive constraints.

pdf bib
When Men Bite Dogs: Testing Good-Enough Parsing in Turkish with Humans and Large Language Models
Onur Keleş | Nazik Dinctopal Deniz

This paper investigates good-enough parsing in Turkish by comparing human self-paced reading performance to the surprisal and attention patterns of three Turkish Large Language Models (LLMs), GPT-2-Base, GPT-2-Large, and LLaMA-3. The results show that Turkish speakers rely on good-enough parsing for implausible but grammatically permissible sentences (e.g., interpreting sentences such as ‘the man bit the dog’ as ‘the dog bit the man’). Although the smaller LLMs (e.g., GPT-2) were better predictors of human RTs, they seem to have relied more heavily on semantic plausibility than humans. Comparably, larger LLMs (e.g., LLaMA-3) tended to make more probabilistic parsing based on word order, exhibiting less good-enough parsing behavior. Therefore, we conclude that LLMs take syntactic and semantic constraints into account when processing thematic roles, but not to the same extent as human parsers.

pdf bib
Transformers Can Model Human Hyperprediction in Buzzer Quiz
Yoichiro Yamashita | Yuto Harada | Yohei Oseki

Humans tend to predict the next words during sentence comprehension, but under unique circumstances, they demonstrate an ability for longer coherent word sequence prediction. In this paper, we investigate whether Transformers can model such hyperprediction observed in humans during sentence processing, specifically in the context of Japanese buzzer quizzes. We conducted eye-tracking experiments where the participants read the first half of buzzer quiz questions and predicted the second half, while we modeled their reading time using the GPT-2. By modeling the reading times of each word in the first half of the question using GPT-2 surprisal, we examined under what conditions fine-tuned language models can better predict reading times. As a result, we found that GPT-2 surprisal effectively explains the reading times of quiz experts as they read the first half of the question while predicting the latter half. When the language model was fine-tuned with quiz questions, the perplexity value decreased. Lower perplexity corresponded to higher psychometric predictive power; however, excessive data for fine-tuning led to a decrease in perplexity and the fine-tuned model exhibited a low psychometric predictive power. Overall, our findings suggest that a moderate amount of data is required for fine-tuning in order to model human hyperprediction.

pdf bib
What to Predict? Exploring How Sentence Structure Influences Contrast Predictions in Humans and Large Language Models
Shuqi Wang | Xufeng Duan | Zhenguang Cai

This study examines how sentence structure shapes contrast predictions in both humans and large language models (LLMs). Using Mandarin ditransitive constructions — double object (DO, “She gave the girl the candy, but not...”) vs. prepositional object (PO, “She gave the candy to the girl, but not...”) as a testbed, we employed a sentence continuation task involving three human groups (written, spoken, and prosodically normalized spoken stimuli) and three LLMs (GPT-4o, LLaMA-3, and Qwen-2.5). Two principal findings emerged: (1) Although human participants predominantly focused on the theme (e.g., “the candy”), contrast predictions were significantly modulated by sentence structure—particularly in spoken contexts, where the sentence-final element drew more attention. (2) While LLMs showed a similar reliance on structure, they displayed a larger effect size and more closely resembled human spoken data than written data, indicating a stronger emphasis on linear order in generating contrast predictions. By adopting a unified psycholinguistic paradigm, this study advances our understanding of predictive language processing for both humans and LLMs and informs research on human–model alignment in linguistic tasks.

pdf bib
Investigating noun-noun compound relation representations in autoregressive large language models
Saffron Kendrick | Mark Ormerod | Hui Wang | Barry Devereux

This paper uses autoregressive large language models to explore at which points in a given input sentence the semantic information is decodable. Using representational similarity analysis and probing, the results show that autoregressive models are capable of extracting the semantic relation information from a dataset of noun-noun compounds. When considering the effect of processing the head and modifier nouns in context, the extracted representations show greater correlation after processing both constituent nouns in the same sentence. The linguistic properties of the head nouns may influence the ability of LLMs to extract relation information when the head and modifier words are processed separately. Probing suggests that Phi-1 and LLaMA-3.2 are exposed to relation information during training, as they are able to predict the relation vectors for compounds from separate word representations to a similar degree as using compositional compound representations. However, the difference in processing condition for GPT-2 and DeepSeek-R1 indicates that these models are actively processing the contextual semantic relation information of the compound.

up

pdf (full)
bib (full)
Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation

pdf bib
Proceedings of Context and Meaning: Navigating Disagreements in NLP Annotation
Michael Roth | Dominik Schlechtweg

pdf bib
Is a bunch of words enough to detect disagreement in hateful content?
Giulia Rizzi | Paolo Rosso | Elisabetta Fersini

The complexity of the annotation process when adopting crowdsourcing platforms for labeling hateful content can be linked to the presence of textual constituents that can be ambiguous, misinterpreted, or characterized by a reduced surrounding context. In this paper, we address the problem of perspectivism in hateful speech by leveraging contextualized embedding representation of their constituents and weighted probability functions. The effectiveness of the proposed approach is assessed using four datasets provided for the SemEval 2023 Task 11 shared task. The results emphasize that a few elements can serve as a proxy to identify sentences that may be perceived differently by multiple readers, without the need of necessarily exploiting complex Large Language Models.

pdf bib
On Crowdsourcing Task Design for Discourse Relation Annotation
Frances Yung | Vera Demberg

Interpreting implicit discourse relations involves complex reasoning, requiring the integration of semantic cues with background knowledge, as overt connectives like “because” or “then” are absent. These relations often allow multiple interpretations, best represented as distributions. In this study, we compare two established methods that crowdsource implicit discourse relation annotation by connective insertion: a free-choice approach, which allows annotators to select any suitable connective, and a forced-choice approach, which asks them to select among a set of predefined options. Specifically, we re-annotate the whole DiscoGeM 1.0 corpus - initially annotated with the free-choice method - using the forced-choice approach. The free-choice approach allows for flexible and intuitive insertion of various connectives, which are context-dependent. Comparison among over 130,000 annotations, however, shows that the free-choice strategy produces less diverse annotations, often converging on common labels. Analysis of the results reveals the interplay between task design and the annotators’ abilities to interpret and produce discourse relations.

pdf bib
Sources of Disagreement in Data for LLM Instruction Tuning
Russel Dsouza | Venelin Kovatchev

In this paper we study the patterns of label disagreement in data used for instruction tuning Large Language models (LLMs). Specifically, we focus on data used for Reinforcement Learning from Human Feedback (RLHF). Our objective is to determine what is the primary source of disagreement: the individual data points, the choice of annotators, or the task formulation. We annotate the same dataset multiple times under different conditions and compare the overall agreement and the patterns of disagreement. For task formulation, we compare “single” format where annotators rate LLM responses individually with “preference” format where annotators select one of two possible responses. For annotators, we compare data from human labelers with automatic data labeling using LLMs. Our results indicate that: (1) there are very few “universally ambiguous” instances. The label disagreement depends largely on the task formulation and the choice of annotators; (2) the overall agreement remains consistent across experiments. We find no evidence that “preference” data is of higher quality than “single” data; and (3) the change of task formulation and annotators impacts the resulting instance-level labels. The labels obtained in different experiments are correlated, but not identical.

pdf bib
CoMeDi Shared Task: Median Judgment Classification & Mean Disagreement Ranking with Ordinal Word-in-Context Judgments
Dominik Schlechtweg | Tejaswi Choppa | Wei Zhao | Michael Roth

We asked task participants to solve two subtasks given a pair of word usages: Ordinal Graded Word-in-Context Classification (OGWiC) and Disagreement in Word-in-Context Ranking (DisWiC). The tasks take a different view on modeling of word meaning by (i) treating WiC as an ordinal classification task, and (ii) making disagreement the explicit detection aim (instead of removing it). OGWiC is solved with relatively high performance while DisWiC proves to be a challenging task. In both tasks, the dominating model architecture uses independently optimized binary Word-in-Context models.

pdf bib
Deep-change at CoMeDi: the Cross-Entropy Loss is not All You Need
Mikhail Kuklin | Nikolay Arefyev

Manual annotation of edges in Diachronic Word Usage Graphs is a critical step in creation of datasets for Lexical Semantic Change Detection tasks, but a very labour-intensive one. Annotators estimate if two senses of an ambiguous word expressed in two usages of this word are related and how. This is a variation of the Word-in-Context (WiC) task with some peculiarities, including diachronic data, an ordinal scale for annotations consisting of 4 values with pre-defined meanings (e.g. homonymy, polysemy), and special attention to the degree of disagreement between annotators which affects the further processing of the graph. CoMeDi is a shared task aiming at automating this annotation process. Participants are asked to predict the median annotation for a pair of usages in the first subtask, and estimate the disagreement between annotators in the second subtask. Together this gives some idea about the distribution of annotations we can get from humans for a given pair of usages. For the first subtask we tried several ways of adapting a binary WiC model to this 4 class problem. We discovered that further fine-tuning the model as a 4 class classifier on the training data of the shared task works significantly worse than thresholding the original binary model. For the second subtask our best results were achieved by building a model that predicts the whole multinomial distribution of annotations and calculating the disagreement from this distribution. Our solutions for both subtasks have outperformed all other participants of the shared task.

pdf bib
Predicting Median, Disagreement and Noise Label in Ordinal Word-in-Context Data
Tejaswi Choppa | Michael Roth | Dominik Schlechtweg

The quality of annotated data is crucial for Machine Learning models, particularly in word sense annotation in context (Word-in-Context, WiC). WiC datasets often show significant annotator disagreement, and information is lost when creating gold labels through majority or median aggregation. Recent work has addressed this by incorporating disagreement data through new label aggregation methods. Modeling disagreement is important since real-world scenarios often lack clean data and require predictions on inherently difficult samples. Disagreement prediction can help detect complex cases or to reflect inherent data ambiguity. We aim to model different aspects of ordinal Word-in-Context annotations necessary to build a more human-like model: (i) the aggregated label, which has traditionally been the modeling aim, (ii) the disagreement between annotators, and (iii) the aggregated noise label which annotators can choose to exclude data points from annotation. We find that disagreement and noise are impacted by various properties of data like ambiguity, which in turn points to data uncertainty.

pdf bib
GRASP at CoMeDi Shared Task: Multi-Strategy Modeling of Annotator Behavior in Multi-Lingual Semantic Judgments
David Alfter | Mattias Appelgren

This paper presents the GRASP team’s systems for the CoMeDi 2025 shared task on disagreement prediction in semantic annotation. The task comprises two subtasks: predicting median similarity scores and mean disagreement scores for word usage across multiple languages including Chinese, English, German, Norwegian, Russian, Spanish, and Swedish. For subtask 1, we implement three approaches: Prochain, a probabilistic chain model predicting sequential judgments; FARM, an ensemble of five fine-tuned XLM-RoBERTa models; and THAT, a task-specific model using XL-Lexeme with adaptive thresholds. For subtask 2, we develop three systems: LAMP, combining language-agnostic and monolingual models; BUMBLE, using optimal language combinations; and DRAMA, leveraging disagreement patterns from FARM’s outputs. Our results show strong performance across both subtasks, ranking second overall among participating teams. The probabilistic Prochain model demonstrates surprisingly robust performance when given accurate initial judgments, while our task-specific approaches show varying effectiveness across languages.

pdf bib
Funzac at CoMeDi Shared Task: Modeling Annotator Disagreement from Word-In-Context Perspectives
Olufunke O. Sarumi | Charles Welch | Lucie Flek | Jörg Schlötterer

In this work, we evaluate annotator disagreement in Word-in-Context (WiC) tasks exploring the relationship between contextual meaning and disagreement as part of the CoMeDi shared task competition. While prior studies have modeled disagreement by analyzing annotator attributes with single-sentence inputs, this shared task incorporates WiC to bridge the gap between sentence-level semantic representation and annotator judgment variability. We describe three different methods that we developed for the shared task, including a feature enrichment approach that combines concatenation, element-wise differences, products, and cosine similarity, Euclidean and Manhattan distances to extend contextual embedding representations, a transformation by Adapter blocks to obtain task-specific representations of contextual embeddings, and classifiers of varying complexities, including ensembles. The comparison of our methods demonstrates improved performance for methods that include enriched and task-specfic features. While the performance of our method falls short in comparison to the best system in subtask 1 (OGWiC), it is competitive to the official evaluation results in subtask 2 (DisWiC)

pdf bib
FuocChuVIP123 at CoMeDi Shared Task: Disagreement Ranking with XLM-Roberta Sentence Embeddings and Deep Neural Regression
Phuoc Duong Huy Chu

This paper presents results of our system for CoMeDi Shared Task, focusing on Subtask 2: Disagreement Ranking. Our system leverages sentence embeddings generated by the paraphrase-xlm-r-multilingual-v1 model, combined with a deep neural regression model incorporating batch normalization and dropout for improved generalization. By predicting the mean of pairwise judgment differences between annotators, our method explicitly targets disagreement ranking, diverging from traditional “gold label” aggregation approaches. We optimized our system with a tailored architecture and training procedure, achieving competitive performance in Spearman correlation against the mean disagreement labels. Our results highlights the importance of robust embeddings, effective model architecture, and careful handling of judgment differences for ranking disagreement in multilingual contexts. These findings provide insights into leveraging contextualized representations for ordinal judgment tasks and open avenues for further refinement in disagreement prediction models.

pdf bib
JuniperLiu at CoMeDi Shared Task: Models as Annotators in Lexical Semantics Disagreements
Zhu Liu | Zhen Hu | Ying Liu

We present the results of our system for the CoMeDi Shared Task, which predicts majority votes (Subtask 1) and annotator disagreements (Subtask 2). Our approach combines model ensemble strategies with MLP-based and threshold-based methods trained on pretrained language models. Treating individual models as virtual annotators, we simulate the annotation process by designing aggregation measures that incorporate continuous relatedness scores and discrete classification labels to capture both majority and disagreement. Additionally, we employ anisotropy removal techniques to enhance performance. Experimental results demonstrate the effectiveness of our methods, particularly for Subtask 2. Notably, we find that standard deviation on continuous relatedness scores among different model manipulations correlates with human disagreement annotations compared to metrics on aggregated discrete labels. The code will be published at https://github.com/RyanLiut/CoMeDi_Solution

pdf bib
MMLabUIT at CoMeDiShared Task: Text Embedding Techniques versus Generation-Based NLI for Median Judgment Classification
Tai Duc Le | Thin Dang Van

This paper presents our approach in the COLING2025-CoMeDi task in 7 languages, focusing on sub-task 1: Median Judgment Classification with Ordinal Word-in-Context Judgments (OGWiC). Specifically, we need to determine the meaning relation of one word in two different contexts and classify the input into 4 labels. To address sub-task 1, we implement and investigate various solutions, including (1) Stacking, Averaged Embedding techniques with a multilingual BERT-based model; and (2) utilizing a Natural Language Inference approach instead of a regular classification process. All the experiments were conducted on the P100 GPU from the Kaggle platform. To enhance the context of input, we perform Improve Known Data Rate and Text Expansion in some languages. For model focusing purposes Custom Token was used in the data processing pipeline. Our best official results on the test set are 0.515, 0.518, and 0.524 in terms of Krippendorff’s α score on task 1. Our participation system achieved a Top 3 ranking in task 1. Besides the official result, our best approach also achieved 0.596 regarding Krippendorff’s α score on Task 1.

pdf bib
ABDN-NLP at CoMeDi Shared Task: Predicting the Aggregated Human Judgment via Weighted Few-Shot Prompting
Ying Xuan Loke | Dominik Schlechtweg | Wei Zhao

Human annotation is notorious for being subjective and expensive. Recently, (CITATION) introduced the CoMeDi shared task aiming to address this issue by predicting human annotations on the semantic proximity between word uses, and estimating the variation of the human annotations. However, distinguishing the proximity between word uses can be challenging, when their semantic difference is subtle. In this work, we focus on predicting the aggregated annotator judgment of semantic proximity by using a large language model fine-tuned on 20 examples with various proximity classes. To distinguish nuanced proximity, we propose a weighted few-shot approach that pays greater attention to the proximity classes identified as important during fine-tuning. We evaluate our approach in the CoMeDi shared task across 7 languages. Our results demonstrate the superiority of our approach over zero-shot and standard few-shot counterparts. While useful, the weighted few-shot should be applied with caution, given that it relies on development sets to compute the importance of proximity classes, and thus may not generalize well to real-world scenarios where the distribution of class importance is different.

pdf bib
Automating Annotation Guideline Improvements using LLMs: A Case Study
Adrien Bibal | Nathaniel Gerlek | Goran Muric | Elizabeth Boschee | Steven C. Fincke | Mike Ross | Steven N. Minton

Annotating texts can be a tedious task, especially when texts are noisy. At the root of the issue, guidelines are not always optimized enough to be able to perform the required annotation task. In difficult cases, complex workflows are designed to be able to reach the best possible guidelines. However, crowdsource workers are commonly recruited to go through these complex workflows, limiting the number of iterations over the workflows, and therefore, the possible results because of the slow speed and the high cost of workers. In this paper, our case study, based on the entity recognition problem, suggests that LLMs can help produce guidelines of high quality (inter-annotator agreement going from 0.593 to 0.84 when improving WNUT-17’s guidelines), while being faster and cheaper than crowdsource workers.

pdf bib
Ambiguity and Disagreement in Abstract Meaning Representation
Shira Wein

Abstract Meaning Representation (AMR) is a graph-based semantic formalism which has been incorporated into a number of downstream tasks related to natural language understanding. Recent work has highlighted the key, yet often ignored, role of ambiguity and implicit information in natural language understanding. As such, in order to effectively leverage AMR in downstream applications, it is imperative to understand to what extent and in what ways ambiguity affects AMR graphs and causes disagreement in AMR annotation. In this work, we examine the role of ambiguity in AMR graph structure by employing a taxonomy of ambiguity types and producing AMRs affected by each type. Additionally, we investigate how various AMR parsers handle the presence of ambiguity in sentences. Finally, we quantify the impact of ambiguity on AMR using disambiguating paraphrases at a larger scale, and compare this to the measurable impact of ambiguity in vector semantics.

pdf bib
Disagreement in Metaphor Annotation of Mexican Spanish Science Tweets
Alec Sánchez-Montero | Gemma Bel-Enguix | Sergio-Luis Ojeda-Trueba | Gerardo Sierra

Traditional linguistic annotation methods often strive for a gold standard with hard labels as input for natural language processing models, assuming an underlying objective truth for all tasks. However, disagreement among annotators is a common scenario, even for seemingly objective linguistic tasks, and is particularly prominent in figurative language annotation, since multiple valid interpretations can sometimes coexist. This study presents the annotation process for identifying metaphorical tweets within a corpus of 3733 Public Communication of Science texts written in Mexican Spanish, emphasizing inter-annotator disagreement. Using Fleiss’ and Cohen’s Kappa alongside agreement percentages, we evaluated metaphorical language detection through binary classification in three situations: two subsets of the corpus labeled by three different non-expert annotators each, and a subset of disagreement tweets, identified in the non-expert annotation phase, re-labeled by three expert annotators. Our results suggest that expert annotation may improve agreement levels, but does not exclude disagreement, likely due to factors such as the relatively novelty of the genre, the presence of multiple scientific topics, and the blending of specialized and non-specialized discourse. Going further, we propose adopting a learning-from-disagreement approach for capturing diverse annotation perspectives to enhance computational metaphor detection in Mexican Spanish.

up

pdf (full)
bib (full)
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages

pdf bib
Proceedings of the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar Madasamy | Sajeetha Thavareesan | Elizabeth Sherly | Saranya Rajiakodi | Balasubramanian Palani | Malliga Subramanian | Subalalitha Cn | Dhivya Chinnappa

pdf bib
Incepto@DravidianLangTech 2025: Detecting Abusive Tamil and Malayalam Text Targeting Women on YouTube
Luxshan Thavarasa | Sivasuthan Sukumar | Jubeerathan Thevakumar

This study introduces a novel multilingualmodel designed to effectively address the challenges of detecting abusive content in low resource, code-mixed languages, where limiteddata availability and the interplay of mixed languages, leading to complex linguistic phenomena, create significant hurdles in developingrobust machine learning models. By leveraging transfer learning techniques and employingmulti-head attention mechanisms, our modeldemonstrates impressive performance in detecting abusive content in both Tamil and Malayalam datasets. On the Tamil dataset, our teamachieved a macro F1 score of 0.7864, whilefor the Malayalam dataset, a macro F1 score of0.7058 was attained. These results highlight theeffectiveness of our multilingual approach, delivering strong performance in Tamil and competitive results in Malayalam.

pdf bib
Eureka-CIOL@DravidianLangTech 2025: Using Customized BERTs for Sentiment Analysis of Tamil Political Comments
Enjamamul Haque Eram | Anisha Ahmed | Sabrina Afroz Mitu | Azmine Toushik Wasi

Sentiment analysis on social media platforms plays a crucial role in understanding public opinion and the decision-making process on political matters. As a significant number of individuals express their views on social media, analyzing these opinions is essential for monitoring political trends and assessing voter sentiment. However, sentiment analysis for low-resource languages, such as Tamil, presents considerable challenges due to the limited availability of annotated datasets and linguistic complexities. To address this gap, we utilize a novel dataset encompassing seven sentiment classes, offering a unique opportunity to explore sentiment variations in Tamil political discourse. In this study, we evaluate multiple pre-trained models from the Hugging Face library and experiment with various hyperparameter configurations to optimize model performance. Our findings aim to contribute to the development of more effective sentiment analysis tools tailored for low-resource languages, ultimately empowering Tamil-speaking communities by providing deeper insights into their political sentiments. Our full experimental codebase is publicly available at: ciol-researchlab/NAACL25-Eureka-Sentiment-Analysis-Tamil

pdf bib
Akatsuki-CIOL@DravidianLangTech 2025: Ensemble-Based Approach Using Pre-Trained Models for Fake News Detection in Dravidian Languages
Mahfuz Ahmed Anik | Md. Iqramul Hoque | Wahid Faisal | Azmine Toushik Wasi | Md Manjurul Ahsan

The widespread spread of fake news on social media poses significant challenges, particularly for low-resource languages like Malayalam. The accessibility of social platforms accelerates misinformation, leading to societal polarization and poor decision-making. Detecting fake news in Malayalam is complex due to its linguistic diversity, code-mixing, and dialectal variations, compounded by the lack of large labeled datasets and tailored models. To address these, we developed a fine-tuned transformer-based model for binary and multiclass fake news detection. The binary classifier achieved a macro F1 score of 0.814, while the multiclass model, using multimodal embeddings, achieved a score of 0.1978. Our system ranked 14th and 11th in the shared task competition, highlighting the need for specialized techniques in underrepresented languages. Our full experimental codebase is publicly available at: ciol-researchlab/NAACL25-Akatsuki-Fake-News-Detection.

pdf bib
RMKMavericks@DravidianLangTech 2025: Tackling Abusive Tamil and Malayalam Text Targeting Women: A Linguistic Approach
Sandra Johnson | Boomika E | Lahari P

Social media abuse of women is a widespread problem, especially in regional languages like Tamil and Malayalam, where there are few tools for automated identification. The use of machine learning methods to detect abusive messages in several languages is examined in this work. An external dataset was used to train a Support Vector Machine (SVM) model for Tamil, which produced an F1 score of 0.6196. Using the given dataset, a Multinomial Naive Bayes (MNB) model was trained for Malayalam, obtaining an F1 score of 0.6484. Both models processed and analyzed textual input efficiently by using TF-IDF vectorization for feature extraction. This method shows the ability to solve the linguistic diversity and complexity of abusive language identification by utilizing language-specific datasets and customized algorithms. The results highlight how crucial it is to use focused machine learning techniques to make online spaces safer for women, especially when speaking minority languages.

pdf bib
RMKMavericks@DravidianLangTech 2025: Emotion Mining in Tamil and Tulu Code-Mixed Text: Challenges and Insights
Gladiss Merlin N.r | Boomika E | Lahari P

Sentiment analysis in code-mixed social media comments written in Tamil and Tulu presents unique challenges due to grammatical inconsistencies, code-switching, and the use of non-native scripts. To address these complexities, we employ pre-processing techniques for text cleaning and evaluate machine learning models tailored for sentiment detection. Traditional machine learning methods combined with feature extraction strategies, such as TF- IDF, are utilized. While logistic regression demonstrated reasonable performance on the Tamil dataset, achieving a macro F1 score of 0.44, support vector machines (SVM) outperformed logistic regression on the Tulu dataset with a macro F1 score of 0.54. These results demonstrate the effectiveness of traditional approaches, particularly SVM, in handling low- resource, multilingual data, while also high- lighting the need for further refinement to improve performance across underrepresented sentiment classes.

pdf bib
JAS@DravidianLangTech 2025: Abusive Tamil Text targeting Women on Social Media
B Saathvik | Janeshvar Sivakumar | Thenmozhi Durairaj

This paper presents our submission for Abusive Comment Detection in Tamil - DravidianLangTech@NAACL 2025. The aim is to classify whether a given comment is abusive towards women. Google’s MuRIL (Khanujaet al., 2021), a transformer-based multilingual model, is fine-tuned using the provided dataset to build the classification model. The datasetis preprocessed, tokenised, and formatted for model training. The model is trained and evaluated using accuracy, F1-score, precision, andrecall. Our approach achieved an evaluation accuracy of 77.76% and an F1-score of 77.65%. The lack of large, high-quality datasets forlow-resource languages has also been acknowledged.

pdf bib
Team-Risers@DravidianLangTech 2025: AI-Generated Product Review Detection in Dravidian Languages Using Transformer-Based Embeddings
Sai Sathvik | Muralidhar Palli | Keerthana NNL | Balasubramanian Palani | Jobin Jose | Siranjeevi Rajamanickam

Online product reviews influence customer choices and company reputations. However, companies can counter negative reviews by generating fake reviews that portray their products positively. These fake reviews lead to legal disputes and concerns, particularly because AI detection tools are limited in low-resource languages such as Tamil and Malayalam. To address this, we use machine learning and deep learning techniques to identify AI-generated reviews. We utilize Tamil BERT and Malayalam BERT in the embedding layer to extract contextual features. These features are sent to a Feedforward Neural Network (FFN) with softmax to classify reviews as AI-generated or not. The performance of the model is evaluated on the dataset. The results show that the transformer-based embedding achieves a better accuracy of 95.68\% on Tamil data and an accuracy of 88.75\% on Malayalam data.

pdf bib
NLPopsCIOL@DravidianLangTech 2025: Classification of Abusive Tamil and Malayalam Text Targeting Women Using Pre-trained Models
Abdullah Al Nahian | Mst Rafia Islam | Azmine Toushik Wasi | Md Manjurul Ahsan

Hate speech detection in multilingual and code-mixed contexts remains a significant challenge due to linguistic diversity and overlapping syntactic structures. This paper presents a study on the detection of hate speech in Tamil and Malayalam using transformer-based models. Our goal is to address underfitting and develop effective models for hate speech classification. We evaluate several pre-trained models, including MuRIL and XLM-RoBERTa, and show that fine-tuning is crucial for better performance. The test results show a Macro-F1 score of 0.7039 for Tamil and 0.6402 for Malayalam, highlighting the promise of these models with further improvements in fine-tuning. We also discuss data preprocessing techniques, model implementations, and experimental findings. Our full experimental codebase is publicly available at: github.com/ciol-researchlab/NAACL25-NLPops-Classification-Abusive-Text.

pdf bib
AiMNLP@DravidianLangTech 2025: Unmask It! AI-Generated Product Review Detection in Dravidian Languages
Somsubhra De | Advait Vats

The rise of Generative AI has led to a surge in AI-generated reviews, often posing a serious threat to the credibility of online platforms. Reviews serve as the primary source of information about products and services. Authentic reviews play a vital role in consumer decision-making. The presence of fabricated content misleads consumers, undermines trust and facilitates potential fraud in digital marketplaces. This study focuses on detecting AI-generated product reviews in Tamil and Malayalam, two low-resource languages where research in this domain is relatively under-explored. We worked on a range of approaches - from traditional machine learning methods to advanced transformer-based models such as Indic-BERT, IndicSBERT, MuRIL, XLM-RoBERTa and Malayalam-BERT. Our findings highlight the effectiveness of leveraging the state-of-the-art transformers in accurately identifying AI-generated content, demonstrating the potential in enhancing the detection of fake reviews in low-resource language settings.

pdf bib
byteSizedLLM@DravidianLangTech 2025: Fake News Detection in Dravidian Languages Using Transliteration-Aware XLM-RoBERTa and Transformer Encoder-Decoder
Durga Prasad Manukonda | Rohith Gowtham Kodali

This study addresses the challenge of fake news detection in code-mixed and transliterated text, focusing on a multilingual setting with significant linguistic variability. A novel approach is proposed, leveraging a fine-tuned multilingual transformer model trained using Masked Language Modeling on a dataset that includes original, fully transliterated, and partially transliterated text. The fine-tuned embeddings are integrated into a custom transformer classifier designed to capture complex dependencies in multilingual sequences. The system achieves state-of-the-art performance, demonstrating the effectiveness of combining transliteration-aware fine-tuning with robust transformer architectures to handle code-mixed and resource-scarce text, providing a scalable solution for multilingual natural language processing tasks.

pdf bib
byteSizedLLM@DravidianLangTech 2025: Fake News Detection in Dravidian Languages Using Transliteration-Aware XLM-RoBERTa and Attention-BiLSTM
Rohith Gowtham Kodali | Durga Prasad Manukonda

This research introduces an innovative Attention BiLSTM-XLM-RoBERTa model for tackling the challenge of fake news detection in Malayalam datasets. By fine-tuning XLM-RoBERTa with Masked Language Modeling (MLM) on transliteration-aware data, the model effectively bridges linguistic and script diversity, seamlessly integrating native, Romanized, and mixed-script text. Although most of the training data is monolingual, the proposed approach demonstrates robust performance in handling diverse script variations. Achieving a macro F1-score of 0.5775 and securing top rankings in the shared task, this work highlights the potential of multilingual models in addressing resource-scarce language challenges and sets a foundation for future advancements in fake news detection.

pdf bib
byteSizedLLM@DravidianLangTech 2025: Multimodal Hate Speech Detection in Malayalam Using Attention-Driven BiLSTM, Malayalam-Topic-BERT, and Fine-Tuned Wav2Vec 2.0
Durga Prasad Manukonda | Rohith Gowtham Kodali | Daniel Iglesias

This research presents a robust multimodal framework for hate speech detection in Malayalam, combining fine-tuned Wav2Vec 2.0, Malayalam-Doc-Topic-BERT, and an Attention-Driven BiLSTM architecture. The proposed approach effectively integrates acoustic and textual features, achieving a macro F1-score of 0.84 on the Malayalam test set. Fine-tuning Wav2Vec 2.0 on Malayalam speech data and leveraging Malayalam-Doc-Topic-BERT significantly improved performance over prior methods using openly available models. The results highlight the potential of language-specific models and advanced multimodal fusion techniques for addressing nuanced hate speech categories, setting the stage for future work on Dravidian languages like Tamil and Telugu.

pdf bib
byteSizedLLM@DravidianLangTech 2025: Detecting AI-Generated Product Reviews in Dravidian Languages Using XLM-RoBERTa and Attention-BiLSTM
Rohith Gowtham Kodali | Durga Prasad Manukonda | Maharajan Pannakkaran

This study presents a hybrid model integrating TamilXLM-RoBERTa and MalayalamXLM-RoBERTa with BiLSTM and attention mechanisms to classify AI-generated and human-written product reviews in Tamil and Malayalam. The model employs a transliteration-based fine-tuning strategy, effectively handling native, Romanized, and mixed-script text. Despite being trained on a relatively small portion of data, our approach demonstrates strong performance in distinguishing AI-generated content, achieving competitive macro F1 scores in the DravidianLangTech 2025 shared task. The proposed method showcases the effectiveness of multilingual transformers and hybrid architectures in tackling low-resource language challenges.

pdf bib
byteSizedLLM@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media Using XLM-RoBERTa and Attention-BiLSTM
Rohith Gowtham Kodali | Durga Prasad Manukonda | Maharajan Pannakkaran

This research investigates abusive comment detection in Tamil and Malayalam, focusing on code-mixed, multilingual social media text. A hybrid Attention BiLSTM-XLM-RoBERTa model was utilized, combining fine-tuned embeddings, sequential dependencies, and attention mechanisms. Despite computational constraints limiting fine-tuning to a subset of the AI4Bharath dataset, the model achieved competitive macro F1-scores, ranking 6th for both Tamil and Malayalam datasets with minor performance differences. The results emphasize the potential of multilingual transformers and the need for further advancements, particularly in addressing linguistic diversity, transliteration complexity, and computational limitations.

pdf bib
byteSizedLLM@DravidianLangTech 2025: Multimodal Misogyny Meme Detection in Low-Resource Dravidian Languages Using Transliteration-Aware XLM-RoBERTa, ResNet-50, and Attention-BiLSTM
Durga Prasad Manukonda | Rohith Gowtham Kodali

Detecting misogyny in memes is challenging due to their multimodal nature, especially in low-resource languages like Tamil and Malayalam. This paper presents our work in the Misogyny Meme Detection task, utilizing both textual and visual features. We propose an Attention-Driven BiLSTM-XLM-RoBERTa-ResNet model, combining a transliteration-aware fine-tuned XLM-RoBERTa for text analysis and ResNet-50 for image feature extraction. Our model achieved Macro-F1 scores of 0.8805 for Malayalam and 0.8081 for Tamil, demonstrating competitive performance. However, challenges such as class imbalance and domain-specific image representation persist. Our findings highlight the need for better dataset curation, task-specific fine-tuning, and advanced fusion techniques to enhance multimodal hate speech detection in Dravidian languages.

pdf bib
byteSizedLLM@DravidianLangTech 2025: Sentiment Analysis in Tamil Using Transliteration-Aware XLM-RoBERTa and Attention-BiLSTM
Durga Prasad Manukonda | Rohith Gowtham Kodali

This study investigates sentiment analysis in code-mixed Tamil-English text using an Attention BiLSTM-XLM-RoBERTa model, combining multilingual embeddings with sequential context modeling to enhance classification performance. The model was fine-tuned using masked language modeling and trained with an attention-based BiLSTM classifier to capture sentiment patterns in transliterated and informal text. Despite computational constraints limiting pretraining, the approach achieved a Macro f1 of 0.5036 and ranked first in the competition. The model performed best on the Positive class, while Mixed Feelings and Unknown State showed lower recall due to class imbalance and ambiguity. Error analysis reveals challenges in handling non-standard transliterations, sentiment shifts, and informal language variations in social media text. These findings demonstrate the effectiveness of transformer-based multilingual embeddings and sequential modeling for sentiment classification in code-mixed text.

pdf bib
SSNCSE@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Sreeja K | Bharathi B

Hate speech detection is a serious challenge due to the different digital media communication, particularly in low-resource languages. This research focuses on the problem of multimodal hate speech detection by incorporating both textual and audio modalities. In the context of social media platforms, hate speech is conveyed not only through text but also through audios, which may further amplify harmful content. In order to manage the issue, we provide a multiclass classification model that influences both text and audio features to detect and categorize hate speech in low-resource languages. The model uses machine learning models for text analysis and audio processing, allowing it to efficiently capture the complex relationships between the two modalities. Class weight mechanism involves avoiding overfitting. The prediction has been finalized using the majority fusion technique. Performance is measured using a macro average F1 score metric. Three languages—Tamil, Malayalam, and Telugu—have the optimal F1-scores, which are 0.59, 0.52, and 0.33.

pdf bib
Bridging Linguistic Complexity: Sentiment Analysis of Tamil Code-Mixed Text Using Meta-Model
Anusha M D Gowda | Deepthi Vikram | Parameshwar R Hegde

Sentiment analysis in code-mixed languages poses significant challenges due to the complex nature of mixed-language text. This study explores sentiment analysis on Tamil code-mixed text using deep learning models such as Long Short-Term Memory (LSTM), hybrid models like Convolutional Neural Network (CNN) + Gated Recurrent Unit (GRU) and LSTM + GRU, along with meta-models including Logistic Regression, Random Forest, and Decision Tree. The LSTM+GRU hybrid model achieved an accuracy of 0.31, while the CNN+GRU hybrid model reached 0.28. The Random Forest meta-model demonstrated exceptional performance on the development set with an accuracy of 0.99. However, its performance dropped significantly on the test set, achieving an accuracy of 0.1333. The study results emphasize the potential of meta-model-based classification for improving performance in NLP tasks.

pdf bib
YenCS@DravidianLangTech 2025: Integrating Hybrid Architectures for Fake News Detection in Low-Resource Dravidian Languages
Anusha M D Gowda | Parameshwar R Hegde

Detecting fake news in under-resourced Dravidian languages is a rigorous task due to the scarcity of annotated datasets and the intricate nature of code-mixed text. This study tackles these issues by employing advanced machine learning techniques for two key classification tasks, the first task involves binary classification achieving a macro-average F1-score of 0.792 using a hybrid fusion model that integrates Bidirectional Recurrent Neural Network (Bi-RNN) and Long Short-Term Memory (LSTM)-Recurrent Neural Network (RNN) with weighted averaging. The second task focuses on fine-grained classification, categorizing news where an LSTM-GRU hybrid model attained a macro-average F1-score of 0.26. These findings highlight the effectiveness of hybrid models in improving fake news detection for under-resourced languages. Additionally, this study provides a foundational framework that can be adapted to address similar challenges in other under-resourced languages, emphasizing the need for further research in this area.

pdf bib
Overview of the Shared Task on Multimodal Hate Speech Detection in Dravidian languages: DravidianLangTech@NAACL 2025
Jyothish Lal G | Premjith B | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Bharathi B | Rajeswari Natarajan | Ratnavel Rajalakshmi

The detection of hate speech in social media platforms is very crucial these days. This is due to its adverse impact on mental health, social harmony, and online safety. This paper presents the overview of the shared task on Multimodal Hate Speech Detection in Dravidian Languages organized as part of DravidianLangTech@NAACL 2025. The task emphasizes detecting hate speech in social media content that combines speech and text. Here, we focus on three low-resource Dravidian languages: Malayalam, Tamil, and Telugu. Participants were required to classify hate speech in three sub-tasks, each corresponding to one of these languages. The dataset was curated by collecting speech and corresponding text from YouTube videos. Various machine learning and deep learning-based models, including transformer-based architectures and multimodal frameworks, were employed by the participants. The submissions were evaluated using the macro F1 score. Experimental results underline the potential of multimodal approaches in advancing hate speech detection for low-resource languages. Team SSNTrio achieved the highest F1 score in Malayalam and Tamil of 0.7511 and 0.7332, respectively. Team lowes scored the best F1 score of 0.3817 in the Telugu sub-task.

pdf bib
Overview of the Shared Task on Detecting AI Generated Product Reviews in Dravidian Languages: DravidianLangTech@NAACL 2025
Premjith B | Nandhini Kumaresh | Bharathi Raja Chakravarthi | Thenmozhi Durairaj | Balasubramanian Palani | Sajeetha Thavareesan | Prasanna Kumar Kumaresan

The detection of AI-generated product reviews is critical due to the increased use of large language models (LLMs) and their capability to generate convincing sentences. The AI-generated reviews can affect the consumers and businesses as they influence the trust and decision-making. This paper presents the overview of the shared task on Detecting AI-generated product reviews in Dravidian Languages” organized as part of DravidianLangTech@NAACL 2025. This task involves two subtasks—one in Malayalam and another in Tamil, both of which are binary classifications where a review is to be classified as human-generated or AI-generated. The dataset was curated by collecting comments from YouTube videos. Various machine learning and deep learning-based models ranging from SVM to transformer-based architectures were employed by the participants.

pdf bib
Girma@DravidianLangTech 2025: Detecting AI Generated Product Reviews
Girma Yohannis Bade | Muhammad Tayyab Zamir | Olga Kolesnikova | José Luis Oropeza | Grigori Sidorov | Alexander Gelbukh

The increasing prevalence of AI-generated content, including fake product reviews, poses significant challenges in maintaining authenticity and trust in e-commerce systems. While much work has focused on detecting such reviews in high-resource languages, limited attention has been given to low-resource languages like Malayalam and Tamil. This study aims to address this gap by developing a robust framework to identify AI-generated product reviews in these languages. We explore a BERT-based approach for this task. Our methodology involves fine-tuning a BERT-based model specifically on Malayalam and Tamil datasets. The experiments are conducted using labeled datasets that contain a mix of human-written and AI-generated reviews. Performance is evaluated using the macro F1 score. The results show that the BERT-based model achieved a macro F1 score of 0.6394 for Tamil and 0.8849 for Malayalam. Preliminary results indicate that the BERT-based model performs significantly better for Malayalam than for Tamil in terms of the average Macro F1 score, leveraging its ability to capture the complex linguistic features of these languages. Finally, we open the source code of the implementation in the GitHub repository: AI-Generated-Product-Review-Code

pdf bib
Beyond_Tech@DravidianLangTech 2025: Political Multiclass Sentiment Analysis using Machine Learning and Neural Network
Kogilavani Shanmugavadivel | Malliga Subramanian | Sanjai R | Mohammed Sameer | Motheeswaran K

Research on political feeling is essential for comprehending public opinion in the digital age, as social media and news platforms are often the sites of discussions. To categorize political remarks into sentiments like positive, negative, neutral, opinionated, substantiated, and sarcastic, this study offers a multiclass sentiment analysis approach. We trained models, such as Random Forest and a Feedforward Neural Network, after preprocessing and feature extraction from a large dataset of political texts using Natural Language Processing approaches. The Random Forest model, which was great at identifying more complex attitudes like sar casm and opinionated utterances, had the great est accuracy of 84%, followed closely by the Feedforward Neural Network model, which had 83%. These results highlight how well political discourse can be analyzed by combining deep learning and traditional machine learning techniques. There is also room for improvement by adding external metadata and using sophisticated models like BERT for better sentiment classification.

pdf bib
Misogynistic Meme Detection in Dravidian Languages Using Kolmogorov Arnold-based Networks
Manasha Arunachalam | Navneet Krishna Chukka | Harish Vijay V | Premjith B | Bharathi Raja Chakravarthi

The prevalence of misogynistic content online poses significant challenges to ensuring a safe and inclusive digital space for women. This study presents a pipeline to classify online memes as misogynistic or non misogynistic. The pipeline combines contextual image embeddings generated using the Vision Transformer Encoder (ViTE) model with text embeddings extracted from the memes using ModernBERT. These multimodal embeddings were fused and trained using three advanced types of Kolmogorov Artificial Networks (KAN): PyKAN, FastKAN, and Chebyshev KAN. The models were evaluated based on their F1 scores, demonstrating their effectiveness in addressing this issue. This research marks an important step towards reducing offensive online content, promoting safer and more respectful interactions in the digital world.

pdf bib
HTMS@DravidianLangTech 2025: Fusing TF-IDF and BERT with Dimensionality Reduction for Abusive Language Detection in Tamil and Malayalam
Bachu Naga Sri Harini | Kankipati Venkata Meghana | Kondakindi Supriya | Tara Samiksha | Premjith B

Detecting abusive and similarly toxic content posted on a social media platform is challenging due to the complexities of the language, data imbalance, and the code-mixed nature of the text. In this paper, we present our submissions for the shared task on abusive Tamil and Malayalam texts targeting women on social media—DravidianLangTech@NAACL 2025. We propose a hybrid embedding model that integrates embeddings generated using term frequency-inverse document frequency (TF-IDF) and BERT. To get rid of the differences in the embedding dimensions, we used a dimensionality reduction method with TF-IDF embedding. We submitted two more runs to the shared task, which involve a model based on TF-IDF embedding and another based on BERT-based embedding. The code for the submissions is available at https://github.com/Tarrruh/NLP_HTMS.

pdf bib
Team_Catalysts@DravidianLangTech 2025: Leveraging Political Sentiment Analysis using Machine Learning Techniques for Classifying Tamil Tweets
Kogilavani Shanmugavadivel | Malliga Subramanian | Subhadevi K | Sowbharanika Janani Sivakumar | Rahul K

This work proposed a methodology for assessing political sentiments in Tamil tweets using machine learning models. The approach addressed linguistic challenges in Tamil text, including cleaning, normalization, tokenization, and class imbalance, through a robust preprocessing pipeline. Various models, including Random Forest, Logistic Regression, and CatBoost, were applied, with Random Forest achieving a macro F1-score of 0.2933 and securing 8th rank among 153 participants in the Codalab competition. This accomplishment highlights the effectiveness of machine learning models in handling the complexities of multilingual, code-mixed, and unstructured data in Tamil political discourse. The study also emphasized the importance of tailored preprocessing techniques to improve model accuracy and performance. It demonstrated the potential of computational linguistics and machine learning in understanding political discourse in low-resource languages like Tamil, contributing to advancements in regional sentiment analysis.

pdf bib
InnovationEngineers@DravidianLangTech 2025: Enhanced CNN Models for Detecting Misogyny in Tamil Memes Using Image and Text Classification
Kogilavani Shanmugavadivel | Malliga Subramanian | Pooja Sree M | Palanimurugan V | Roshini Priya K

The rise of misogynistic memes on social media posed challenges to civil discourse. This paper aimed to detect misogyny in Dravidian language memes using a multimodal deep learning approach. We integrated Bidirectional Encoder Representations from Transformers (BERT), Long Short-Term Memory (LSTM), EfficientNet, and a Vision Language Model (VLM) to analyze textual and visual informa tion. EfficientNet extracted image features, LSTM captured sequential text patterns, and BERT learned language-specific embeddings. Among these, VLM achieved the highest accuracy of 85.0% and an F1-score of 70.8, effectively capturing visual-textual relationships. Validated on a curated dataset, our method outperformed baselines in precision, recall, and F1-score. Our approach ranked 12th out of 118 participants for the Tamil language, highlighting its competitive performance. This research emphasizes the importance of multimodal models in detecting harmful content. Future work can explore improved feature fusion techniques to enhance classification accuracy.

pdf bib
MysticCIOL@DravidianLangTech 2025: A Hybrid Framework for Sentiment Analysis in Tamil and Tulu Using Fine-Tuned SBERT Embeddings and Custom MLP Architectures
Minhaz Chowdhury | Arnab Laskar | Taj Ahmad | Azmine Toushik Wasi

Sentiment analysis is a crucial NLP task used to analyze opinions in various domains, including marketing, politics, and social media. While transformer-based models like BERT and SBERT have significantly improved sentiment classification, their effectiveness in low-resource languages remains limited. Tamil and Tulu, despite their widespread use, suffer from data scarcity, dialectal variations, and code-mixing challenges, making sentiment analysis difficult. Existing methods rely on traditional classifiers or word embeddings, which struggle to generalize in these settings. To address this, we propose a hybrid framework that integrates fine-tuned SBERT embeddings with a Multi-Layer Perceptron (MLP) classifier, enhancing contextual representation and classification robustness. Our framework achieves validation F1-scores of 0.4218 for Tamil and 0.3935 for Tulu and test F1-scores of 0.4299 in Tamil and 0.1546 on Tulu, demonstrating its effectiveness. This research provides a scalable solution for sentiment classification in low-resource languages, with future improvements planned through data augmentation and transfer learning. Our full experimental codebase is publicly available at: github.com/ciol-researchlab/NAACL25-Mystic-Tamil-Sentiment-Analysis.

pdf bib
KEC_AI_DATA_DRIFTERS@DravidianLangTech 2025: Fake News Detection in Dravidian Languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Vishali K S | Priyanka B | Naveen Kumar K

Detecting fake news in Malayalam possess significant challenges due to linguistic diversity, code-mixing, and the limited availability of structured datasets. We participated in the Fake News Detection in Dravidian Languages shared task, classifying news and social media posts into binary and multi-class categories. Our experiments used traditional ML models: Support Vector Machine (SVM), Random Forest, Logistic Regression, Naive Bayes and transfer learning models: Multilingual Bert (mBERT) and XLNet. In binary classification, SVM achieved the highest macro-F1 score of 0.97, while in multi-class classification, it also outperformed other models with a macro-F1 score of 0.98. Random Forest ranked second in both tasks. Despite their advanced capabilities, mBERT and XLNet exhibited lower precision due to data limitations. Our approach enhances fake news detection and NLP solutions for low-resource languages.

pdf bib
KECEmpower@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Malliga Subramanian | Kogilavani Shanmugavadivel | Indhuja V S | Kowshik P | Jayasurya S

The detection of abusive text targeting women, especially in Dravidian languages like Tamil and Malayalam, presents a unique challenge due to linguistic complexities and code-mixing on social media. This paper evaluates machine learning models such as Support Vector Machines (SVM), Logistic Regression (LR), and Random Forest Classifiers (RFC) for identifying abusive content. Code-mixed datasets sourced from platforms like YouTube are used to train and test the models. Performance is evaluated using accuracy, precision, recall, and F1-score metrics. Our findings show that SVM outperforms the other classifiers in accuracy and recall. However, challenges persist in detecting implicit abuse and addressing informal, culturally nuanced language. Future work will explore transformer-based models like BERT for better context understanding, along with data augmentation techniques to enhance model performance. Additionally, efforts will focus on expanding labeled datasets to improve abuse detection in these low-resource languages.

pdf bib
KEC_AI_GRYFFINDOR@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel | Malliga Subramanian | ShahidKhan S | Shri Sashmitha.s | Yashica S

It is difficult to detect hate speech in codemixed Dravidian languages because the data is multilingual and unstructured. We took part in the shared task to detect hate speech in text and audio data for Tamil, Malayalam, and Telugu in this research. We tested different machine learning and deep learning models such as Logistic Regression, Ridge Classifier, Random Forest, and CNN. For Tamil, Logistic Regression gave the best macro-F1 score of 0.97 for text, whereas Ridge Classifier was the best for audio with a score of 0.75. For Malayalam, Random Forest gave the best F1-score of 0.97 for text, and CNN was the best for audio (F1 score: 0.69). For Telugu, Ridge Classifier gave the best F1-score of 0.89 for text, whereas CNN was the best for audio (F1-score: 0.87).Our findings prove that a multimodal solution effi ciently tackles the intricacy of hate speech detection in Dravidian languages. In this shared task,out of 145 teams we attained the 12th rank for Tamil and 7th rank for Malayalam and Telugu.

pdf bib
KECLinguAIsts@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
Malliga Subramanian | Rojitha R | Mithun Chakravarthy Y | Renusri R V | Kogilavani Shanmugavadivel

With the surge of AI-generated content in online spaces, ensuring the authenticity of product reviews has become a critical challenge. This paper addresses the task of detecting AI-generated product reviews in Dravidian languages, specifically Tamil and Malayalam, which present unique hurdles due to their complex morphology, rich syntactic structures, and code-mixed nature. We introduce a novel methodology combining machine learning classifiers with advanced multilingual transformer models to identify AI-generated reviews. Our approach not only accounts for the linguistic intricacies of these languages but also leverages domain specific datasets to improve detection accuracy. For Tamil, we evaluate Logistic Regression, Random Forest, and XGBoost, while for Malayalam, we explore Logistic Regression, Multinomial Naive Bayes (MNB), and Support Vector Machines (SVM). Transformer based models significantly outperform these traditional classifiers, demonstrating superior performance across multiple metrics.

pdf bib
Dll5143@DravidianLangTech 2025: Majority Voting-Based Framework for Misogyny Meme Detection in Tamil and Malayalam
Sarbajeet Pattanaik | Ashok Yadav | Vrijendra Singh

Misogyny memes pose a significant challenge on social networks, particularly in Dravidian-scripted languages, where subtle expressions can propagate harmful narratives against women. This paper presents our approach for the “Shared Task on MisogynyMeme Detection,” organized as part of DravidianLangTech@NAACL 2025, focusing on misogyny meme detection in Tamil andMalayalam. To tackle this problem, we proposed a multi-model framework that integrates three distinct models: M1 (ResNet-50 + google/muril-large-cased), M2 (openai/clipvit- base-patch32 + ai4bharat/indic-bert), and M3 (ResNet-50 + ai4bharat/indic-bert). Thefinal classification is determined using a majority voting mechanism, ensuring robustness by leveraging the complementary strengths ofthese models. This approach enhances classification performance by reducing biases and improving generalization. Our model achievedan F1 score of 0.77 for Tamil, significantly improving misogyny detection in the language. For Malayalam, the framework achieved anF1 score of 0.84, demonstrating strong performance. Overall, our method ranked 5th in Tamil and 4th in Malayalam, highlighting itscompetitive effectiveness in misogyny meme detection.

pdf bib
KEC_AI_VSS_run2@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Kogilavani Shanmugavadivel | Malliga Subramanian | Sathiyaseelan S | Suresh Babu K | Vasikaran S

The increasing instances of abusive language against women on social media platforms have brought to the fore the need for effective content moderation systems, especially in low-resource languages like Tamil and Malayalam. This paper addresses the challenge of detecting gender-based abuse in YouTube comments using annotated datasets in these languages. Comments are classified into abusive and non-abusive categories. We applied the following machine learning algorithms, namely Random Forest, Support Vector Machine, K-Nearest Neighbor, Gradient Boosting and AdaBoost for classification. Micro F1 score of 0.95 was achieved by SVM for Tamil and 0.72 by Random Forest for Malayalam. Our system participated in the shared task on abusive comment detection, out of 160 teams achieving the rank of 13th for Malayalam and rank 34 for Tamil, and both indicate both the challenges and potential of our approach in low-resource language processing. Our findings have highlighted the significance of tailored approaches to language-specific abuse detection.

pdf bib
The_Deathly_Hallows@DravidianLangTech 2025: AI Content Detection in Dravidian Languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Vasantharan K | Prethish G A | Vijayakumaran S

The DravidianLangTech@NAACL 2025 shared task focused on Detecting AI-generated Product Reviews in Dravidian Languages, aiming to address the challenge of distinguishing AI-generated content from human-written reviews in Tamil and Malayalam. As AI generated text becomes more prevalent, ensuring the authenticity of online product reviews is crucial for maintaining consumer trust and preventing misinformation. In this study, we explore various feature extraction techniques, including TF-IDF, Count Vectorizer, and transformer-based embeddings such as BERT-Base-Multilingual-Cased and XLM-RoBERTa-Large, to build a robust classification model. Our approach achieved F1-scores of 0.9298 for Tamil and 0.8797 for Malayalam, ranking 8th in Tamil and 11th in Malayalam among all participants. The results highlight the effectiveness of transformer-based embeddings in differentiating AI-generated and human-written content. This research contributes to the growing body of work on AI-generated content detection, particularly in underrepresented Dravidian languages, and provides insights into the challenges unique to these languages.

pdf bib
SSN_MMHS@DravidianLangTech 2025: A Dual Transformer Approach for Multimodal Hate Speech Detection in Dravidian Languages
Jahnavi Murali | Rajalakshmi Sivanaiah

The proliferation of the Internet and social media platforms has resulted in an alarming increase in online hate speech, negatively affecting individuals and communities worldwide. While most research focuses on text-based detection in English, there is an increasing demand for multilingual and multimodal approaches to address hate speech more effectively. This paper presents a methodology for multiclass hate speech classification in low-resource Indian languages namely, Malayalam, Telugu, and Tamil, as part of the shared task at DravidianLangTech 2025. Our proposed approach employs a dual transformer-based framework that integrates audio and text modalities, facilitating cross-modal learning to enhance detection capabilities. Our model achieved macro-F1 scores of 0.348, 0.1631, and 0.1271 in the Malayalam, Telugu, and Tamil subtasks respectively. Although the framework’s performance is modest, it provides valuable insights into the complexities of multimodal hate speech detection in low-resource settings and highlights areas for future improvement, including data augmentation, and alternate fusion and feature extraction techniques.

pdf bib
InnovateX@DravidianLangTech 2025: Detecting AI-Generated Product Reviews in Dravidian Languages
Moogambigai A | Pandiarajan D | Bharathi B

This paper presents our approach to the Shared Task on Detecting AI-Generated Product Reviews in Dravidian Languages as part of DravidianLangTech@NAACL 2025. The task focuses on distinguishing between human-written and AI-generated reviews in Tamil and Malayalam, languages rich in linguistic complexities. Using the provided datasets, we implemented machine learning and deep learning models, including Logistic Regression (LR), Support Vector Machine (SVM), and BERT. Through preprocessing techniques like tokenization and TF-IDF vectorization, we achieved competitive results, with our SVM and BERT models demonstrating superior performance in Tamil and Malayalam respectively. Our findings underscore the unique challenges of working with Dravidian languages in this domain and highlight the importance of robust feature extraction.

pdf bib
KSK@DravidianLangTech 2025: Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments Using Incremental Learning
Kalaivani K S | Sanjay R | Thissyakkanna S M | Nirenjhanram S K

The introduction of Jio in India has significantly increased the number of social media users, particularly on platforms like X (Twitter), Facebook, Instagram. While this growth is positive, it has also led to a rise in native language speakers, making social media analysis more complex. In this study, we focus on Tamil, a Dravidian language, and aim to classify social media comments from X (Twitter) into seven different categories. Tamil speaking users often communicate using a mix of Tamil and English, creating unique challenges for analysis and tracking. This surge in diverse language usage on social media highlights the need for robust sentiment analysis tools to ensure the platform remains accessible and user-friendly for everyone with different political opinions. In this study we trained four machine learning models, SGD Classifier, Random Forest Classifier, Decision Tree, and Multinomial Naive Bayes classifier to identify and classify the comments. Among these, the SGD Classifier achieved the best performance, with a training accuracy of 83.67% and a validation accuracy of 80.43%.

pdf bib
BlueRay@DravidianLangTech-2025: Fake News Detection in Dravidian Languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Aiswarya M | Aruna T | Jeevaananth S

The rise of fake news presents significant issues, particularly for underrepresented lan guages. This study tackles fake news identification in Dravidian languages with two subtasks: binary classification of YouTube comments and multi-class classification of Malayalam news into five groups. Text preprocessing, vectorization, and transformer-based embeddings are all part of the methodology, including baseline comparisons utilizing classic machine learning, deep learning, and transfer learning models. In Task 1, our solution placed 17th, displaying acceptable binary classification per formance. In Task 2, we finished eighth place by effectively identifying nuanced categories of Malayalam news, demonstrating the efficacy of transformer-based models.

pdf bib
KEC_AI_ZEROWATTS@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Naveenram C E | Vishal Rs | Srinesh S

Hate speech detection in code-mixed Dravidian languages presents significant challenges due to the multilingual and unstructured nature of the data. In this work, we participated in the shared task to detect hate speech in Tamil, Malayalam, and Telugu using both text and audio data. We explored various machine learning models, including Logistic Regression, Ridge Classifier, Random Forest, and Convolutional Neural Networks (CNN). For Tamil text data, Logistic Regression achieved the highest macro-F1 score of 0.97, while Ridge Classifier performed best for audio with 0.75. In Malayalam, Random Forest excelled for text with 0.97, and CNN for audio with 0.69. For Telugu, Ridge Classifier achieved 0.89 for text and CNN 0.87 for audio.These results demonstrate the efficacy of our multimodal approach in addressing the complexity of hate speech detection across the Dravidian languages.Tamil:11th rank, Malayalam :6th rank,Telugu:8th rank among 145 teams

pdf bib
MNLP@DravidianLangTech 2025: A Deep Multimodal Neural Network for Hate Speech Detection in Dravidian Languages
Shraddha Chauhan | Abhinav Kumar

Social media hate speech is a significant issue because it may incite violence, discrimination, and social unrest. Anonymity and reach of such platforms enable the rapid spread of harmful content, targeting individuals or communities based on race, gender, religion, or other attributes. The detection of hate speech is very important for the creation of safe online environments, protection of marginalized groups, and compliance with legal and ethical standards. This paper aims to analyze complex social media content using a combination of textual and audio features. The experimental results establish the effectiveness of the proposed approach, with F1-scores reaching 72% for Tamil, 77% for Malayalam, and 36% for Telugu. Such results strongly indicate that multimodal methodologies have significant room for improvement in hate speech detection in resource-constrained languages and underscore the need to continue further research into this critical area.

pdf bib
MSM_CUET@DravidianLangTech 2025: XLM-BERT and MuRIL Based Transformer Models for Detection of Abusive Tamil and Malayalam Text Targeting Women on Social Media
Md Mizanur Rahman | Srijita Dhar | Md Mehedi Hasan | Hasan Murad

Social media has evolved into an excellent platform for presenting ideas, viewpoints, and experiences in modern society. But this large domain has also brought some alarming problems including internet misuse. Targeted specifically at certain groups like women, abusive language is pervasive on social media. The task is always difficult to detect abusive text for low-resource languages like Tamil, Malayalam, and other Dravidian languages. It is crucial to address this issue seriously, especially for Dravidian languages. This paper presents a novel approach to detecting abusive Tamil and Malayalam texts targeting social media. A shared task on Abusive Tamil and Malayalam Text Targeting Women on Social Media Detection has been organized by DravidianLangTech at NAACL-2025. The organizer has provided an annotated dataset that labels two classes: Abusive and Non-Abusive. We have implemented our model with different transformer-based models like XLM-R, MuRIL, IndicBERT, and mBERT transformers and the Ensemble method with SVM and Random Forest for training. We selected XLM-RoBERT for Tamil text and MuRIL for Malayalam text due to their superior performance compared to other models. After developing our model, we tested and evaluated it on the DravidianLangTech@NAACL 2025 shared task dataset. We found that XLM-R has provided the best result for abusive Tamil text detections with an F1 score of 0.7873 on the test set and ranked 2nd position among all participants. On the other hand, MuRIL has provided the best result for abusive Malayalam text detections with an F1 score of 0.6812 and ranked 10th among all participants.

pdf bib
MNLP@DravidianLangTech 2025: Transformer-based Multimodal Framework for Misogyny Meme Detection
Shraddha Chauhan | Abhinav Kumar

A meme is essentially an artefact of content- usually an amalgamation of a picture, text, or video-content that spreads like wildfire on the internet, usually shared for amusement, cultural expression, or commentary. They are very much similar to an inside joke or a cultural snapshot that reflects shared ideas, emotions, or social commentary, remodulated and reformed by communities. Some of them carry harmful content, such as misogyny. A misogynistic meme is social commentary that espouses negative stereotypes, prejudice, or hatred against women. The detection and addressing of such content will help make the online space inclusive and respectful. The work focuses on developing a multimodal approach for categorizing misogynistic and non-misogynistic memes through the use of pretrained XLM-RoBERTa to draw text features and Vision Transformer to draw image features. The combination of both text and images features are processed into a machine learning and deep learning model which have attained F1-scores 0.77 and 0.88, respectively Tamil and Malayalam for misogynist Meme Dataset.

pdf bib
Code_Conquerors@DravidianLangTech 2025: Deep Learning Approach for Sentiment Analysis in Tamil and Tulu
Harish Vijay V | Ippatapu Venkata Srichandra | Pathange Omkareshwara Rao | Premjith B

In this paper we propose a novel approach to sentiment analysis in languages with mixed Dravidian codes, specifically Tamil-English and Tulu-English social media text. We introduce an innovative hybrid deep learning architecture that uniquely combines convolutional and recurrent neural networks to effectively capture both local patterns and long-term dependencies in code-mixed text. Our model addresses critical challenges in low-resource language processing through a comprehensive preprocessing pipeline and specialized handling of class imbalance and out-of-vocabulary words. Evaluated on a substantial dataset of social media comments, our approach achieved competitive macro F1 scores of 0.3357 for Tamil (ranked 18) and 0.3628 for Tulu (ranked 13)

pdf bib
KEC_TECH_TITANS@DravidianLangTech 2025: Abusive Text Detection in Tamil and Malayalam Social Media Comments Using Machine Learning
Malliga Subramanian | Kogilavani Shanmugavadivel | Deepiga P | Dharshini S | Ananthakumar S | Praveenkumar C

Social media platforms have become a breeding ground for hostility and toxicity, with abusive language targeting women becoming a pervasive issue. This paper addresses the detection of abusive content in Tamil and Malayalam social media comments using machine learning models. We experimented with GRU, LSTM, Bidirectional LSTM, CNN, FastText, and XGBoost models, evaluating their performance on a code-mixed dataset of Tamil and Malayalam comments collected from YouTube. Our findings demonstrate that FastText and CNN models yielded the best performance among the evaluated classifiers, achieving F1-scores of 0.73 each. This study contributes to the ongoing research on abusive text detection for under-resourced languages and highlights the need for robust, scalable solutions to combat online toxicity.

pdf bib
F2 (FutureFiction): Detection of Fake News on Futuristic Technology
Msvpj Sathvik | Venkatesh Velugubantla | Ravi Teja Potla

There is widespread of misinformation on futuristic technology and society. To accurately detect such news, the algorithms require up-to-date knowledge. The Large Language Models excel in the NLP but cannot retrieve the ongoing events or innovations. For example, GPT and it’s variants are restricted till the knowledge of 2021. We introduce a new methodology for the identification of fake news pertaining to futuristic technology and society. Leveraging the power of Google Knowledge, we enhance the capabilities of the GPT-3.5 language model, thereby elevating its performance in the detection of misinformation. The proposed framework exhibits superior efficacy compared to established baselines with the accuracy of 81.04%. Moreover, we propose a novel dataset consisting of fake news in three languages English, Telugu and Tenglish of around 21000 from various sources.

pdf bib
JustATalentedTeam@DravidianLangTech 2025: A Study of ML and DL approaches for Sentiment Analysis in Code-Mixed Tamil and Tulu Texts
Ponsubash Raj R | Paruvatha Priya B | Bharathi B

The growing prevalence of code-mixed text on social media presents unique challenges for sen- timent analysis, particularly in low-resource languages like Tamil and Tulu. This paper ex- plores sentiment classification in Tamil-English and Tulu-English code-mixed datasets using both machine learning (ML) and deep learn- ing (DL) approaches. The ML model utilizes TF-IDF feature extraction combined with a Logistic Regression classifier, while the DL model employs FastText embeddings and a BiLSTM network enhanced with an attention mechanism. Experimental results reveal that the ML model outperforms the DL model in terms of macro F1-score for both languages. Specifically, for Tamil, the ML model achieves a macro F1-score of 0.46, surpassing the DL model’s score of 0.43. For Tulu, the ML model significantly outperforms the DL model, achiev- ing 0.60 compared to 0.48. This performance disparity is more pronounced in Tulu due to its smaller dataset size of 13,308 samples com- pared to Tamil’s 31,122 samples, highlight- ing the data efficiency of ML models in low- resource settings. The study provides insights into the strengths and limitations of each ap- proach, demonstrating that traditional ML tech- niques remain competitive for code-mixed sen- timent analysis when data is limited. These findings contribute to ongoing research in mul- tilingual NLP and offer practical implications for applications such as social media monitor- ing, customer feedback analysis, and conversa- tional AI in Dravidian languages.

pdf bib
KEC_TECH_TITANS@DravidianLangTech 2025:Sentiment Analysis for Low-Resource Languages: Insights from Tamil and Tulu using Deep Learning and Machine Learning Models
Malliga Subramanian | Kogilavani Shanmugavadivel | Dharshini S | Deepiga P | Praveenkumar C | Ananthakumar S

Sentiment analysis in Dravidian languages like Tamil and Tulu presents significant challenges due to their linguistic diversity and limited resources for natural language processing (NLP). This study explores sentiment classification for Tamil and Tulu, focusing on the complexities of handling both languages, which differ in script, grammar, and vocabulary. We employ a variety of machine learning and deep learning techniques, including traditional models like Support Vector Machines (SVM), and K-Nearest Neighbors (KNN), as well as advanced transformer-based models like BERT and multilingual BERT (mBERT). A key focus of this research is to evaluate the performance of these models on sentiment analysis tasks, considering metrics such as accuracy, precision, recall, and F1-score. The results show that transformer-based models, particularly mBERT, significantly outperform traditional machine learning models in both Tamil and Tulu sentiment classification. This study also highlights the need for further research into addressing challenges like language-specific nuances, dataset imbalance, and data augmentation techniques for improved sentiment analysis in under-resourced languages like Tamil and Tulu.

pdf bib
Code_Conquerors@DravidianLangTech 2025: Multimodal Misogyny Detection in Dravidian Languages Using Vision Transformer and BERT
Pathange Omkareshwara Rao | Harish Vijay V | Ippatapu Venkata Srichandra | Neethu Mohan | Sachin Kumar S

This research focuses on misogyny detection in Dravidian languages using multimodal techniques. It leverages advanced machine learning models, including Vision Transformers (ViT) for image analysis and BERT-based transformers for text processing. The study highlights the challenges of working with regional datasets and addresses these with innovative preprocessing and model training strategies. The evaluation reveals significant improvements in detection accuracy, showcasing the potential of multimodal approaches in combating online abuse in underrepresented languages.

pdf bib
YenLP_CS@DravidianLangTech 2025: Sentiment Analysis on Code-Mixed Tamil-Tulu Data Using Machine Learning and Deep Learning Models
Raksha Adyanthaya | Rathnakara Shetty P

The sentiment analysis in code-mixed Dravidian languages such as Tamil-English and Tulu-English is the focus of this study because these languages present difficulties for conventional techniques. In this work, We used ensembles, multilingual Bidirectional Encoder Representation(mBERT), Bidirectional Long Short Term Memory (BiLSTM), Random Forest (RF), Support Vector Machine (SVM), and preprocessing in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) and Word2Vec feature extraction. mBERT obtained accuracy of 64% for Tamil and 68% for Tulu on development datasets. In test sets, the ensemble model gave Tamil a macro F1-score of 0.4117, while mBERT gave Tulu a macro F1-score of 0.5511. With regularization and data augmentation, these results demonstrate the approach’s potential for further advancements.

pdf bib
LinguAIsts@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Dhanyashree G | Kalpana K | Lekhashree A | Arivuchudar K | Arthi R | Bommineni Sahitya | Pavithra J | Sandra Johnson

Social media sites are becoming crucial sites for communication and interaction, yet they are increasingly being utilized to commit gender-based abuse, with horrific, harassing, and degrading comments targeted at women. This paper tries to solve the common issue of women being subjected to abusive language in two South Indian languages, Malayalam and Tamil. To find explicit abuse, implicit bias, preconceptions, and coded language, we were given a set of YouTube comments labeled Abusive and Non-Abusive. To solve this problem, we applied and compared different machine learning models, i.e., Support Vector Machines (SVM), Logistic Regression (LR), and Naive Bayes classifiers, to classify comments into the given categories. The models were trained and validated using the given dataset to achieve the best performance with respect to accuracy and macro F1 score. The solutions proposed aim to make robust content moderation systems that can detect and prevent abusive language, ensuring safer online environments for women.

pdf bib
KEC-Elite-Analysts@DravidianLangTech 2025: Deciphering Emotions in Tamil-English and Code-Mixed Social Media Tweets
Malliga Subramanian | Aruna A | Anbarasan T | Amudhavan M | Jahaganapathi S | Kogilavani Shanmugavadivel

Sentiment analysis in code-mixed languages, particularly Tamil-English, is a growing challenge in natural language processing (NLP) due to the prevalence of multilingual communities on social media. This paper explores various machine learning and transformer-based models, including Logistic Regression, Support Vector Machines (SVM), K-Nearest Neighbors (KNN), BERT, and mBERT, for sentiment classification of Tamil-English code-mixed text. The models are evaluated on a shared task dataset provided by DravidianLangTech@NAACL 2025, with performance measured through accuracy, precision, recall, and F1-score. Our results demonstrate that transformer-based models, particularly mBERT, outperform traditional classifiers in identifying sentiment polarity. Future work aims to address the challenges posed by code-switching and class imbalance through advanced model architectures and data augmentation techniques.

pdf bib
Cyber Protectors@DravidianLangTech 2025: Abusive Tamil and Malayalam Text Targeting Women on Social Media using FastText
Rohit Vp | Madhav M | Ippatapu Venkata Srichandra | Neethu Mohan | Sachin Kumar S

Social media has transformed communication, but it has opened new ways for women to be abused. Because of complex morphology, large vocabulary, and frequent code-mixing of Tamil and Malayalam, it might be especially challenging to identify discriminatory text in linguistically diverse settings. Because traditional moderation systems frequently miss these linguistic subtleties, gendered abuse in many forms—from outright threats to character insults and body shaming—continues. In addition to examining the sociocultural characteristics of this type of harassment on social media, this study compares the effectiveness of several Natural Language Processing (NLP) models, such as FastText, transformer-based architectures, and BiLSTM. Our results show that FastText achieved an macro f1 score of 0.74 on the Tamil dataset and 0.64 on the Malayalam dataset, outperforming the Transformer model which achieved a macro f1 score of 0.62 and BiLSTM achieved 0.57. By addressing the limitations of existing moderation techniques, this research underscores the urgent need for language-specific AI solutions to foster safer digital spaces for women.

pdf bib
LinguAIsts@DravidianLangTech 2025: Misogyny Meme Detection using multimodel Approach
Arthi R | Pavithra J | Dr G Manikandan | Lekhashree A | Dhanyashree G | Bommineni Sahitya | Arivuchudar K | Kalpana K

Memes often disseminate misogynistic material, which nurtures gender discrimination and stereotyping. While it is an effective tool of communication, social media has also provided a fertile ground for online abuse. This vital issue in the multilingual and multimodal setting is tackled by the Misogyny Meme Detection Shared Task. Our method employs advanced NLP techniques and machine learning models to classify memes in Malayalam and Tamil, two low-resource languages. Preprocessing of text includes tokenization, lemmatization, and stop word removal. Features are then extracted using TF-IDF. With the best achievable hyperparameters, along with the SVM model, our system provided very promising outcomes and ranked 9th among the systems competing in the Tamil task with a 0.71259 F1-score, and ranked 15th with an F1-score of 0.68186 in the Malayalam taks. With this research work, it would be established how important AI-based solutions are toward stopping online harassment and developing secure online spaces.

pdf bib
CUET_Agile@DravidianLangTech 2025: Fine-tuning Transformers for Detecting Abusive Text Targeting Women from Tamil and Malayalam Texts
Tareque Md Hanif | Md Rashadur Rahman

As social media has grown, so has online abuse, with women often facing harmful online behavior. This discourages their free participation and expression online. This paper outlines the approach adopted by our team for detecting abusive comments in Tamil and Malayalam. The task focuses on classifying whether a given comment contains abusive language towards women. We experimented with transformer based models by fine-tuning Tamil-BERT for Tamil and Malayalam-BERT for Malayalam. Additionally, we fine-tuned IndicBERT v2 on both Tamil and Malayalam datasets. To evaluate the effect of pre-processing, we also conducted experiments using non-preprocessed text. Results demonstrate that IndicBERT v2 outperformed the language-specific BERT models in both languages. Pre-processing the data showed mixed results, with a slight improvement in the Tamil dataset but no significant benefit for the Malayalam dataset. Our approach secured first place in Tamil with a macro F1-score of 0.7883 and second place in Malayalam with a macro F1-score of 0.7234. The implementation details of the task will be found in the GitHub repository.

pdf bib
Necto@DravidianLangTech 2025: Fine-tuning Multilingual MiniLM for Text Classification in Dravidian Languages
Livin Nector Dhasan

This paper explores the application of a fine-tuned Multilingual MiniLM model for various binary text classification tasks, including AI-generated product review detection, abusive language targeting woman detection, and fake news detection in the Dravidian languages Tamil and Malayalam. This work was done as part of submissions to shared tasks organized by DravidianLangTech@NAACL 2025. The model was fine-tuned using both Tamil and Malayalam datasets, and its performance was evaluated across different tasks using macro F1-score. The results indicate that this model produces performance that is very close to the best F1 score reported by other teams. An investigation is conducted on the AI-generated product review dataset and the findings are reported.

pdf bib
CUET-823@DravidianLangTech 2025: Shared Task on Multimodal Misogyny Meme Detection in Tamil Language
Arpita Mallik | Ratnajit Dhar | Udoy Das | Momtazul Arefin Labib | Samia Rahman | Hasan Murad

Misogynous content on social media, especially in memes, present challenges due to the complex reciprocation of text and images that carry offensive messages. This difficulty mostly arises from the lack of direct alignment between modalities and biases in large-scale visio-linguistic models. In this paper, we present our system for the Shared Task on Misogyny Meme Detection - DravidianLangTech@NAACL 2025. We have implemented various unimodal models, such as mBERT and IndicBERT for text data, and ViT, ResNet, and EfficientNet for image data. Moreover, we have tried combining these models and finally adopted a multimodal approach that combined mBERT for text and EfficientNet for image features, both fine-tuned to better interpret subtle language and detailed visuals. The fused features are processed through a dense neural network for classification. Our approach achieved an F1 score of 0.78120, securing 4th place and demonstrating the potential of transformer-based architectures and state-of-the-art CNNs for this task.

pdf bib
Hermes@DravidianLangTech 2025: Sentiment Analysis of Dravidian Languages using XLM-RoBERTa
Emmanuel George P | Ashiq Firoz | Madhav Murali | Siranjeevi Rajamanickam | Balasubramanian Palani

Sentiment analysis, the task of identifying subjective opinions or emotional responses, has become increasingly significant with the rise of social media. However, analysing sentiment in Dravidian languages such as Tamil-English and Tulu-English presents unique challenges due to linguistic code-switching (where people tend to mix multiple languages) and non-native scripts. Traditional monolingual sentiment analysis models struggle to address these complexities effectively. This research explores a fine-tuned transformer model based on the XLM-RoBERTa model for sentiment detection. It utilizes the tokenizer from the XLM-RoBERTa model for text preprocessing. Additionally, the performance of the XLM-RoBERTa model was compared with traditional machine learning models such as Logistic Regression (LR) and Random Forest (RF), as well as other transformer-based models like BERT and RoBERTa. This research was based on our work for the Sentiment Analysis in Tamil and Tulu DravidianLangTech@NAACL 2025 competition, where we received a macro F1-score of 59% for the Tulu dataset and 49% for the Tamil dataset, placing third in the competition.

pdf bib
SSNTrio@DravidianLangTech 2025: Identification of AI Generated Content in Dravidian Languages using Transformers
J Bhuvana | Mirnalinee T T | Rohan R | Diya Seshan | Avaneesh Koushik

The increasing prevalence of AI-generated content has raised concerns about the authenticity and reliability of online reviews, particularly in resource-limited languages like Tamil and Malayalam. This paper presents an approach to the Shared Task on Detecting AI-generated Product Reviews in Dravidian Languages at NAACL2025, which focuses on distinguishing AI-generated reviews from human-written ones in Tamil and Malayalam. Several transformer-based models, including IndicBERT, RoBERTa, mBERT, and XLM-R, were evaluated, with language-specific BERT models for Tamil and Malayalam demonstrating the best performance. The chosen methodologies were evaluated using Macro Average F1 score. In the rank list released by the organizers, team SSNTrio, achieved ranks of 3rd and 29th for the Malayalam and Tamil datasets with Macro Average F1 Scores of 0.914 and 0.598 respectively.

pdf bib
SSNTrio@DravidianLangTech 2025: Sentiment Analysis in Dravidian Languages using Multilingual BERT
J Bhuvana | Mirnalinee T T | Diya Seshan | Rohan R | Avaneesh Koushik

This paper presents an approach to sentiment analysis for code-mixed Tamil-English and Tulu-English datasets as part of the DravidianLangTech@NAACL 2025 shared task. Sentiment analysis, the process of determining the emotional tone or subjective opinion in text, has become a critical tool in analyzing public sentiment on social media platforms. The approach discussed here uses multilingual BERT (mBERT) fine-tuned on the provided datasets to classify sentiment polarity into various predefined categories: for Tulu, the categories were positive, negative, not_tulu, mixed, and neutral; for Tamil, the categories were positive, negative, unknown, mixed_feelings, and neutral. The mBERT model demonstrates its effectiveness in handling sentiment analysis for codemixed and resource-constrained languages by achieving an F1-score of 0.44 for Tamil, securing the 6th position in the ranklist; and 0.56 for Tulu, ranking 5th in the respective task.

pdf bib
NLP_goats@DravidianLangTech 2025: Detecting Fake News in Dravidian Languages: A Text Classification Approach
Srihari V K | Vijay Karthick Vaidyanathan | Thenmozhi Durairaj

The advent and expansion of social media have transformed global communication. Despite its numerous advantages, it has also created an avenue for the rapid spread of fake news, which can impact people’s decision-making and judgment. This study explores detecting fake news as part of the DravidianLangTech@NAACL 2025 shared task, focusing on two key tasks. The aim of Task 1 is to classify Malayalam social media posts as either original or fake, and Task 2 categorizes Malayalam-language news articles into five levels of truthfulness: False, Half True, Mostly False, Partly False, and Mostly True. We accomplished the tasks using transformer models, e.g., M-BERT and classifiers like Naive Bayes. Our results were promising, with M-BERT achieving the better results. We achieved a macro-F1 score of 0.83 for distinguishing between fake and original content in Task 1 and a score of 0.54 for classifying news articles in Task 2, ranking us 11 and 4, respectively.

pdf bib
NLP_goats@DravidianLangTech 2025: Towards Safer Social Media: Detecting Abusive Language Directed at Women in Dravidian Languages
Vijay Karthick Vaidyanathan | Srihari V K | Thenmozhi Durairaj

Social media in the present world is an essential communication platform for information sharing. But their emergence has now led to an increase in the proportion of online abuse, in particular against women in the form of abusive and offensive messages. A reflection of the social inequalities, the importance of detecting abusive language is highlighted by the fact that the usage has a profound psychological and social impact on the victims. This work by DravidianLangTech@NAACL 2025 aims at developing an automated abusive content detection system for women directed towards women on the Tamil and Malayalam platforms, two of the Dravidian languages. Based on a dataset of their YouTube comments about sensitive issues, the study uses multilingual BERT (mBERT) to detect abusive comments versus non-abusive ones. We achieved F1 scores of 0.75 in Tamil and 0.68 in Malayalam, placing us 13 and 9 respectively.

pdf bib
HerWILL@DravidianLangTech 2025: Ensemble Approach for Misogyny Detection in Memes Using Pre-trained Text and Vision Transformers
Neelima Monjusha Preeti | Trina Chakraborty | Noor Mairukh Khan Arnob | Saiyara Mahmud | Azmine Toushik Wasi

Misogynistic memes on social media perpetuate gender stereotypes, contribute to harassment, and suppress feminist activism. However, most existing misogyny detection models focus on high-resource languages, leaving a gap in low-resource settings. This work addresses that gap by focusing on misogynistic memes in Tamil and Malayalam, two Dravidian languages with limited resources. We combine computer vision and natural language processing for multi-modal detection, using CLIP embeddings for the vision component and BERT models trained on code-mixed hate speech datasets for the text component. Our results show that this integrated approach effectively captures the unique characteristics of misogynistic memes in these languages, achieving competitive performance with a Macro F1 Score of 0.7800 for the Tamil test set and 0.8748 for the Malayalam test set. These findings highlight the potential of multimodal models and the adaptation of pre-trained models to specific linguistic and cultural contexts, advancing misogyny detection in low-resource settings. Code available at https://github.com/HerWILL-Inc/NAACL-2025

pdf bib
Cognitext@DravidianLangTech2025: Fake News Classification in Malayalam Using mBERT and LSTM
Shriya Alladi | Bharathi B

Fake news detection is a crucial task in combat- ing misinformation, particularly in underrepresented languages such as Malayalam. This paper focuses on detecting fake news in Dravidian languages using two tasks: Social Media Text Classification and News Classification. We employ a fine-tuned multilingual BERT (mBERT) model for classifying a given social media text into original or fake and an LSTM-based architecture for accurately detecting and classifying fake news articles in the Malayalam language into different categories.Extensive preprocessing techniques, such as tokenization and text cleaning, were used to ensure data quality. Our experiments achieved significant accuracy rates and F1- scores. The study’s contributions include applying advanced machine learning techniques to the Malayalam language, addressing the lack of research on low-resource languages, and highlighting the challenges of fake news detection in multilingual and code-mixed environments.

pdf bib
NLP_goats_DravidianLangTech_2025__Detecting_AI_Written_Reviews_for_Consumer_Trust
Srihari V K | Vijay Karthick Vaidyanathan | Mugilkrishna D U | Thenmozhi Durairaj

The rise of AI-generated content has introduced challenges in distinguishing machine-generated text from human-written text, particularly in low-resource languages. The identification of artificial intelligence (AI)-based reviews is of significant importance to preserve trust and authenticity on online platforms. The Shared Task on Detecting AI-Generated Product Reviews in Dravidian languages deals with the task of detecting AI-generated and human-written reviews in Tamil and Malayalam. To solve this problem, we specifically fine-tuned mBERT for binary classification. Our system achieved 10th place in Tamil with a macro F1-score of 0.90 and 28th place in Malayalam with a macro F1-score of 0.68, as reported by the NAACL 2025 organizers. The findings demonstrate the complexity involved in the separation of AI-derived text from human-authored writing, with a call for continued advances in detection methods.

pdf bib
RATHAN@DravidianLangTech 2025: Annaparavai - Separate the Authentic Human Reviews from AI-generated one
Jubeerathan Thevakumar | Luheerathan Thevakumar

Detecting AI-generated reviews is crucial for maintaining the authenticity of online feedback in low-resource languages like Tamil and Malayalam. We propose a transfer learning-based approach using embeddings from XLM-RoBERTa, IndicBERT, mT5, and Sentence-BERT, validated with five-fold cross-validation via XGBoost. These embeddings are used to train deep neural networks (DNNs), refined through a weighted ensemble model. Our method achieves 90% F1-score for Malayalam and 73% for Tamil, demonstrating the effectiveness of transfer learning and ensembling for review detection. The source code is publicly available to support further research and improve online review systems in multilingual settings.

pdf bib
DLRG@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Ratnavel Rajalakshmi | Ramesh Kannan | Meetesh Saini | Bitan Mallik

Social media is a powerful communication tooland rich in diverse content requiring innovativeapproaches to understand nuances of the lan-guages. Addressing challenges like hate speechnecessitates multimodal analysis that integratestextual, and other cues to capture its contextand intent effectively. This paper proposes amultimodal hate speech detection system inTamil, which uses textual and audio featuresfor classification. Our proposed system usesa fine-tuned Indic-BERT model for text basedhate speech detection and Wav2Vec2 modelfor audio based hate speech detection of au-dio data. The fine-tuned Indic-BERT modelwith Whisper achieved an F1 score of 0.25 onMultimodal approach. Our proposed approachranked at 10th position in the shared task onMultimodal Hate Speech Detection in Dravid-ian languages at the NAACL 2025 WorkshopDravidianLangTech.

pdf bib
Team ML_Forge@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Adnan Faisal | Shiti Chowdhury | Sajib Bhattacharjee | Udoy Das | Samia Rahman | Momtazul Arefin Labib | Hasan Murad

Ensuring a safe and inclusive online environment requires effective hate speech detection on social media. While detection systems have significantly advanced for English, many regional languages, including Malayalam, Tamil and Telugu, remain underrepresented, creating challenges in identifying harmful content accurately. These languages present unique challenges due to their complex grammar, diverse dialects, and frequent code-mixing with English. The rise of multimodal content, including text and audio, adds further complexity to detection tasks. The shared task “Multimodal Hate Speech Detection in Dravidian Languages: DravidianLangTech@NAACL 2025” has aimed to address these challenges. A Youtube-sourced dataset has been provided, labeled into five categories: Gender (G), Political (P), Religious (R), Personal Defamation (C) and Non-Hate (NH). In our approach, we have used mBERT, T5 for text and Wav2Vec2 and Whisper for audio. T5 has performed poorly compared to mBERT, which has achieved the highest F1 scores on the test dataset. For audio, Wav2Vec2 has been chosen over Whisper because it processes raw audio effectively using self-supervised learning. In the hate speech detection task, we have achieved a macro F1 score of 0.2005 for Malayalam, ranking 15th in this task, 0.1356 for Tamil and 0.1465 for Telugu, with both ranking 16th in this task.

pdf bib
codecrackers@DravidianLangTech 2025: Sentiment Classification in Tamil and Tulu Code-Mixed Social Media Text Using Machine Learning
Lalith Kishore V P | Dr G Manikandan | Mohan Raj M A | Keerthi Vasan A | Aravindh M

Sentiment analysis of code-mixed Dravidian languages has become a major area of concern with increasing volumes of multilingual and code-mixed information across social media. This paper presents the “Seventh Shared Task on Sentiment Analysis in Code-mixed Tamil and Tulu”, which was held as part of DravidianLangTech (NAACL-2025). However, sentiment analysis for code-mixed Dravidian languages has received little attention due to challenges such as class imbalance, small sample size, and the informal nature of the code-mixed text. This study applied an SVM-based approach for the sentiment classification of both Tamil and Tulu languages. The SVM model achieved competitive macro-average F1 scores of 0.54 for Tulu and 0.438 for Tamil, demonstrating that traditional machine learning methods can effectively tackle sentiment categorization in code-mixed languages under low-resource settings.

pdf bib
CUET_Ignite@DravidianLangTech 2025: Detection of Abusive Comments in Tamil Text Using Transformer Models
MD.Mahadi Rahman | Mohammad Minhaj Uddin | Mohammad Shamsul Arefin

Abusive comment detection in low-resource languages is a challenging task particularly when addressing gender-based abuse. Identifying abusive language targeting women is crucial for effective content moderation and fostering safer online spaces. A shared task on abusive comment detection in Tamil text organized by DravidianLangTech@NAACL 2025 allowed us to address this challenge using a curated dataset. For this task, we experimented with various machine learning (ML) and deep learning (DL) models including Logistic Regression, Random Forest, SVM, CNN, LSTM, BiLSTMand transformer-based models such as mBERT, IndicBERT, XLMRoBERTa and many more. The dataset comprised of Tamil YouTube comments annotated with binary labels, Abusive and NonAbusive capturing explicit abuse, implicit biases and stereotypes. Our experiments demonstrated that XLM-RoBERTa achieved the highest macro F1-score(0.80), highlighting its effectiveness in handling Tamil text. This research contributes to advancing abusive language detection and natural language processing in lowresource languages particularly for addressing gender-based abuse online.

pdf bib
CUET_Absolute_Zero@DravidianLangTech 2025: Detecting AI-Generated Product Reviews in Malayalam and Tamil Language Using Transformer Models
Anindo Barua | Sidratul Muntaha | Momtazul Arefin Labib | Samia Rahman | Udoy Das | Hasan Murad

Artificial Intelligence (AI) is opening new doors of learning and interaction. However, it has its share of problems. One major issue is the ability of AI to generate text that resembles human-written text. So, how can we tell apart human-written text from AI-generated text?With this in mind, we have worked on detecting AI-generated product reviews in Dravidian languages, mainly in Malayalam and Tamil. The “Shared Task on Detecting AI-Generated Product Reviews in Dravidian Languages,” held as part of the DravidianLangTech Workshop at NAACL 2025 has provided a dataset categorized into two categories, human-written review and AI-generated review. We have implemented four machine learning models (Random Forest, Support Vector Machine, Decision Tree, and XGBoost), four deep learning models (Long Short-Term Memory, Bidirectional Long Short-Term Memory, Gated Recurrent Unit, and Recurrent Neural Network), and three transformer-based models (AI-Human-Detector, Detect-AI-Text, and E5-Small-Lora-AI-Generated-Detector). We have conducted a comparative study among all the models by training and evaluating each model on the dataset. We have discovered that the transformer, E5-Small-Lora-AI-Generated-Detector, has provided the best result with an F1 score of 0.8994 on the test set ranking 7th position in the Malayalam language. Tamil has a higher token overlap and richer morphology than Malayalam. Thus, we obtained a worse F1 score of 0.5877 ranking 28th position in the Tamil language among all participants in the shared task.

pdf bib
MNLP@DravidianLangTech 2025: Transformers vs. Traditional Machine Learning: Analyzing Sentiment in Tamil Social Media Posts
Abhay Vishwakarma | Abhinav Kumar

Sentiment analysis in Natural Language Processing (NLP) aims to categorize opinions in text. In the political domain, understanding public sentiment is crucial for influencing policymaking. Social media platforms like X (Twitter) provide abundant sources of real-time political discourse. This study focuses on political multiclass sentiment analysis of Tamil comments from X, classifying sentiments into seven categories: substantiated, sarcastic, opinionated, positive, negative, neutral, and none of the above. A number of traditional machine learning such as Naive Bayes, Voting Classifier (an ensemble of Decision Tree, SVM, Naive Bayes, K-Nearest Neighbors, and Logistic Regression) and deep learning models such as LSTM, deBERTa, and a hybrid approach combining deBERTa embeddings with an LSTM layer are implemented. The proposed ensemble-based voting classifier achieved best performance among all implemented models with an accuracy of 0.3750, precision of 0.3387, recall of 0.3250, and macro-F1-score of 0.3227.

pdf bib
shimig@DravidianLangTech2025: Stratification of Abusive content on Women in Social Media
Gersome Shimi | Jerin Mahibha C | Thenmozhi Durairaj

The social network is a trending medium for interaction and sharing content globally. The content is sensitive since it can create an impact and change the trends of stakeholder’s thought as well as behavior. When the content is targeted towards women, it may be abusive or non-abusive and the identification is a tedious task. The content posted on social networks can be in English, code mix, or any low-resource language. The shared task Abusive Tamil and Malayalam Text targeting Women on Social Media was conducted as part of DravidianLangTech@NAACL 2025 organized by DravidianLangTech. The task is to identify the content given in Tamil or Malayalam or code mix as abusive or non-abusive. The task is accomplished for the South Indian languages Tamil and Malayalam using pretrained transformer model, BERT base multilingual cased and achieved the accuracy measure of 0.765 and 0.677.

pdf bib
SSNTrio@DravidianLangTech2025: LLM Based Techniques for Detection of Abusive Text Targeting Women
Mirnalinee T T | J Bhuvana | Avaneesh Koushik | Diya Seshan | Rohan R

This study focuses on developing a solution for detecting abusive texts on social media against women in Tamil and Malayalam, two low-resource Dravidian languages in South India. As the usage of social media for communication and idea sharing has increased significantly, these platforms are being used to target and victimize women. Hence an automated solution becomes necessary to screen the huge volume of content generated. This work is part of the shared Task on Abusive Tamil and Malayalam Text targeting Women on Social MediaDravidianLangTech@NAACL 2025. The approach used to tackle this problem involves utilizing LLM based techniques for classifying abusive text. The Macro Average F1-Score for the Tamil BERT model was 0.76 securing the 11th position, while the Malayalam BERT model for Malayalam obtained a score of 0.30 and secured the 33rd rank. The proposed solution can be extended further to incorporate other regional languages as well based on similar techniques.

pdf bib
CUET-NLP_MP@DravidianLangTech 2025: A Transformer and LLM-Based Ensemble Approach for Fake News Detection in Dravidian
Md Minhazul Kabir | Md. Mohiuddin | Kawsar Ahmed | Mohammed Moshiul Hoque

pdf bib
CUET-NLP_Big_O@DravidianLangTech 2025: A Multimodal Fusion-based Approach for Identifying Misogyny Memes
Md. Refaj Hossan | Nazmus Sakib | Md. Alam Miah | Jawad Hossain | Mohammed Moshiul Hoque

Memes have become one of the main mediums for expressing ideas, humor, and opinions through visual-textual content on social media. The same medium has been used to propagate harmful ideologies, such as misogyny, that undermine gender equality and perpetuate harmful stereotypes. Identifying misogynistic memes is particularly challenging in low-resource languages (LRLs), such as Tamil and Malayalam, due to the scarcity of annotated datasets and sophisticated tools. Therefore, DravidianLangTech@NAACL 2025 launched a Shared Task on Misogyny Meme Detection to identify misogyny memes. For this task, this work exploited an extensive array of models, including machine learning (LR, RF, SVM, and XGBoost), and deep learning (CNN, BiLSTM+CNN, CNN+GRU, and LSTM) are explored to extract textual features, while CNN, BiLSTM + CNN, ResNet50, and DenseNet121 are utilized for visual features.Furthermore, we have explored feature-level and decision-level fusion techniques with several model combinations like MuRIL with ResNet50, MuRIL with BiLSTM+CNN, T5+MuRIL with ResNet50, and mBERT with ResNet50. The evaluation results demonstrated that BERT + ResNet50 performed best, obtaining an F1 score of 0.81716 (Tamil) and were ranked 2nd in the task. The early fusion of MuRIL+ResNet50 showed the highest F1 score of 0.82531 and received a 9th rank in Malayalam.

pdf bib
LexiLogic@DravidianLangTech 2025: Detecting Misogynistic Memes and Abusive Tamil and Malayalam Text Targeting Women on Social Media
Niranjan Kumar M | Pranav Gupta | Billodal Roy | Souvik Bhattacharyya

Social media platforms have become a significant medium for communication and expression, but they are also plagued by misogynistic content targeting women. This study focuses on detecting misogyny in memes and abusive textual content in Tamil and Malayalam languages, which are underrepresented in natural language processing research. Leveraging advanced machine learning and deep learning techniques, we developed a system capable of identifying misogynistic memes and abusive text. By addressing cultural and linguistic nuances, our approach enhances detection accuracy and contributes to safer online spaces for women. This work also serves as a foundation for expanding misogyny detection to other low-resource languages, fostering inclusivity and combating online abuse effectively.This paper presents our work on detecting misogynistic memes and abusive Tamil and Malayalam text targeting women on social media platforms. Leveraging the pretrained models l3cube-pune/tamil-bert and l3cube-pune/malayalam-bert, we explored various data cleaning and augmentation strategies to enhance detection performance. The models were fine-tuned on curated datasets and evaluated using accuracy, F1-score, precision, and recall. The results demonstrated significant improvements with our cleaning and augmentation techniques, yielding robust performance in detecting nuanced and culturally-specific abusive content.Our model achieved macro F1 scores of 77.83/78.24 on L3Cube-Bert-Tamil and 78.16/77.01 on L3Cube-Bert-Malayalam, ranking 3rd and 4th on the leaderboard. For the misogyny task, we obtained 83.58/82.94 on L3Cube-Bert-Malayalam and 73.16/73.8 on L3Cube-Bert-Tamil, placing 9th in both. These results highlight our model’s effectiveness in low-resource language classification.

pdf bib
CUET-NLP_Big_O@DravidianLangTech 2025: A BERT-based Approach to Detect Fake News from Malayalam Social Media Texts
Nazmus Sakib | Md. Refaj Hossan | Alamgir Hossain | Jawad Hossain | Mohammed Moshiul Hoque

The rapid growth of digital platforms and social media has significantly contributed to spreading fake news, posing serious societal challenges. While extensive research has been conducted on detecting fake news in high-resource languages (HRLs) such as English, relatively little attention has been given to low-resource languages (LRLs) like Malayalam due to insufficient data and computational tools. To address this challenge, the DravidianLangTech 2025 workshop organized a shared task on fake news detection in Dravidian languages. The task was divided into two sub-tasks, and our team participated in Task 1, which focused on classifying social media texts as original or fake. We explored a range of machine learning (ML) techniques, including Logistic Regression (LR), Multinomial Naïve Bayes (MNB), and Support Vector Machines (SVM), as well as deep learning (DL) models such as CNN, BiLSTM, and a hybrid CNN+BiLSTM. Additionally, this work examined several transformer-based models, including m-BERT, Indic-BERT, XLM-Roberta, and MuRIL-BERT, to exploit the task. Our team achieved 6th place in Task 1, with MuRIL-BERT delivering the best performance, achieving an F1 score of 0.874.

pdf bib
LexiLogic@DravidianLangTech 2025: Detecting Fake News in Malayalam and AI-Generated Product Reviews in Tamil and Malayalam
Souvik Bhattacharyya | Pranav Gupta | Niranjan Kumar M | Billodal Roy

Fake news and hard-to-detect AI-generated content are pressing issues in online media, which are expected to exacerbate due to the recent advances in generative AI. Moreover, tools to keep such content under check are less accurate for languages with less available online data. In this paper, we describe our submissions to two shared tasks at the NAACL Dravidian Language Tech workshop, namely detecting fake news in Malayalam and detecting AI-generated product reviews in Malayalam and Tamil. We obtained test macro F1 scores of 0.29 and 0.82 in the multi-class and binary classification sub-tasks within the Malayalam fake news task, and test macro F1 scores of 0.9 and 0.646 in the task of detecting AI-generated product reviews in Malayalam and Tamil respectively.

pdf bib
SSNTrio @ DravidianLangTech 2025: Hybrid Approach for Hate Speech Detection in Dravidian Languages with Text and Audio Modalities
J Bhuvana | Mirnalinee T T | Rohan R | Diya Seshan | Avaneesh Koushik

This paper presents the approach and findings from the Multimodal Social Media Data Analysis in Dravidian Languages (MSMDA-DL) shared task at DravidianLangTech@NAACL 2025. The task focuses on detecting multimodal hate speech in Tamil, Malayalam, and Telugu, requiring models to analyze both text and speech components from social media content. The proposed methodology uses language-specific BERT models for the provided text transcripts, followed by multimodal feature extraction techniques, and classification using a Random Forest classifier to enhance performance across the three languages. The models achieved a macro-F1 score of 0.7332 (Rank 1) in Tamil, 0.7511 (Rank 1) in Malayalam, and 0.3758 (Rank 2) in Telugu, demonstrating the effectiveness of the approach in multilingual settings. The models performed well despite the challenges posed by limited resources, highlighting the potential of language-specific BERT models and multimodal techniques in hate speech detection for Dravidian languages.

pdf bib
Fired_from_NLP@DravidianLangTech 2025: A Multimodal Approach for Detecting Misogynistic Content in Tamil and Malayalam Memes
Md. Sajid Alam Chowdhury | Mostak Mahmud Chowdhury | Anik Mahmud Shanto | Jidan Al Abrar | Hasan Murad

In the context of online platforms, identifying misogynistic content in memes is crucial for maintaining a safe and respectful environment. While most research has focused on high-resource languages, there is limited work on languages like Tamil and Malayalam. To address this gap, we have participated in the Misogyny Meme Detection task organized by DravidianLangTech@NAACL 2025, utilizing the provided dataset named MDMD (Misogyny Detection Meme Dataset), which consists of Tamil and Malayalam memes. In this paper, we have proposed a multimodal approach combining visual and textual features to detect misogynistic content. Through a comparative analysis of different model configurations, combining various deep learning-based CNN architectures and transformer-based models, we have developed fine-tuned multimodal models that effectively identify misogynistic memes in Tamil and Malayalam. We have achieved an F1 score of 0.678 for Tamil memes and 0.803 for Malayalam memes.

pdf bib
One_by_zero@DravidianLangTech 2025: Fake News Detection in Malayalam Language Leveraging Transformer-based Approach
Dola Chakraborty | Shamima Afroz | Jawad Hossain | Mohammed Moshiul Hoque

The rapid spread of misinformation in the digital era presents critical challenges for fake news detection, especially in low-resource languages (LRLs) like Malayalam, which lack extensive datasets and pre-trained models for widely spoken languages. This gap in resources makes it harder to build robust systems for combating misinformation despite the significant societal and political consequences it can have. To address these challenges, this work proposes a transformer-based approach for Task 1 of the Fake News Detection in Dravidian Languages (DravidianLangTech@NAACL 2025), which focuses on classifying Malayalam social media texts as either original or fake. The experiments involved a range of ML techniques (Logistic Regression (LR), Support Vector Machines (SVM), and Decision Trees (DT)) and DL architectures (BiLSTM, BiLSTM-LSTM, and BiLSTM-CNN). This work also explored transformer-based models, including IndicBERT, MuRiL, XLM-RoBERTa, and Malayalam BERT. Among these, Malayalam BERT achieved the best performance, with the highest macro F1-score of 0.892, securing a rank of 3rd in the competition.

pdf bib
CUET_Novice@DravidianLangTech 2025: A Multimodal Transformer-Based Approach for Detecting Misogynistic Memes in Malayalam Language
Khadiza Sultana Sayma | Farjana Alam Tofa | Md Osama | Ashim Dey

Memes, combining images and text, are a popular social media medium that can spread humor or harmful content, including misogyny—hatred or discrimination against women. Detecting misogynistic memes in Malayalam is challenging due to their multimodal nature, requiring analysis of both visual and textual elements. A Shared Task on Misogyny Meme Detection, organized as part of DravidianLangTech@NAACL 2025, aimed to address this issue by promoting the advancement of multimodal machine learning models for classifying Malayalam memes as misogynistic or non-misogynistic. In this work, we explored visual, textual, and multimodal approaches for meme classification. CNN, ResNet50, Vision Transformer (ViT), and Swin Transformer were used for visual feature extraction, while mBERT, IndicBERT, and MalayalamBERT were employed for textual analysis. Additionally, we experimented with multimodal fusion models, including IndicBERT+ViT, MalayalamBERT+ViT, and MalayalamBERT+Swin. Among these, our MalayalamBERT+Swin Transformer model performed best, achieving the highest weighted F1-score of 0.87631, securing 1st place in the competition. Our results highlight the effectiveness of multimodal learning in detecting misogynistic Malayalam memes and the need for robust AI models in low-resource languages.

pdf bib
teamiic@DravidianLangTech2025-NAACL 2025: Transformer-Based Multimodal Feature Fusion for Misogynistic Meme Detection in Low-Resource Dravidian Language
Harshita Sharma | Simran Simran | Vajratiya Vajrobol | Nitisha Aggarwal

Misogyny has become a pervasive issue in digital spaces. Misleading gender stereotypes are getting communicated through digital content.This content is majorly displayed as a text-and-image memes. With the growing prevalence of online content, it is essential to develop automated systems capable of detecting such harmful content to ensure safer online environments. This study focuses on the detection of misogynistic memes in two Dravidian languages, Tamil and Malayalam. The proposed model utilizes a pre-trained XLM-RoBERTa (XLM-R) model for text analysis and a Vision Transformer (ViT) for image feature extraction. A custom neural network classifier was trained on integrating the outputs of both modalities to form a unified representation. This model predicts whether the meme represents misogyny or not. This follows an early-fusion strategy since features of both modalities are combined before feeding into the classification model. This approach achieved promising results using a macro F1-score of 0.84066 on the Malayalam test dataset and 0.68830 on the Tamil test dataset. In addition, it is worth noting that this approach secured Rank 7 and 11 in Malayalam and Tamil classification respectively in the shared task of Misogyny Meme Detection (MMD). The findings demonstrate that the multimodal approach significantly enhances the accuracy of detecting misogynistic content compared to text-only or image-only models.

pdf bib
CUET_Novice@DravidianLangTech 2025: Abusive Comment Detection in Malayalam Text Targeting Women on Social Media Using Transformer-Based Models
Farjana Alam Tofa | Khadiza Sultana Sayma | Md Osama | Ashim Dey

Social media has become a widely used platform for communication and entertainment, but it has also become a space where abuseand harassment can thrive. Women, in particular, face hateful and abusive comments that reflect gender inequality. This paper discussesour participation in the Abusive Text Targeting Women in Dravidian Languages shared task at DravidianLangTech@NAACL 2025, whichfocuses on detecting abusive text targeting women in Malayalam social media comments. The shared task provided a dataset of YouTubecomments in Tamil and Malayalam, focusing on sensitive and controversial topics where abusive behavior is prevalent. Our participationfocused on the Malayalam dataset, where the goal was to classify comments into these categories accurately. Malayalam-BERT achievedthe best performance on the subtask, securing 3rd place with a macro f1-score of 0.7083, highlighting the effectiveness of transformer models for low-resource languages. These results contribute to tackling gender-based abuse and improving online content moderation for underrepresented languages.

pdf bib
SemanticCuetSync@DravidianLangTech 2025: Multimodal Fusion for Hate Speech Detection - A Transformer Based Approach with Cross-Modal Attention
Md. Sajjad Hossain | Symom Hossain Shohan | Ashraful Islam Paran | Jawad Hossain | Mohammed Moshiul Hoque

The rise of social media has significantly facilitated the rapid spread of hate speech. Detecting hate speech for content moderation is challenging, especially in low-resource languages (LRLs) like Telugu. Although some progress has been noticed in hate speech detection in Telegu concerning unimodal (text or image) in recent years, there is a lack of research on hate speech detection based on multimodal content detection (specifically using audio and text). In this regard, DravidianLangTech has arranged a shared task to address this challenge. This work explored three machine learning (ML), three deep learning (DL), and seven transformer-based models that integrate text and audio modalities using cross-modal attention for hate speech detection. The evaluation results demonstrate that mBERT achieved the highest F-1 score of 49.68% using text. However, the proposed multimodal attention-based approach with Whisper-small+TeluguBERT-3 achieved an F-1 score of 43 68%, which helped us achieve a rank of 3rd in the shared task competition.

pdf bib
CUET_Novice@DravidianLangTech 2025: A Bi-GRU Approach for Multiclass Political Sentiment Analysis of Tamil Twitter (X) Comments
Arupa Barua | Md Osama | Ashim Dey

Political sentiment analysis in multilingual content poses significant challenges in capturing the subtle variations of diverse sentiments expressed in complex and low-resourced languages. Accurately classifying sentiments, whether positive, negative, or neutral, is crucialfor understanding public discourse. A shared task on Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments, organized by DravidianLangTech@NAACL 2025, provided an opportunity to tackle these challenges. For this task, we implemented two data augmentation techniques, which are synonym replacement and back translation, and then explored various machine learning (ML) algorithms, including Logistic Regression, Decision Tree, Random Forest, SVM, and MultiNomial Naive Bayes. To capture the semantic meanings more efficiently, we experimented with deep learning (DL) models, including GRU, BiLSTM, BiGRU, and a hybrid CNN-BiLSTM.The Bidirectional Gated Recurrent Unit (BiGRU) achieved the best macro-F1 (MF1) score of 0.33, securing the 17th position in the sharedtask. These findings underscore the challenges of political sentiment analysis in low-resource languages and the need for advanced language-specific models for improved classification.

pdf bib
CIC-NLP@DravidianLangTech 2025: Detecting AI-generated Product Reviews in Dravidian Languages
Tewodros Achamaleh | Tolulope Olalekan Abiola | Lemlem Eyob Kawo | Mikiyas Mebraihtu | Grigori Sidorov

AI-generated text now matches human writing so well that telling them apart is very difficult. Our CIC-NLP team submits results for the DravidianLangTech@NAACL 2025 shared task to reveal AI-generated product reviews in Dravidian languages. We performed a binary classification task with XLM-RoBERTa-Base using the DravidianLangTech@NAACL 2025 datasets offered by the event organizers. Through training the model correctly, our tests could tell between human and AI-generated reviews with scores of 0.96 for Tamil and 0.88 for Malayalam in the evaluation test set. This paper presents detailed information about preprocessing, model architecture, hyperparameter fine-tuning settings, the experimental process, and the results. The source code is available on GitHub1.

pdf bib
One_by_zero@DravidianLangTech 2025: A Multimodal Approach for Misogyny Meme Detection in Malayalam Leveraging Visual and Textual Features
Dola Chakraborty | Shamima Afroz | Jawad Hossain | Mohammed Moshiul Hoque

Misogyny memes are a form of online content that spreads harmful and damaging ideas about women. By combining images and text, they often aim to mock, disrespect, or insult women, sometimes overtly and other times in more subtle, insidious ways. Detecting Misogyny memes is crucial for fostering safer and more respectful online communities. While extensive research has been conducted on high-resource languages (HRLs) like English, low-resource languages (LRLs) such as Dravidian (e.g., Tamil and Malayalam) remain largely overlooked. The shared task on Misogyny Meme Detection, organized as part of DravidianLangTech@NAACL 2025, provided a platform to tackle the challenge of identifying misogynistic content in memes, specifically in Malayalam. We participated in the competition and adopted a multimodal approach to contribute to this effort. For image analysis, we employed a ResNet18 model to extract visual features, while for text analysis, we utilized the IndicBERT model. Our system achieved an impressive F1-score of 0.87, earning us the 3rd rank in the task.

pdf bib
CUET-NLP_MP@DravidianLangTech 2025: A Transformer-Based Approach for Bridging Text and Vision in Misogyny Meme Detection in Dravidian Languages
Md. Mohiuddin | Md Minhazul Kabir | Kawsar Ahmed | Mohammed Moshiul Hoque

Misogyny memes, a form of digital content, reflect societal prejudices by discriminating against women through shaming and stereotyping. In this study, we present a multimodal approach combining Indic-BERT and ViT-base-patch16-224 to address misogyny memes. We explored various Machine Learning, Deep Learning, and Transformer models for unimodal and multimodal classification using provided Tamil and Malayalam meme dataset. Our findings highlight the challenges traditional ML and DL models face in understanding the nuances of Dravidian languages, while emphasizing the importance of transformer models in capturing these complexities. Our multimodal method achieved F1-scores of 77.18% and 84.11% in Tamil and Malayalam, respectively, securing 6th place for both languages among the participants.

pdf bib
CUET_NetworkSociety@DravidianLangTech 2025: A Transformer-Based Approach for Detecting AI-Generated Product Reviews in Low-Resource Dravidian Languages
Sabik Aftahee | Tofayel Ahmmed Babu | MD Musa Kalimullah Ratul | Jawad Hossain | Mohammed Moshiul Hoque

E-commerce platforms face growing challenges regarding consumer trust and review authenticity because of the growing number of AI-generated product reviews. Low-resource languages (LRLs) such as Tamil and Malayalam face limited investigation by AI detection techniques because these languages experience constraints from sparse data sources and complex linguistic structures. The research team at CUET_NetworkSociety took part in the AI-Generated Review Detection contest during the DravidianLangTech@NAACL 2025 event to fill this knowledge void. Using a combination of machine learning, deep learning, and transformer-based models, we detected AI-generated and human-written reviews in both Tamil and Malayalam. The developed method employed DistilBERT, which underwent an advanced preprocessing pipeline and hyperparameter optimization using the Transformers library. This approach achieved a Macro F1-score of 0.81 for Tamil (Subtask 1), securing 18th place, and a score of 0.7287 for Malayalam (Subtask 2), ranking 25th.

pdf bib
CUET_NetworkSociety@DravidianLangTech 2025: A Multimodal Framework to Detect Misogyny Meme in Dravidian Languages
MD Musa Kalimullah Ratul | Sabik Aftahee | Tofayel Ahmmed Babu | Jawad Hossain | Mohammed Moshiul Hoque

Memes are commonly used for communication on social media platforms, and some of them can propagate misogynistic content, spreading harmful messages. Detecting such misogynistic memes has become a significant challenge, especially for low-resource languages like Tamil and Malayalam, due to their complex linguistic structures. To tackle this issue, a shared task on detecting misogynistic memes was organized at DravidianLangTech@NAACL 2025. This paper proposes a multimodal deep learning approach for detecting misogynistic memes in Tamil and Malayalam. The proposed model combines fine-tuned ResNet18 for visual feature extraction and indicBERT for analyzing textual content. The fused model was applied to the test dataset, achieving macro F1 scores of 76.32% for Tamil and 80.35% for Malayalam. Our approach led to 7th and 12th positions for Tamil and Malayalam, respectively.

pdf bib
CUET_NetworkSociety@DravidianLangTech 2025: A Transformer-Driven Approach to Political Sentiment Analysis of Tamil X (Twitter) Comments
Tofayel Ahmmed Babu | MD Musa Kalimullah Ratul | Sabik Aftahee | Jawad Hossain | Mohammed Moshiul Hoque

Social media has become an established medium of public communication and opinions on every aspect of life, but especially politics. This has resulted in a growing need for tools that can process the large amount of unstructured data that is produced on these platforms providing actionable insights in domains such as social trends and political opinion. Low-resource languages like Tamil present challenges due to limited tools and annotated data, highlighting the need for NLP focus on understudied languages. To address this, a shared task has been organized by DravidianLangTech@NAACL 2025 for political sentiment analysis for low-resource languages, with a specific focus on Tamil. In this task, we have explored several machine learning methods such as SVM, AdaBoost, GB, deep learning methods including CNN, LSTM, GRU BiLSTM, and the ensemble of different deep learning models, and transformer-based methods including mBERT, T5, XLM-R. The mBERT model performed best by achieving a macro F1 score of 0.2178 and placing our team 22nd in the rank list.

pdf bib
cantnlp@DravidianLangTech-2025: A Bag-of-Sounds Approach to Multimodal Hate Speech Detection
Sidney Wong | Andrew Li

This paper presents the systems and results for the Multimodal Social Media Data Analysis in Dravidian Languages (MSMDA-DL) shared task at the Fifth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2025). We took a ‘bag-of-sounds’ approach by training our hate speech detection system on the speech (audio) data using transformed Mel spectrogram measures. While our candidate model performed poorly on the test set, our approach offered promising results during training and development for Malayalam and Tamil. With sufficient and well-balanced training data, our results show that it is feasible to use both text and speech (audio) data in the development of multimodal hate speech detection systems.

pdf bib
LexiLogic@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Billodal Roy | Pranav Gupta | Souvik Bhattacharyya | Niranjan Kumar M

This paper describes our participation in the DravidianLangTech@NAACL 2025 shared task on hate speech detection in Dravidian languages. While the task provided both text transcripts and audio data, we demonstrate that competitive results can be achieved using text features alone. We employed fine-tuned Bidirectional Encoder Representations from Transformers (BERT) models from l3cube-pune for Malayalam, Tamil, and Telugu languages. Our system achieved notable results, securing second position for Tamil and Malayalam tasks, and first position for Telugu in the official leaderboard.

pdf bib
LexiLogic@DravidianLangTech 2025: Political Multiclass Sentiment Analysis of Tamil X(Twitter) Comments and Sentiment Analysis in Tamil and Tulu
Billodal Roy | Souvik Bhattacharyya | Pranav Gupta | Niranjan Kumar M

We present our approach and findings for two sentiment analysis shared tasks as part of DravidianLangTech@NAACL 2025. The first task involved a seven-class political sentiment classification for Tamil tweets, while the second addressed code-mixed sentiment analysis in Tamil-English and Tulu-English social media texts. We employed language-specific BERT models fine-tuned on the respective tasks, specifically utilizing the L3Cube-Tamil-BERT for Tamil classification and a Telugu-based BERT model for Tulu classification. Our system achieved notable results, particularly securing the first position in the Tulu code-mixed sentiment analysis track. The experiments demonstrate the effectiveness of language-specific pre-trained models for Dravidian language sentiment analysis, while also highlighting the challenges in handling political discourse and code-mixed content.

pdf bib
Detection of Religious Hate Speech During Elections in Karnataka
Msvpj Sathvik | Raj Sonani | Ravi Teja Potla

We propose a novel dataset for detecting religious hate speech in the context of elections in Karnataka, with a particular focus on Kannada and Kannada-English code-mixed text. The data was collected during the Karnataka state elections and includes 3,000 labeled samples that reflect various forms of online discourse related to religion. This dataset aims to address the growing concern of religious intolerance and hate speech during election periods, it’s a dataset of multilingual, code-mixed language. To evaluate the effectiveness of this dataset, we benchmarked it using the latest state-of-the-art algorithms. We achieved accuracy of 78.61%.

pdf bib
DLTCNITPY@DravidianLangTech 2025 Abusive Code-mixed Text Detection System Targeting Women for Tamil and Malayalam Languages using Deep Learning Technique
Habiba A | Dr G Aghila

The growing use of social communication platforms has seen women facing higher degrees of online violence than ever before. This paper presents how a deep learning abuse detection system can be applied to inappropriate text directed at women on social media. Because of the diversity of languages and the casual nature of online communication, coupled with the cultural diversity around the world, the detection of such content is often severely lacking. This research utilized Long Short-Term Memory (LSTM) for abuse text detection in Malayalam and Tamil languages. This modeldelivers 0.75, a high F1 score for Malayalam, and for Tamil, 0.72, achieving the desired balance of identifying abuse and non-abusive content and achieving high-performance rates. The designed model, based on the dataset provided in DravidianLangTech@NAACL2025 (shared task) comprising code-mixed abusive and nonabusive social media posts in Malayalam and Tamil, showcases a high propensity for detecting accuracy and indicates the likely success of deep learning-based models for abuse textdetection in resource-constrained languages.

pdf bib
TSD: Towards Computational Processing of Tamil Similes - A Tamil Simile Dataset
Aathavan Nithiyananthan | Jathushan Raveendra | Uthayasanker Thayasivam

A simile is a powerful figure of speech that makes a comparison between two different things via shared properties, often using words like “like” or “as” to create vivid imagery, convey emotions, and enhance understanding. However, computational research on similes is limited in low-resource languages like Tamil due to the lack of simile datasets. This work introduces a manually annotated Tamil Simile Dataset (TSD) comprising around 1.5k simile sentences drawn from various sources. Our data annotation guidelines ensure that all the simile sentences are annotated with the three components, namely tenor, vehicle, and context. We benchmark our dataset for simile interpretation and simile generation tasks using chosen pre-trained language models (PLMs) and present the results. Our findings highlight the challenges of simile tasks in Tamil, suggesting areas for further improvement. We believe that TSD will drive progress in computational simile processing for Tamil and other low-resource languages, further advancing simile related tasks in Natural Language Processing.

pdf bib
Hydrangea@DravidianLanTech2025: Abusive language Identification from Tamil and Malayalam Text using Transformer Models
Shanmitha Thirumoorthy | Thenmozhi Durairaj | Ratnavel Rajalakshmi

Abusive language toward women on the Internet has always been perceived as a danger to free speech and safe online spaces. In this paper, we discuss three transformer-based models - BERT, XLM-RoBERTa, and DistilBERT-in identifying gender-abusive comments in Tamil and Malayalam YouTube contents. We fine-tune and compare these models using a dataset provided by DravidianLangTech 2025 shared task for identifying the abusive content from social media. Compared to the models above, the results of XLM-RoBERTa are better and reached F1 scores of 0.7708 for Tamil and 0.6876 for Malayalam. BERT followed with scores of 0.7658 (Tamil) and 0.6671 (Malayalam). Of the DistilBERTs, performance was varyingly different for the different languages. A large difference in performance between the models, especially in the case of Malayalam, indicates that working in low-resource languages is difficult. The choice of a model is extremely critical in applying abusive language detection. The findings would be important information for effective content moderation systems in linguistically diverse contexts. In general, it would promote safe online spaces for women in South Indian language communities.

pdf bib
Towards Effective Emotion Analysis in Low-Resource Tamil Texts
Priyatharshan Balachandran | Uthayasanker Thayasivam | Randil Pushpananda | Ruvan Weerasinghe

Emotion analysis plays a significant role in understanding human behavior and communication, yet research in Tamil language remains limited. This study focuses on building an emotion classifier for Tamil texts using machine learning (ML) and deep learning (DL), along with creating an emotion-annotated Tamil corpus for Ekman’s basic emotions. Our dataset combines publicly available data with re-annotation and translations. Along with traditional ML models we investigated the use of Transfer Learning (TL) with state-of-the-art models, such as BERT and Electra based models. Experiments were conducted on unbalanced and balanced datasets using data augmentation techniques. The results indicate that MultinomialNaive Bayes (MNB) and Support Vector Machine (SVM) performed well with TF-IDF and BoW representations, while among Transfer Learning models, LaBSE achieved the highest accuracy (63% balanced, 69% unbalanced), followed by TamilBERT and IndicBERT.

pdf bib
CUET_NLP_FiniteInfinity@DravidianLangTech 2025: Exploring Large Language Models for AI-Generated Product Review Classification in Malayalam
Md. Zahid Hasan | Safiul Alam Sarker | MD Musa Kalimullah Ratul | Kawsar Ahmed | Mohammed Moshiul Hoque

pdf bib
NAYEL@DravidianLangTech-2025: Character N-gram and Machine Learning Coordination for Fake News Detection in Dravidian Languages
Hamada Nayel | Mohammed Aldawsari | Hosahalli Lakshmaiah Shashirekha

This paper introduces the detailed description of the submitted model by the team NAYEL to Fake News Detection in Dravidian Languages shared task. The proposed model uses a simple character n-gram TF-IDF as a feature extraction approach integrated with an ensemble of various classical machine learning classification algorithms. While the simplicity of the proposed model structure, although it outperforms other complex structure models as the shared task results observed. The proposed model achieved a f1-score of 87.5% and secured the 5th rank.

pdf bib
AnalysisArchitects@DravidianLangTech 2025: BERT Based Approach For Detecting AI Generated Product Reviews In Dravidian Languages
Abirami Jayaraman | Aruna Devi Shanmugam | Dharunika Sasikumar | Bharathi B

The shared task on Detecting AI-generated Product Reviews in Dravidian Languages is aimed at addressing the growing concern of AI-generated product reviews, specifically in Malayalam and Tamil. As AI tools become more advanced, the ability to distinguish between human-written and AI-generated content has become increasingly crucial, especially in the domain of online reviews where authenticity is essential for consumer decision-making. In our approach, we used the ALBERT, IndicBERT, and Support Vector Machine (SVM) models to classify the reviews. The results of our experiments demonstrate the effectiveness of our methods in detecting AI-generated content.

pdf bib
AnalysisArchitects@DravidianLangTech 2025: Machine Learning Approach to Political Multiclass Sentiment Analysis of Tamil
Abirami Jayaraman | Aruna Devi Shanmugam | Dharunika Sasikumar | Bharathi B

Sentiment analysis is recognized as an important area in Natural Language Processing (NLP) that aims at understanding and classifying opinions or emotions in text. In the political field, public sentiment is analyzed to gain insight into opinions, address issues, and shape better policies. Social media platforms like Twitter (now X) are widely used to express thoughts and have become a valuable source of real-time political discussions. In this paper, the shared task of Political Multiclass Sentiment Analysis of Tamil tweets is examined, where the objective is to classify tweets into specific sentiment categories. The proposed approach is explained, which involves preprocessing Tamil text, extracting useful features, and applying machine learning and deep learning models for classification. The effectiveness of the methods is demonstrated through experimental results and the challenges encountered while working on the analysis of Tamil political sentiment are discussed.

pdf bib
TEAM_STRIKERS@DravidianLangTech2025: Misogyny Meme Detection in Tamil Using Multimodal Deep Learning
Kogilavani Shanmugavadivel | Malliga Subramanian | Mohamed Arsath H | Ramya K | Ragav R

This study focuses on detecting misogynistic content in memes under the title Misogynistic. Meme Detection Using Multimodal Deep Learning. Through an analysis of both textual and visual components of memes, specifically in Tamil, the study seeks to detect misogynistic rhetoric directed towards women. Preprocessing and vectorizing text data using methods like TF-IDF, GloVe, Word2Vec, and transformer-based embeddings like BERT are all part of the textual analysis process. Deep learning models like ResNet and EfficientNet are used to extract significant image attributes for the visual component. To improve classification performance, these characteristics are then combined in a multimodal framework employing hybrid architectures such as CNN-LSTM, GRU-EfficientNet, and ResNet-BERT. The classification of memes as misogynistic or non-misogynistic is done using sophisticated machine learning and deep learning ap proaches. Model performance is evaluated using metrics like Accuracy, Precision, Recall, F1-Score, and Macro Average F1-Score. This study shows how multimodal deep learning can effectively detect and counteract negative narratives about women in digital media by combining natural language processing with image classification.

pdf bib
KCRL@DravidianLangTech 2025: Multi-Pooling Feature Fusion with XLM-RoBERTa for Malayalam Fake News Detection and Classification
Fariha Haq | Md. Tanvir Ahammed Shawon | Md Ayon Mia | Golam Sarwar Md. Mursalin | Muhammad Ibrahim Khan

The rapid spread of misinformation on social media platforms necessitates robust detection mechanisms, particularly for languages with limited computational resources. This paper presents our system for the DravidianLangTech 2025 shared task on Fake News Detection in Malayalam YouTube comments, addressing both binary and multiclass classification challenges. We propose a Multi-Pooling Feature Fusion (MPFF) architecture that leverages [CLS] + Mean + Max pooling strategy with transformer models. Our system demonstrates strong performance across both tasks, achieving a macro-averaged F1 score of 0.874, ranking 6th in binary classification, and 0.628, securing 1st position in multiclass classification. Experimental results show that our MPFF approach with XLM-RoBERTa significantly outperforms traditional machine learning and deep learning baselines, particularly excelling in the more challenging multiclass scenario. These findings highlight the effectiveness of our methodology in capturing nuanced linguistic features for fake news detection in Malayalam, contributing to the advancement of automated verification systems for Dravidian languages.

pdf bib
KCRL@DravidianLangTech 2025: Multi-View Feature Fusion with XLM-R for Tamil Political Sentiment Analysis
Md Ayon Mia | Fariha Haq | Md. Tanvir Ahammed Shawon | Golam Sarwar Md. Mursalin | Muhammad Ibrahim Khan

Political discourse on social media platforms significantly influences public opinion, necessitating accurate sentiment analysis for understanding societal perspectives. This paper presents a system developed for the shared task of Political Multiclass Sentiment Analysis in Tamil tweets. The task aims to classify tweets into seven distinct sentiment categories: Substantiated, Sarcastic, Opinionated, Positive, Negative, Neutral, and None of the above. We propose a Multi-View Feature Fusion (MVFF) architecture that leverages XLM-R with a CLS-Attention-Mean mechanism for sentiment classification. Our experimental results demonstrate the effectiveness of our approach, achieving a macro-average F1-score of 0.37 on the test set and securing the 2nd position in the shared task. Through comprehensive error analysis, we identify specific classification challenges and demonstrate how our model effectively navigates the linguistic complexities of Tamil political discourse while maintaining robust classification performance across multiple sentiment categories.

pdf bib
TensorTalk@DravidianLangTech 2025: Sentiment Analysis in Tamil and Tulu using Logistic Regression and SVM
K Anishka | Anne Jacika J

Words are powerful; they shape thoughts that influence actions and reveal emotions. On social media, where billions of people share theiropinions daily. Comments are the key to understanding how users feel about a video, an image, or even an idea. But what happens when these comments are messy, riddled with code-mixed language, emojis, and informal text? The challenge becomes even greater when analyzing low-resource languages like Tamil and Tulu. To tackle this, TensorTalk deployed cutting-edge machine learning techniques such as Logistic regression for Tamil language and SVM for Tulu language , to breathe life into unstructured data. By balancing, cleaning, and processing comments, TensorTalk broke through barriers like transliteration and tokenization, unlocking the emotions buried in the language.

pdf bib
TeamVision@DravidianLangTech 2025: Detecting AI generated product reviews in Dravidian Languages
Shankari S R | Sarumathi P | Bharathi B

Recent advancements in natural language processing (NLP) have enabled artificial intelligence (AI) models to generate product reviewsthat are indistinguishable from those written by humans. To address these concerns, this study proposes an effective AI detector model capable of differentiating between AI-generated and human-written product reviews. Our methodology incorporates various machine learning techniques, including Naive Bayes, Random Forest, Logistic Regression, SVM, and deep learning approaches based on the BERT architecture.Our findings reveal that BERT outperforms other models in detecting AI-generated content in both Tamil product reviews and Malayalam product reviews.

pdf bib
CIC-NLP@DravidianLangTech 2025: Fake News Detection in Dravidian Languages
Tewodros Achamaleh | Nida Hafeez | Mikiyas Mebraihtu | Fatima Uroosa | Grigori Sidorov

Misinformation is a growing problem for technologycompanies and for society. Although there exists a large body of related work on identifying fake news in predominantlyresource languages, there is unfortunately a lack of such studies in low-resource languages (LRLs). Because corpora and annotated data are scarce in LRLs, the identification of false information remains at an exploratory stage. Fake news detection is critical in this digital era to avoid spreading misleading information. This research work presents an approach to Detect Fake News in Dravidian Languages. Our team CIC-NLP work primarily targets Task 1 which involves identifying whether a given social platform news is original or fake. For fake news detection (FND) problem, we used mBERT model and utilized the dataset that was provided by the organizers of the workshop. In this work, we describe our findings and the results of the proposed method. Our mBERT model achieved an F1 score of 0.853.

pdf bib
CoreFour_IIITK@DravidianLangTech 2025: Abusive Content Detection Against Women Using Machine Learning And Deep Learning Models
Varun Balaji S | Bojja Revanth Reddy | Vyshnavi Reddy Battula | Suraj Nagunuri | Balasubramanian Palani

The rise in utilizing social media platforms increased user-generated content significantly, including negative comments about women in Tamil and Malayalam. While these platforms encourage communication and engagement, they also become a medium for the spread of abusive language, which poses challenges to maintaining a safe online environment for women. Prevention of usage of abusive content against women as much as possible is the main issue focused in the research. This research focuses on detecting abusive language against women in Tamil and Malayalam social media comments using computational models, such as Logistic regression model, Support vector machines (SVM) model, Random forest model, multilingual BERT model, XLM-Roberta model, and IndicBERT. These models were trained and tested on a specifically curated dataset containing labeled comments in both languages. Among all the approaches, IndicBERT achieved a highest macro F1-score of 0.75. The findings emphasize the significance of employing a combination of traditional and advanced computational techniques to address challenges in Abusive Content Detection (ACD) specific to regional languages.

pdf bib
The_Deathly_Hallows@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Vasantharan K | Prethish G A | Santhosh S

The DravidianLangTech@NAACL 2025 shared task focused on multimodal hate speech detection in Tamil, Telugu, and Malayalam using social media text and audio. Our approach integrated advanced preprocessing, feature extraction, and deep learning models. For text, preprocessing steps included normalization, tokenization, stopword removal, and data augmentation. Feature extraction was performed using TF-IDF, Count Vectorizer, BERT-base-multilingual-cased, XLM-Roberta-Base, and XLM-Roberta-Large, with the latter achieving the best performance. The models attained training accuracies of 83% (Tamil), 88% (Telugu), and 85% (Malayalam). For audio, Mel Frequency Cepstral Coefficients (MFCCs) were extracted and enhanced with augmentation techniques such as noise addition, time-stretching, and pitch-shifting. A CNN-based model achieved training accuracies of 88% (Tamil), 88% (Telugu), and 93% (Malayalam). Macro F1 scores ranked Tamil 3rd (0.6438), Telugu 15th (0.1559), and Malayalam 12th (0.3016). Our study highlights the effectiveness of text-audio fusion in hate speech detection and underscores the importance of preprocessing, multimodal techniques, and feature augmentation in addressing hate speech on social media.

pdf bib
SSN_IT_NLP@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Maria Nancy C | Radha N | Swathika R

The proliferation of social media platforms has resulted in increased instances of online abuse, particularly targeting marginalized groups such as women. This study focuses on the classification of abusive comments in Tamil and Malayalam, two Dravidian languages widely spoken in South India. Leveraging a multilingual BERT model, this paper provides an effective approach for detecting and categorizing abusive and non-abusive text. Using labeled datasets comprising social media comments, our model demonstrates its ability to identify targeted abuse with promising accuracy. This paper outlines the dataset preparation, model architecture, training methodology, and the evaluation of results, providing a foundation for combating online abuse in low-resource languages.The methodology is unique for its integration of multilingual BERT and weighted loss functions to address class imbalance, showcasing a pathway for effective abuse detection in other underrepresented languages. The BERT model achieved an F1-score of 0.6519 for Tamil and 0.6601 for Malayalam. The codefor this work is available on github Abusive-Text-targeting-women

pdf bib
Findings of the Shared Task on Abusive Tamil and Malayalam Text Targeting Women on Social Media: DravidianLangTech@NAACL 2025
Saranya Rajiakodi | Bharathi Raja Chakravarthi | Shunmuga Priya Muthusamy Chinnan | Ruba Priyadharshini | Raja Meenakshi J | Kathiravan Pannerselvam | Rahul Ponnusamy | Bhuvaneswari Sivagnanam | Paul Buitelaar | Bhavanimeena K | Jananayagan Jananayagan | Kishore Kumar Ponnusamy

This overview paper presents the findings of the Shared Task on Abusive Tamil and Malayalam Text Targeting Women on Social Media, organized as part of DravidianLangTech@NAACL 2025. The task aimed to encourage the development of robust systems to detectabusive content targeting women in Tamil and Malayalam, two low-resource Dravidian languages. Participants were provided with annotated datasets containing abusive and nonabusive text curated from YouTube comments. We present an overview of the approaches and analyse the results of the shared task submissions. We believe the findings presented in this paper will be useful to researchers working in Dravidian language technology.

pdf bib
LinguAIsts@DravidianLangTech 2025: Abusive Tamil and Malayalam Text targeting Women on Social Media
Dhanyashree G | Kalpana K | Lekhashree A | Arivuchudar K | Arthi R | Bommineni Sahitya | Pavithra J | Sandra Johnson

Social media sites are becoming crucial sites for communication and interaction, yet they are increasingly being utilized to commit gender-based abuse, with horrific, harassing, and degrading comments targeted at women. This paper tries to solve the common issue of women being subjected to abusive language in two South Indian languages, Malayalam and Tamil. To find explicit abuse, implicit bias, preconceptions, and coded language, we were given a set of YouTube comments labeled Abusive and Non-Abusive. To solve this problem, we applied and compared different machine learning models, i.e., Support Vector Machines (SVM), Logistic Regression (LR), and Naive Bayes classifiers, to classify comments into the given categories. The models were trained and validated using the given dataset to achieve the best performance with respect to accuracy and macro F1 score. The solutions proposed aim to make robust content moderation systems that can detect and prevent abusive language, ensuring safer online environments for women.

pdf bib
Celestia@DravidianLangTech 2025: Malayalam-BERT and m-BERT based transformer models for Fake News Detection in Dravidian Languages
Syeda Alisha Noor | Sadia Anjum | Syed Ahmad Reza | Md Rashadur Rahman

Fake news detection in Malayalam is difficult due to limited data and language challenges. This study compares machine learning, deep learning, and transformer models for classification. The dataset is balanced and divided into training, development and test sets. Machine learning models (SVM, Random Forest, Naive Bayes) used TF-IDF features and deep learning models (LSTM, BiLSTM, CNN) worked with tokenized sequences. We fine-tuned transformer models like IndicBERT, MuRIL, mBERT, and Malayalam-Bert. Among them, the Malayalam-Bert model performed the best and achieved an F1 score of 86%. On the other hand mBERT performed best at spotting fake news. However, the models struggled with mixed-language text and complex writing. Despite these challenges, transformer models turned out to be the most effective for detecting fake news in Malayalam.

pdf bib
DravLingua@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages using Late Fusion of Muril and Wav2Vec Models
Aishwarya Selvamurugan

Detecting hate speech on social media is increasingly difficult, particularly in low-resource Dravidian languages such as Tamil, Telugu and Malayalam. Traditional approaches primarily rely on text-based classification, often overlooking the multimodal nature of online communication, where speech plays a pivotal role in spreading hate speech. We propose a multimodal hate speech detection model using a late fusion technique that integrates Wav2Vec 2.0 for speech processing and Muril for text analysis. Our model is evaluated on the DravidianLangTech@NAACL 2025 dataset, which contains speech and text data in Telugu, Tamil, and Malayalam scripts. The dataset is categorized into six classes: Non-Hate, Gender Hate, Political Hate, Religious Hate, Religious Defamation, and Personal Defamation. To address class imbalance, we incorporate class weighting and data augmentation techniques. Experimental results demonstrate that the late fusion approach effectively captures patterns of hate speech that may be missed when analyzing a single modality. This highlights the importance of multimodal strategies in enhancing hate speech detection, particularly for low-resource languages.

pdf bib
Trio Innovators @ DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian Languages
Radha N | Swathika R | Farha Afreen I | Annu G | Apoorva A

This paper presents an in-depth study on multimodal hate speech detection in Dravidian languages—Tamil, Telugu, and Malayalam—by leveraging both audio and text modalities. Detecting hate speech in these languages is particularly challenging due to factors such as codemixing, limited linguistic resources, and diverse cultural contexts. Our approach integrates advanced techniques for audio feature extraction and XLM-Roberta for text representation, with feature alignment and fusion to develop a robust multimodal framework. The dataset is carefully categorized into labeled classes: gender-based, political, religious, and personal defamation hate speech, along with a non-hate category. Experimental results indicate that our model achieves a macro F1-score of 0.76 and an accuracy of approximately 85.

pdf bib
Wictory@DravidianLangTech 2025: Political Sentiment Analysis of Tamil X(Twitter) Comments using LaBSE and SVM
Nithish Ariyha K | Eshwanth Karti T R | Yeshwanth Balaji A P | Vikash J | Sachin Kumar S

Political sentiment analysis has become an essential area of research in Natural Language Processing (NLP), driven by the rapid rise ofsocial media as a key platform for political discourse. This study focuses on sentiment classification in Tamil political tweets, addressing the linguistic and cultural complexities inherent in low-resource languages. To overcome data scarcity challenges, we develop a system that integrates embeddings with advanced Machine Learning techniques, ensuring effective sentiment categorization. Our approach leverages deep learning-based models and transformer architectures to capture nuanced expressions, contributing to improved sentiment classification. This work enhances NLP methodologies for low-resource languages and provides valuable insights into Tamil political discussions, aiding policymakers and researchers in understanding public sentiment more accurately. Notably, our system secured Rank 5in the NAACL shared task, demonstrating its effectiveness in real-world sentiment classification challenges.

pdf bib
ANSR@DravidianLangTech 2025: Detection of Abusive Tamil and Malayalam Text Targeting Women on Social Media using RoBERTa and XGBoost
Nishanth S | Shruthi Rengarajan | S Ananthasivan | Burugu Rahul | Sachin Kumar S

Abusive language directed at women on social media, often characterized by crude slang, offensive terms, and profanity, is not just harmful communication but also acts as a tool for serious and widespread cyber violence. It is imperative that this pressing issue be addressed in order to establish safer online spaces and provide efficient methods for detecting and minimising this kind of abuse. However, the intentional masking of abusive language, especially in regional languages like Tamil and Malayalam, presents significant obstacles, making detection and prevention more difficult. The system created effectively identifies abusive sentences using supervised machine learning techniques based on RoBerta embeddings. The method aims to improve upon the current abusive language detection systems, which are essential for various online platforms, including social media and online gaming services. The proposed method currently ranked 8 in malayalam and 20 in tamil in terms of f1 score.

pdf bib
Synapse@DravidianLangTech 2025: Multiclass Political Sentiment Analysis in Tamil X (Twitter) Comments: Leveraging Feature Fusion of IndicBERTv2 and Lexical Representations
Suriya Kp | Durai Singh K | Vishal A S | Kishor S | Sachin Kumar S

Social media platforms like X (twitter) have gained popularity for political debates and election campaigns in the last decade. This creates the need to moderate and understand the sentiments of the tweets in order to understand the state of digital campaigns. This paper focuses on political sentiment classification of Tamil X (Twitter) comments which proves to be challenging because of the presence of informal expressions, code-switching, and limited annotated datasets. This study focuses on categorizing them into seven classes: substantiated, sarcastic, opinionated, positive, negative, neutral, and none of the above. This paper proposes a solution to Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments - DravidianLangTech@NAACL 2025 shared task, the solution incorporates IndicBERTv2-MLM-Back-Translation model and TF-IDF vectors into a custom model. Further we explore the use of preprocessing techniques to enrich hashtags and emojis with their context. Our approach achieved Rank 1 with a macro F1 average of 0.38 in the shared task.

pdf bib
Findings of the Shared Task on Misogyny Meme Detection: DravidianLangTech@NAACL 2025
Bharathi Raja Chakravarthi | Rahul Ponnusamy | Saranya Rajiakodi | Shunmuga Priya Muthusamy Chinnan | Paul Buitelaar | Bhuvaneswari Sivagnanam | Anshid K A

The rapid expansion of social media has facilitated communication but also enabled the spread of misogynistic memes, reinforcing gender stereotypes and toxic online environments. Detecting such content is challenging due to the multimodal nature of memes, where meaning emerges from the interplay of text and images. The Misogyny Meme Detection shared task at DravidianLangTech@NAACL 2025 focused on Tamil and Malayalam, encouraging the development of multimodal approaches. With 114 teams registered and 23 submitting predictions, participants leveraged various pretrained language models and vision models through fusion techniques. The best models achieved high macro F1 scores (0.83682 for Tamil, 0.87631 for Malayalam), highlighting the effectiveness of multimodal learning. Despite these advances, challenges such as bias in the data set, class imbalance, and cultural variations persist. Future research should refine multimodal detection methods to improve accuracy and adaptability, fostering safer and more inclusive online spaces.

pdf bib
Overview of the Shared Task on Sentiment Analysis in Tamil and Tulu
Thenmozhi Durairaj | Bharathi Raja Chakravarthi | Asha Hegde | Hosahalli Lakshmaiah Shashirekha | Rajeswari Natarajan | Sajeetha Thavareesan | Ratnasingam Sakuntharaj | Krishnakumari K | Charmathi Rajkumar | Poorvi Shetty | Harshitha S Kumar

Sentiment analysis is an essential task for interpreting subjective opinions and emotions in textual data, with significant implications across commercial and societal applications. This paper provides an overview of the shared task on Sentiment Analysis in Tamil and Tulu, organized as part of DravidianLangTech@NAACL 2025. The task comprises two components: one addressing Tamil and the other focusing on Tulu, both designed as multi-class classification challenges, wherein the sentiment of a given text must be categorized as positive, negative, neutral and unknown. The dataset was diligently organized by aggregating user-generated content from social media platforms such as YouTube and Twitter, ensuring linguistic diversity and real-world applicability. Participants applied a variety of computational approaches, ranging from classical machine learning algorithms such as Traditional Machine Learning Models, Deep Learning Models, Pre-trained Language Models and other Feature Representation Techniques to tackle the challenges posed by linguistic code-mixing, orthographic variations, and resource scarcity in these low resource languages.

pdf bib
cuetRaptors@DravidianLangTech 2025: Transformer-Based Approaches for Detecting Abusive Tamil Text Targeting Women on Social Media
Md. Mubasshir Naib | Md. Saikat Hossain Shohag | Alamgir Hossain | Jawad Hossain | Mohammed Moshiul Hoque

With the exponential growth of social media usage, the prevalence of abusive language targeting women has become a pressing issue, particularly in low-resource languages (LRLs) like Tamil and Malayalam. This study is part of the shared task at DravidianLangTech@NAACL 2025, which focuses on detecting abusive comments in Tamil social media content. The provided dataset consists of binary-labeled comments (Abusive or Non-Abusive), gathered from YouTube, reflecting explicit abuse, implicit bias, stereotypes, and coded language. We developed and evaluated multiple models for this task, including traditional machine learning algorithms (Logistic Regression, Support Vector Machine, Random Forest Classifier, and Multinomial Naive Bayes), deep learning models (CNN, BiLSTM, and CNN+BiLSTM), and transformer-based architectures (DistilBERT, Multilingual BERT, XLM-RoBERTa), and fine-tuned variants of these models. Our best-performing model, Multilingual BERT, achieved a weighted F1-score of 0.7203, ranking 19 in the competition.

pdf bib
Overview on Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments: DravidianLangTech@NAACL 2025
Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Sathiyaraj Thangasamy | Ratnasingam Sakuntharaj | Prasanna Kumar Kumaresan | Kishore Kumar Ponnusamy | Arunaggiri Pandian Karunanidhi | Rohan R

Political multiclass detection is the task of identifying the predefined seven political classes. In this paper, we report an overview of the findings on the “Political Multiclass Sentiment Analysis of Tamil X(Twitter) Comments” shared task conducted at the workshop on DravidianLangTech@NAACL 2025. The participants were provided with annotated Twitter comments, which are split into training, development, and unlabelled test datasets. A total of 139 participants registered for this shared task, and 25 teams finally submitted their results. The performance of the submitted systems was evaluated and ranked in terms of the macro-F1 score.

pdf bib
KEC_AI_BRIGHTRED@DravidianLangTech 2025: Multimodal Hate Speech Detection in Dravidian languages
Kogilavani Shanmugavadivel | Malliga Subramanian | Nishdharani P | Santhiya E | Yaswanth Raj E

Hate speech detection in multilingual settings presents significant challenges due to linguistic variations and speech patterns across different languages. This study proposes a fusion-based approach that integrates audio and text features to enhance classification accuracy in Tamil, Telugu, and Malayalam. We extract Mel- Frequency Cepstral Coefficients and their delta variations for speech representation, while textbased features contribute additional linguistic insights. Several models were evaluated, including BiLSTM, Capsule Networks with Attention, Capsule-GRU, ConvLSTM-BiLSTM, and Multinomial Naïve Bayes, to determine the most effective architecture. Experimental results demonstrate that Random Forest performs best for text classification, while CNN achieves the highest accuracy for audio classification. The model was evaluated using the Macro F1 score and ranked ninth in Tamil with a score of 0.3018, ninth in Telugu with a score of 0.251, and thirteenth in Malayalam with a score of 0.2782 in the Multimodal Social Media Data Analysis in Dravidian Languages shared task at DravidianLangTech@NAACL 2025. By leveraging feature fusion and optimized model selection, this approach provides a scalable and effective framework for multilingual hate speech detection, contributing to improved content moderation on social media platforms.

pdf bib
Overview of the Shared Task on Fake News Detection in Dravidian Languages-DravidianLangTech@NAACL 2025
Malliga Subramanian | Premjith B | Kogilavani Shanmugavadivel | Santhiya Pandiyan | Balasubramanian Palani | Bharathi Raja Chakravarthi

Detecting and mitigating fake news on social media is critical for preventing misinformation, protecting democratic processes, preventing public distress, mitigating hate speech, reducing financial fraud, maintaining information reliability, etc. This paper summarizes the findings of the shared task “Fake News Detection in Dravidian Languages—DravidianLangTech@NAACL 2025.” The goal of this task is to detect fake content in social media posts in Malayalam. It consists of two subtasks: the first focuses on binary classification (Fake or Original), while the second categorizes the fake news into five types—False, Half True, Mostly False, Partly False, and Mostly True. In Task 1, 22 teams submitted machine learning techniques like SVM, Naïve Bayes, and SGD, as well as BERT-based architectures. Among these, XLM-RoBERTa had the highest macro F1 score of 89.8%. For Task 2, 11 teams submitted models using LSTM, GRU, XLM-RoBERTa, and SVM. XLM-RoBERTa once again outperformed other models, attaining the highest macro F1 score of 68.2%.

up

pdf (full)
bib (full)
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation

pdf bib
Proceedings of the First Workshop of Evaluation of Multi-Modal Generation
Wei Emma Zhang | Xiang Dai | Desmond Elliot | Byron Fang | Mongyuan Sim | Haojie Zhuang | Weitong Chen

pdf bib
A Dataset for Programming-based Instructional Video Classification and Question Answering
Sana Javaid Raja | Adeel Zafar | Aqsa Shoaib

This work aims to develop an understanding of the rapidly emerging field of VideoQA, particularly in the context of instructional programming videos. It also encourages designing of system that can produce visual answer to programming based natural language questions. We introduce two datasets: CodeVidQA, with 2,104 question-answer pair links with timestamps taken from programming videos of Stack Overflow for Programming Visual Answer Localization task, and CodeVidCL with 4,331 videos (1,751 programming ,2580 non-programming) for Programming Video Classification task. In addition, we proposed a framework that adapts BigBird and SVM for video classification techniques. The proposed approach achieves a significantly high accuracy of 99.61% for video classification.

pdf bib
CVT5: Using Compressed Video Encoder and UMT5 for Dense Video Captioning
Mohammad Javad Pirhadi | Motahhare Mirzaei | Sauleh Eetemadi

The dense video captioning task aims to detect all events occurring in a video and describe each event using natural language. Unlike most other video processing tasks, where it is typically assumed that videos contain only a single main event, this task deals with long, untrimmed videos. Consequently, the speed of processing videos in dense video captioning is a critical aspect of the system. To the best of our knowledge, all published work on this task uses RGB frames to encode input videos. In this work, we introduce the use of compressed videos for the first time in this task. Our experiments on the SoccerNet challenge demonstrate significant improvements in both processing speed and GPU memory footprint while achieving competitive results. Additionally, we leverage multilingual transcripts, which seems to be effective. The encoder in our proposed method achieves approximately 5.4× higher speed and 5.1× lower GPU memory usage during training, and 4.7× higher speed and 7.8× lower GPU memory usage during inference, compared to its RGB-based counterpart. The code is publicly available at https://github.com/mohammadjavadpirhadi/CVT5.

pdf bib
If I feel smart, I will do the right thing: Combining Complementary Multimodal Information in Visual Language Models
Yuyu Bai | Sandro Pezzelle

Generative visual language models (VLMs) have recently shown potential across various downstream language-and-vision tasks. At the same time, it is still an open question whether, and to what extent, these models can properly understand a multimodal context where language and vision provide complementary information—a mechanism routinely in place in human language communication. In this work, we test various VLMs on the task of generating action descriptions consistent with both an image’s visual content and an intention or attitude (not visually grounded) conveyed by a textual prompt. Our results show that BLIP-2 is not far from human performance when the task is framed as a generative multiple-choice problem, while other models struggle. Furthermore, the actions generated by BLIP-2 in an open-ended generative setting are better than those by the competitors; indeed, human annotators judge most of them as plausible continuations for the multimodal context. Our study reveals substantial variability among VLMs in integrating complementary multimodal information, yet BLIP-2 demonstrates promising trends across most evaluations, paving the way for seamless human-computer interaction.

pdf bib
LLaVA-RE: Binary Image-Text Relevancy Evaluation with Multimodal Large Language Model
Tao Sun | Oliver Liu | JinJin Li | Lan Ma

Multimodal generative AI usually involves generating image or text responses given inputs in another modality. The evaluation of image-text relevancy is essential for measuring the response quality or ranking candidate responses. In particular, binary relevancy evaluation, i.e., “Relevant” vs. “Not Relevant”, is a fundamental problem. However, this is a challenging task considering that texts have diverse formats and the definition of relevancy varies in different scenarios. We find that Multimodal Large Language Models (MLLMs) are an ideal choice to build such evaluators, as they can flexibly handle complex text formats and take in additional task information. In this paper, we present LLaVA-RE, a first attempt for binary image-text relevancy evaluation with MLLM. It follows the LLaVA architecture and adopts detailed task instructions and multimodal in-context samples. Further, we propose a novel binary relevancy dataset covering diverse tasks. Experimental results validate the effectiveness of our framework.

pdf bib
Persian in a Court: Benchmarking VLMs In Persian Multi-Modal Tasks
Farhan Farsi | Shahriar Shariati Motlagh | Shayan Bali | Sadra Sabouri | Saeedeh Momtazi

This study introduces a novel framework for evaluating Large Language Models (LLMs) and Vision-Language Models (VLMs) in Persian, a low-resource language. We develop comprehensive datasets to assess reasoning, linguistic understanding, and multimodal capabilities. Our datasets include Persian-OCR-QA for optical character recognition, Persian-VQA for visual question answering, Persian world-image puzzle for multimodal integration, Visual-Abstraction-Reasoning for abstract reasoning, and Iran-places for visual knowledge of Iranian figures and locations. We evaluate models like GPT-4o, Claude 3.5 Sonnet, and Llama 3.2 90B Vision, revealing their strengths and weaknesses in processing Persian. This research contributes to inclusive language processing by addressing the unique challenges of low-resource language evaluation.

pdf bib
TaiwanVQA: A Benchmark for Visual Question Answering for Taiwanese Daily Life
Hsin-Yi Hsieh | Shang Wei Liu | Chang Chih Meng | Shuo-Yueh Lin | Chen Chien-Hua | Hung-Ju Lin | Hen-Hsen Huang | I-Chen Wu

We introduce TaiwanVQA, a novel visual question answering benchmark designed to evaluate vision language models’ (VLMs) ability to recognize and reason about Taiwan-specific multimodal content.TaiwanVQA comprises 2,000 image-question pairs covering diverse topics relevant to Taiwanese culture and daily life. We categorize the questions into recognition and reasoning tasks, further sub-classifying reasoning questions based on the level of external knowledge required. We conduct extensive experiments on state-of-the-art VLMs, including GPT-4o, Llama-3.2, LLaVA, Qwen2-VL, and InternVL2 models. Our findings reveal significant limitations in current VLMs when handling culturally specific content. The performance gap widens between recognition tasks (top score 73.60%) and reasoning tasks (top score 49.80%), indicating challenges in cultural inference and contextual understanding.These results highlight the need for more culturally diverse training data and improved model architectures that can better integrate visual and textual information within specific cultural contexts. By providing TaiwanVQA, we aim to contribute to the development of more inclusive and culturally aware AI models, facilitating their deployment in diverse real-world settings. TaiwanVQA can be accessed on our GitHub page.

pdf bib
Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types
Neelabh Sinha | Vinija Jain | Aman Chadha

Visual Question-Answering (VQA) has become key to user experience, particularly after improved generalization capabilities of Vision-Language Models (VLMs). But evaluating VLMs for an application requirement using a standardized framework in practical settings is still challenging. This paper aims to solve that using an end-to-end framework. We present VQA360 - a novel dataset derived from established VQA benchmarks, annotated with task types, application domains, and knowledge types, for a comprehensive evaluation. We also introduce GoEval, a multimodal evaluation metric developed using GPT-4o, achieving a correlation factor of 56.71% with human judgments. Our experiments with state-of-the-art VLMs reveal that no single model excels universally, thus, making a right choice a key design decision. Proprietary models such as Gemini-1.5-Pro and GPT-4o-mini generally outperform others, but open-source models like InternVL-2-8B and CogVLM-2-Llama-3-19B also demonstrate competitive strengths, while providing additional advantages. Our framework can also be extended to other tasks.

up

pdf (full)
bib (full)
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

pdf bib
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)
Chung-Chi Chen | Antonio Moreno-Sandoval | Jimin Huang | Qianqian Xie | Sophia Ananiadou | Hsin-Hsi Chen

pdf bib
Chat Bankman-Fried: an Exploration of LLM Alignment in Finance
Claudia Biancotti | Carolina Camassa | Andrea Coletta | Oliver Giudice | Aldo Glielmo

Advancements in large language models (LLMs) have renewed concerns about AI alignment—the consistency between human and AI goals and values. As various jurisdictions enact legislation on AI safety, the concept of alignment must be defined and measured across different domains. This paper proposes an experimental framework to assess whether LLMs adhere to ethical and legal standards in the relatively unexplored context of finance. We prompt ten LLMs to impersonate the CEO of a financial institution and test their willingness to misuse customer assets to repay outstanding corporate debt. Beginning with a baseline configuration, we adjust preferences, incentives and constraints, analyzing the impact of each adjustment with logistic regression. Our findings reveal significant heterogeneity in the baseline propensity for unethical behavior of LLMs. Factors such as risk aversion, profit expectations, and regulatory environment consistently influence misalignment in ways predicted by economic theory, although the magnitude of these effects varies across LLMs. This paper highlights the benefits and limitations of simulation-based, ex-post safety testing. While it can inform financial authorities and institutions aiming to ensure LLM safety, there is a clear trade-off between generality and cost.

pdf bib
GraphRAG Analysis for Financial Narrative Summarization and A Framework for Optimizing Domain Adaptation
Neelesh Kumar Shukla | Prabhat Prabhakar | Sakthivel Thangaraj | Sandeep Singh | Weiyi Sun | C Prasanna Venkatesan | Viji Krishnamurthy

Large Language Models (LLMs) have shown promise in summarizing complex documents, but their limitations in handling lengthy documents and capturing global information hinder their performance in tasks like Query-Focused Summarization (QFS). We explore GraphRAG, a retrieval-augmented generation approach that utilizes a globally summarized knowledge graph derived from an LLM. We apply GraphRAG to the Financial Narrative Summarization (FNS) dataset, which consists of lengthy financial reports. Our results show that a naive RAG approach outperforms GraphRAG in terms of comprehensiveness, directness, conciseness and completeness. However, we demonstrate that optimizing entity and relation extraction using an LLM as an optimizer can enhance GraphRAG’s performance. Our study highlights the need for domain-specific optimization to improve GraphRAG’s capabilities for summarization tasks in facts-heavy domains like finance. We propose an optimization framework that extends GraphRAG’s original domain adaptation strategy by incorporating entity and relations optimization, leading to improved performance in capturing relevant entities and relationships. Our findings contribute to the development of more effective summarization models for complex documents in finance and other domains.

pdf bib
BuDDIE: A Business Document Dataset for Multi-task Information Extraction
Dongsheng Wang | Ran Zmigrod | Mathieu J. Sibue | Yulong Pei | Petr Babkin | Ivan Brugere | Xiaomo Liu | Nacho Navarro | Antony Papadimitriou | William Watson | Zhiqiang Ma | Armineh Nourbakhsh | Sameena Shah

The field of visually rich document understanding (VRDU) aims to solve a multitude of well-researched NLP tasks in the multi-modal domain. Several datasets exist for research on specific tasks of VRDU, such as document classification (DC), key entity extraction (KEE), entity linking, visual question answering (VQA), inter alia. These datasets cover documents like invoices and receipts with sparse annotations such that they support one or two co-related tasks (e.g., entity extraction and entity linking). Unfortunately, only focusing on a single specific type of documents or task is not representative of how documents often need to be processed in the wild – where variety in style and requirements is expected. In this paper, we introduce BuDDIE: Business Document Dataset for Information Extraction, the first multi-task dataset of 1665 real-world business documents that contains rich and dense annotations for DC, KEE, and VQA. Our dataset consists of publicly available business entity documents from US state government websites. The documents are structured and vary in their style and layout across states and types (e.g., forms, certificates, reports, etc.). We provide data variety and quality metrics for BuDDIE as well as a series of baselines for each task. Our baselines cover traditional textual, multi-modal, and large language model approaches to VRDU.

pdf bib
FinMoE: A MoE-based Large Chinese Financial Language Model
Xuanyu Zhang | Qing Yang

Large-scale language models have demonstrated remarkable success, achieving strong performance across a variety of general tasks. However, when applied to domain-specific fields, such as finance, these models face challenges due to the need for both specialized knowledge and robust general capabilities. In this paper, we introduce FinMoE, a MOE-based large-scale Chinese financial language model that bridges the gap between general language models and domain-specific requirements. FinMoE employs a dense MoE architecture, where all expert networks are simultaneously activated and dynamically combined to effectively integrate general linguistic understanding with domain-specific financial expertise. Experimental results demonstrate that FinMoE achieves state-of-the-art performance on both general-purpose and financial benchmarks at a comparable scale, validating its ability to balance domain specialization with general knowledge and reasoning.

pdf bib
Bridging the Gap: Efficient Cross-Lingual NER in Low-Resource Financial Domain
Sunisth Kumar | Mohammed ElKholy | Davide Liu | Alexandre Boulenger

We present an innovative and efficient modeling framework for cross-lingual named entity recognition (NER), leveraging the strengths of knowledge distillation and consistency training. Our approach distills knowledge from an XLM-RoBERTa model pre-trained on a high-resource source language (English) to a student model, which then undergoes semi-supervised consistency training with KL divergence loss on a low-resource target language (Arabic). We focus our application on the financial domain, using a small, sourced dataset of financial transactions as seen in SMS messages Using datasets comprising SMS messages in English and Arabic containing financial transaction information, we aim to transfer NER capabilities from English to Arabic with minimal labeled Arabic samples. The framework generalizes named entity recognition from English to Arabic, achieving F1 scores of 0.74 on the Arabic financial transaction dataset and 0.61 on the WikiANN dataset, surpassing or closely competing with models that have 1.7 and 5.3 more parameters, respectively, while efficiently training it on a single T4 GPU. Our experiments show that using a small number of labeled data for low-resource cross-lingual NER applications is a wiser choice than utilizing zero-shot techniques while also using up fewer resources. This framework holds significant potential for developing multilingual applications, particularly in regions where digital interactions span English and low-resource languages.

pdf bib
Evaluating Financial Literacy of Large Language Models through Domain Specific Languages for Plain Text Accounting
Alexei Gustavo Figueroa Rosero | Paul Grundmann | Julius Freidank | Wolfgang Nejdl | Alexander Loeser

Large language models (LLMs) have proven highly effective for a wide range of tasks, including code generation. Recently, advancements in their capabilities have shown promise in areas like mathematical reasoning, chain-of-thought processes and self-reflection. However, their effectiveness in domains requiring nuanced understanding of financial contexts, such as accounting, remains unclear. In this study, we evaluate how well LLMs perform in generating code for domain-specific languages (DSLs) in accounting, using Beancount as a case study. We create a set of tasks based on common financial ratios, to evaluate the numeracy and financial literacy of LLMs. Our findings reveal that while LLMs are state-of-the art in generative tasks, they struggle severely with accounting, often producing inaccurate calculations and misinterpreting financial scenarios. We characterize these shortcomings through a comprehensive evaluation, shedding light on the limitations of LLMs in understanding and handling money-related tasks.

pdf bib
Synthetic Data Generation Using Large Language Models for Financial Question Answering
Chetan Harsha | Karmvir Singh Phogat | Sridhar Dasaratha | Sai Akhil Puranam | Shashishekar Ramakrishna

Recent research has shown excellent performance of large language models (LLMs) for answering questions requiring multi-step financial reasoning. While the larger models have been used with zero-shot or few-shot prompting, the smaller variants need fine-tuning on training data containing questions and the corresponding answers that includes detailed reasoning demonstrations. To alleviate the significant cost of creating a data set with complex questions and corresponding answers, we explore the use of synthetic data for financial question answering using a multi-step LLM based approach to generate question as well as the answers with reasoning steps. We consider standard as well as conversational financial question answering scenarios. We experiment with synthetic data generation for three different real financial reasoning problems that already have manually collected data sets created with the help of financial experts. Using the same document sources, we use the proposed LLM based approach to generate synthetic questions and answers. To measure the effectiveness, we train multiple small language models (SLMs) on these synthetic data and compare the performance with that of the same SLMs trained on the real data. We further perform extensive experimental analysis generating important evidence on the potential of using synthetic data in financial reasoning tasks.

pdf bib
Concept-Based RAG Models: A High-Accuracy Fact Retrieval Approach
Cheng-Yu Lin | Jyh-Shing Jang

This study introduces a concept-based methodology to optimize Retrieval-Augmented Generation (RAG) tasks by assessing dataset certainty using entropy-based metrics and concept extraction techniques. Unlike traditional methods focused on reducing LLM hallucinations or modifying data structures, this approach evaluates inherent knowledge uncertainty from an LLM perspective. By pre-processing documents with LLMs, the concept-based method significantly enhances precision in tasks demanding high accuracy, such as legal, finance, or formal document responses.

pdf bib
Training LayoutLM from Scratch for Efficient Named-Entity Recognition in the Insurance Domain
Benno Uthayasooriyar | Antoine Ly | Franck Vermet | Caio Corro

Generic pre-trained neural networks may struggle to produce good results in specialized domains like finance and insurance. This is due to a domain mismatch between training data and downstream tasks, as in-domain data are often scarce due to privacy constraints. In this work, we compare different pre-training strategies for LayoutLM. We show that using domain-relevant documents improves results on a named-entity recognition (NER) problem using a novel dataset of anonymized insurance-related financial documents called PAYSLIPS. Moreover, we show that we can achieve competitive results using a smaller and faster model.

pdf bib
AveniBench: Accessible and Versatile Evaluation of Finance Intelligence
Mateusz Klimaszewski | Pinzhen Chen | Liane Guillou | Ioannis Papaioannou | Barry Haddow | Alexandra Birch

Over the last few years, there has been great interest in applying large language models (LLMs) to problems in the finance industry, and the field needs a robust LLM benchmark to support this work. Current financial LLM benchmarks contain simple tasks which are not representative of real use cases and have test sets with licences that do not allow commercial use. In response, we release AveniBench, a permissively licensed benchmark that tests a group of six key finance-related skills: tabular reasoning, numerical reasoning, question answering, long context modelling, summarisation and dialogue. We refactor the test sets to ensure that metrics are comparable, providing a unified framework. Furthermore, AveniBench introduces two task difficulty modes, easy and hard, enabling scalable evaluation based on real-world deployment needs. We use our benchmark to evaluate a diverse set of 20 widely used LLMs, from small open-weight models to proprietary systems like GPT-4. This evaluation initiates our public leaderboard, providing valuable insights for future academic research and commercial development.

pdf bib
Forecasting Credit Ratings: A Case Study where Traditional Methods Outperform Generative LLMs
Felix Drinkall | Janet B. Pierrehumbert | Stefan Zohren

Large Language Models (LLMs) have been shown to perform well for many downstream tasks. Transfer learning can enable LLMs to acquire skills that were not targeted during pre-training. In financial contexts, LLMs can sometimes beat well-established benchmarks. This paper investigates how well LLMs perform at forecasting corporate credit ratings. We show that while LLMs are very good at encoding textual information, traditional methods are still very competitive when it comes to encoding numeric and multimodal data. For our task, current LLMs perform worse than a more traditional XGBoost architecture that combines fundamental and macroeconomic data with high-density text-based embedding features. We investigate the degree to which the text encoding methodology affects performance and interpretability.

pdf bib
Investigating the effectiveness of length based rewards in DPO for building Conversational Financial Question Answering Systems
Anushka Yadav | Sai Krishna Rallabandi | Parag Pravin Dakle | Preethi Raghavan

In this paper, we address the numerical reasoning challenges of financial question-answering systems. We propose a two-stage approach where models first generate intermediate calculations and then produce the final answer. We perform two set of experiments to evaluate the performance of our approach. In the first, we compare single-step and multi-step approaches, demonstrating that incorporating intermediate calculations significantly improves numerical accuracy. In the second experiment, we compare traditional DPO and iterative DPO (iDPO) with length-regularized DPO. We show that while traditional DPO reduced parsing errors, it introduces verbosity; iDPO improves reasoning iteratively but faces diminishing returns. On the other hand, Length-regularized DPO reduces verbosity of intermediate calculation as well as enhances numerical accuracy across all models. These results highlight the potential of combining intermediate reasoning steps with domain-specific optimizations to build robust financial question-answering systems.

pdf bib
CreditLLM: Constructing Financial AI Assistant for Credit Products using Financial LLM and Few Data
Sixing Yan | Ting Zhu

Facilitating financial technology with the large-language model (LLM) has been developing in recent years. To address the challenges in one of the biggest world-wide markets, China, Chinese-expertise financial LLM has also been studied. The related works focus on conventional NLP tasks in finance, while developing LLM for specific tasks is also required. Besides, in the credit loan business, the existing AI-based approaches are largely related to Credit like Credit rating and Fraud prediction, while credit product customization is still missing. In China, Inclusive Finance and Rural Finance become two hot topics that raise critical challenges in flexibly customizing credit products to meet the variable fund requirements of small & micro businesses, individual businesses, and agricultural businesses of local character. In this paper, the credit product customization is studied by developing an LLM-based financial AI assistant for the credit loan business. It is proposed to satisfy the business requirements of customer counseling, recommendation, and question-answers regarding credit loans. The proposed LLM is developed by Chinese prompt data automatically constructed based on a small set of real-world credit products. The experiments demonstrate its effectiveness in credit loan-related ability while maintaining comparable performance in conventional finance NLP tasks.

pdf bib
Modeling Interactions Between Stocks Using LLM-Enhanced Graphs for Volume Prediction
Zhiyu Xu | Yi Liu | Yuchi Wang | Ruihan Bao | Keiko Harimoto | Xu Sun

Accurate trading volume prediction is essential for portfolio optimization, market regulation, and financial risk control. An effective method for predicting trading volume involves building a graph to model relations between stock. Recent research has enhanced these models by integrating stock news to improve forecasting ability. However, existing approaches primarily integrate news data as auxiliary features for nodes in Graph Neural Networks (GNNs), overlooking the relational information between stocks embedded in news. To address this, we propose LLM-Enhanced Dynamic Graph Neural Network (LED-GNN), a framework that constructs dynamic graphs using inter-stock relationships extracted from news via a large language model (LLM)-centered pipeline, combined with graphs learned from historical price-volume data. A dynamic GNN then processes these graphs to generate predictions. Evaluated on a real-world dataset, TOPIX, with Reuters Financial News, LED-GNN consistently outperformed all baseline models, achieving a 2% improvement over the strongest baseline.

pdf bib
Financial Named Entity Recognition: How Far Can LLM Go?
Yi-Te Lu | Yintong Huo

The surge of large language models (LLMs) has revolutionized the extraction and analysis of crucial information from a growing volume of financial statements, announcements, and business news. Recognition for named entities to construct structured data poses a significant challenge in analyzing financial documents and is a foundational task for intelligent financial analytics. However, how effective are these generic LLMs and their performance under various prompts are yet need a better understanding. To fill in the blank, we present a systematic evaluation of state-of-the-art LLMs and prompting methods in the financial Named Entity Recognition (NER) problem. Specifically, our experimental results highlight their strengths and limitations, identify five representative failure types, and provide insights into their potential and challenges for domain-specific tasks.

pdf bib
Proxy Tuning for Financial Sentiment Analysis: Overcoming Data Scarcity and Computational Barriers
Yuxiang Wang | Yuchi Wang | Yi Liu | Ruihan Bao | Keiko Harimoto | Xu Sun

Financial sentiment analysis plays a pivotal role in the financial domain. However, the task remains challenging due to the nuanced nature of financial sentiment, the need for high interpretability, and the scarcity of high-quality datasets. To address these issues, we leverage recent advancements in large language models (LLMs) and propose to adapt proxy tuning for financial sentiment analysis. Proxy tuning efficiently transfers knowledge from a pre-trained expert model to a controllable base model by incorporating logit differences, steering the base model toward the desired sentiment representation. Our method offers significant advantages: (1) it is training-free, reducing computational demands and data dependency; (2) it achieves promising performance, with a 36.67% improvement over the base model and over 90% of the tuned model’s performance; and (3) it is highly adaptable, functioning in a plug-and-play manner without requiring access to model architectures or weights. These results demonstrate the potential of proxy tuning as an efficient and practical solution for financial sentiment analysis in data-scarce scenarios.

pdf bib
The contribution of LLMs to relation extraction in the economic field
Mohamed Ettaleb | Mouna Kamel | Nathalie Aussenac-Gilles | Véronique Moriceau

Relation Extraction (RE) is a fundamental task in natural language processing, aimed at deducing semantic relationships between entities in a text. Traditional supervised extraction methods relation extraction methods involve training models to annotate tokens representing entity mentions, followed by predicting the relationship between these entities. However, recent advancements have transformed this task into a sequence-to-sequence problem. This involves converting relationships between entities into target string, which are then generated from the input text. Thus, language models now appear as a solution to this task and have already been used in numerous studies, with various levels of refinement, across different domains. The objective of the present study is to evaluate the contribution of large language models (LLM) to the task of relation extraction in a specific domain (in this case, the economic domain), compared to smaller language models. To do this, we considered as a baseline a model based on the BERT architecture, trained in this domain, and four LLM, namely FinGPT specific to the financial domain, XLNet, ChatGLM, and Llama3, which are generalists. All these models were evaluated on the same extraction task, with zero-shot for the general-purpose LLM, as well as refinements through few-shot learning and fine-tuning. The experiments showedthat the best performance in terms of F-score was achieved with fine-tuned LLM, with Llama3 achieving the highest performance.

pdf bib
Generating Financial News Articles from Factors of Stock Price Rise / Decline by LLMs
Shunsuke Nishida | Takehito Utsuro

In this paper, we study the task of generating financial news articles related to stock price fluctuations. Traditionally, reporters manually write these articles by identifying the causes behind significant stock price volatility. However, this process is time-consuming, limiting the number of articles produced. To address this, the study explores the use of generative AI to automatically generate such articles. The AI system, similar to human reporters, would analyze stock price volatility and determine the underlying factors contributing to these fluctuations. To support this approach, we introduces a Japanese dataset called JFinSR, which includes stock price fluctuation rankings from “Kabutan” and related financial information regarding factors of stock price rise / decline from “Nihon Keizai Shimbun (Nikkei).” Using this dataset, we implement the few-shot learning technique on large language models (LLMs) to enable automatic generation of high-quality articles from factors of stock price rise / decline that are available in Nikkei. In the evaluation, we compare zero-shot and few-shot learning approaches, where the few-shot learning achieved the higher F1 scores in terms of ROUGE-1/ROUGE-L metrics.

pdf bib
Can Large language model analyze financial statements well?
Xinlin Wang | Mats Brorsson

Since GPT-3.5’s release, large language models (LLMs) have made significant advancements, including in financial analysis. However, their effectiveness in financial calculations and predictions is still uncertain. This study examines LLMs’ ability to analyze financial reports, focusing on three questions: their accuracy in calculating financial ratios, the use of these metrics in DuPont analysis and the Z-score model for bankruptcy prediction, and their effectiveness in predicting financial indicators with limited knowledge. We used various methods, including zero-shot and few-shot learning, retrieval-augmented generation (RAG), and fine-tuning, in three advanced LLMs and compared their outputs to ground truth and expert predictions to assess their calculation and predictive abilities. The results highlight both the potential and limitations of LLMs in processing numerical data and performing complex financial analyses.

pdf bib
AMWAL: Named Entity Recognition for Arabic Financial News
Muhammad S. Abdo | Yash Hatekar | Damir Cavar

Financial Named Entity Recognition (NER) presents a pivotal task in extracting structured information from unstructured financial data, especially when extending its application to languages beyond English. In this paper, we present AMWAL, a named entity recognition system for Arabic financial news. Our approach centered on building a specialized corpus compiled from three major Arabic financial newspapers spanning from 2000 to 2023. Entities were extracted from this corpus using a semi-automatic process that included manual annotation and review to ensure accuracy. The total number of entities identified amounts to 17.1k tokens, distributed across 20 categories, providing a comprehensive coverage of financial entities. To standardize the identified entities, we adopt financial concepts from the Financial Industry Business Ontology (FIBO, 2020), aligning our framework with industry standards. The significance of our work lies not only in the creation of the first customized NER system for Arabic financial data but also in its potential to streamline information extraction processes in the financial domain. Our NER system achieves a Precision score of 96.08, a Recall score of 95.87, and an F1 score of 95.97, which outperforms state-of-the-art general Arabic NER systems as well as other systems for financial NER in other languages.

pdf bib
The Financial Document Causality Detection Shared Task (FinCausal 2025)
Antonio Moreno-Sandoval | Jordi Porta | Blanca Carbajo-Coronado | Yanco Torterolo | Doaa Samy

We present the Financial Document Causality Detection Task (FinCausal 2025), a multilingual challenge to identify causal relationships within financial texts. This task comprises English and Spanish subtasks, with datasets compiled from British and Spanish annual reports. Participants were tasked with identifying and generating answers to questions about causes or effects within specific text segments. The dataset combines extractive and generative question-answering (QA) methods, with abstractly formulated questions and directly extracted answers from the text. Systems performance is evaluated using exact matching and semantic similarity metrics. The challenge attracted submissions from 10 teams for the English subtask and 10 teams for the Spanish subtask. FinCausal 2025 is part of the 6th Financial Narrative Processing Workshop (FNP 2025), hosted at COLING 2025 in Abu Dhabi.

pdf bib
KULFi Framework: Knowledge Utilization for Optimizing Large Language Models for Financial Causal Reasoning
Neelesh Kumar Shukla | Sandeep Singh | Prabhat Kumar Prabhakar | Sakthivel Thangaraj | Weiyi Sun | C Prasanna Venkatesan | Viji Krishnamurthy

This paper presents our contribution to the Financial Document Causality Detection (FinCausal) task 2025. The FinCausal challenge centers on the extraction of cause-and-effect relationships from financial texts written in both English and Spanish. We introduce KULFi, a novel Knowledge Utilization framework designed to augment the capabilities of Large Language Models (LLMs) by leveraging the expertise of more advanced reasoning models. Through the utilization of Teacher LLMs to generate task-specific instructions, KULFi optimizes the performance of Student LLMs via automated prompt optimization. We evaluate the efficacy of KULFi on the Financial Document Causality Detection Task, where Student LLM achieves a similarity score comparable to human-guided prompt optimization for the same LLM, demonstrating significant improvements in causal reasoning performance. Our results demonstrate that KULFi enables effective knowledge transfer from more robust models to less capable ones, as well as efficient learning from training data, minimizing the need for human input in prompt design and enabling more precise causal analysis in financial contexts. Our system attained SAS and Exact Match scores of 0.92 and 0.35 on the English dataset, and 0.92 and 0.09 on the Spanish dataset, respectively. This framework has far-reaching implications, with potential applications in enhancing decision-making across complex financial environments.

pdf bib
Exploring the Effectiveness of Multilingual and Generative Large Language Models for Question Answering in Financial Texts
Ali Al-Laith

This paper investigates the use of large language models (LLMs) for financial causality detection in the FinCausal 2025 shared task, focusing on generative and multilingual question answering (QA) tasks. Our study employed both generative and discriminative approaches, utilizing GPT-4o for generative QA and BERT-base-multilingual-cased, XLM-RoBerta-large, and XLM-RoBerta-base for multilingual QA across English and Spanish datasets. The datasets consist of financial disclosures where questions reflect causal relationships, paired with extractive answers derived directly from the text. Evaluation was conducted using Semantic Answer Similarity (SAS) and Exact Match (EM) metrics. While the discriminative XLM-RoBerta-large model achieved the best overall performance, ranking 5th in English (SAS: 0.9598, EM: 0.7615) and 4th in Spanish (SAS: 0.9756, EM: 0.8084) among 11 team submissions, our results also highlight the effectiveness of the generative GPT-4o approach. Notably, GPT-4o achieved promising results in few-shot settings, with SAS scores approaching those of fine-tuned discriminative models, demonstrating that the generative approach can provide competitive performance despite lacking task-specific fine-tuning. This comparison underscores the potential of generative LLMs as robust, versatile alternatives for complex QA tasks like financial causality detection.

pdf bib
CLRG@FinCausal2025: Cause-Effect Extraction in Finance Domain
Vibhavkrishnan K S | Pattabhi RK Rao | Sobha Lalitha Devi

This paper presents our work on Cause-Effect information extraction specifically in the financial domain. Cause and effect information is very much needed for expert decision making. Particularly, in the financial domain, the fund managers, financial analysts, etc. need to have the information on cause-effects for their works. Natural Language Processing (NLP) techniques help in the automatic extraction of cause and effect from a given text. In this work, we build various cause-effect text span detection models using pre-trained transformer-based language models and fine tune these models using the data provided by FinCausal 2025 task organizers. We have only used FinCausal 2025 data sets to train our models. No other external data is used. Our ensemble of sequence tagging models based on the Fine-tuned RoBERTa-Large language model achieves SAS score of 0.9604 and Exact match score of 0.7214 for English. Similarly for Spanish we obtain SAS score of 0.9607 and Exact match score of 0.7166. This is our first time participation in the FinCausal 2025 Task.

pdf bib
Sarang at FinCausal 2025: Contextual QA for Financial Causality Detection Combining Extractive and Generative Models
Avinash Trivedi | Gauri Toshniwal | Sangeetha S | S R. Balasundaram

This paper describes our approach for the FinCausal 2025 English Shared Task, aimed at detecting and extracting causal relationships from the financial text. The task involved answering context-driven questions to identify causes or effects within specified text segments. Our method utilized a consciousAI RoBERTa-base encoder model, fine-tuned on the SQuADx dataset. We further fine-tuned it using the FinCausal 2025 development set. To enhance the quality and contextual relevance of the answers, we passed outputs from the extractive model through Gemma2-9B, a generative large language model, for answer refinement. This hybrid approach effectively addressed the task’s requirements, showcasing the strength of combining extractive and generative models. We (Team name: Sarang) achieved outstanding results, securing 3rd rank with a Semantic Answer Similarity (SAS) score of 96.74% and an Exact Match (EM) score of 70.14%.

pdf bib
Enhancing Causal Relationship Detection Using Prompt Engineering and Large Language Models
Pulkit Chatwal | Amit Agarwal | Ankush Mittal

This paper explores the use of large language models (LLMs) and prompt engineering to detect causal relationships in financial disclosures. The task was part of the FinCausal 2025 shared competition, which focuses on identifying cause-and-effect relationships in financial texts across languages. The study demonstrates the effectiveness of LLMs, specifically LLaMA 3.2, in tackling causality detection in English and Spanish financial reports. The paper introduces various prompt engineering techniques, including zero-shot, few-shot, and chain-of-thought (CoT) prompting, to improve performance. For English, the best results were achieved using the Few-Shot + CoT approach, while for Spanish, the Few-Shot method provided strong semantic alignment despite lower exact match accuracy. The evaluation used two metrics: Exact Match (EM) and Semantic Alignment Score (SAS). The results showed high SAS scores for both languages, indicating good semantic understanding, with English performing particularly well. The study emphasizes the importance of tailored prompt engineering techniques to handle language-specific nuances in financial contexts and suggests future research directions, including fine-tuning LLaMA 3.2 and testing additional LLM architectures to enhance multilingual causality detection in financial texts.

pdf bib
Addressing Hallucination in Causal Q&A: The Efficacy of Fine-tuning over Prompting in LLMs
Georg Niess | Houssam Razouk | Stasa Mandic | Roman Kern

This paper presents our approach and findings for participating in the FinCausal 2025 competition, which addresses causal question answering derived from financial documents, specifically English and Spanish annual reports. We investigate the effectiveness of generative models, such as Llama, in contrast to common extractive methods like BERT-based token classification. While prompt optimization and few-shot learning offer some improvements, they were insufficient for consistently outperforming extractive methods in FinCausal, suffering from hallucinations. In contrast, fine-tuning generative models was shown to be essential for minimizing hallucinations and achieving superior performance. Using our fine-tuned multilingual model for both tasks, we outperform our extractive and monolingual approaches, achieving top results for Spanish and second-best for English in the competition. Our findings indicate that fine-tuned large language models are well-suited for causal Q&A from complex financial narratives, offering robust multilingual capabilities and effectively mitigating hallucinations.

pdf bib
PresiUniv at FinCausal 2025 Shared Task: Applying Fine-tuned Language Models to Explain Financial Cause and Effect with Zero-shot Learning
Medha Jeenoor | Madiha Aziz | Saipriya Dipika Vaidyanathan | Avijit Samantraya | Sandeep Mathias

Transformer-based multilingual question-answering models are used to detect causality in financial text data. This study employs BERT (CITATION) for English text and XLM-RoBERTa (CITATION) for Spanish data, which were fine-tuned on the SQuAD datasets (CITATION) (CITATION). These pre-trained models are used to extract answers to the targeted questions. We design a system using these pre-trained models to answer questions, based on the given context. The results validate the effectiveness of the systems in understanding nuanced financial language and offers a tool for multi-lingual text analysis. Our system is able to achieve SAS scores of 0.75 in Spanish and 0.82 in English.

pdf bib
Extracting Financial Causality through QA: Insights from FinCausal 2025 Spanish Subtask
Marcelo Jose Moreno Aviles | Alejandro Vaca

The methodology tested both span extraction and generative tasks, with generative models ultimately proving to be more effective. SuperLenia, a private generative model, was the best-performing model. It is a combination of public models with sizes ranging from 7B to 8B parameters. SuperLenia was fine-tuned using QLoRA in a chat-based framework, and hyperparameter tuned during inference, including adjustments to temperature and sampling, further enhanced its performance.

pdf bib
FinNLP-FNP-LLMFinLegal-2025 Shared Task: Financial Misinformation Detection Challenge Task
Zhiwei Liu | Keyi Wang | Zhuo Bao | Xin Zhang | Jiping Dong | Kailai Yang | Mohsinul Kabir | Polydoros Giannouris | Rui Xing | Seongchan Park | Jaehong Kim | Dong Li | Qianqian Xie | Sophia Ananiadou

Despite the promise of large language models (LLMs) in finance, their capabilities for financial misinformation detection (FMD) remain largely unexplored. To evaluate the capabilities of LLMs in FMD task, we introduce the financial misinformation detection shared task featured at COLING FinNLP-FNP-LLMFinLegal-2024, FMD Challenge. This challenge aims to evaluate the ability of LLMs to verify financial misinformation while generating plausible explanations. In this paper, we provide an overview of this task and dataset, summarize participants’ methods, and present their experimental evaluations, highlighting the effectiveness of LLMs in addressing the FMD task. To the best of our knowledge, the FMD Challenge is one of the first challenges for assessing LLMs in the field of FMD. Therefore, we provide detailed observations and draw conclusions for the future development of this field.

pdf bib
FMD-Mllama at the Financial Misinformation Detection Challenge Task: Multimodal Reasoning and Evidence Generation
Zheyang Luo | Guangbin Zhang | Jiahao Xiao | Xuankang Zhang | Yulin Dou | Jiangming Liu

This paper presents our system for the Financial Misinformation Detection Challenge Task. We utilize multimodal reasoning, incorporating textual and image information, to address the task. Our system demonstrates the capability to detect financial misinformation while providing comprehensive explanations. Experimental results show that our final system significantly outperforms the baselines and ranks second on the task leaderboard.

pdf bib
Ask Asper at the Financial Misinformation Detection Challenge Task: Enhancing Financial Decision-Making: A Dual Approach Using Explainable LLMs for Misinformation Detection
Sonal Singh | Rahul Mehta | Yadunath Gupta | Soudip Roy Chowdhury

The integrity of the market and investor con- fidence are seriously threatened by the prolif- eration of financial misinformation via digital media. Existing approaches such as fact check, lineage detection and others have demonstrated significant progress in detecting financial mis- information. In this paper, we present a novel two-stage framework leveraging large language models (LLMs) to identify and explain finan- cial misinformation. The framework first em- ploys a GPT-4 model fine-tuned on financial datasets to classify claims as “True,” “False,” or “Not Enough Information” by analyzing rel- evant financial context. To enhance classifi- cation reliability, a second LLM serves as a verification layer, examining and refining the initial model’s predictions. This dual-model approach ensures greater accuracy in misinfor- mation detection through cross-validation. Beyond classification, our methodology empha- sizes generating clear, concise, and actionable explanations that enable users to understand the reasoning behind each determination. By com- bining robust misinformation detection with interpretability, our paradigm advances AI sys- tem transparency and accountability, providing valuable support to investors, regulators, and financial stakeholders in mitigating misinfor- mation risks.

pdf bib
Team FMD LLM at the Financial Misinformation Detection Challenge Task: Exploring Task Structuring and Metadata Impact on Performance
Ken Kawamura

The detection of financial misinformation (FMD) is a growing challenge. In this paper, we investigate how task structuring and metadata integration impact the performance of large language models (LLMs) on FMD tasks. We compare two approaches: predicting the label before generating an explanation, and generating the explanation first. Our results reveal that prediction-first models achieve higher F1 scores. We also assess the effect of auxiliary metadata, which surprisingly degraded performance despite its correlation with the labels. Our findings highlight the importance of task order and the need to carefully consider whether to use metadata in limited data settings.

pdf bib
Dunamu ML at the Financial Misinformation Detection Challenge Task: Improving Supervised Fine-Tuning with LLM-based Data Augmentation
Dongjun Lee | Heesoo Park

In this paper, we describe Dunamu ML’s submission to the Financial Misinformation Detection (FMD) 2025 shared task. To address the low-resource challenge in FMD, we augmented a general domain misinformation detection dataset for training. We first collected claims, contexts, and misinformation labels from a public dataset. Then, we generated evidence for each label based on a closed LLM with few-shot examples extracted from the FMD training dataset. Finally, we oversampled the training data specific to the financial domain and augmented it with the generated data to perform supervised fine-tuning (SFT) on the LLM. When evaluated on the blind test dataset, our model achieved an F1 score of 84.67 in misinformation classification and a ROUGE-1 score of 81.21 in evidence generation, ranking first on the leaderboard in both aspects.

pdf bib
1-800-SHARED-TASKS at the Financial Misinformation Detection Challenge Task: Sequential Learning for Claim Verification and Explanation Generation in Financial Domains
Jebish Purbey | Siddhant Gupta | Nikhil Manali | Siddartha Pullakhandam | Drishti Sharma | Ashay Srivastava | Ram Mohan Rao Kadiyala

This paper presents the system description of our entry for the COLING 2025 FMD challenge, focusing on misinformation detection in financial domains. We experimented with a combination of large language models, including Qwen, Mistral, and Gemma-2, and leveraged pre-processing and sequential learning for not only identifying fraudulent financial content but also generating coherent, and concise explanations that clarify the rationale behind the classifications. Our approach achieved competitive results with an F1-score of 0.8283 for classification, and ROUGE-1 of 0.7253 for explanations. This work highlights the transformative potential of LLMs in financial applications, offering insights into their capabilities for combating misinformation and enhancing transparency while identifying areas for future improvement in robustness and domain adaptation.

pdf bib
GMU-MU at the Financial Misinformation Detection Challenge Task: Exploring LLMs for Financial Claim Verification
Alphaeus Dmonte | Roland R. Oruche | Marcos Zampieri | Eunmi Ko | Prasad Calyam

This paper describes the team GMU-MU submission to the Financial Misinformation Detection challenge. The goal of this challenge is to identify financial misinformation and generate explanations justifying the predictions by developing or adapting LLMs. The participants were provided with a dataset of financial claims that were categorized into six financial domain categories. We experiment with the Llama model using two approaches; instruction-tuning the model with the training dataset, and a prompting approach that directly evaluates the off-the-shelf model. Our best system was placed 5th among the 12 systems, achieving an overall evaluation score of 0.6682.

pdf bib
Deloitte (Drocks) at the Financial Misinformation Detection Challenge Task: Enhancing Misinformation Detection through Instruction-Tuned Models
Harika Abburi | Alex Chandler | Edward Bowen | Sanmitra Bhattacharya | Nirmala Pudota

Large Language Models (LLMs) are capable of producing highly fluent and convincing text; however, they can sometimes include factual errors and misleading information. Consequently, LLMs have emerged as tools for the rapid and cost-effective generation of financial misinformation, enabling bad actors to harm individual investors and attempt to manipulate markets. In this study, we instruction-tune Generative Pre-trained Transformers (GPT-4o-mini) to detect financial misinformation and produce concise explanations for why a given claim or statement is classified as misinformation, leveraging the contextual information provided. Our model achieved fourth place in Financial Misinformation Detection (FMD) shared task with a micro F1 score of 0.788 and a ROUGE-1 score of 0.743 on the private test set of FACT-checking within the FINancial domain (FIN-FACT) dataset provided by the shared task organizers.

pdf bib
Capybara at the Financial Misinformation Detection Challenge Task: Chain-of-Thought Enhanced Financial Misinformation Detection
Yupeng Cao | Haohang Li | Yangyang Yu | Shashidhar Reddy Javaji

Financial misinformation poses a significant threat to investment decisions and market stability. Recently, the application of Large Language Models (LLMs) for detecting financial misinformation has gained considerable attention within the natural language processing (NLP) community. The Financial Misinformation Detection (FMD) challenge @ Coling 2025 serves as a valuable platform for collaboration and innovation. This paper presents our solution to FMD challenge. Our approach involves using search engines to retrieve the summarized high-quality information as supporting evidence and designing a financial domain-specific chain-of-thought to enhance the reasoning capabilities of LLMs. We evaluated our method on both commercial closed-source LLMs (GPT-family) and open-source models (Llama-3.1-8B and QWen). The experimental results domonstrate that the proposed method improves veracity prediction performance. However, the quality of the generated explanations remains relatively poor. In the paper, we present the experimental findings and provides an in depth analysis of these results.

pdf bib
A Scalable Framework for Legal Text Understanding in Regulatory and Financial Contexts.
Santiago Martínez | Juan Manuel Castañeda | Ruben Manrique

This study presents a comprehensive approach to developing a domain-specific large language model (LLM) for regulatory and financial text interpretation. A specialized corpus was constructed through large-scale scraping of financial and regulatory documents across domains such as compliance, licensing, and financial reporting. The data was preprocessed using GPT-4o-mini with prompt engineering to retain critical information and remove noise. We further pre-trained a LLaMA-3.1-8B model on the curated corpus and fine-tuned it using an instruction dataset covering nine tasks from the Coling 2025 Regulations Challenge, including acronym expansion, regulatory question-answering, and XBRL-based financial analytics, employing QLoRA to reduce memory requirements. The model exhibits a slight improvement from baseline answering complex regulatory questions (detailed QA) and expanding acronyms. This study demonstrates the potential of domain-specific LLMs in regulatory text interpretation and lays the groundwork for future research in specialized NLP evaluation methodologies.

pdf bib
Audit-FT at the Regulations Challenge Task: An Open-Source Large Language Model for Audit
Jiajia Huang | Maowei Jiang | Haoran Zhu

Intelligent auditing represents a crucial advancement in modern audit practices, enhancing both the quality and efficiency of audits within the realm of artificial intelligence. With the rise of large language model (LLM), there is enormous potential for intelligent models to contribute to audit domain. However, general LLMs applied in audit domain face the challenges of lacking specialized knowledge and the presence of data biases. To overcome these challenges, this study introduces AuditWen, an open-source audit LLM by fine-tuning Qwen with constructing instruction data from audit domain. We first outline the application scenarios for LLMs in the audit and extract requirements that shape the development of LLMs tailored for audit purposes. We then propose an audit LLM, called AuditWen, by fine-tuning Qwen with constructing 30k instruction dataset from 15 audit tasks and 3 layers. In evaluation stage, we proposed a benchmark with 5k instructions that covers a set of critical audit tasks derived from the application scenarios. With the benchmark, we compare AuditWen with other existing LLMs from information extraction, question answering and document generation. The experimental results demonstrate superior performance of AuditWen both in question understanding and answer generation, making it an immediately valuable tool for audit.

pdf bib
FinMind-Y-Me at the Regulations Challenge Task: Financial Mind Your Meaning based on THaLLE
Pantid Chantangphol | Pornchanan Balee | Kantapong Sucharitpongpan | Chanatip Saetia | Tawunrat Chalothorn

This paper presents our submission to the COLING 2025 regulation challenge, focusing on nine tasks in the regulatory and financial domains. The challenge aims to advance large language models beyond general-purpose capabilities, adapting them for regulatory and financial tasks using a unified framework of task-specific prompts and input templates. We propose a sequential fine-tuning approach that integrates reasoning-based training, tailored system prompts, and Chain-of-Thought (CoT) inference to optimize task-specific performance. This method improves accuracy and reliability across diverse tasks. Notably, CoT inference demonstrates exceptional effectiveness in handling complex scenarios and tasks requiring specific answer patterns, such as named entity recognition and financial calculations. Our model achieved an overall score of 54.801%, ranking 1st among all teams and becoming the top performer in the challenge. These results highlight the effectiveness of sequential fine-tuning, advanced reasoning techniques, and fine-tuned prompts in improving performance and scalability for complex regulatory and financial applications.

pdf bib
FinNLP-FNP-LLMFinLegal-2025 Shared Task: Regulations Challenge
Keyi Wang | Jaisal Patel | Charlie Shen | Daniel Kim | Andy Zhu | Alex Lin | Luca Borella | Cailean Osborne | Matt White | Steve Yang | Kairong Xiao | Xiao-Yang Liu

Financial large language models (FinLLMs) have been applied to various tasks in business, finance, accounting, and auditing. Complex financial regulations and standards are critical to financial services, which LLMs must comply with. However, FinLLMs’ performance in understanding and interpreting financial regulations has rarely been studied. Therefore, we organize the Regulations Challenge, a shared task at COLING FinNLP-FNP-LLMFinLegal-2025. It encourages the academic community to explore the strengths and limitations of popular LLMs. We create 9 novel tasks and corresponding question sets. In this paper, we provide an overview of these tasks and summarize participants’ approaches and results. We aim to raise awareness of FinLLMs’ professional capability in financial regulations and industry standards.

pdf bib
IntelliChain Stars at the Regulations Challenge Task: A Large Language Model for Financial Regulation
Shijia Jiang | Yongfu Dai | Haochen Jia | Yuxin Wang | Hao Wang

We present our approach to the COLING-2025 Regulations Challenge, which evaluates large language models (LLMs) on nine regulatory tasks, such as abbreviation recognition and financial data extraction. To address challenges like domain-specific terminologies and dynamic regulatory contexts, we developed a robust data construction pipeline, integrating proprietary Chinese regulatory data, Fin-GPT datasets, and financial Q&A data. The pipeline applied, but was not limited to, language filtering, semantic screening, and deduplication, resulting in a 30,000-example dataset combining financial regulations and general financial data. Using this dataset, we fine-tuned Llama 3.2-3B-Instruct to create Reg-LLaMA, a specialized model that outperformed baselines on the Regulations Challenge and PIXIU datasets. These results demonstrate the effectiveness of domain-specific data construction in advancing LLMs for regulatory tasks, paving the way for reliable and interpretable AI in regulated industries.

pdf bib
Fin-DBQA Shared-task: Database Querying and Reasoning
Rungsiman Nararatwong | Natthawut Kertkeidkachorn | Hiroya Takamura | Ryutaro Ichise

This paper presents the results of the Fin-DBQA shared task based on a question-answering dataset, focusing on database querying and reasoning. The dataset, consisting of 400 questions grouped into 40 conversations, evaluates language models’ abilities to answer sequential questions with complex reasoning and multi-hop queries in a multi-turn conversational question-answering setting. Each sample includes the question, answer, database queries, querying result (tables), and a program (series of operations) that produces the answer from the result. We received 52 submissions from three participants, with scores significantly surpassing the baselines. One participant submitted a paper detailing a prompt-based solution using large language models with additional data preprocessing that helps improve the overall performance.

pdf bib
Adapt LLM for Multi-turn Reasoning QA using Tidy Data
Jan Strich

This paper presents our submission to the Fin-DBQA shared task at the 9th FinNLP workshop. The task involves answering finance-focused questions in a multi-turn environment, requiring step-by-step reasoning and Python code generation. We propose a novel approach to tackle this multidimensional problem by pre-processing the data into tidy data format so that each column represents a variable and each row an observation. Our experiments demonstrate that using the tidy data format allows all models to surpass SOTA, with GPT-4o achieving a 50.62% accuracy on the DBQR-QA benchmark achieving second place on the shared task leaderboard. These findings suggest that transforming data into the tidy data format enhances reasoning capabilities, reduces syntax errors, and improves performance on table-reasoning QA tasks. The code is available online.

pdf bib
FinNLP-FNP-LLMFinLegal @ COLING 2025 Shared Task: Agent-Based Single Cryptocurrency Trading Challenge
Yangyang Yu | Haohang Li | Yupeng Cao | Keyi Wang | Zhiyang Deng | Zhiyuan Yao | Yuechen Jiang | Dong Li | Ruey-Ling Weng | Jordan W. Suchow

Despite the promise of large language models based agent framework in stock trading task, their capabilities for comprehensive analysis and multiple different financial assets remain largely unexplored, such as cryptocurrency trading. To evaluate the capabilities of LLM-based agent framework in cryptocurrency trading, we introduce an LLMs-based financial shared task featured at COLING 2025 FinNLP-FNP-LLMFinLegal workshop, named Agent-based Single Cryptocurrency Trading Challenge. This challenge includes two cryptocurrencies: BitCoin and Ethereum. In this paper, we provide an overview of these tasks and datasets, summarize participants’ methods, and present their experimental evaluations, highlighting the effectiveness of LLMs in addressing cryptocurrency trading challenges. To the best of our knowledge, the Agent-based Single Cryptocurrency Trading Challenge is one of the first challenges for assessing LLMs in the financial area. In consequence, we provide detailed observations and take away conclusions for future development in this area.

pdf bib
Sam’s Fans at the Crypto Trading Challenge Task: A Threshold-Based Decision Approach Based on FinMem Framework
You Wang | Jingyi Wei | Mingsong Ye

The advancements of large language models (LLMs) demonstrate the value of pre-training on diverse datasets, enabling these models to excel across a wide range of tasks while adapting effectively to specialized applications. This study presents an approach to enhance LLMs’ ability to process and trade based on cryptocurrency data across different time horizons. We fine-tuned two established language models, Llama-3.1-8b and Qwen2.5-7b, to effectively interpret and utilize temporal market data provided by the FinMem framework. Our methodology enables these models to analyze multi-period market data from FinMem, including price movements and momentum indicators, to execute effective cryptocurrency trading decisions. Results show that this fine-tuning approach improves the models’ capacity to analyze market conditions and inform trading decisions based on multi-period market dynamics.

pdf bib
300k/ns team at the Crypto Trading Challenge Task: Enhancing the justification of accurate trading decisions through parameter-efficient fine-tuning of reasoning models
Artem Agarkov | Mihail Kulik | Leonid Shmyrkov

In this paper, we address the Agent-Based Sin- gle Cryptocurrency Trading Challenge, focus- ing on decision-making for trading Bitcoin and Etherium. Our approach utilizes fine- tuning a Mistral AI model on a dataset com- prising summarized cryptocurrency news, en- abling it to make informed “buy,” “sell,” or “hold” decisions and articulate its reasoning. The model integrates textual sentiment analysis and contextual reasoning with real-time mar- ket trends, demonstrating the potential of Large Language Models (LLMs) in high-stakes finan- cial decision-making. The model achieved a notable accuracy, highlighting its capacity to manage risk while optimizing returns. This work contributes to advancing AI-driven so- lutions for cryptocurrency markets and offers insights into the practical deployment of LLMs in real-time trading environments. We made our model publicly available.

up

pdf (full)
bib (full)
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)

pdf bib
Proceedings of the 1stWorkshop on GenAI Content Detection (GenAIDetect)
Firoj Alam | Preslav Nakov | Nizar Habash | Iryna Gurevych | Shammur Chowdhury | Artem Shelmanov | Yuxia Wang | Ekaterina Artemova | Mucahid Kutlu | George Mikros

pdf bib
SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs
Aldan Creo | Shushanta Pudasaini

The advent of Large Language Models (LLMs) has enabled the generation of text that increasingly exhibits human-like characteristics. As the detection of such content is of significant importance, substantial research has been conducted with the objective of developing reliable AI-generated text detectors. These detectors have demonstrated promising results on test data, but recent research has revealed that they can be circumvented by employing different techniques. In this paper, we present homoglyph-based attacks (‘A’ → Cyrillic ‘А’) as a means of circumventing existing detectors. We conduct a comprehensive evaluation to assess the effectiveness of these attacks on seven detectors, including ArguGPT, Binoculars, DetectGPT, Fast-DetectGPT, Ghostbuster, OpenAI’s detector, and watermarking techniques, on five different datasets. Our findings demonstrate that homoglyph-based attacks can effectively circumvent state-of-the-art detectors, leading them to classify all texts as either AI-generated or human-written (decreasing the average Matthews Correlation Coefficient from 0.64 to -0.01). Through further examination, we extract the technical justification underlying the success of the attacks, which varies across detectors. Finally, we discuss the implications of these findings and potential defenses against such attacks.

pdf bib
Human vs. AI: A Novel Benchmark and a Comparative Study on the Detection of Generated Images and the Impact of Prompts
Philipp Moeßner | Heike Adel

With the advent of publicly available AI-based text-to-image systems, the process of creating photorealistic but fully synthetic images has been largely democratized. This can pose a threat to the public through a simplified spread of disinformation. Machine detectors and human media expertise can help to differentiate between AI-generated (fake) and real images and counteract this danger. Although AI generation models are highly prompt-dependent, the impact of the prompt on the fake detection performance has rarely been investigated yet. This work therefore examines the influence of the prompt’s level of detail on the detectability of fake images, both with an AI detector and in a user study. For this purpose, we create a novel dataset, COCOXGEN, which consists of real photos from the COCO dataset as well as images generated with SDXL and Fooocus using prompts of two standardized lengths. Our user study with 200 participants shows that images generated with longer, more detailed prompts are detected significantly more easily than those generated with short prompts. Similarly, an AI-based detection model achieves better performance on images generated with longer prompts. However, humans and AI models seem to pay attention to different details, as we show in a heat map analysis.

pdf bib
Mirror Minds : An Empirical Study on Detecting LLM-Generated Text via LLMs
Josh Baradia | Shubham Gupta | Suman Kundu

The use of large language models (LLMs) is inevitable in text generation. LLMs are intelligent and slowly replacing the search engines. LLMs became the de facto choice for conversation, knowledge extraction, and brain storming. This study focuses on a question: ‘Can we utilize the generative capabilities of LLMs to detect AI-generated content?’ We present a methodology and empirical results on four publicly available data sets. The result shows, with 90% accuracy it is possible to detect AI-generated content by a zero-shot detector utilizing multiple LLMs.

pdf bib
Benchmarking AI Text Detection: Assessing Detectors Against New Datasets, Evasion Tactics, and Enhanced LLMs
Shushanta Pudasaini | Luis Miralles | David Lillis | Marisa Llorens Salvador

The rapid advancement of Large Language Models (LLMs), such as GPT-4, has sparked concerns regarding academic misconduct, misinformation, and the erosion of originality. Despite the growing number of AI detection tools, their effectiveness is often undermined by sophisticated evasion tactics and the continuous evolution of LLMs. This research benchmarks the performance of leading AI detectors, including OpenAI Detector, RADAR, and ArguGPT, across a variety of text domains, evaded content, and text generated by cutting-edge LLMs. Our experiments reveal that current detection models show considerable unreliability in real-world scenarios, particularly when tested against diverse data domains and novel evasion strategies. The study underscores the need for enhanced robustness in detection systems and provides valuable insights into areas of improvement for these models. Additionally, this work lays the groundwork for future research by offering a comprehensive evaluation of existing detectors under challenging conditions, fostering a deeper understanding of their limitations. The experimental code and datasets are publicly available for further benchmarking on Github.

pdf bib
Cross-table Synthetic Tabular Data Detection
G. Charbel N. Kindji | Lina M. Rojas Barahona | Elisa Fromont | Tanguy Urvoy

Detecting synthetic tabular data is essential to prevent the distribution of false or manipulated datasets that could compromise data-driven decision-making. This study explores whether synthetic tabular data can be reliably identified “in the wild”—meaning across different generators, domains, and table formats. This challenge is unique to tabular data, where structures (such as number of columns, data types, and formats) can vary widely from one table to another. We propose three cross-table baseline detectors and four distinct evaluation protocols, each corresponding to a different level of “wildness”. Our very preliminary results confirm that cross-table adaptation is a challenging task.

pdf bib
Your Large Language Models are Leaving Fingerprints
Hope Elizabeth McGovern | Rickard Stureborg | Yoshi Suhara | Dimitris Alikaniotis

It has been shown that fine-tuned transformers and other supervised detectors are effective for distinguishing between human and machine-generated texts in non-adversarial settings, but we find that even simple classifiers on top of n-gram and part-of-speech features can achieve very robust performance on both in- and out-of-domain data. To understand how this is possible, we analyze machine-generated output text in four datasets, finding that LLMs possess unique fingerprints that manifest as slight differences in the frequency of certain lexical and morphosyntactic features. We show how to visualize such fingerprints, describe how they can be used to detect machine-generated text and find that they are even robust across text domains. We find that fingerprints are often persistent across models in the same model family (e.g. 13B parameter LLaMA’s fingerprint is similar to that of 65B parameter LLaMA) and that while a detector trained on text from one model can easily recognize text generated by a model in the same family, it struggles to detect text generated by an unrelated model.

pdf bib
GPT-4 is Judged More Human than Humans in Displaced and Inverted Turing Tests
Ishika M. Rathi | Sydney Taylor | Benjamin Bergen | Cameron Jones

Everyday AI detection requires differentiating between humans and AI in informal, online conversations. At present, human users most often do not interact directly with bots but instead read their conversations with other humans. We measured how well humans and large language models can discriminate using two modified versions of the Turing test: inverted and displaced. GPT-3.5, GPT-4, and displaced human adjudicators judged whether an agent was human or AI on the basis of a Turing test transcript. We found that both AI and displaced human judges were less accurate than interactive interrogators, with below chance accuracy overall. Moreover, all three judged the best-performing GPT-4 witness to be human more often than human witnesses. This suggests that both humans and current LLMs struggle to distinguish between the two when they are not actively interrogating the person, underscoring an urgent need for more accurate tools to detect AI in conversations.

pdf bib
The Consistent Lack of Variance of Psychological Factors Expressed by LLMs and Spambots
Vasudha Varadarajan | Salvatore Giorgi | Siddharth Mangalik | Nikita Soni | Dave M. Markowitz | H. Andrew Schwartz

In recent years, the proliferation of chatbots like ChatGPT and Claude has led to an increasing volume of AI-generated text. While the text itself is convincingly coherent and human-like, the variety of expressed of human attributes may still be limited. Using theoretical individual differences, the fundamental psychological traits which distinguish people, this study reveals a distinctive characteristic of such content: AI-generations exhibit remarkably limited variation in inferrable psychological traits compared to human-authored texts. We present a review and study across multiple datasets spanning various domains. We find that AI-generated text consistently models the authorship of an “average” human with such little variation that, on aggregate, it is clearly distinguishable from human-written texts using unsupervised methods (i.e., without using ground truth labels). Our results show that (1) fundamental human traits are able to accurately distinguish human- and machine-generated text and (2) current generation capabilities fail to capture a diverse range of human traits

pdf bib
DAMAGE: Detecting Adversarially Modified AI Generated Text
Elyas Masrour | Bradley N. Emi | Max Spero

AI humanizers are a new class of online software tools meant to paraphrase and rewrite AI-generated text in a way that allows them to evade AI detection software. We study 19 AI humanizer and paraphrasing tools and qualitatively assess their effects and faithfulness in preserving the meaning of the original text. We show that many existing AI detectors fail to detect humanized text. Finally, we demonstrate a robust model that can detect humanized AI text while maintaining a low false positive rate using a data-centric augmentation approach. We attack our own detector, training our own fine-tuned model optimized against our detector’s predictions, and show that our detector’s cross-humanizer generalization is sufficient to remain robust to this attack.

pdf bib
Text Graph Neural Networks for Detecting AI-Generated Content
Andric Valdez-Valenzuela | Helena Gómez-Adorno | Manuel Montes-y-Gómez

The widespread availability of Large Language Models (LLMs) such as GPT-4 and Llama-3, among others, has led to a surge in machine-generated content across various platforms, including social media, educational tools, and academic settings. While these models demonstrate remarkable capabilities in generating coherent text, their misuse raises significant concerns. For this reason, detecting machine-generated text has become a pressing need to mitigate these risks. This research proposed a novel classification method combining text-graph representations with Graph Neural Networks (GNNs) and different node feature initialization strategies to distinguish between human-written and machine-generated content. Experimental results demonstrate that the proposed approach outperforms traditional machine learning classifiers, highlighting the effectiveness of integrating structural and semantic relationships in text.

pdf bib
I Know You Did Not Write That! A Sampling Based Watermarking Method for Identifying Machine Generated Text
Kaan Efe Keleş | Ömer Kaan Gürbüz | Mucahid Kutlu

Potential harms of Large Language Models such as mass misinformation and plagiarism can be partially mitigated if there exists a reliable way to detect machine generated text. In this paper, we propose a new watermarking method to detect machine-generated texts. Our method embeds a unique pattern within the generated text, ensuring that while the content remains coherent and natural to human readers, it carries distinct markers that can be identified algorithmically. Specifically, we intervene with the token sampling process in a way which enables us to trace back our token choices during the detection phase. We show how watermarking affects textual quality and compare our proposed method with a state-of-the-art watermarking method in terms of robustness and detectability. Through extensive experiments, we demonstrate the effectiveness of our watermarking scheme in distinguishing between watermarked and non-watermarked text, achieving high detection rates while maintaining textual quality.

pdf bib
DCBU at GenAI Detection Task 1: Enhancing Machine-Generated Text Detection with Semantic and Probabilistic Features
Zhaowen Zhang | Songhao Chen | Bingquan Liu

This paper presents our approach to the MGT Detection Task 1, which focuses on detecting AI-generated content. The objective of this task is to classify texts as either machine-generated or human-written. We participated in Subtask A, which concentrates on English-only texts. We utilized the RoBERTa model for semantic feature extraction and the LLaMA3 model for probabilistic feature analysis. By integrating these features, we aimed to enhance the system’s classification accuracy. Our approach achieved strong results, with an F1 score of 0.7713 on Subtask A, ranking ninth among 36 teams. These results demonstrate the effectiveness of our feature integration strategy.

pdf bib
L3i++ at GenAI Detection Task 1: Can Label-Supervised LLaMA Detect Machine-Generated Text?
Hanh Thi Hong Tran | Nguyen Tien Nam

The widespread use of large language models (LLMs) influences different social media and educational contexts through the overwhelming generated text with a certain degree of coherence. To mitigate their potential misuse, this paper explores the feasibility of finetuning LLaMA with label supervision (named LS-LLaMA) in unidirectional and bidirectional settings, to discriminate the texts generated by machines and humans in monolingual and multilingual corpora. Our findings show that unidirectional LS-LLaMA outperformed the sequence language models as the benchmark by a large margin. Our code is publicly available at https://github.com/honghanhh/llama-as-a-judge.

pdf bib
TechExperts(IPN) at GenAI Detection Task 1: Detecting AI-Generated Text in English and Multilingual Contexts
Gull Mehak | Amna Qasim | Abdul Gafar Manuel Meque | Nisar Hussain | Grigori Sidorov | Alexander Gelbukh

The ever-increasing spread of AI-generated text, driven by the considerable progress in large language models, entails a real problem for all digital platforms: how to ensure con tent authenticity. The team TechExperts(IPN) presents a method for detecting AI-generated content in English and multilingual contexts, using the google/gemma-2b model fine-tuned for COLING 2025 shared task 1 for English and multilingual. Training results show peak F1 scores of 97.63% for English and 97.87% for multilingual detection, highlighting the model’s effectiveness in supporting content integrity across platforms.

pdf bib
SzegedAI at GenAI Detection Task 1: Beyond Binary - Soft-Voting Multi-Class Classification for Binary Machine-Generated Text Detection Across Diverse Language Models
Mihaly Kiss | Gábor Berend

This paper describes the participation of the SzegedAI team in Subtask A of Task 1 at the COLING 2025 Workshop on Detecting AI-Generated Content. Our solutions investigate the effectiveness of combining multi-class approaches with ensemble methods for detecting machine-generated text. This approach groups models into multiple classes based on properties such as model size or generative capabilities. Additionally, we employ a length-based method, utilizing specialized expert models designed for specific text length ranges. During inference, we condense multi-class predictions into a binary outcome, categorizing any label other than human as AI-generated. The effectiveness of both standard and snapshot ensemble techniques is evaluated. Although not all multi-class configurations outperformed the binary setup, our findings indicate that the combination of multi-class training and ensemble methods can enhance performance over single-method or binary approaches.

pdf bib
Team Unibuc - NLP at GenAI Detection Task 1: Qwen it detect machine-generated text?
Claudiu Creanga | Teodor-George Marchitan | Liviu P. Dinu

We explored both masked language models and causal models. For Subtask A, our best model achieved first-place out of 36 teams when looking at F1 Micro (Auxiliary Score) of 0.8333, and second-place when looking at F1 Macro (Main Score) of 0.8301. For causal models, our best model was a fine-tuned version of Qwen and for masked models, our best model was a fine-tuned version of XLM-Roberta-Base.

pdf bib
Fraunhofer SIT at GenAI Detection Task 1: Adapter Fusion for AI-generated Text Detection
Karla Schaefer | Martin Steinebach

The detection of AI-generated content is becoming increasingly important with the growing prevalence of tools such as ChatGPT. This paper presents our results in the GenAI Content Detection Task 1, focusing on binary English and multilingual AI-generated text detection. We trained and tested transformers, adapters and adapter fusion. In the English setting (Subtask A), the combination of our own adapter on AI-generated text detection based on RoBERTa with a task adapter on multi-genre NLI yielded a macro F1 score of 0.828 on the challenge test set, ranking us third out of 35 teams. In the multilingual setting (Subtask B), adapter fusion resulted in a deterioration of the results. Consequently, XLM-RoBERTa, fine-tuned on the training set, was employed for the final evaluation, attaining a macro F1 score of 0.7258 and ranking tenth out of 25 teams.

pdf bib
OSINT at GenAI Detection Task 1: Multilingual MGT Detection: Leveraging Cross-Lingual Adaptation for Robust LLMs Text Identification
Shifali Agrahari | Sanasam Ranbir Singh

Detecting AI-generated text has become in- creasingly prominent. This paper presents our solution for the DAIGenC Task 1 Subtask 2, where we address the challenge of distin- guishing human-authored text from machine- generated content, especially in multilingual contexts. We introduce Multi-Task Detection (MLDet), a model that leverages Cross-Lingual Adaptation and Model Generalization strate- gies for Multilingual Machine-Generated Text (MGT) detection. By combining language- specific embeddings with fusion techniques, MLDet creates a unified, language-agnostic feature representation, enhancing its ability to generalize across diverse languages and mod- els. Our approach demonstrates strong perfor- mance, achieving macro and micro F1 scores of 0.7067 and 0.7187, respectively, and ranking 15th in the competition1. We also evaluate our model across datasets generated by different distinct models in many languages, showcasing its robustness in multilingual and cross-model scenarios.

pdf bib
Nota AI at GenAI Detection Task 1: Unseen Language-Aware Detection System for Multilingual Machine-Generated Text
Hancheol Park | Jaeyeon Kim | Geonmin Kim | Tae-Ho Kim

Recently, large language models (LLMs) have demonstrated unprecedented capabilities in language generation, yet they still often produce incorrect information. Therefore, determining whether a text was generated by an LLM has become one of the factors that must be considered when evaluating its reliability. In this paper, we discuss methods to determine whether texts written in various languages were authored by humans or generated by LLMs. We have discovered that the classification accuracy significantly decreases for texts written in languages not observed during the training process, and we aim to address this issue. We propose a method to improve performance for unseen languages by using token-level predictive distributions extracted from various LLMs and text embeddings from a multilingual pre-trained langauge model. With the proposed method, we achieved third place out of 25 teams in Subtask B (binary multilingual machine-generated text detection) of Shared Task 1, with an F1 macro score of 0.7532.

pdf bib
CNLP-NITS-PP at GenAI Detection Task 1: AI-Generated Text Using Transformer-Based Approaches
Annepaka Yadagiri | Sai Teja Lekkala | Mandadoddi Srikar Vardhan | Partha Pakray | Reddi Mohana Krishna

In the current digital landscape, distinguishing between text generated by humans and that created by large language models has become increasingly complex. This challenge is exacerbated by advanced LLMs such as the Gemini, ChatGPT, GPT-4, and LLaMa, which can produce highly sophisticated, human-like text. This indistinguishability introduces a range of challenges across different sectors. Cybersecurity increases the risk of social engineering and misinformation, while social media aids the spread of biased or false content. The educational sector faces issues of academic integrity, and within large, multi-team environments, these models add complexity to managing interactions between human and AI agents. To address these challenges, we approached the problem as a binary classification task using an English-language benchmark COLING dataset. We employed transformer-based neural network models, including BERT, DistilBERT, and RoBERTa, fine-tuning each model with optimized hyperparameters to maximize classification accuracy. Our team CNLP-NITS-PP has achieved the 23rd rank in subtask 1 at COLING-2025 for machine-generated text detection in English with a Main Score F1 Macro of 0.6502 and micro-F1 score of 0.6876.

pdf bib
LuxVeri at GenAI Detection Task 1: Inverse Perplexity Weighted Ensemble for Robust Detection of AI-Generated Text across English and Multilingual Contexts
MD. Kamrujjaman Mobin | Md Saiful Islam

This paper presents a system developed for Task 1 of the COLING 2025 Workshop on Detecting AI-Generated Content, focusing on the binary classification of machine-generated versus human-written text. Our approach utilizes an ensemble of models, with weights assigned according to each model’s inverse perplexity, to enhance classification accuracy. For the English text detection task, we combined RoBERTa-base, RoBERTa-base with the OpenAI detector, and BERT-base-cased, achieving a Macro F1-score of 0.7458, which ranked us 12th out of 35 teams. We ensembled RemBERT, XLM-RoBERTa-base, and BERT-base-multilingual-case for the multilingual text detection task, employing the same inverse perplexity weighting technique. This resulted in a Macro F1-score of 0.7513, positioning us 4th out of 25 teams. Our results demonstrate the effectiveness of inverse perplexity weighting in improving the robustness of machine-generated text detection across both monolingual and multilingual settings, highlighting the potential of ensemble methods for this challenging task.

pdf bib
Grape at GenAI Detection Task 1: Leveraging Compact Models and Linguistic Features for Robust Machine-Generated Text Detection
Nhi Hoai Doan | Kentaro Inui

In this project, we aim to address two subtasks of Task 1: Binary Multilingual Machine-Generated Text (MGT) Detection (Human vs. Machine) as part of the COLING 2025 Workshop on MGT Detection (Wang et al., 2025) using different approaches. The first method involves separately fine-tuning small language models tailored to the specific subtask. The second approach builds on this methodology by incorporating linguistic, syntactic, and semantic features, leveraging ensemble learning to integrate these features with model predictions for more robust classification. By evaluating and comparing these approaches, we aim to identify the most effective techniques for detecting machine-generated content across languages, providing insights into improving automated verification tools amidst the rapid growth of LLM-generated text in digital spaces.

pdf bib
AAIG at GenAI Detection Task 1: Exploring Syntactically-Aware, Resource-Efficient Small Autoregressive Decoders for AI Content Detection
Avanti Bhandarkar | Ronald Wilson | Damon Woodard

This paper presents a lightweight and efficient approach to AI-generated content detection using small autoregressive fine-tuned decoders (AFDs) for secure, on-device deployment. Motivated by resource-efficiency, syntactic awareness, and bias mitigation, our model employs small language models (SLMs) with autoregressive pre-training and loss fusion to accurately distinguish between human and AI-generated content while significantly reducing computational demands. The system achieved highest macro-F1 score of 0.8186, with the submitted model scoring 0.7874—both significantly outperforming the task baseline while reducing model parameters by ~60%. Notably, our approach mitigates biases, improving recall for human-authored text by over 60%. Ranking 8th out of 36 participants, these results confirm the feasibility and competitiveness of small AFDs in challenging, adversarial settings, making them ideal for privacy-preserving, on-device deployment suitable for real-world applications.

pdf bib
TurQUaz at GenAI Detection Task 1:Dr. Perplexity or: How I Learned to Stop Worrying and Love the Finetuning
Kaan Efe Keleş | Mucahid Kutlu

This paper details our methods for addressing Task 1 of the GenAI Content Detection shared tasks, which focus on distinguishing AI-generated text from human-written content. The task comprises two subtasks: Subtask A, centered on English-only datasets, and Subtask B, which extends the challenge to multilingual data. Our approach uses a fine-tuned XLM-RoBERTa model for classification, complemented by features including perplexity and TF-IDF. While perplexity is commonly regarded as a useful indicator for identifying machine-generated text, our findings suggest its limitations in multi-model and multilingual contexts. Our approach ranked 6th in Subtask A, but a submission issue left our Subtask B unranked, where it would have placed 23rd.

pdf bib
AI-Monitors at GenAI Detection Task 1: Fast and Scalable Machine Generated Text Detection
Azad Singh | Vishnu Tripathi | Ravindra Kumar Pandey | Pragyanand Saho | Prakhar Joshi | Neel Mani | Richa Alagh | Pallaw Mishra | Piyush Arora

We describe the work carried out by our team, AI-Monitors, on the Binary Multilingual Machine-Generated Text Detection (Human vs. Machine) task at COLING 2025. This task aims to determine whether a given text is generated by a machine or authored by a human. We propose a lightweight, simple, and scalable approach using encoder models such as RoBERTa and XLM-R We provide an in-depth analysis based on our experiments. Our study found that carefully exploring fine-tuned parameters such as i) no. of training epochs, ii) maximum input size, iii) handling class imbalance etc., plays an important role in building an effective system to achieve good results and can significantly impact the underlying tasks. We found the optimum setting of these parameters can lead to a difference of about 5-6% in absolute terms for measure such as accuracy and F1 measure. The paper presents crucial insights into optimal parameter selection for fine-tuning RoBERTa and XLM-R based models to detect whether a given text is generated by a machine or a human.

pdf bib
Advacheck at GenAI Detection Task 1: AI Detection Powered by Domain-Aware Multi-Tasking
German Gritsai | Anastasia Voznyuk | Ildar Khabutdinov | Andrey Grabovoy

The paper describes a system designed by Advacheck team to recognise machine-generated and human-written texts in the monolingual subtask of GenAI Detection Task 1 competition. Our developed system is a multi-task architecture with shared Transformer Encoder between several classification heads. One head is responsible for binary classification between human-written and machine-generated texts, while the other heads are auxiliary multiclass classifiers for texts of different domains from particular datasets. As multiclass heads were trained to distinguish the domains presented in the data, they provide a better understanding of the samples. This approach led us to achieve the first place in the official ranking with 83.07% macro F1-score on the test set and bypass the baseline by 10%. We further study obtained system through ablation, error and representation analyses, finding that multi-task learning outperforms single-task mode and simultaneous tasks form a cluster structure in embeddings space.

pdf bib
GenAI Content Detection Task 1: English and Multilingual Machine-Generated Text Detection: AI vs. Human
Yuxia Wang | Artem Shelmanov | Jonibek Mansurov | Akim Tsvigun | Vladislav Mikhailov | Rui Xing | Zhuohan Xie | Jiahui Geng | Giovanni Puccetti | Ekaterina Artemova | Jinyan Su | Minh Ngoc Ta | Mervat Abassy | Kareem Ashraf Elozeiri | Saad El Dine Ahmed El Etter | Maiya Goloburda | Tarek Mahmoud | Raj Vardhan Tomar | Nurkhan Laiyk | Osama Mohammed Afzal | Ryuto Koike | Masahiro Kaneko | Alham Fikri Aji | Nizar Habash | Iryna Gurevych | Preslav Nakov

We present the GenAI Content Detection Task 1 – a shared task on binary machine generated text detection, conducted as a part of the GenAI workshop at COLING 2025. The task consists of two subtasks: Monolingual (English) and Multilingual. The shared task attracted many participants: 36 teams made official submissions to the Monolingual subtask during the test phase and 27 teams – to the Multilingual. We provide a comprehensive overview of the data, a summary of the results – including system rankings and performance scores – detailed descriptions of the participating systems, and an in-depth analysis of submissions.

pdf bib
CIC-NLP at GenAI Detection Task 1: Advancing Multilingual Machine-Generated Text Detection
Tolulope Olalekan Abiola | Tewodros Achamaleh Bizuneh | Fatima Uroosa | Nida Hafeez | Grigori Sidorov | Olga Kolesnikova | Olumide Ebenezer Ojo

Machine-written texts are gradually becoming indistinguishable from human-generated texts, leading to the need to use sophisticated methods to detect them. Team CIC-NLP presents work in the Gen-AI Content Detection Task 1 at COLING 2025 Workshop: the focus of our work is on Subtask B of Task 1, which is the classification of text written by machines and human authors, with particular attention paid to identifying multilingual binary classification problem. Usng mBERT, we addressed the binary classification task using the dataset provided by the GenAI Detection Task team. mBERT acchieved a macro-average F1-score of 0.72 as well as an accuracy score of 0.73.

pdf bib
CIC-NLP at GenAI Detection Task 1: Leveraging DistilBERT for Detecting Machine-Generated Text in English
Tolulope Olalekan Abiola | Tewodros Achamaleh Bizuneh | Oluwatobi Joseph Abiola | Temitope Olasunkanmi Oladepo | Olumide Ebenezer Ojo | Grigori Sidorov | Olga Kolesnikova

As machine-generated texts (MGT) become increasingly similar to human writing, these dis- tinctions are harder to identify. In this paper, we as the CIC-NLP team present our submission to the Gen-AI Content Detection Workshop at COLING 2025 for Task 1 Subtask A, which involves distinguishing between text generated by LLMs and text authored by humans, with an emphasis on detecting English-only MGT. We applied the DistilBERT model to this binary classification task using the dataset provided by the organizers. Fine-tuning the model effectively differentiated between the classes, resulting in a micro-average F1-score of 0.70 on the evaluation test set. We provide a detailed explanation of the fine-tuning parameters and steps involved in our analysis.

pdf bib
nits_teja_srikar at GenAI Detection Task 2: Distinguishing Human and AI-Generated Essays Using Machine Learning and Transformer Models
Sai Teja Lekkala | Annepaka Yadagiri | Mangadoddi Srikar Vardhan | Partha Pakray

This paper presents models to differentiate between human-written and AI-generated essays, addressing challenges posed by advanced AI models like ChatGPT and Claude. Using a structured dataset, we fine-tune multiple machine learning models, including XGBoost and Logistic Regression, along with ensemble learning and k-fold cross-validation. The dataset is processed through TF-IDF vectorization, followed by text cleaning, lemmatization, stemming, and part-of-speech tagging before training. Our team nits_teja_srikar achieves high accuracy, with DistilBERT performing at 77.3% accuracy, standing at 20th position for English, and XLM-RoBERTa excelling in Arabic at 92.2%, standing at 14th position in the official leaderboard, demonstrating the model’s potential for real-world applications.

pdf bib
IntegrityAI at GenAI Detection Task 2: Detecting Machine-Generated Academic Essays in English and Arabic Using ELECTRA and Stylometry
Mohammad AL-Smadi

We present a robust system for detecting machine-generated academic essays, leveraging pre-trained, transformer-based models specifically tailored for both English and Arabic texts. Our primary approach utilizes ELECTRA-Small for English and AraELECTRA-Base for Arabic, fine-tuned to deliver high performance while balancing computational efficiency. By incorporating stylometric features, such as word count, sentence length, and vocabulary richness, our models excel at distinguishing between human-written and AI-generated content. Proposed models achieved excellent results with an F1- score of 99.7%, ranking second among of 26 teams in the English subtask, and 98.4%, finishing first out of 23 teams in the Arabic one. Main Contributions include: (1) We develop lightweight and efficient models using ELECTRA-Small and AraELECTRA-Base, achieving an impressive F1-score of 98.5% on the English dataset and 98.4% on the Arabic dataset. This demonstrates the power of combining transformer-based architectures with stylometric analysis. (2) We optimize our system to maintain high performance while being computationally efficient, making it suitable for deployment on GPUs with moderate memory capacity. (3) Additionally, we tested larger models, such as ELECTRA-Large, achieving an even higher F1-score of 99.7% on the English dataset, highlighting the potential for further accuracy gains when using more computationally intensive models.

pdf bib
CMI-AIGCX at GenAI Detection Task 2: Leveraging Multilingual Proxy LLMs for Machine-Generated Text Detection in Academic Essays
Kaijie Jiao | Xingyu Yao | Shixuan Ma | Sifan Fang | Zikang Guo | Benfeng Xu | Licheng Zhang | Quan Wang | Yongdong Zhang | Zhendong Mao

This paper presents the approach we proposed for GenAI Detection Task 2, which aims to classify a given text as either machine-generated or human-written, with a particular emphasis on academic essays. We participated in subtasks A and B, which focus on detecting English and Arabic essays, respectively. We propose a simple and efficient method for detecting machine-generated essays, where we use the Llama-3.1-8B as a proxy to capture the essence of each token in the text. These essences are processed and classified using a refined feature classification network. Our approach does not require fine-tuning the LLM. Instead, we leverage its extensive multilingual knowledge acquired during pretraining to significantly enhance detection performance. The results validate the effectiveness of our approach and demonstrate that leveraging a proxy model with diverse multilingual knowledge can significantly enhance the detection of machine-generated text across multiple languages, regardless of model size. In Subtask A, we achieved an F1 score of 99.9%, ranking first out of 26 teams. In Subtask B, we achieved an F1 score of 96.5%, placing fourth out of 22 teams, with the same score as the third-place team.

pdf bib
EssayDetect at GenAI Detection Task 2: Guardians of Academic Integrity: Multilingual Detection of AI-Generated Essays
Shifali Agrahari | Subhashi Jayant | Saurabh Kumar | Sanasam Ranbir Singh

Detecting AI-generated text in the field of academia is becoming very prominent. This paper presents a solution for Task 2: AI vs. Hu- man – Academic Essay Authenticity Challenge in the COLING 2025 DAIGenC Workshop 1. The rise of Large Language models (LLMs) like ChatGPT has posed significant challenges to academic integrity, particularly in detecting AI-generated essays. To address this, we pro- pose a fusion model that combines pre-trained language model embeddings with stylometric and linguistic features. Our approach, tested on both English and Arabic, utilizes adaptive training and attention mechanisms to enhance F1 scores, address class imbalance, and capture linguistic nuances across languages. This work advances multilingual solutions for detecting AI-generated text in academia.

pdf bib
CNLP-NITS-PP at GenAI Detection Task 2: Leveraging DistilBERT and XLM-RoBERTa for Multilingual AI-Generated Text Detection
Annepaka Yadagiri | Reddi Mohana Krishna | Partha Pakray

In today’s digital landscape, distinguishing between human-authored essays and content generated by advanced Large Language Models such as ChatGPT, GPT-4, Gemini, and LLaMa has become increasingly complex. This differentiation is essential across sectors like academia, cybersecurity, social media, and education, where the authenticity of written material is often crucial. Addressing this challenge, the COLING 2025 competition introduced Task 2, a binary classification task to separate AI-generated text from human-authored content. Using a benchmark dataset for English and Arabic, developing a methodology that fine-tuned various transformer-based neural networks, including CNN-LSTM, RNN, Bi-GRU, BERT, DistilBERT, GPT-2, and RoBERTa. Our Team CNLP-NITS-PP achieved competitive performance through meticulous hyperparameter optimization, reaching a Recall score of 0.825. Specifically, we ranked 18th in the English sub-task A with an accuracy of 0.77 and 20th in the Arabic sub-task B with an accuracy of 0.59. These results underscore the potential of transformer-based models in academic settings to detect AI-generated content effectively, laying a foundation for more advanced methods in essay authenticity verification.

pdf bib
RA at GenAI Detection Task 2: Fine-tuned Language Models For Detection of Academic Authenticity, Results and Thoughts
Rana Gharib | Ahmed Elgendy

This paper assesses the performance of “RA” in the Academic Essay Authenticity Challenge, which saw nearly 30 teams participating in each subtask. We employed cutting-edge transformer-based models to achieve our results. Our models consistently exceeded both the mean and median scores across the tasks. Notably, we achieved an F1-score of 0.969 in classifying AI-generated essays in English and an F1-score of 0.957 for classifying AI-generated essays in Arabic. Additionally, this paper offers insights into the current state of AI-generated models and argues that the benchmarking methods currently in use do not accurately reflect real-world scenarios.

pdf bib
Tesla at GenAI Detection Task 2: Fast and Scalable Method for Detection of Academic Essay Authenticity
Vijayasaradhi Indurthi | Vasudeva Varma

This paper describes a simple yet effective method to identify if academic essays have been written by students or generated through the language models in English language. We extract a set of style, language complexity, bias and subjectivity, and emotion-based features that can be used to distinguish human-written essays from machine-generated essays. Our methods rank 6th on the leaderboard, achieving an impressive F1-score of 0.986.

pdf bib
GenAI Content Detection Task 2: AI vs. Human – Academic Essay Authenticity Challenge
Shammur Absar Chowdhury | Hind Almerekhi | Mucahid Kutlu | Kaan Efe Keleş | Fatema Ahmad | Tasnim Mohiuddin | George Mikros | Firoj Alam

This paper presents a comprehensive overview of the first edition of the Academic Essay Authenticity Challenge, organized as part of the GenAI Content Detection shared tasks collocated with COLING 2025. This challenge focuses on detecting machine-generated vs human-authored essays for academic purposes. The task is defined as follows: “Given an essay, identify whether it is generated by a machine or authored by a human.” The challenge involves two languages: English and Arabic. During the evaluation phase, 25 teams submitted systems for English and 21 teams for Arabic, reflecting substantial interest in the task. Finally, five teams submitted system description papers. The majority of submissions utilized fine-tuned transformer-based models, with one team employing Large Language Models (LLMs) such as Llama 2 and Llama 3. This paper outlines the task formulation, details the dataset construction process, and explains the evaluation framework. Additionally, we present a summary of the approaches adopted by participating teams. Nearly all submitted systems outperformed the n-gram-based baseline, with the top-performing systems achieving F1 scores exceeding 0.98 for both languages, indicating significant progress in the detection of machine-generated text.

pdf bib
CNLP-NITS-PP at GenAI Detection Task 3: Cross-Domain Machine-Generated Text Detection Using DistilBERT Techniques
Sai Teja Lekkala | Annepaka Yadagiri | Mangadoddi Srikar Vardhan | Partha Pakray

This paper presents a Cross-domain Machine-Generated Text Detection model developed for the COLING 2025 Workshop on Detecting AI-generated Content (DAIGenC). As large language models evolve, detecting machine-generated text becomes increasingly challenging, particularly in contexts like misinformation and academic integrity. While current detectors perform well on unseen data, they remain vulnerable to adversarial strategies, including paraphrasing, homoglyphs, misspellings, synonyms, whitespace manipulations, etc. We introduce a framework to address these adversarial tactics designed to bypass detection systems by adversarial training. Our team DistilBERT-NITS detector placed 7th in the Non-Adversarial Attacks category, and Adversarial-submission-3 achieved 17th in the Adversarial Attacks category.

pdf bib
Leidos at GenAI Detection Task 3: A Weight-Balanced Transformer Approach for AI Generated Text Detection Across Domains
Abishek R. Edikala | Gregorios A. Katsios | Noelie Creaghe | Ning Yu

Advancements in Large Language Models (LLMs) blur the distinction between human and machine-generated text (MGT), raising concerns about misinformation and academic dishonesty. Existing MGT detection methods often fail to generalize across domains and generator models. We address this by framing MGT detection as a text classification task using transformer-based models. Utilizing Distil-RoBERTa-Base, we train four classifiers (binary and multi-class, with and without class weighting) on the RAID dataset (Dugan et al., 2024). Our systems placed first to fourth in the COLING 2025 MGT Detection Challenge Task 3 (Dugan et al., 2025). Internal in-domain and zero-shot evaluations reveal that applying class weighting improves detector performance, especially with multi-class classification training. Our best model effectively generalizes to unseen domains and generators, demonstrating that transformer-based models are robust detectors of machine-generated text.

pdf bib
Pangram at GenAI Detection Task 3: An Active Learning Approach to Machine-Generated Text Detection
Bradley N. Emi | Max Spero | Elyas Masrour

We pretrain an autoregressive LLM-based detector on a wide variety of datasets, domains, languages, prompt schemes, and LLMs used to generate the AI portion of the dataset. We aggressively employ several augmentation strategies and preprocessing strategies to improve robustness. We then mine the RAID train set for the AI examples with the largest error based on the original classifier, and mix those examples and their human-written counterparts back into the training set. We then retrain the detector until convergence.

pdf bib
LuxVeri at GenAI Detection Task 3: Cross-Domain Detection of AI-Generated Text Using Inverse Perplexity-Weighted Ensemble of Fine-Tuned Transformer Models
MD. Kamrujjaman Mobin | Md Saiful Islam

This paper presents our approach for Task 3 of the GenAI content detection workshop at COLING-2025, focusing on Cross-Domain Machine-Generated Text (MGT) Detection. We propose an ensemble of fine-tuned transformer models, enhanced by inverse perplexity weighting, to improve classification accuracy across diverse text domains. For Subtask A (Non-Adversarial MGT Detection), we combined a fine-tuned RoBERTa-base model with an OpenAI detector-integrated RoBERTa-base model, achieving an aggregate TPR score of 0.826, ranking 10th out of 23 detectors. In Subtask B (Adversarial MGT Detection), our fine-tuned RoBERTa-base model achieved a TPR score of 0.801, securing 8th out of 22 detectors. Our results demonstrate the effectiveness of inverse perplexity-based weighting for enhancing generalization and performance in both non-adversarial and adversarial MGT detection, highlighting the potential for transformer models in cross-domain AI-generated content detection.

pdf bib
BBN-U.Oregon’s ALERT system at GenAI Content Detection Task 3: Robust Authorship Style Representations for Cross-Domain Machine-Generated Text Detection
Hemanth Kandula | Chak Fai Li | Haoling Qiu | Damianos Karakos | Hieu Man | Thien Huu Nguyen | Brian Ulicny

This paper presents BBN-U.Oregon’s system, ALERT, submitted to the Shared Task 3: Cross-Domain Machine-Generated Text Detection. Our approach uses robust authorship-style representations to distinguish between human-authored and machine-generated text (MGT) across various domains. We employ an ensemble-based authorship attribution (AA) system that integrates stylistic embeddings from two complementary subsystems: one that focuses on cross-genre robustness with hard positive and negative mining strategies and another that captures nuanced semantic-lexical-authorship contrasts. This combination enhances cross-domain generalization, even under domain shifts and adversarial attacks. Evaluated on the RAID benchmark, our system demonstrates strong performance across genres and decoding strategies, with resilience against adversarial manipulation, achieving 91.8% TPR at FPR=5% on standard test sets and 82.6% on adversarial sets.

pdf bib
Random at GenAI Detection Task 3: A Hybrid Approach to Cross-Domain Detection of Machine-Generated Text with Adversarial Attack Mitigation
Shifali Agrahari | Prabhat Mishra | Sujit Kumar

Machine-generated text (MGT) detection has gained critical importance in the era of large language models, especially for maintaining trust in multilingual and cross-domain applica- tions. This paper presents Task 3 Subtask B: Adversarial Cross-Domain MGT Detection for in the COLING 2025 DAIGenC Workshop. Task 3 emphasizes the complexity of detecting AI-generated text across eight domains, eleven generative models, and four decoding strate- gies, with an added challenge of adversarial manipulation. We propose a robust detection framework transformer embeddings utilizing Domain-Adversarial Neural Networks (DANN) to address domain variability and adversarial robustness. Our model demonstrates strong performance in identifying AI-generated text under adversarial conditions while highlighting condition scope of future improvement.

pdf bib
MOSAIC at GENAI Detection Task 3 : Zero-Shot Detection Using an Ensemble of Models
Matthieu Dubois | François Yvon | Pablo Piantanida

MOSAIC introduces a new ensemble approach that combines several detector models to spot AI-generated texts. The method enhances the reliability of detection by integrating insights from multiple models, thus addressing the limitations of using a single detector model which often results in performance brittleness. This approach also involves using a theoretically grounded algorithm to minimize the worst-case expected encoding size across models, thereby optimizing the detection process. In this submission, we report evaluation results on the RAID benchmark, a comprehensive English-centric testbed for machine-generated texts. These results were obtained in the context of the “Cross-domain Machine-Generated Text Detection” shared task. We show that our model can be competitive for a variety of domains and generator models, but that it can be challenged by adversarial attacks and by changes in the text generation strategy.

pdf bib
GenAI Content Detection Task 3: Cross-Domain Machine Generated Text Detection Challenge
Liam Dugan | Andrew Zhu | Firoj Alam | Preslav Nakov | Marianna Apidianaki | Chris Callison-Burch

Recently there have been many shared tasks targeting the detection of generated text from Large Language Models (LLMs). However, these shared tasks tend to focus either on cases where text is limited to one particular domain or cases where text can be from many domains, some of which may not be seen during test time. In this shared task, using the newly released RAID benchmark, we aim to answer whether or not models can detect generated text from a large, yet fixed, number of domains and LLMs, all of which are seen during training. Over the course of three months, our task was attempted by 9 teams with 23 detector submissions. We find that multiple participants were able to obtain accuracies of over 99% on machine-generated text from RAID while maintaining a 5% False Positive Rate—suggesting that detectors are able to robustly detect text from many domains and models simultaneously. We discuss potential interpretations of this result and provide directions for future research.

up

pdf (full)
bib (full)
Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK)

pdf bib
Proceedings of the Workshop on Generative AI and Knowledge Graphs (GenAIK)
Genet Asefa Gesese | Harald Sack | Heiko Paulheim | Albert Merono-Penuela | Lihu Chen

pdf bib
Effective Modeling of Generative Framework for Document-level Relational Triple Extraction
Pratik Saini | Tapas Nayak

Document-level relation triple extraction (DocRTE) is a complex task that involves three key sub-tasks: entity mention extraction, entity clustering, and relation triple extraction. Past work has applied discriminative models to address these three sub-tasks, either by training them sequentially in a pipeline fashion or jointly training them. However, while end-to-end discriminative or generative models have proven effective for sentence-level relation triple extraction, they cannot be trivially extended to the document level, as they only handle relation extraction without addressing the remaining two sub-tasks, entity mention extraction or clustering. In this paper, we propose a three-stage generative framework leveraging a pre-trained BART model to address all three tasks required for document-level relation triple extraction. Tested on the widely used DocRED dataset, our approach outperforms previous generative methods and achieves competitive performance against discriminative models.

pdf bib
Learn Together: Joint Multitask Finetuning of Pretrained KG-enhanced LLM for Downstream Tasks
Anastasia Martynova | Vladislav Tishin | Natalia Semenova

Recent studies have shown that a knowledge graph (KG) can enhance text data by providing structured background knowledge, which can significantly improve the language understanding skills of the LLM. Besides, finetuning of such models shows solid results on commonsense reasoning benchmarks. In this work, we introduce expandable Joint Multitask Finetuning on Pretrained KG-enchanced LLM approach for Question Answering (QA), Machine Reading Comprehension (MRC) and Knowledge Graph Question Answering (KGQA) tasks. Extensive experiments show competitive performance of joint finetuning QA+MRC+KGQA over single task approach with a maximum gain of 30% accuracy.

pdf bib
GNET-QG: Graph Network for Multi-hop Question Generation
Samin Jamshidi | Yllias Chali

Multi-hop question generation is a challenging task in natural language processing (NLP) that requires synthesizing information from multiple sources. We propose GNET-QG, a novel approach that integrates Graph Attention Networks (GAT) with sequence-to-sequence models, enabling structured reasoning over multiple information sources to generate complex questions. Our experiments demonstrate that GNET-QG outperforms previous state-of-the-art models across several evaluation metrics, particularly excelling in METEOR, showing its effectiveness in enhancing machine reasoning capabilities.

pdf bib
SKETCH: Structured Knowledge Enhanced Text Comprehension for Holistic Retrieval
Aakash Mahalingam | Vinesh Kumar Gande | Aman Chadha | Vinija Jain | Divya Chaudhary

This paper discusses about the SKETCH approach which enhances text retrieval and context relevancy on large corpuses compared to the traditional baseline methods. The abstract attached below discusses this further. Abstract: Retrieval-Augmented Generation (RAG) systems have become pivotal in leveraging vast corpora to generate informed and contextually relevant responses, notably reducing hallucinations in Large Language Models. Despite significant advancements, these systems struggle to efficiently process and retrieve information from large datasets while maintaining a comprehensive understanding of the context. This paper introduces SKETCH, a novel methodology that enhances the RAG retrieval process by integrating semantic text retrieval with knowledge graphs, thereby merging structured and unstructured data for a more holistic comprehension. SKETCH, demonstrates substantial improvements in retrieval performance and maintains superior context integrity compared to traditional methods. Evaluated across four diverse datasets: QuALITY, QASPER, NarrativeQA, and Italian Cuisine—SKETCH consistently outperforms baseline approaches on key RAGAS metrics such as answer relevancy, faithfulness, context precision and context recall. Notably, on the Italian Cuisine dataset, SKETCH achieved an answer relevancy of 0.94 and a context precision of 0.99, representing the highest performance across all evaluated metrics. These results highlight SKETCH’s capability in delivering more accurate and contextually relevant responses, setting new benchmarks for future retrieval systems.

pdf bib
On Reducing Factual Hallucinations in Graph-to-Text Generation Using Large Language Models
Dmitrii Iarosh | Alexander Panchenko | Mikhail Salnikov

Recent work in Graph-to-Text generation has achieved impressive results, but it still suffers from hallucinations in some cases, despite extensive pretraining stages and various methods for working with graph data. While the commonly used metrics for evaluating the quality of Graph-to-Text models show almost perfect results, it makes it challenging to compare different approaches. This paper demonstrates the challenges of recent Graph-to-Text systems in terms of hallucinations and proposes a simple yet effective approach to using a general LLM, which has shown state-of-the-art results and reduced the number of factual hallucinations. We provide step-by-step instructions on how to develop prompts for language models and a detailed analysis of potential factual errors in the generated text.

pdf bib
GraphRAG: Leveraging Graph-Based Efficiency to Minimize Hallucinations in LLM-Driven RAG for Finance Data
Mariam Barry | Gaetan Caillaut | Pierre Halftermeyer | Raheel Qader | Mehdi Mouayad | Fabrice Le Deit | Dimitri Cariolaro | Joseph Gesnouin

This study explores the integration of graph-based methods into Retrieval-Augmented Generation (RAG) systems to enhance efficiency, reduce hallucinations, and improve explainability, with a particular focus on financial and regulatory document retrieval. We propose two strategies—FactRAG and HybridRAG—which leverage knowledge graphs to improve RAG performance. Experiments conducted using Finance Bench, a benchmark for AI in finance, demonstrate that these approaches achieve a 6% reduction in hallucinations and an 80% decrease in token usage compared to conventional RAG methods. Furthermore, we evaluate HybridRAG by comparing the Digital Operational Resilience Act (DORA) from the European Union with the Federal Financial Institutions Examination Council (FFIEC) guidelines from the United States. The results reveal a significant improvement in computational efficiency, reducing contradiction detection complexity from O(n2) to O(k ⋅ n)—where n is the number of chunks—and a remarkable 734-fold decrease in token consumption. Graph-based retrieval methods can improve the efficiency and cost-effectiveness of large language model (LLM) applications, though their performance and token usage depend on the dataset, knowledge graph design, and retrieval task.

pdf bib
Structured Knowledge meets GenAI: A Framework for Logic-Driven Language Models
Farida Helmy Eldessouky | Nourhan Ehab | Carolin Schindler | Mervat Abuelkheir | Wolfgang Minker

Large Language Models (LLMs) excel at generating fluent text but struggle with context sensitivity, logical reasoning, and personalization without extensive fine-tuning. This paper presents a logical modulator: an adaptable communication layer between Knowledge Graphs (KGs) and LLMs as a way to address these limitations. Unlike direct KG-LLM integrations, our modulator is domain-agnostic and incorporates logical dependencies and commonsense reasoning in order to achieve contextual personalization. By enhancing KG interaction, this method will produce linguistically coherent and logically sound outputs, increasing interpretability and reliability in generative AI.

pdf bib
Performance and Limitations of Fine-Tuned LLMs in SPARQL Query Generation
Thamer Mecharnia | Mathieu d’Aquin

Generative AI has simplified information access by enabling natural language-driven interactions between users and automated systems. In particular, Question Answering (QA) has emerged as a key application of AI, facilitating efficient access to complex information through dialogue systems and virtual assistants. The Large Language Models (LLMs) combined with Knowledge Graphs (KGs) have further enhanced QA systems, allowing them to not only correctly interpret natural language but also retrieve precise answers from structured data sources such as Wikidata and DBpedia. However, enabling LLMs to generate machine-readable SPARQL queries from natural language questions (NLQs) remains challenging, particularly for complex questions. In this study, we present experiments in fine-tuning LLMs for the task of NLQ-to-SPARQL transformation. We rely on benchmark datasets for training and testing the fine-tuned models, generating queries directly from questions written in English (without further processing of the input or output). By conducting an analytical study, we examine the effectiveness of each model, as well as the limitations associated with using fine-tuned LLMs to generate SPARQL.

pdf bib
Refining Noisy Knowledge Graph with Large Language Models
Na Dong | Natthawut Kertkeidkachorn | Xin Liu | Kiyoaki Shirai

Knowledge graphs (KGs) represent structured real-world information composed by triplets of head entity, relation, and tail entity. These graphs can be constructed automatically from text or manually curated. However, regardless of the construction method, KGs often suffer from misinformation, incompleteness, and noise, which hinder their reliability and utility. This study addresses the challenge of noisy KGs, where incorrect or misaligned entities and relations degrade graph quality. Leveraging recent advancements in large language models (LLMs) with strong capabilities across diverse tasks, we explore their potential to detect and refine noise in KGs. Specifically, we propose a novel method, LLM_sim, to enhance the detection and refinement of noisy triples. Our results confirm the effectiveness of this approach in elevating KG quality in noisy environments. Additionally, we apply our proposed method to Knowledge Graph Completion (KGC), a downstream KG task that aims to predict missing links and improve graph completeness. Traditional KGC methods assume that KGs are noise-free, which is unrealistic in practical scenarios. Our experiments analyze the impact of varying noise levels on KGC performance, revealing that LLMs can mitigate noise by identifying and refining incorrect entries, thus enhancing KG quality.

pdf bib
Can LLMs be Knowledge Graph Curators for Validating Triple Insertions?
André Gomes Regino | Julio Cesar dos Reis

As Knowledge Graphs (KGs) become central to modern applications, automated methods for validating RDF triples before insertion into these graphs are essential. The complexity and scalability challenges in manual validation processes have led researchers to explore Large Language Models (LLMs) as potential automated validators. This study investigates the feasibility of using LLMs to validate RDF triples by focusing on four distinct and complementary validation tasks: class and property alignment, URI standardization, semantic consistency, and syntactic correctness. We propose a systematic validation method that uses prompts to guide LLMs through each stage of the triple evaluation of the RDF. In our experiments, four models are evaluated across these tasks. Our results reveal that more advanced models like Llama-3-70B-Instruct offer superior accuracy and consistency. Our findings emphasize the practical open challenges of deploying LLMs in real-world RDF validation scenarios, including domain generalization, semantic drift, and the need for human-in-the-loop interventions. This investigation advances the research on the refinement and integration of LLM-based RDF validation techniques into KG management workflows.

pdf bib
Text2Cypher: Bridging Natural Language and Graph Databases
Makbule Gulcin Ozsoy | Leila Messallem | Jon Besga | Gianandrea Minneci

Knowledge graphs use nodes, relationships, and properties to represent arbitrarily complex data. When stored in a graph database, the Cypher query language enables efficient modeling and querying of knowledge graphs. However, using Cypher requires specialized knowledge, which can present a challenge for non-expert users. Our work Text2Cypher aims to bridge this gap by translating natural language queries into Cypher query language and extending the utility of knowledge graphs to non-technical expert users. While large language models (LLMs) can be used for this purpose, they often struggle to capture complex nuances, resulting in incomplete or incorrect outputs. Fine-tuning LLMs on domain-specific datasets has proven to be a more promising approach, but the limited availability of high-quality, publicly available Text2Cypher datasets makes this challenging. In this work, we show how we combined, cleaned and organized several publicly available datasets into a total of 44,387 instances, enabling effective fine-tuning and evaluation. Models fine-tuned on this dataset showed significant performance gains, with improvements in Google-BLEU and Exact Match scores over baseline models, highlighting the importance of high-quality datasets and fine-tuning in improving Text2Cypher performance.

pdf bib
KGFakeNet: A Knowledge Graph-Enhanced Model for Fake News Detection
Anuj Kumar | Pardeep Kumar | Abhishek Yadav | Satyadev Ahlawat | Yamuna Prasad

The proliferation of fake news on social media has intensified the spread of misinformation, promoting societal biases, hate, and violence. While recent advancements in Generative AI (GenAI), particularly large language models (LLMs), have shown promise, these models often need more structured representation for accurate verification, as they rely on pre-trained data patterns without access to real-time or validated information. This study presents a framework that utilizes Open Information Extractor 6 (OpenIE6) to extract triplet relationships (subject-predicate-object) from statements and justifications to compute the cosine similarity between the Knowledge Graphs (KGs) of the statements and their supporting justification to precisely measure the relevance and alignment between them. This similarity feature is integrated with an attention mechanism over GenAI-generated embeddings to enhance the model’s ability to capture semantic features accurately. In addition, a Multi-Layer Perceptron (MLP) classifier is employed to integrate all features, resulting in a 4% improvement in accuracy and a 5% increase in F1-score over state-of-the-art LLM-based approaches.

pdf bib
Style Knowledge Graph: Augmenting Text Style Transfer with Knowledge Graphs
Martina Toshevska | Slobodan Kalajdziski | Sonja Gievska

Text style transfer is the task of modifying the stylistic attributes of a given text while preserving its original meaning. This task has also gained interest with the advent of large language models. Although knowledge graph augmentation has been explored in various tasks, its potential for enhancing text style transfer has received limited attention. This paper proposes a method to create a Style Knowledge Graph (SKG) to facilitate and improve text style transfer. The SKG captures words, their attributes, and relations in a particular style, that serves as a knowledge resource to augment text style transfer. We conduct baseline experiments to evaluate the effectiveness of the SKG for augmenting text style transfer by incorporating relevant parts from the SKG in the prompt. The preliminary results demonstrate its potential for enhancing content preservation and style transfer strength in text style transfer tasks, while the results on fluency indicate promising outcomes with some room for improvement. We hope that the proposed SKG and the initial experiments will inspire further research in the field.

pdf bib
Entity Quality Enhancement in Knowledge Graphs through LLM-based Question Answering
Morteza Kamaladdini Ezzabady | Farah Benamara

Most models for triple extraction from texts primarily focus on named entities. However, real-world applications often comprise non-named entities that pose serious challenges for entity linking and disambiguation. We focus on these entities and propose the first LLM-based entity revision framework to improve the quality of extracted triples via a multi-choice question-answering mechanism. When evaluated on two benchmark datasets, our results show a significant improvement, thereby generating more reliable triples for knowledge graphs.

pdf bib
Multilingual Skill Extraction for Job Vacancy–Job Seeker Matching in Knowledge Graphs
Hamit Kavas | Marc Serra-Vidal | Leo Wanner

In the modern labor market, accurate matching of job vacancies with suitable candidate CVs is critical. We present a novel multilingual knowledge graph-based framework designed to enhance the matching by accurately extracting the skills requested by a job and provided by a job seeker in a multilingual setting and aligning them via the standardized skill labels of the European Skills, Competences, Qualifications and Occupations (ESCO) taxonomy. The proposed framework employs a combination of state-of-the-art techniques to extract relevant skills from job postings and candidate experiences. These extracted skills are then filtered and mapped to the ESCO taxonomy and integrated into a multilingual knowledge graph that incorporates hierarchical relationships and cross-linguistic variations through embeddings. Our experiments demonstrate a significant improvement of the matching quality compared to the state of the art.

up

pdf (full)
bib (full)
Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)

pdf bib
Proceedings of the Fourth Workshop on Intelligent and Interactive Writing Assistants (In2Writing 2025)
Vishakh Padmakumar | Katy Gero | Thiemo Wambsganss | Sarah Sterman | Ting-Hao Huang | David Zhou | John Chung

pdf bib
Understanding Writing Assistants for Scientific Figure Captions: A Thematic Analysis
Ho Yin Sam Ng | Ting-Yao Hsu | Jiyoo Min | Sungchul Kim | Ryan A. Rossi | Tong Yu | Hyunggu Jung | Ting-Hao Kenneth Huang

Scientific figure captions are essential for communicating complex data but are often overlooked, leading to unclear or redundant descriptions. While many studies focus on generating captions as an ‘output’, little attention has been given to the writer’s process of crafting captions for scientific figures. This study examines how researchers use AI-generated captions to support caption writing. Through thematic analysis of interviews and video recordings with 18 participants from diverse disciplines, we identified four key themes: (1) integrating captions with figures and text, (2) bridging gaps between language proficiency and domain expertise, (3) leveraging multiple AI-generated suggestions, and (4) adapting to diverse writing norms. These findings provide actionable design insights for developing AI writing assistants that better support researchers in creating effective scientific figure captions.

pdf bib
ARWI: Arabic Write and Improve
Kirill Chirkunov | Bashar Alhafni | Chatrine Qwaider | Nizar Habash | Ted Briscoe

Although Arabic is spoken by over 400 million people, advanced Arabic writing assistance tools remain limited. To address this gap, we present ARWI, a new writing assistant that helps learners improve essay writing in Modern Standard Arabic. ARWI is the first publicly available Arabic writing assistant to include a prompt database for different proficiency levels, an Arabic text editor, state-of-the-art grammatical error detection and correction, and automated essay scoring aligned with the Common European Framework of Reference standards for language attainment (https://arwi.mbzuai.ac.ae/). Moreover, ARWI can be used to gather a growing auto-annotated corpus, facilitating further research on Arabic grammar correction and essay scoring, as well as profiling patterns of errors made by native speakers and non-native learners. A preliminary user study shows that ARWI provides actionable feedback, helping learners identify grammatical gaps, assess language proficiency, and guide improvement.

pdf bib
ReadCtrl: Personalizing text generation with readability-controlled instruction learning
Hieu Tran | Zonghai Yao | Lingxi Li | Hong Yu

Content generation conditioning on users’ readability is an important application for personalization. In an era of large language models (LLMs), readability-controlled text generation based on LLMs has become increasingly important. This paper introduces a novel methodology called “Readability-Controlled Instruction Learning (ReadCtrl),” which aims to instruction-tune LLMs to tailor users’ readability levels. Unlike the traditional methods, which primarily focused on categorical readability adjustments—typically classified as high, medium, and low or expert and layperson levels—with limited success, ReadCtrl introduces a dynamic framework that enables LLMs to generate content at various (near continuous level) complexity levels, thereby enhancing their versatility across different applications. Our results show that the ReadCtrl-Mistral-7b models significantly outperformed strong baseline models such as GPT-4 and Claude-3, with a win rate of 52.1%:35.7% against GPT-4 in human evaluations. Furthermore, ReadCtrl has shown significant improvements in automatic evaluations, as evidenced by better readability metrics (e.g., FOG, FKGL) and generation quality metrics (e.g., BLEU, SARI, SummaC-Factuality, UniEval-Consistency and Coherence). These results underscore ReadCtrl’s effectiveness and tenacity in producing high-quality, contextually appropriate outputs that closely align with targeted readability levels, marking a significant advancement in personalized content generation using LLMs.

pdf bib
AI Writing Assistants in Tanzanian Universities: Adoption Trends, Challenges, and Opportunities
Alfred Malengo Kondoro

This study examines the adoption, challenges, and impact of AI writing assistants in Tanzanian universities, with a focus on their role in supporting academic writing, enhancing accessibility, and accommodating low-resource languages such as Swahili. Through a structured survey of 1,005 university students, we analyze AI usage patterns, key barriers to adoption, and the improvements needed to make AI writing assistants more inclusive and effective. Findings reveal that limited Swahili integration, affordability constraints, and ethical concerns hinder AI adoption, disproportionately affecting students in resource-constrained settings. To address these challenges, we propose strategies for adapting AI models to diverse linguistic, academic, and infrastructural contexts, emphasizing Swahili-language support, AI literacy initiatives, and accessibility-focused AI development. By bridging these gaps, this study contributes to the development of AI-driven educational tools that are more equitable, contextually relevant, and effective for students in Tanzania and beyond.

pdf bib
From Crafting Text to Crafting Thought: Grounding Intelligent Writing Support to Writing Center Pedagogy
Yijun Liu | Tal August

Intelligent writing support tools have evolved from solving surface-level issues to collaborating and creating language with writers. Along with these new capabilities come concerns that generated fluent text can impact writers’ processes in unintended ways, especially for students. In this workshop paper, we look to a similar transition that writing centers experienced over the last century, which shifted focus from fixing surface-level issues to maintaining student writer voices. We interviewed 10 current writing tutors and grounded their described practices with ideas proposed in writing center literature. We employed these strategies in developing an intelligent writing tool prototype. We describe the design of our tool and discuss potential evaluations along with how to foster deeper relationships between writers and writing centers using intelligent writing tools.

pdf bib
Interaction-Required Suggestions for Control, Ownership, and Awareness in Human-AI Co-Writing
Kenneth C. Arnold | Jiho Kim

This paper explores interaction designs for generative AI interfaces that necessitate human involvement throughout the generation process. We argue that such interfaces can promote cognitive engagement, agency, and thoughtful decision-making. Through a case study in text revision, we present and analyze two interaction techniques: (1) using a predictive-text interaction to type the agent’s response to a revision request, and (2) highlighting potential edit opportunities in a document. Our implementations demonstrate how these approaches reveal the landscape of writing possibilities and enable fine-grained control. We discuss implications for human-AI writing partnerships and future interaction design directions.

pdf bib
Voice Interaction With Conversational AI Could Facilitate Thoughtful Reflection and Substantive Revision in Writing
Jiho Kim | Philippe Laban | Xiang Chen | Kenneth C. Arnold

Writing well requires not only expressing ideas but also refining them through revision, a process facilitated by reflection. Prior research suggests that feedback delivered through dialogues, such as those in writing center tutoring sessions, can help writers reflect more thoughtfully on their work compared to static feedback. Recent advancements in multi-modal large language models (LLMs) now offer new possibilities for supporting interactive and expressive voice-based reflection in writing. In particular, we propose that LLM-generated static feedback can be repurposed as conversation starters, allowing writers to seek clarification, request examples, and ask follow-up questions, thereby fostering deeper reflection on their writing. We argue that voice-based interaction can naturally facilitate this conversational exchange, encouraging writers’ engagement with higher-order concerns, facilitating iterative refinement of their reflections, and reduce cognitive load compared to text-based interactions. To investigate these effects, we propose a formative study exploring how text vs. voice input influence writers’ reflection and subsequent revisions. Findings from this study will inform the design of intelligent and interactive writing tools, offering insights into how voice-based interactions with LLM-powered conversational agents can support reflection and revision.

pdf bib
RONA: Pragmatically Diverse Image Captioning with Coherence Relations
Aashish Anantha Ramakrishnan | Aadarsh Anantha Ramakrishnan | Dongwon Lee

Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance caption diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. We propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as a controllable axis for pragmatic variations. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA

pdf bib
Multi-Agent Based Character Simulation for Story Writing
Tian Yu | Ken Shi | Zixin Zhao | Gerald Penn

This work proposes a novel multi-agent story-generation system that writes stories from a narrative plan. Traditional approaches tend to generate a section of text directly from its outline. Our system, by contrast, divides this elaboration process into role-play and rewrite steps, where the former step enacts the story in chronological order with LLM-backed character agents, and the latter step refines the role-play result to align with a narrative plan. We show that the stories produced by our system are preferable to two other LLM-based story-generation approaches. We attribute this advancement to the benefits of incorporating a character-based simulation strategy.

pdf bib
An Analysis of Scoring Methods for Reranking in Large Language Model Story Generation
Megan Deering | Gerald Penn

Outline-conditioned story generation using Large Language Models (LLMs) offers a promising approach for automating narrative creation. Some outline-conditioned story generation methods use automatic scoring during the generation process in order to improve the story quality. However, current research has shown that automatic scoring is not ideal for assessing story quality. This paper evaluates three proposed automatic story-scoring methods to improve the reranking of outputs during the generation process. These scoring methods leverage different prompting strategies and fine-tuning techniques to enhance the accuracy and relevance of the assessments. By experimenting with these approaches within a beam search framework, we aim to identify the most effective methods for optimizing story-generation outcomes. While we have found no significant overall difference between these methods in terms of their agreement with human ratings during story generation, the overall story ratings by human evaluators are average. These findings motivate the need for improved automatic scoring techniques and datasets while also indicating that simpler, more easily implementable scoring methods for reranking perform comparably to more complex approaches.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages
Ruvan Weerasinghe | Isuri Anuradha | Deshan Sumanathilaka

pdf bib
Hindi Reading Comprehension: Do Large Language Models Exhibit Semantic Understanding?
Daisy Monika Lal | Paul Rayson | Mo El-Haj

In this study, we explore the performance of four advanced Generative AI models—GPT-3.5, GPT-4, Llama3, and HindiGPT, for the Hindi reading comprehension task. Using a zero-shot, instruction-based prompting strategy, we assess model responses through a comprehensive triple evaluation framework using the HindiRC dataset. Our framework combines (1) automatic evaluation using ROUGE, BLEU, BLEURT, METEOR, and Cosine Similarity; (2) rating-based assessments focussing on correctness, comprehension depth, and informativeness; and (3) preference-based selection to identify the best responses. Human ratings indicate that GPT-4 outperforms the other LLMs on all parameters, followed by HindiGPT, GPT-3.5, and then Llama3. Preference-based evaluation similarly placed GPT-4 (80%) as the best model, followed by HindiGPT(74%). However, automatic evaluation showed GPT-4 to be the lowest performer on n-gram metrics, yet the best performer on semantic metrics, suggesting it captures deeper meaning and semantic alignment over direct lexical overlap, which aligns with its strong human evaluation scores. This study also highlights that even though the models mostly address literal factual recall questions with high precision, they still face the challenge of specificity and interpretive bias at times.

pdf bib
Machine Translation and Transliteration for Indo-Aryan Languages: A Systematic Review
Sandun Sameera Perera | Deshan Koshala Sumanathilaka

This systematic review paper provides an overview of recent machine translation and transliteration developments for Indo-Aryan languages spoken by a large population across South Asia. The paper examines advancements in translation and transliteration systems for a few language pairs which appear in recently published papers. The review summarizes the current state of these technologies, providing a worthful resource for anyone who is doing research in these fields to understand and find existing systems and techniques for translation and transliteration.

pdf bib
BERTopic for Topic Modeling of Hindi Short Texts: A Comparative Study
Atharva Mutsaddi | Anvi Jamkhande | Aryan Shirish Thakre | Yashodhara Haribhakta

As short text data in native languages like Hindi increasingly appear in modern media, robust methods for topic modeling on such data have gained importance. This study investigates the performance of BERTopic in modeling Hindi short texts, an area that has been under-explored in existing research. Using contextual embeddings, BERTopic can capture semantic relationships in data, making it potentially more effective than traditional models, especially for short and diverse texts. We evaluate BERTopic using 6 different document embedding models and compare its performance against 8 established topic modeling techniques, such as Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), Latent Semantic Indexing (LSI), Additive Regularization of Topic Models (ARTM), Probabilistic Latent Semantic Analysis (PLSA), Embedded Topic Model (ETM), Combined Topic Model (CTM), and Top2Vec. The models are assessed using coherence scores across a range of topic counts. Our results reveal that BERTopic consistently outperforms other models in capturing coherent topics from short Hindi texts.

pdf bib
Evaluating Structural and Linguistic Quality in Urdu DRS Parsing and Generation through Bidirectional Evaluation
Muhammad Saad Amin | Luca Anselma | Alessandro Mazzei

Evaluating Discourse Representation Structure (DRS)-based systems for semantic parsing (Text-to-DRS) and generation (DRS-to-Text) poses unique challenges, particularly in low-resource languages like Urdu. Traditional metrics often fall short, focusing either on structural accuracy or linguistic quality, but rarely capturing both. To address this limitation, we introduce two complementary evaluation methodologies—Parse-Generate (PARS-GEN) and Generate-Parse (GEN-PARS)—designed for a more comprehensive assessment of DRS-based systems. PARS-GEN evaluates the parsing process by converting DRS outputs back to the text, revealing linguistic nuances often missed by structure-focused metrics like SMATCH. Conversely, GEN-PARS assesses text generation by converting generated text into DRS, providing a semantic perspective that complements surface-level metrics such as BLEU, METEOR, and BERTScore. Using the Parallel Meaning Bank (PMB) dataset, we demonstrate our methodology across Urdu, uncovering unique insights into Urdu’s structural and linguistic interplay. Findings show that traditional metrics frequently overlook the complexity of linguistic and semantic fidelity, especially in low-resource languages. Our dual approach offers a robust framework for evaluating DRS-based systems, enhancing semantic parsing and text generation quality.

pdf bib
Studying the Effect of Hindi Tokenizer Performance on Downstream Tasks
Rashi Goel | Fatiha Sadat

This paper deals with a study on the effect of training data size and tokenizer performance for Hindi language on the eventual downstream model performance and comprehension. Multiple monolingual Hindi tokenizers are trained for large language models such as BERT and intrinsic and extrinsic evaluations are performed on multiple Hindi datasets. The objective of this study is to understand the precise effects of tokenizer performance on downstream task performance to gain insight on how to develop better models for low-resource languages.

pdf bib
Adapting Multilingual LLMs to Low-Resource Languages using Continued Pre-training and Synthetic Corpus: A Case Study for Hindi LLMs
Raviraj Joshi | Kanishk Singla | Anusha Kamath | Raunak Kalani | Rakesh Paul | Utkarsh Vaidya | Sanjay Singh Chauhan | Niranjan Wartikar | Eileen Long

Multilingual LLMs support a variety of languages; however, their performance is suboptimal for low-resource languages. In this work, we emphasize the importance of continued pre-training of multilingual LLMs and the use of translation-based synthetic pre-training corpora for improving LLMs in low-resource languages. We conduct our study in the context of the low-resource Indic language Hindi. We introduce Nemotron-Mini-Hindi 4B, a bilingual SLM supporting both Hindi and English, based on Nemotron-Mini 4B. The model is trained using a mix of real and synthetic Hindi + English tokens, with continuous pre-training performed on 400B tokens. We demonstrate that both the base and instruct models achieve state-of-the-art results on Hindi benchmarks while remaining competitive on English tasks. Additionally, we observe that the continued pre-training approach enhances the model’s overall factual accuracy.

pdf bib
OVQA: A Dataset for Visual Question Answering and Multimodal Research in Odia Language
Shantipriya Parida | Shashikanta Sahoo | Sambit Sekhar | Kalyanamalini Sahoo | Ketan Kotwal | Sonal Khosla | Satya Ranjan Dash | Aneesh Bose | Guneet Singh Kohli | Smruti Smita Lenka | Ondřej Bojar

This paper introduces OVQA, the first multimodal dataset designed for visual question-answering (VQA), visual question elicitation (VQE), and multimodal research for the low-resource Odia language. The dataset was created by manually translating 6,149 English question-answer pairs, each associated with 6,149 unique images from the Visual Genome dataset. This effort resulted in 27,809 English-Odia parallel sentences, ensuring a semantic match with the corresponding visual information. Several baseline experiments were conducted on the dataset, including visual question answering and visual question elicitation. The dataset is the first VQA dataset for the low-resource Odia language and will be released for multimodal research purposes and also help researchers extend for other low-resource languages.

pdf bib
Advancing Multilingual Speaker Identification and Verification for Indo-Aryan and Dravidian Languages
Braveenan Sritharan | Uthayasanker Thayasivam

Multilingual speaker identification and verification is a challenging task, especially for languages with diverse acoustic and linguistic features such as Indo-Aryan and Dravidian languages. Previous models have struggled to generalize across multilingual environments, leading to significant performance degradation when applied to multiple languages. In this paper, we propose an advanced approach to multilingual speaker identification and verification, specifically designed for Indo-Aryan and Dravidian languages. Empirical results on the Kathbath dataset show that our approach significantly improves speaker identification accuracy, reducing the performance gap between monolingual and multilingual systems from 15% to just 1%. Additionally, our model reduces the equal error rate for speaker verification from 15% to 5% in noisy conditions. Our method demonstrates strong generalization capabilities across diverse languages, offering a scalable solution for multilingual voice-based biometric systems.

pdf bib
Sentiment Analysis of Sinhala News Comments Using Transformers
Isuru Bandaranayake | Hakim Usoof

Sentiment analysis has witnessed significant advancements with the emergence of deep learning models such as transformer models. Transformer models adopt the mechanism of self-attention and have achieved state-of-the-art performance across various natural language processing (NLP) tasks, including sentiment analysis. However, limited studies are exploring the application of these recent advancements in sentiment analysis of Sinhala text. This study addresses this research gap by employing transformer models such as BERT, DistilBERT, RoBERTa, and XLM-RoBERTa (XLM-R) for sentiment analysis of Sinhala News comments. This study was conducted for 4 classes: positive, negative, neutral, and conflict, as well as for 3 classes: positive, negative, and neutral. It revealed that the XLM-R-large model outperformed the other four models, and the transformer models used in previous studies for the Sinhala language. The XLM-R-large model achieved an accuracy of 65.84% and a macro-F1 score of 62.04% for sentiment analysis with four classes and an accuracy of 75.90% and a macro-F1 score of 72.31% for three classes.

pdf bib
ExMute: A Context-Enriched Multimodal Dataset for Hateful Memes
Riddhiman Swanan Debnath | Nahian Beente Firuj | Abdul Wadud Shakib | Sadia Sultana | Md Saiful Islam

In this paper, we introduce ExMute, an extended dataset for classifying hateful memes that incorporates critical contextual information, addressing a significant gap in existing resources. Building on a previous dataset of 4,158 memes without contextual annotations, ExMute expands the collection by adding 2,041 new memes and providing comprehensive annotations for all 6,199 memes. Each meme is labeled across six defined contexts with language markers indicating code-mixing, code-switching, and Bengali captions, enhancing its value for linguistic and cultural research. These memes are systematically labeled across six contexts: religion, politics, celebrity, male, female, and others, facilitating a more nuanced understanding of meme content and intent. To evaluate ExMute, we apply state-of-the-art textual, visual, and multimodal approaches, leveraging models including BanglaBERT, Visual Geometry Group (VGG), Inception, ResNet, and Vision Transformer (ViT). Our experiments show that our custom LSTM-based attention-based textual model achieves an accuracy of 0.60, while VGG-based visual models reach up to 0.63. Multimodal models, which combine visual and textual features, consistently achieve accuracy scores of around 0.64, demonstrating the dataset’s robustness for advancing multimodal classification tasks. ExMute establishes a valuable benchmark for future NLP research, particularly in low-resource language settings, highlighting the importance of context-aware labeling in improving classification accuracy and reducing bias.

pdf bib
Studying the capabilities of Large Language Models in solving Combinatorics Problems posed in Hindi
Yash Kumar | Subhajit Roy

There are serious attempts at improving the mathematical acumen of LLMs in questions posed in English. In India, where a large fraction of the students study in regional languages, there is a need to assess and improve these state-of-the-art LLMs in their reasoning abilities in regional languages as well. As Hindi is a language predominantly used in India, this study proposes a new dataset on mathematical combinatorics problems consisting of a parallel corpus of problems in English and Hindi collected from NCERT textbooks. We evaluate the “raw” single-shot capabilities of these LLMs in solving problems posed in Hindi. Then we apply a chain-of-thought approach to evaluate the improvement in the abilities of the LLMs at solving combinatorics problems posed in Hindi. Our study reveals that while smaller LLMs like LLaMa3-8B shows a significant drop in performance when questions are posed in Hindi, versus questions posed in English, larger LLMs like GPT4-turbo shows excellent capabilities at solving problems posed in Hindi, almost at par its abilities in English. We make two primary inferences from our study: (1) large models like GPT4 can be readily deployed in schools where Hindi is the primary language of study, especially in rural India; (2) there is a need to improve the multilingual capabilities of smaller models. As these smaller open-source models can be deployed on not so expensive GPUs, it is easier for schools to provide these models to the students, and hence, the latter is an important direction for future research.

pdf bib
From Scarcity to Capability: Empowering Fake News Detection in Low-Resource Languages with LLMs
Hrithik Majumdar Shibu | Shrestha Datta | Md. Sumon Miah | Nasrullah Sami | Mahruba Sharmin Chowdhury | Md Saiful Islam

The rapid spread of fake news presents a significant global challenge, particularly in low-resource languages like Bangla, which lack adequate datasets and detection tools. Although manual fact-checking is accurate, it is expensive and slow to prevent the dissemination of fake news. Addressing this gap, we introduce BanFakeNews-2.0, a robust dataset to enhance Bangla fake news detection. This version includes 11,700 additional, meticulously curated fake news articles validated from credible sources, creating a proportional dataset of 47,000 authentic and 13,000 fake news items across 13 categories. In addition, we created a manually curated independent test set of 460 fake and 540 authentic news items for rigorous evaluation. We invest efforts in collecting fake news from credible sources and manually verified while preserving the linguistic richness. We develop a benchmark system utilizing transformer-based architectures, including fine-tuned Bidirectional Encoder Representations from Transformers variants (F1-87%) and Large Language Models with Quantized Low-Rank Approximation (F1-89%), that significantly outperforms traditional methods. BanFakeNews-2.0 offers a valuable resource to advance research and application in fake news detection for low-resourced languages. We publicly release our dataset and model on GitHub to foster research in this direction.

pdf bib
Enhancing Participatory Development Research in South Asia through LLM Agents System: An Empirically-Grounded Methodological Initiative from Field Evidence in Sri Lankan
Xinjie Zhao | Hao Wang | Shyaman Maduranga Sriwarnasinghe | Jiacheng Tang | Shiyun Wang | Sayaka Sugiyama | So Morikawa

The integration of artificial intelligence into development research methodologies offers unprecedented opportunities to address persistent challenges in participatory research, particularly in linguistically diverse regions like South Asia. Drawing on empirical implementation in Sri Lanka’s Sinhala-speaking communities, this study presents a methodological framework designed to transform participatory development research in the multilingual context of Sri Lanka’s flood-prone Nilwala River Basin. Moving beyond conventional translation and data collection tools, the proposed framework leverages a multi-agent system architecture to redefine how data collection, analysis, and community engagement are conducted in linguistically and culturally complex research settings. This structured, agent-based approach facilitates participatory research that is both scalable and adaptive, ensuring that community perspectives remain central to research outcomes. Field experiences underscore the immense potential of LLM-based systems in addressing long-standing issues in development research across resource-limited regions, delivering both quantitative efficiencies and qualitative improvements in inclusivity. At a broader methodological level, this research advocates for AI-driven participatory research tools that prioritize ethical considerations, cultural sensitivity, and operational efficiency. It highlights strategic pathways for deploying AI systems to reinforce community agency and equitable knowledge generation, offering insights that could inform broader research agendas across the Global South.

pdf bib
Identifying Aggression and Offensive Language in Code-Mixed Tweets: A Multi-Task Transfer Learning Approach
Bharath Kancharla | Prabhjot Singh | Lohith Bhagavan Kancharla | Yashita Chama | Raksha Sharma

The widespread use of social media has contributed to the increase in hate speech and offensive language, impacting people of all ages. This issue is particularly difficult to address when the text is in a code-mixed language. Twitter is commonly used to express opinions in code-mixed language. In this paper, we introduce a novel Multi-Task Transfer Learning (MTTL) framework to detect aggression and offensive language. By focusing on the dual facets of cyberbullying, aggressiveness and offensiveness, our model leverages the MTTL approach to enhance the performance of the model on the aggression and offensive language detection. Results show that our Multi-Task Transfer Learning (MTTL) setup significantly enhances the performance of state-of-the-art pretrained language models, BERT, RoBERTa, and Hing-RoBERTa for Hindi-English code-mixed data from Twitter.

pdf bib
Team IndiDataMiner at IndoNLP 2025: Hindi Back Transliteration - Roman to Devanagari using LLaMa
Saurabh Kumar | Dhruvkumar Babubhai Kakadiya | Sanasam Ranbir Singh

The increasing use of Romanized typing for Indo-Aryan languages on social media poses challenges due to its lack of standardization and loss of linguistic richness. To address this, we propose a sentence-level back-transliteration approach using the LLaMa 3.1 model for Hindi. Leveraging fine-tuning with the Dakshina dataset, our approach effectively resolves ambiguities in Romanized Hindi text, offering a robust solution for converting it into the native Devanagari script.

pdf bib
IndoNLP 2025 Shared Task: Romanized Sinhala to Sinhala Reverse Transliteration Using BERT
Sandun Sameera Perera | Lahiru Prabhath Jayakodi | Deshan Koshala Sumanathilaka | Isuri Anuradha

The Romanized text has become popular with the growth of digital communication platforms, largely due to the familiarity with English keyboards. In Sri Lanka, Romanized Sinhala, commonly referred to as “Singlish” is widely used in digital communications. This paper introduces a novel context-aware back-transliteration system designed to address the ad-hoc typing patterns and lexical ambiguity inherent in Singlish. The proposed system com bines dictionary-based mapping for Singlish words, a rule-based transliteration for out of-vocabulary words and a BERT-based language model for addressing lexical ambiguities. Evaluation results demonstrate the robustness of the proposed approach, achieving high BLEU scores along with low Word Error Rate (WER) and Character Error Rate (CER) across test datasets. This study provides an effective solution for Romanized Sinhala back-transliteration and establishes the foundation for improving NLP tools for similar low-resourced languages.

pdf bib
Crossing Language Boundaries: Evaluation of Large Language Models on Urdu-English Question Answering
Samreen Kazi | Maria Rahim | Shakeel Ahmed Khoja

This study evaluates the question-answering capabilities of Large Language Models (LLMs) in Urdu, addressing a critical gap in low-resource language processing. Four models GPT-4, mBERT, XLM-R, and mT5 are assessed across monolingual, cross-lingual, and mixed-language settings using the UQuAD1.0 and SQuAD2.0 datasets. Results reveal significant performance gaps between English and Urdu processing, with GPT-4 achieving the highest F1 scores (89.1% in English, 76.4% in Urdu) while demonstrating relative robustness in cross-lingual scenarios. Boundary detection and translation mismatches emerge as primary challenges, particularly in cross-lingual settings. The study further demonstrates that question complexity and length significantly impact performance, with factoid questions yielding 14.2% higher F1 scores compared to complex questions. These findings establish important benchmarks for enhancing LLM performance in low-resource languages and identify key areas for improvement in multilingual question-answering systems.

pdf bib
Investigating the Effect of Backtranslation for Indic Languages
Sudhansu Bala Das | Samujjal Choudhury | Dr Tapas Kumar Mishra | Dr Bidyut Kr Patra

Neural machine translation (NMT) is becoming increasingly popular as an effective method of automated language translation. However, due to a scarcity of training datasets, its effectiveness is limited when used with low-resource languages, such as Indian Languages (ILs). The lack of parallel datasets in Natural Language Processing (NLP) makes it difficult to investigate many ILs for Machine Translation (MT). A data augmentation approach such as Backtranslation (BT) can be used to enhance the size of the training dataset. This paper presents the development of a NMT model for ILs within the context of a MT system. To address the issue of data scarcity, the paper examines the effectiveness of a BT approach for ILs that uses both monolingual and parallel datasets. Experimental results reveal that while the BT has improved the model’s performance, however, it is not as significant as expected. It has also been observed that, even though the English-ILs and ILs-English models are trained on the same dataset, the ILs-English models perform better in all evaluation metrics. The reason for this is that ILs frequently differ in sentence structure, word order, and morphological richness from English. The paper also includes error analysis for translations between languages that were utilized in experiments utilizing the Multidimensional Quality Metrics (MQM) framework.

pdf bib
Sinhala Transliteration: A Comparative Analysis Between Rule-based and Seq2Seq Approaches
Yomal De Mel | Kasun Wickramasinghe | Nisansa de Silva | Surangika Ranathunga

Due to reasons of convenience and lack of tech literacy, transliteration (i.e., Romanizing native scripts instead of using localization tools) is eminently prevalent in the context of low-resource languages such as Sinhala, which have their own writing script. In this study, our focus is on Romanized Sinhala transliteration. We propose two methods to address this problem: Our baseline is a rule-based method, which is then compared against our second method where we approach the transliteration problem as a sequence-to-sequence task akin to the established Neural Machine Translation (NMT) task. For the latter, we propose a Transformer based Encode-Decoder solution. We witnessed that the Transformer-based method could grab many ad-hoc patterns within the Romanized scripts compared to the rule-based method.

pdf bib
Romanized to Native Malayalam Script Transliteration Using an Encoder-Decoder Framework
Bajiyo Baiju | Kavya Manohar | Leena G. Pillai | Elizabeth Sherly

In this work, we present the development of a reverse transliteration model to convert romanized Malayalam to native script using an encoder-decoder framework built with attention-based bidirectional Long Short Term Memory (Bi-LSTM) architecture. To train the model, we have used curated and combined collection of 4.3 million transliteration pairs derived from publicly available Indic language translitertion datasets, Dakshina and Aksharantar. We evaluated the model on two different test dataset provided by IndoNLP-2025-Shared-Task that contain, (1) General typing patterns and (2) Adhoc typing patterns, respectively. On the Test Set-1, we obtained a character error rate (CER) of 7.42%. However upon Test Set-2, with adhoc typing patterns, where most vowel indicators are missing, our model gave a CER of 22.8%.

up

pdf (full)
bib (full)
The Sixth Workshop on Insights from Negative Results in NLP

pdf bib
The Sixth Workshop on Insights from Negative Results in NLP
Aleksandr Drozd | João Sedoc | Shabnam Tafreshi | Arjun Akula | Raphael Shu

pdf bib
Challenging Assumptions in Learning Generic Text Style Embeddings
Phil Ostheimer | Marius Kloft | Sophie Fellenz

Recent advancements in language representation learning primarily emphasize language modeling for deriving meaningful representations, often neglecting style-specific considerations. This study addresses this gap by creating generic, sentence-level style embeddings crucial for style-centric tasks. Our approach is grounded on the premise that low-level text style changes can compose any high-level style. We hypothesize that applying this concept to representation learning enables the development of versatile text style embeddings. By fine-tuning a general-purpose text encoder using contrastive learning and standard cross-entropy loss, we aim to capture these low-level style shifts, anticipating that they offer insights applicable to high-level text styles. The outcomes prompt us to reconsider the underlying assumptions as the results do not always show that the learned style representations capture high-level text styles.

pdf bib
In-Context Learning on a Budget: A Case Study in Token Classification
Uri Berger | Tal Baumel | Gabriel Stanovsky

Few shot in-context learning (ICL) typically assumes access to large annotated training sets. However, in many real world scenarios, such as domain adaptation, there is only a limited budget to annotate a small number of samples, with the goal of maximizing downstream performance. We study various methods for selecting samples to annotate within a predefined budget, focusing on token classification tasks, which are expensive to annotate and are relatively less studied in ICL setups. Across various tasks, models, and datasets, we observe that no method significantly outperforms the others, with most yielding similar results, including random sample selection for annotation. Moreover, we demonstrate that a relatively small annotated sample pool can achieve performance comparable to using the entire training set. We hope that future work adopts our realistic paradigm which takes annotation budget into account.

pdf bib
Reassessing Graph Linearization for Sequence-to-sequence AMR Parsing: On the Advantages and Limitations of Triple-Based
Jeongwoo Kang | Maximin Coavoux | Didier Schwab | Cédric Lopez

Sequence-to-sequence models are widely used to train Abstract Meaning Representation (Banarescu et al.,2013, AMR) parsers. To train such models, AMR graphs have to be linearized into a one-line text format. While Penman encoding is widely used for this purpose, we argue that it has limitations: 1) for deep graphs, some closely related nodes are located far apart in the linearized text 2) Penman’s tree-based encoding necessitates inverse roles to handle node re-entrancy, doubling the number of relation types to predict. To address these issues, we propose a triple-based linearization method and compare its efficiency by training an AMR parser with both approaches. Although triple is well suited to represent a graph, our results show that it does not yet improve performance on deeper or longer graphs. It suggests room for improvement in its design to better compete with Penman’s concise representation and explicit encoding of a nested graph structure.

pdf bib
Corrective In-Context Learning: Evaluating Self-Correction in Large Language Models
Mario Sanz-Guerrero | Katharina Von Der Wense

In-context learning (ICL) has transformed the use of large language models (LLMs) for NLP tasks, enabling few-shot learning by conditioning on labeled examples without finetuning. Despite its effectiveness, ICL is prone to errors, especially for challenging examples. With the goal of improving the performance of ICL, we propose *corrective in-context learning* (CICL), an approach that incorporates a model’s incorrect predictions alongside ground truth corrections into the prompt, aiming to enhance classification accuracy through self-correction. However, contrary to our hypothesis, extensive experiments on text classification tasks demonstrate that CICL consistently underperforms standard ICL, with performance degrading as the proportion of corrections in the prompt increases. Our findings indicate that CICL introduces confusion by disrupting the model’s task understanding, rather than refining its predictions. Additionally, we observe that presenting harder examples in standard ICL does not improve performance, suggesting that example difficulty alone may not be a reliable criterion for effective selection. By presenting these negative results, we provide important insights into the limitations of self-corrective mechanisms in LLMs and offer directions for future research.

pdf bib
Do Prevalent Bias Metrics Capture Allocational Harms from LLMs?
Hannah Cyberey | Yangfeng Ji | David Evans

Allocational harms occur when resources or opportunities are unfairly withheld from specific groups. Many proposed bias measures ignore the discrepancy between predictions, which are what the proposed methods consider, and decisions that are made as a result of those predictions. Our work examines the reliability of current bias metrics in assessing allocational harms arising from predictions of large language models (LLMs). We evaluate their predictive validity and utility for model selection across ten LLMs and two allocation tasks. Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes. Our work highlights the need to account for how model predictions are used in decisions, in particular in contexts where they are influenced by how limited resources are allocated.

pdf bib
Language-Specific Neurons Do Not Facilitate Cross-Lingual Transfer
Soumen Kumar Mondal | Sayambhu Sen | Abhishek Singhania | Preethi Jyothi

Multilingual large language models (LLMs) aim towards robust natural language understanding across diverse languages, yet their performance significantly degrades on low-resource languages. This work explores whether existing techniques to identify language-specific neurons can be leveraged to enhance cross-lingual task performance of low-resource languages. We conduct detailed experiments covering existing language-specific neuron identification techniques (such as LanguageActivation Probability Entropy and activation probability-based thresholding) andneuron-specific LoRA fine-tuning with models like Llama 3.1 and Mistral Nemo. We find that such neuron-specific interventions are insufficient to yield cross-lingual improvements on downstream tasks (XNLI, XQuAD) in low-resource languages. This study highlights the challenges in achieving cross-lingual generalization and provides critical insights for multilingual LLMs.

pdf bib
Monte Carlo Sampling for Analyzing In-Context Examples
Stephanie Schoch | Yangfeng Ji

Prior works have shown that in-context learning is brittle to presentation factors such as the order, number, and choice of selected examples. However, ablation-based guidance on selecting the number of examples may ignore the interplay between different presentation factors. In this work we develop a Monte Carlo sampling-based method to study the impact of number of examples while explicitly accounting for effects from order and selected examples. We find that previous guidance on how many in-context examples to select does not always generalize across different sets of selected examples and orderings, and whether one-shot settings outperform zero-shot settings is highly dependent on the selected example. Additionally, inspired by data valuation, we apply our sampling method to in-context example selection to select examples that perform well across different orderings. We find a negative result, that while performance is robust to ordering and number of examples, there is an unexpected performance degradation compared to random sampling.

pdf bib
Does Training on Synthetic Data Make Models Less Robust?
Lingze Zhang | Ellie Pavlick

An increasingly common practice is to train large language models (LLMs) using synthetic data. Often this synthetic data is produced by the same or similar LLMs as those it is being used to train. This raises the question of whether the synthetic data might in fact exacerbate certain “blindspots” by reinforcing heuristics that the LLM already encodes. In this paper, we conduct simulated experiments on the natural language inference (NLI) task with Llama-2-7B-hf models. We use MultiNLI as the general task and HANS, a targeted evaluation set designed to measure the presence of specific heuristic strategies for NLI, as our “blindspot” task. Our goal is to determine whether performance disparities between the general and blind spot tasks emerge. Our results indicate that synthetic data does not reinforce blindspots in the way we expected. Specifically, we see that, while fine-tuning with synthetic data doesn’t necessarily reduce the use of the heuristic, it also does not make it worse as we hypothesized.

pdf bib
Bridging the Faithfulness Gap in Prototypical Models
Andrew Koulogeorge | Sean Xie | Saeed Hassanpour | Soroush Vosoughi

Prototypical Network-based Language Models (PNLMs) have been introduced as a novel approach for enhancing interpretability in deep learning models for NLP. In this work, we show that, despite the transparency afforded by their case-based reasoning architecture, current PNLMs are, in fact, not faithful, i.e. their explanations do not accurately reflect the underlying model’s reasoning process. By adopting an axiomatic approach grounded in the seminal works’ definition of faithfulness, we identify two specific points in the architecture of PNLMs where unfaithfulness may occur. To address this, we introduce Faithful Alignment (FA), a two-part framework that ensures the faithfulness of PNLMs’ explanations. We then demonstrate that FA achieves this goal without compromising model performance across a variety of downstream tasks and ablation studies.

pdf bib
Aligning Sizes of Intermediate Layers by LoRA Adapter for Knowledge Distillation
Takeshi Suzuki | Hiroaki Yamada | Takenobu Tokunaga

Intermediate Layer Distillation (ILD) is a variant of Knowledge Distillation (KD), a method for compressing neural networks.ILD requires mapping to align the intermediate layer sizes of the teacher and student models to compute the loss function in training, while this mapping is not used during inference.This inconsistency may reduce the effectiveness of learning in intermediate layers.In this study, we propose LoRAILD, which uses LoRA adapters to eliminate the inconsistency.However, our experimental results show that LoRAILD does not outperform existing methods.Furthermore, contrary to previous studies, we observe that conventional ILD does not outperform vanilla KD.Our analysis of the distilled models’ intermediate layers suggests that ILD does not improve language models’ performance.

pdf bib
LLMs are not Zero-Shot Reasoners for Biomedical Information Extraction
Aishik Nagar | Viktor Schlegel | Thanh-Tung Nguyen | Hao Li | Yuping Wu | Kuluhan Binici | Stefan Winkler

Large Language Models (LLMs) are increasingly adopted for applications in healthcare, reaching the performance of domain experts on tasks such as question answering and document summarisation. Despite their success on these tasks, it is unclear how well LLMs perform on tasks that are traditionally pursued in the biomedical domain, such as structured information extration. To bridge this gap, in this paper, we systematically benchmark LLM performance in Medical Classification and Named Entity Recognition (NER) tasks. We aim to disentangle the contribution of different factors to the performance, particularly the impact of LLMs’ task knowledge and reasoning capabilities, their (parametric) domain knowledge, and addition of external knowledge. To this end, we evaluate various open LLMs—including BioMistral and Llama-2 models—on a diverse set of biomedical datasets, using standard prompting, Chain-of-Thought (CoT) and Self-Consistency based reasoning as well as Retrieval-Augmented Generation (RAG) with PubMed and Wikipedia corpora. Counter-intuitively, our results reveal that standard prompting consistently outperforms more complex techniques across both tasks, laying bare the limitations in the current application of CoT, self-consistency and RAG in the biomedical domain. Our findings suggest that advanced prompting methods developed for knowledge- or reasoning-intensive tasks, such as CoT or RAG, are not easily portable to biomedical tasks where precise structured outputs are required. This highlights the need for more effective integration of external knowledge and reasoning mechanisms in LLMs to enhance their performance in real-world biomedical applications.

pdf bib
Exploring Limitations of LLM Capabilities with Multi-Problem Evaluation
Zhengxiang Wang | Jordan Kodner | Owen Rambow

We propose using prompts made up of multiple problems to evaluate LLM capabilities, an approach we call multi-problem evaluation. We examine 7 LLMs on 4 related task types constructed from 6 existing classification benchmarks. We find that while LLMs can generally perform multiple homogeneous classifications at once (Batch Classification) as well as when they do so separately, they perform significantly worse on two selection tasks that are conceptually equivalent to Batch Classification and involve selecting indices of text falling into each class label, either independently or altogether. We show that such a significant performance drop is due to LLMs’ inability to adequately combine index selection with text classification. Such a drop is surprisingly observed across all LLMs attested, under zero-shot, few-shot, and CoT settings, and even with a novel synthetic dataset, potentially reflecting an inherent capability limitation with modern LLMs.

pdf bib
Exploring Multimodal Language Models for Sustainability Disclosure Extraction: A Comparative Study
Tanay Gupta | Tushar Goel | Ishan Verma

Sustainability metrics have increasingly become a crucial non-financial criterion in investment decision-making. Organizations worldwide are recognizing the importance of sustainability and are proactively highlighting their efforts through specialized sustainability reports. Unlike traditional annual reports, these sustainability disclosures are typically text-heavy and are often expressed as infographics, complex tables, and charts. The non-machine-readable nature of these reports presents a significant challenge for efficient information extraction. The rapid advancement of Vision Language Models (VLMs) has raised the question whether these VLMs can address such challenges in domain specific task. In this study, we demonstrate the application of VLMs for extracting sustainability information from dedicated sustainability reports. Our experiments highlight the limitations in the performance of several open-source VLMs in extracting information about sustainability disclosures from different type of pages.

pdf bib
Self Knowledge-Tracing for Tool Use (SKT-Tool): Helping LLM Agents Understand Their Capabilities in Tool Use
Joshua Vigel | Renpei Cai | Eleanor Chen | Anish Neema | Austen Liao | Kevin Zhu | Sean O’brien

Large Language Models (LLMs) enhanced with tool use and APIs improve task performance but often misuse them, leading to inefficiency and unnecessary cost. We propose Self Knowledge-Tracing for Tool Use (SKT-Tool), a method enabling LLMs to assess their capabilities and make informed API usage decisions using knowledge tracing (KT). Our teacher-student framework helps LLMs optimize API calls in real-time without fine-tuning. Experiments across multiple datasets show that SKT-Tool significantly reduces API calls while maintaining accuracy, offering a scalable and cost-effective solution for tool-augmented LLMs. We conclude by analyzing shortcomings in this method and identifying directions for future work.

pdf bib
Error Reflection Prompting: Can Large Language Models Successfully Understand Errors?
Jason Li | Lauren Yraola | Kevin Zhu | Sean O’brien

Prompting methods for language models, such as Chain-of-thought (CoT), present intuitive step-by-step processes for problem solving. These methodologies aim to equip models with a better understanding of the correct procedures for addressing a given task. Despite these advancements, CoT lacks the ability of reflection and error correction, potentially causing a model to perpetuate mistakes and errors. Therefore, inspired by the human ability for said tasks, we propose Error Reflection Prompting (ERP) to further enhance reasoning in language models. Building upon CoT, ERP is a method comprised of an incorrect answer, error recognition, and a correct answer. This process enables the model to recognize types of errors and the steps that lead to incorrect answers, allowing the model to better discern which steps to avoid and which to take. The model is able to generate the error outlines itself with automated ERP generation, allowing for error recognition and correction to be integrated into the reasoning chain and produce scalability and reliability in the process. The results demonstrate that ERP serves as a versatile supplement to conventional CoT, ultimately contributing to more robust and capable reasoning abilities along with increased interpretability in how models ultimately reach their errors.

pdf bib
Evaluating Robustness of LLMs to Numerical Variations in Mathematical Reasoning
Yuli Yang | Hiroaki Yamada | Takenobu Tokunaga

Evaluating an LLM’s robustness against numerical perturbation is a good way to know if the LLM actually performs reasoning or just replicates patterns learned. We propose a novel method to augment math word problems (MWPs), producing numerical variations at a large scale utilizing templates. We also propose an automated error classification framework for scalable error analysis, distinguishing calculation errors from reasoning errors. Our experiments using the methods show LLMs are weak against numerical variations, suggesting they are not fully capable of generating valid reasoning steps, often failing in arithmetic operations.

up

pdf (full)
bib (full)
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology

pdf bib
Proceedings of the 15th International Workshop on Spoken Dialogue Systems Technology
Maria Ines Torres | Yuki Matsuda | Zoraida Callejas | Arantza del Pozo | Luis Fernando D'Haro

pdf bib
Automatic Generation of Structured Domain Knowledge for Dialogue-based XAI Systems
Carolin Schindler | Isabel Feustel | Niklas Rach | Wolfgang Minker

Explanatory dialogue systems serve as intuitive interface between non-expert users and explainable AI (XAI) systems. The interaction with these kind of systems benefits especially from the integration of structured domain knowledge, e.g., by means of bipolar argumentation trees. So far, these domain-specific structures need to be created manually, therewith impairing the flexibility of the system with respect to the domain. We address this limitation by adapting an existing pipeline for topic-independent acquisition of argumentation trees in the field of persuasive, argumentative dialogue to the area of explanatory dialogue. This shift is achieved by a) introducing and investigating different formulations of auxiliary claims per feature of the explanation of the AI model, b) exploring the influence of pre-grouping of the arguments with respect to the feature they address, c) suggesting adaptions to the existing algorithm of the pipeline for obtaining a tree structure, and d) utilizing a new approach for determining the type of the relationship between the arguments. Through a step-wise expert evaluation for the domain titanic survival, we identify the best performing variant of our pipeline. With this variant we conduct a user study comparing the automatically generated argumentation trees against their manually created counterpart in the domains titanic survival and credit acquisition. This assessment of the suitability of the generated argumentation trees for a later integration into dialogue-based XAI systems as domain knowledge yields promising results.

pdf bib
Exploring the Impact of Modalities on Building Common Ground Using the Collaborative Scene Reconstruction Task
Yosuke Ujigawa | Asuka Shiotani | Masato Takizawa | Eisuke Midorikawa | Ryuichiro Higashinaka | Kazunori Takashio

To deepen our understanding of verbal and non-verbal modalities in establishing common ground, this study introduces a novel “collaborative scene reconstruction task.” In this task, pairs of participants, each provided with distinct image sets derived from the same video, work together to reconstruct the sequence of the original video. The level of agreement between the participants on the image order—quantified using Kendall’s rank correlation coefficient—serves as a measure of common ground construction. This approach enables the analysis of how various modalities contribute to the constraction of common ground. A corpus comprising 40 dialogues from 20 participants was collected and analyzed. The findings suggest that specific gestures play a significant role in fostering common ground, offering valuable insights for the development of dialogue systems that leverage multimodal information to enhance user the counstraction of common ground.

pdf bib
Design, Generation and Evaluation of a Synthetic Dialogue Dataset for Contextually Aware Chatbots in Art Museums
Inass Rachidi | Anas Ezzakri | Jaime Bellver-Soler | Luis Fernando D’Haro

This paper presents the design, synthetic generation, and automated evaluation of ArtGenEval-GPT++, an advanced dataset for training and fine-tuning conversational agents with artificial awareness capabilities targeting to the art domain. Building on the foundation of a previously released dataset (ArtGenEval-GPT), the new version introduces enhancements for greater personalization (e.g., gender, ethnicity, age, and knowledge) while addressing prior limitations, including low-quality dialogues and hallucinations. The dataset comprises approximately 12,500 dyadic, multi-turn dialogues generated using state-of-the-art large language models (LLMs). These dialogues span diverse museum scenarios, incorporating varied visitor profiles, emotional states, interruptions, and chatbot behaviors. Objective evaluations confirm the dataset’s quality and contextual coherence. Ethical considerations, including biases and hallucinations, are analyzed, with proposed directions for improving the dataset utility. This work contributes to the development of personalized, context-aware conversational agents capable of navigating complex, real-world environments, such as museums, to enhance visitor engagement and satisfaction.

pdf bib
A Voice-Controlled Dialogue System for NPC Interaction using Large Language Models
Milan Wevelsiep | Nicholas Thomas Walker | Nicolas Wagner | Stefan Ultes

This paper explores the integration of voice-controlled dialogue systems in narrative-driven video games, addressing the limitations of existing approaches. We propose a hybrid interface that allows players to freely paraphrase predefined dialogue options, combining player expressiveness with narrative cohesion. The prototype was developed in Unity, and a large language model was used to map the transcribed voice input to existing dialogue options. The approach was evaluated in a user study (n=14) that compared the hybrid interface to traditional point-and-click methods. Results indicate the proposed interface enhances player’s degree of joy and perceived freedom while maintaining narrative consistency. The findings provide insights into the design of scalable and engaging voice-controlled systems for interactive storytelling. Future research should focus on reducing latency and refining language model accuracy to further improve user experience and immersion.

pdf bib
A Dialogue System for Semi-Structured Interviews by LLMs and its Evaluation on Persona Information Collection
Ryo Hasegawa | Yijie Hua | Takehito Utsuro | Ekai Hashimoto | Mikio Nakano | Shun Shiramatsu

In this paper, we propose a dialogue control management framework using large language models for semi-structured interviews. Specifically, large language models are used to generate the interviewer’s utterances and to make conditional branching decisions based on the understanding of the interviewee’s responses. The framework enables flexible dialogue control in interview conversations by generating and updating slots and values according to interviewee answers. More importantly, we invented through LLMs’ prompt tuning the framework of accumulating the list of slots generated along the course of incrementing the number of interviewees through the semi-structured interviews. Evaluation results showed that the proposed approach of accumulating the list of generated slots throughout the semi-structured interviews outperform the baseline without accumulating generated slots in terms of the number of persona attributes and values collected through the semi-structured interview.

pdf bib
Exploring Personality-Aware Interactions in Salesperson Dialogue Agents
Sijia Cheng | Wen Yu Chang | Yun-Nung Chen

The integration of dialogue agents into the sales domain requires a deep understanding of how these systems interact with users possessing diverse personas. This study explores the influence of user personas, defined using the Myers-Briggs Type Indicator (MBTI), on the interaction quality and performance of sales-oriented dialogue agents. Through large-scale testing and analysis, we assess the pre-trained agent’s effectiveness, adaptability, and personalization capabilities across a wide range of MBTI-defined user types. Our findings reveal significant patterns in interaction dynamics, task completion rates, and dialogue naturalness, underscoring the future potential for dialogue agents to refine their strategies to better align with varying personality traits. This work not only provides actionable insights for building more adaptive and user-centric conversational systems in the sales domain but also contributes broadly to the field by releasing persona-defined user simulators. These simulators, unconstrained by domain, offer valuable tools for future research and demonstrate the potential for scaling personalized dialogue systems across diverse applications.

pdf bib
ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents
Vardhan Dongre | Xiaocheng Yang | Emre Can Acikgoz | Suvodip Dey | Gokhan Tur | Dilek Hakkani-Tur

Large language model (LLM)-based agents have been increasingly used to interact with external environments (e.g., games, APIs, etc.) and solve tasks. However, current frameworks do not enable these agents to work with users and interact with them to align on the details of their tasks and reach user-defined goals; instead, in ambiguous situations, these agents may make decisions based on assumptions. This work introduces ReSpAct (Reason, Speak, and Act), a novel framework that synergistically combines the essential skills for building task-oriented “conversational” agents. ReSpAct addresses this need for agents, expanding on the ReAct approach. ReSpAct framework enables agents to interpret user instructions, reason about complex tasks, execute appropriate actions and engage in dynamic dialogue to seek guidance, clarify ambiguities, understand user preferences, resolve problems, and use the intermediate feedback and responses of users to update their plans. We evaluated ReSpAct with GPT-4 in environments supporting user interaction, such as task-oriented dialogue (MultiWOZ) and interactive decision-making (Alfworld, WebShop), ReSpAct is flexible enough to incorporate dynamic user feedback and addresses prevalent issues like error propagation and agents getting stuck in reasoning loops. This results in more interpretable, human-like task-solving trajectories than baselines relying solely on reasoning traces. In two interactive decision-making benchmarks, AlfWorld and WebShop, ReSpAct outperforms strong reasoning-only method ReAct by an absolute success rate of 6% and 4%, respectively. In the task-oriented dialogue benchmark MultiWOZ, ReSpAct improved Inform and Success scores by 5.5% and 3%, respectively.

pdf bib
Examining Older Adults’ Motivation for Interacting with Health-Monitoring Conversational Systems Through Field Trials
Mariko Yoshida | Ryo Hori | Yuki Zenimoto | Mayu Urata | Mamoru Endo | Takami Yasuda | Aiko Inoue | Takahiro Hayashi | Ryuichiro Higashinaka

When assessing the health of older adults, oral interviews and written questionnaires are commonly used. However, these methods are time-consuming in terms of both execution and data aggregation. To address this issue, systems utilizing generative AI for health information collection through conversation have been developed and implemented. Despite these advancements, the motivation of older adults to consistently engage with such systems in their daily lives has not been thoroughly explored. In this study, we developed a smart-speaker extension that uses generative AI to monitor health status through casual conversations with older adult users. The system was tested in a two-week home trial with older adult participants. We conducted post-trial questionnaires and interviews, and we analyzed conversation log data. The results revealed that older adult users enjoy interacting with such systems and can integrate their use into their daily routines. Customized notifications through text messages encouraged system use, and the system’s ability to refer to previous conversations and address users by name was identified as a key factor motivating continued use.

pdf bib
Balancing Knowledge Delivery and Emotional Comfort in Healthcare Conversational Systems
Shang-Chi Tsai | Yun-Nung Chen

With the advancement of large language models, many dialogue systems are now capable of providing reasonable and informative responses to patients’ medical conditions. However, when patients consult their doctor, they may experience negative emotions due to the severity and urgency of their situation. If the model can provide appropriate comfort and empathy based on the patient’s negative emotions while answering medical questions, it will likely offer a more reassuring experience during the medical consultation process. To address this issue, our paper explores the balance between knowledge sharing and emotional support in the healthcare dialogue process. We utilize a large language model to rewrite a real-world interactive medical dialogue dataset, generating patient queries with negative emotions and corresponding medical responses aimed at soothing the patient’s emotions while addressing their concerns. The modified data serves to refine the latest large language models with various fine-tuning methods, enabling them to accurately provide sentences with both emotional reassurance and constructive suggestions in response to patients’ questions. Compared to the original LLM model, our experimental results demonstrate that our methodology significantly enhances the model’s ability to generate emotional responses while maintaining its original capability to provide accurate knowledge-based answers.

pdf bib
Context or Retrieval? Evaluating RAG Methods for Art and Museum QA System
Samuel Ramos-Varela | Jaime Bellver-Soler | Marcos Estecha-Garitagoitia | Luis Fernando D’Haro

Recent studies suggest that increasing the context window of language models could outperform retrieval-augmented generation (RAG) methods in certain tasks. However, in domains such as art and museums, where information is inherently multimodal, combining images and detailed textual descriptions, this assumption needs closer examination. To explore this, we compare RAG techniques with direct large-context input approaches for answering questions about artworks. Using a dataset of painting images paired with textual information, we develop a synthetic database of question-answer (QA) pairs for evaluating these methods. The focus is on assessing the efficiency and accuracy of RAG in retrieving and using relevant information compared to passing the entire textual context to a language model. Additionally, we experiment with various strategies for segmenting and retrieving text to optimise the RAG pipeline. The results aim to clarify the trade-offs between these approaches and provide valuable insights for interactive systems designed for art and museum contexts.

pdf bib
Paralinguistic Attitude Recognition for Spoken Dialogue Systems
Kouki Miyazawa | Zhi Zhu | Yoshinao Sato

Although paralinguistic information is critical for human communication, most spoken dialogue systems ignore such information, hindering natural communication between humans and machines. This study addresses the recognition of paralinguistic attitudes in user speech. Specifically, we focus on four essential attitudes for generating an appropriate system response, namely agreement, disagreement, questions, and stalling. The proposed model can help a dialogue system better understand what the user is trying to convey. In our experiments, we trained and evaluated a model that classified paralinguistic attitudes on a reading-speech dataset without using linguistic information. The proposed model outperformed human perception. Furthermore, experimental results indicate that speech enhancement alleviates the degradation of model performance caused by background noise, whereas reverberation remains a challenge.

pdf bib
Exploring ReAct Prompting for Task-Oriented Dialogue: Insights and Shortcomings
Michelle Elizabeth | Morgan Veyret | Miguel Couceiro | Ondrej Dusek | Lina M. Rojas Barahona

Large language models (LLMs) gained immense popularity due to their impressive capabilities in unstructured conversations. Empowering LLMs with advanced prompting strategies such as reasoning and acting (ReAct) (Yao et al., 2022) has shown promise in solving complex tasks traditionally requiring reinforcement learning. In this work, we apply the ReAct strategy to guide LLMs performing task-oriented dialogue (TOD). We evaluate ReAct-based LLMs (ReAct-LLMs) both in simulation and with real users. While ReAct-LLMs severely underperform state-of-the-art approaches on success rate in simulation, this difference becomes less pronounced in human evaluation. Moreover, compared to the baseline, humans report higher subjective satisfaction with ReAct-LLM despite its lower success rate, most likely thanks to its natural and confidently phrased responses.

pdf bib
Design of a conversational agent to support people on suicide risk
Mario Manso Vázquez | José Manuel Ramírez Sánchez | Carmen García-Mateo | Laura Docío-Fernández | Manuel José Fernández-Iglesias | Beatriz Gómez-Gómez | Beatriz Pinal | Antia Brañas | Alejandro García-Caballero

In this paper, we present a core component of the VisIA project: a conversational agent designed to detect suicide risk factors during real-time chat interactions. By adhering to clinical guidelines and the state-of-the-art theories of suicide, the agent aims to provide a scalable and effective approach to identifying individuals at risk. Preliminary results demonstrate the feasibility and potential of conversational agents in enhancing suicide risk detection.

pdf bib
Optimizing RAG: Classifying Queries for Dynamic Processing
Kabir Olawore | Michael McTear | Yaxin Bi | David Griol

In Retrieval-Augmented Generation (RAG) systems efficient information retrieval is crucial for enhancing user experience and satisfaction, as response times and computational demands significantly impact performance. RAG can be unnecessarily resource-intensive for frequently asked questions (FAQs) and simple questions. In this paper we introduce an approach in which we categorize user questions into simple queries that do not require RAG processing. Evaluation results show that our proposal reduces latency and improves response efficiency compared to systems relying solely on RAG.

pdf bib
Enhancing Proactive Dialogue Systems Through Self-Learning of Reasoning and Action-Planning
Ryosuke Ito | Tetsuya Takiguchi | Yasuo Ariki

A proactive dialogue system refers to a conversational system designed to guide the direction of a conversation in order to achieve pre-defined targets or fulfill specific goals. Recent studies have shown that Proactive Chain-of-Thought, which guides the system to explicitly think through intermediate reasoning and action-planning steps toward a conversational goal before generating a response, can significantly enhance the performance of proactive dialogue systems. However, these improvements primarily focus on prompt-based control, while the potential of fine-tuning Proactive-CoT remains largely unexplored. Furthermore, fine-tuning Proactive-CoT requires manual annotation of reasoning processes and action plans, which incurs significant time and cost. In this study, we propose a novel approach for automatically annotating reasoning processes and action plans through self-learning. This method enables fully automated annotation, significantly reducing the time and cost associated with manual annotation. Experimental results show that models trained using our proposed method outperform those trained with other fine-tuning approaches. These findings highlight the potential of self-learning approaches to advance the development of more robust and efficient proactive dialogue systems.

pdf bib
TrustBoost: Balancing flexibility and compliance in conversational AI systems
David Griol | Zoraida Callejas | Manuel Gil-Martín | Ksenia Kharitonova | Juan Manuel Montero-Martínez | David Pérez Fernández | Fernando Fernández-Martínez

Conversational AI (ConvAI) systems are gaining growing importance as an alternative for more natural interaction with digital services. In this context, Large Language Models (LLMs) have opened new possibilities for less restricted interaction and richer natural language understanding. However, despite their advanced capabilities, LLMs can pose accuracy and reliability problems, as they sometimes generate factually incorrect or contextually inappropriate content that does not fulfill the regulations or business rules of a specific application domain. In addition, they still do not possess the capability to adjust to users’ needs and preferences, showing emotional awareness, while concurrently adhering to the regulations and limitations of their designated domain. In this paper we present the TrustBoost project, which addresses the challenge of improving trustworthiness of ConvAI from two dimensions: cognition (adaptability, flexibility, compliance, and performance) and affectivity (familiarity, emotional dimension, and perception). The duration of the project is from September 2024 to December 2027.

pdf bib
ScriptBoard: Designing modern spoken dialogue systems through visual programming
Divesh Lala | Mikey Elmers | Koji Inoue | Zi Haur Pang | Keiko Ochi | Tatsuya Kawahara

Implementation of spoken dialogue systems can be time-consuming, in particular for people who are not familiar with managing dialogue states and turn-taking in real-time. A GUI-based system where the user can quickly understand the dialogue flow allows rapid prototyping of experimental and real-world systems. In this demonstration we present ScriptBoard, a tool for creating dialogue scenarios which is independent of any specific robot platform. ScriptBoard has been designed with multi-party scenarios in mind and makes use of large language models to both generate dialogue and make decisions about the dialogue flow. This program promotes both flexibility and reproducibility in spoken dialogue research and provides everyone the opportunity to design and test their own dialogue scenarios.

pdf bib
D4AC: A Tool for Developing Multimodal Dialogue Systems without Coding
Mikio Nakano | Ryuichiro Higashinaka

To enable the broader application of dialogue system technology across various fields, it is beneficial to empower individuals with limited programming experience to build dialogue systems. Domain experts, where dialogue system technology is highly relevant, may not necessarily possess expertise in information technology. This paper presents D4AC, which works as a client for text-based dialogue servers. By combining D4AC with a no-code tool for developing text-based dialogue servers, it is possible to build multimodal dialogue systems without coding. These systems can adapt to the user’s age, gender, emotions, and engagement levels obtained from their facial images. D4AC can be installed, launched, and configured without technical knowledge. D4AC was used in student projects at a university, which suggested the effectiveness of D4AC.

pdf bib
A Multilingual Speech-Based Driver Assistant for Basque and English
Antonio Aparicio Akcharov | Asier López Zorrilla | Juan Camilo Vásquez Correa | Oscar Montserrat | José Maria Echevarría | Begoña Arrate | Joxean Zapirain | Mikel deVelasco Vázquez | Santiago Andrés Moreno-Acevedo | Ander González-Docasal | Maria Ines Torres | Aitor Álvarez

This demo paper presents a prototype of a multilingual, speech-based driver assistant, designed to support both English and Basque languages. The inclusion of Basque—a low-resource language with limited domain-specific training data—marks a significant contribution, as publicly available AI models, including Large Language Models, often underperform for such languages compared to high-resource languages like English. Despite these challenges, our system demonstrates robust performance, successfully understanding user queries and delivering rapid responses in a demanding environment: a car simulator. Notably, the system achieves comparable performance in both English and Basque, showcasing its effectiveness in addressing linguistic disparities in AI-driven applications. A demo of our prototype will be available in the workshop.

pdf bib
Intimebot – A Dialogue Agent for Timekeeping Support
Shoaib Khan | Alex Samani | Rafael Banchs

This demo paper presents intimebot, an AI-powered timekeeping solution designed to assist with timekeeping. Timekeeping is a fundamental but also overwhelming and complex task in many professional services practices. Our intimebot demo demonstrates how Artificial Intelligence can be utilized to implement a more efficient timekeeping process within a firm. Based on brief work descriptions provided by the timekeeper, intimebot is able to (1) predict the relevant combination of client, matter, and phase, (2) estimate the work effort hours, and (3) rewrite and normalize the provided work description into a compliant narrative. This can save a significant amount of time for busy professionals while ensuring terms of business compliance and best practices.

pdf bib
A Chatbot for Providing Suicide Prevention Information in Spanish
Pablo Ascorbe | María S. Campos | César Domínguez | Jónathan Heras | Magdalena Pérez | Ana Rosa Terroba-Reinares

Suicide has been identified by the World Health Organization as one of the most serious health problems that can affect people. Among the interventions that have been proposed to help people suffering from this problem and their relatives, the dissemination of accurate information is crucial. To achieve this goal, we have developed PrevenIA, a chatbot that provides reliable information on suicide prevention. The chatbot consists of a Retrieval Augmented Module for answering users’ queries based on a curated list of documents. In addition, it includes several models to avoid undesirable behaviours. The system has been validated by specialists and is currently being evaluated by different populations. Thanks to this project, reliable information on suicide will be disseminated in an easy and understandable form.

pdf bib
LAMIA: An LLM Approach for Task-Oriented Dialogue Systems in Industry 5.0
Cristina Fernandez | Izaskun Fernandez | Cristina Aceta

Human-Machine Interaction (HMI) plays an important role in Industry 5.0, improving worker well-being by automating repetitive tasks and enhancing seamless collaboration between humans and intelligent systems. In this context, Task-Oriented Dialogue (TOD) systems are a commonly used approach to enable natural communication in these settings, traditionally developed using rule-based approaches. However, the revolution of Large Language Models (LLMs) is changing how dialogue systems are being developed without the necessity of relying on tedious and rigid handcrafted rules. Despite their popularity, their application in industrial contexts remains underexplored, necessitating a solution to challenges such as hallucinations, lack of domain-specific data, high training costs, and limited adaptability. In order to explore the contribution of LLMs in the industry field, this work presents LAMIA, a task-oriented dialogue system for industrial scenarios that leverages LLMs through prompt tuning. This system has been adapted and evaluated for a bin-picking use case, using GPT-3.5 Turbo, showing to be an intuitive method for new use cases in Industry 5.0.

pdf bib
Conversational Tutoring in VR Training: The Role of Game Context and State Variables
Maia Aguirre | Ariane Méndez | Aitor García-Pablos | Montse Cuadros | Arantza del Pozo | Oier Lopez de Lacalle | Ander Salaberria | Jeremy Barnes | Pablo Martínez | Muhammad Zeshan Afzal

Virtual Reality (VR) training provides safe, cost-effective engagement with lifelike scenarios but lacks intuitive communication between users and the virtual environment. This study investigates the use of Large Language Models (LLMs) as conversational tutors in VR health and safety training, examining the impact of game context and state variables on LLM-generated answers in zero- and few-shot settings. Results demonstrate that incorporating both game context and state information significantly improves answer accuracy, with human evaluations showing gains of up to 0.26 points in zero-shot and 0.18 points in few-shot settings on a 0-1 scale.

pdf bib
A Methodology for Identifying Evaluation Items for Practical Dialogue Systems Based on Business-Dialogue System Alignment Models
Mikio Nakano | Hironori Takeuchi | Kazunori Komatani

This paper proposes a methodology for identifying evaluation items for practical dialogue systems. Traditionally, user satisfaction and user experiences have been the primary metrics for evaluating dialogue systems. However, there are various other evaluation items to consider when developing and operating practical dialogue systems, and such evaluation items are expected to lead to new research topics. So far, there has been no methodology for identifying these evaluation items. We propose identifying evaluation items based on business-dialogue system alignment models, which are applications of business-IT alignment models used in the development and operation of practical IT systems. We also present a generic model that facilitates the construction of a business-dialogue system alignment model for each dialogue system.

pdf bib
Speech-Controlled Smart Speaker for Accurate, Real-Time Health and Care Record Management
Jonathan E. Carrick | Nina Dethlefs | Lisa Greaves | Venkata M. V. Gunturi | Rameez Raja Kureshi | Yongqiang Cheng

To help alleviate the pressures felt by care workers, we have begun new research into improving the efficiency of care plan management by advancing recent developments in automatic speech recognition. Our novel approach adapts off-the-shelf tools in a purpose-built application for the speech domain, addressing challenges of accent adaption, real-time processing and speech hallucinations. We augment the speech-recognition scope of Open AI’s Whisper model through fine-tuning, reducing word error rates (WERs) from 16.8 to 1.0 on a range of British dialects. Addressing the speech-hallucination side effect of adapting to real-time recognition by enforcing a signal-to-noise ratio threshold and audio stream checks, we achieve a WER of 5.1, compared to 14.9 with Whisper’s original model. These ongoing research efforts tackle challenges that are necessary to build the speech-control basis for a custom smart speaker system that is both accurate and timely.

pdf bib
Analysis of Voice Activity Detection Errors in API-based Streaming ASR for Human-Robot Dialogue
Kenta Yamamoto | Ryu Takeda | Kazunori Komatani

In human-robot dialogue systems, streaming automatic speech recognition (ASR) services (e.g., Google ASR) are often utilized, with the microphone positioned close to the robot’s loudspeaker. Under these conditions, both the robot’s and the user’s utterances are captured, resulting in frequent failures to detect user speech. This study analyzes voice activity detection (VAD) errors by comparing results from such streaming ASR to those from standalone VAD models. Experiments conducted on three distinct dialogue datasets showed that streaming ASR tends to ignore user utterances immediately following system utterances. We discuss the underlying causes of these VAD errors and provide recommendations for improving VAD performance in human-robot dialogue.

pdf bib
A Survey of Recent Advances on Turn-taking Modeling in Spoken Dialogue Systems
Galo Castillo-López | Gael de Chalendar | Nasredine Semmar

The rapid growth of dialogue systems adoption to serve humans in daily tasks has increased the realism expected from these systems. One trait of realism is the way speaking agents take their turns. We provide here a review of recent methods on turn-taking modeling and thoroughly describe the corpora used in these studies. We observe that 72% of the reviewed works in this survey do not compare their methods with previous efforts. We argue that one of the challenges in the field is the lack of well-established benchmarks to monitor progress. This work aims to provide the community with a better understanding of the current state of research around turn-taking modeling and future directions to build more realistic spoken conversational agents.

pdf bib
Integrating Respiration into Voice Activity Projection for Enhancing Turn-taking Performance
Takao Obi | Kotaro Funakoshi

Voice Activity Projection (VAP) models predict upcoming voice activities on a continuous timescale, enabling more nuanced turn-taking behaviors in spoken dialogue systems. Although previous studies have shown robust performance with audio-based VAP, the potential of incorporating additional physiological information, such as respiration, remains relatively unexplored. In this paper, we investigate whether respiratory information can enhance VAP performance in turn-taking. To this end, we collected Japanese dialogue data with synchronized audio and respiratory waveforms, and then we integrated the respiratory information into the VAP model. Our results showed that the VAP model combining audio and respiratory information had better performance than the audio-only model. This finding underscores the potential for improving the turn-taking performance of VAP by incorporating respiration.

pdf bib
DSLCMM: A Multimodal Human-Machine Dialogue Corpus Built through Competitions
Ryuichiro Higashinaka | Tetsuro Takahashi | Shinya Iizuka | Sota Horiuchi | Michimasa Inaba | Zhiyang Qi | Yuta Sasaki | Kotaro Funakoshi | Shoji Moriya | Shiki Sato | Takashi Minato | Kurima Sakai | Tomo Funayama | Masato Komuro | Hiroyuki Nishikawa | Ryosaku Makino | Hirofumi Kikuchi | Mayumi Usami

A corpus of dialogues between multimodal systems and humans is indispensable for the development and improvement of such systems. However, there is a shortage of human-machine multimodal dialogue datasets, which hinders the widespread deployment of these systems in society. To address this issue, we construct a Japanese multimodal human-machine dialogue corpus, DSLCMM, by collecting and organizing data from the Dialogue System Live Competitions (DSLCs). This paper details the procedure for constructing the corpus and presents our analysis of the relationship between various dialogue features and evaluation scores provided by users.

pdf bib
Cutting Through Overload: Efficient Token Dropping for Speech Emotion Recognition in Multimodal Large Language Models
Jaime Bellver-Soler | Mario Rodriguez-Cantelar | Ricardo Córdoba | Luis Fernando D’Haro

Recent developments in Multimodal Large Language Models (MLLMs) have provided novel insights into Speech Emotion Recognition (SER). However, combining high-dimensional speech signals with textual tokens can lead to a rapid growth in input tokens, increasing computational costs and inference times. This “token overload” also risks shadowing essential textual cues, affecting the reasoning capabilities of the language model and diluting emotional information crucial to accurate SER. In this paper, we explore different token drop methods that mitigate excessive token counts while preserving both emotional nuances and the core linguistic capabilities of the model. Specifically, we compare various pooling approaches to produce a compact representation. Our preliminary findings suggest that these techniques can reduce computational costs without decreasing SER accuracy.

pdf bib
Integrating Conversational Entities and Dialogue Histories with Knowledge Graphs and Generative AI
Graham Wilcock | Kristiina Jokinen

Existing methods for storing dialogue history and for tracking mentioned entities in spoken dialogues usually handle these tasks separately. Recent advances in knowledge graphs and generative AI make it possible to integrate them in a framework with a uniform representation for dialogue management. This may help to build more natural and grounded dialogue models that can reduce misunderstanding and lead to more reliable dialogue-based interactions with AI agents. The paper describes ongoing work on this approach.

pdf bib
Enabling Trait-based Personality Simulation in Conversational LLM Agents: Case Study of Customer Assistance in French
Ahmed Njifenjou | Virgile Sucal | Bassam Jabaian | Fabrice Lefèvre

Among the numerous models developed to represent the multifaceted complexity of human personality, particularly in psychology, the Big Five (commonly referred to as ‘OCEAN’, an acronym of its five traits) stands out as a widely used framework. Although personalized chatbots have incorporated this model, existing approaches, such as focusing on individual traits or binary combinations, may not capture the full diversity of human personality. In this study, we propose a five-dimensional vector representation, where each axis corresponds to the degree of presence of an OCEAN trait on a continuous scale from 0 to 1. This representation is designed to enable greater versatility in modeling personality. Application to customer assistance scenarios in French demonstrates that, based on humans-bots as well as bots-bots conversations, assigned personality vectors are distinguishable by both humans and LLMs acting as judges. Both of their subjective evaluations also confirm the measurable impacts of the assigned personality on user experience, agent efficiency, and conversation quality.

pdf bib
Developing Classifiers for Affirmative and Negative User Responses with Limited Target Domain Data for Dialogue System Development Tools
Yunosuke Kubo | Ryo Yanagimoto | Mikio Nakano | Kenta Yamamoto | Ryu Takeda | Kazunori Komatani

We aim to develop a library for classifying affirmative and negative user responses, intended for integration into a dialogue system development toolkit. Such a library is expected to highly perform even with minimal annotated target domain data, addressing the practical challenge of preparing large datasets for each target domain. This short paper compares several approaches under conditions where little or no annotated data is available in the target domain. One approach involves fine-tuning a pre-trained BERT model, while the other utilizes a GPT API for zero-shot or few-shot learning. Since these approaches differ in execution speed, development effort, and execution costs, in addition to performance, the results serve as a basis for discussing an appropriate configuration suited to specific requirements. Additionally, we have released the training data and the fine-tuned BERT model for Japanese affirmative/negative classification.

pdf bib
Why Do We Laugh? Annotation and Taxonomy Generation for Laughable Contexts in Spontaneous Text Conversation
Koji Inoue | Mikey Elmers | Divesh Lala | Tatsuya Kawahara

Laughter serves as a multifaceted communicative signal in human interaction, yet its identification within dialogue presents a significant challenge for conversational AI systems. This study addresses this challenge by annotating laughable contexts in Japanese spontaneous text conversation data and developing a taxonomy to classify the underlying reasons for such contexts. Initially, multiple annotators manually labeled laughable contexts using a binary decision (laughable or non-laughable). Subsequently, an LLM was used to generate explanations for the binary annotations of laughable contexts, which were then categorized into a taxonomy comprising ten categories, including “Empathy and Affinity” and “Humor and Surprise,” highlighting the diverse range of laughter-inducing scenarios. The study also evaluated GPT-4o’s performance in recognizing the majority labels of laughable contexts, achieving an F1 score of 43.14%. These findings contribute to the advancement of conversational AI by establishing a foundation for more nuanced recognition and generation of laughter, ultimately fostering more natural and engaging human-AI interactions.

pdf bib
Adaptive Psychological Distance in Japanese Spoken Human-Agent Dialogue: A Politeness-Based Management Model
Akira Inaba | Emmanuel Ayedoun | Masataka Tokumaru

While existing spoken dialogue systems can adapt various aspects of interaction, systematic management of psychological distance through verbal politeness remains underexplored. Current approaches typically maintain fixed levels of formality and social distance, limiting naturalness in long-term human-agent interactions. We propose a novel dialogue management model that dynamically adjusts verbal politeness levels in Japanese based on user preferences. We evaluated the model using two pseudo-users with distinct distance preferences in daily conversations. Human observers (n=20) assessed the interactions, with 70% successfully distinguishing the intended social distance variations. The results demonstrate that systematic modulation of verbal politeness can create perceptibly different levels of psychological distance in spoken dialogue, with implications for culturally appropriate human-agent interaction in Japanese contexts.

pdf bib
An LLM Benchmark for Addressee Recognition in Multi-modal Multi-party Dialogue
Koji Inoue | Divesh Lala | Mikey Elmers | Keiko Ochi | Tatsuya Kawahara

Handling multi-party dialogues represents a significant step for advancing spoken dialogue systems, necessitating the development of tasks specific to multi-party interactions. To address this challenge, we are constructing a multi-modal multi-party dialogue corpus of triadic (three-participant) discussions. This paper focuses on the task of addressee recognition, identifying who is being addressed to take the next turn, a critical component unique to multi-party dialogue systems. A subset of the corpus was annotated with addressee information, revealing that explicit addressees are indicated in approximately 20% of conversational turns. To evaluate the task’s complexity, we benchmarked the performance of a large language model (GPT-4o) on addressee recognition. The results showed that GPT-4o achieved an accuracy only marginally above chance, underscoring the challenges of addressee recognition in multi-party dialogue. These findings highlight the need for further research to enhance the capabilities of large language models in understanding and navigating the intricacies of multi-party conversational dynamics.

pdf bib
Will AI shape the way we speak? The emerging sociolinguistic influence of synthetic voices
Eva Szekely | Jura Miniota | Míša (Michaela) Hejná

The growing prevalence of conversational voice interfaces, powered by developments in both speech and language technologies, raises important questions about their influence on human communication. While written communication can signal identity through lexical and stylistic choices, voice-based interactions inherently amplify socioindexical elements – such as accent, intonation, and speech style – which more prominently convey social identity and group affiliation. There is evidence that even passive media such as television is likely to influence the audience’s linguistic patterns. Unlike passive media, conversational AI is interactive, creating a more immersive and reciprocal dynamic that holds a greater potential to impact how individuals speak in everyday interactions. Such heightened influence can be expected to arise from phenomena such as acoustic-prosodic entrainment and linguistic accommodation, which occur naturally during interaction and enable users to adapt their speech patterns in response to the system. While this phenomenon is still emerging, its potential societal impact could provide organisations, movements, and brands with a subtle yet powerful avenue for shaping and controlling public perception and social identity. We argue that the socioindexical influence of AI-generated speech warrants attention and should become a focus of interdisciplinary research, leveraging new and existing methodologies and technologies to better understand its implications.

up

pdf (full)
bib (full)
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing

pdf bib
Proceedings of the 4th International Workshop on Knowledge-Augmented Methods for Natural Language Processing
Weijia Shi | Wenhao Yu | Akari Asai | Meng Jiang | Greg Durrett | Hannaneh Hajishirzi | Luke Zettlemoyer

pdf bib
Entity Retrieval for Answering Entity-Centric Questions
Hassan Shavarani | Anoop Sarkar

The similarity between the question and indexed documents is a key factor in document retrieval for retrieval-augmented question answering. Although this is typically the only method for obtaining the relevant documents, it is not the sole approach when dealing with entity-centric questions. We study Entity Retrieval, an alternative retrieval method, which rather than relying on question-document similarity, depends on the salient entities within the question to identify the retrieval documents. We conduct an in-depth analysis of the performance of both dense and sparse retrieval methods in comparison to Entity Retrieval. Our findings reveal the great potential of entity-driven methods for improving augmentation document retrieval in both accuracy and efficiency.

pdf bib
ELECTRA and GPT-4o: Cost-Effective Partners for Sentiment Analysis
James P. Beno

Bidirectional transformers excel at sentiment analysis, and Large Language Models (LLM) are effective zero-shot learners. Might they perform better as a team? This paper explores collaborative approaches between ELECTRA and GPT-4o for three-way sentiment classification. We fine-tuned (FT) four models (ELECTRA Base/Large, GPT-4o/4o-mini) using a mix of reviews from Stanford Sentiment Treebank (SST) and DynaSent. We provided input from ELECTRA to GPT as: predicted label, probabilities, and retrieved examples. Sharing ELECTRA Base FT predictions with GPT-4o-mini significantly improved performance over either model alone (82.50 macro F1 vs. 79.14 ELECTRA Base FT, 79.41 GPT-4o-mini) and yielded the lowest cost/performance ratio ($0.12/F1 point). However, when GPT models were fine-tuned, including predictions decreased performance. GPT-4o FT-M was the top performer (86.99), with GPT-4o-mini FT close behind (86.70) at much less cost ($0.38 vs. $1.59/F1 point). Our results show that augmenting prompts with predictions from fine-tuned encoders is an efficient way to boost performance, and a fine-tuned GPT-4o-mini is nearly as good as GPT-4o FT at 76% less cost. Both are affordable options for projects with limited resources.

pdf bib
Retrieval of Temporal Event Sequences from Textual Descriptions
Zefang Liu | Yinzhu Quan

Retrieving temporal event sequences from textual descriptions is crucial for applications such as analyzing e-commerce behavior, monitoring social media activities, and tracking criminal incidents. To advance this task, we introduce TESRBench, a comprehensive benchmark for temporal event sequence retrieval (TESR) from textual descriptions. TESRBench includes diverse real-world datasets with synthesized and reviewed textual descriptions, providing a strong foundation for evaluating retrieval performance and addressing challenges in this domain. Building on this benchmark, we propose TPP-Embedding, a novel model for embedding and retrieving event sequences. The model leverages the TPP-LLM framework, integrating large language models (LLMs) with temporal point processes (TPPs) to encode both event texts and times. By pooling representations and applying a contrastive loss, it unifies temporal dynamics and event semantics in a shared embedding space, aligning sequence-level embeddings of event sequences and their descriptions. TPP-Embedding demonstrates superior performance over baseline models across TESRBench datasets, establishing it as a powerful solution for the temporal event sequence retrieval task.

pdf bib
Generating Tables from the Parametric Knowledge of Language Models
Yevgeni Berkovitch | Oren Glickman | Amit Somech | Tomer Wolfson

We explore generating factual tables from the parametric knowledge of large language models (LLMs). While LLMs have demonstrated impressive capabilities in recreating knowledge bases and generating free-form text, their ability to generate structured tabular data has received little attention. To address this gap, we explore the table generation abilities of eight state-of-the-art LLMs, including GPT-4o and Llama3.1-405B, using three prompting methods: full-table, row-by-row, and cell-by-cell. To facilitate evaluation we introduce WikiTabGen, a new benchmark consisting of 119 manually curated Wikipedia tables and their description. Our findings show that table generation remains challenging, with the best performing model (LLaMA3.1-405B) reaching only 25.4% accuracy. We further analyze how properties like table size, popularity, and numerical content impact performance. This study highlights the unique challenges of LLM-based table generation and offers a foundation for future research in this area. All code, data, and prompts are publicly available.

pdf bib
Investigating Large Language Models for Text-to-SPARQL Generation
Jacopo D’Abramo | Andrea Zugarini | Paolo Torroni

Large Language Models (LLMs) have demonstrated strong capabilities in code generation, such as translating natural language questions into SQL queries. However, state-of-the-art solutions often involve a costly fine-tuning step. In this study, we extensively evaluate In-Context Learning (ICL) solutions for text-to-SPARQL generation with different architectures and configurations, based on methods for retrieving relevant demonstrations for few-shot prompting and working with multiple generated hypotheses. In this way, we demonstrate that LLMs can formulate SPARQL queries achieving state-of-the-art results on several Knowledge Graph Question Answering (KGQA) benchmark datasets without fine-tuning.

pdf bib
GAVEL: Generative Attribute-Value Extraction Using LLMs on LLM-Augmented Datasets
Pollawat Hongwimol | Dong Sheng | Li Zhang | Kai Liu | Xiufei Wang

In the evolving e-commerce landscape, accurate product attribute-value extraction is crucial for enhancing user experience and increasing sales. This paper introduces GAVEL, a generative approach leveraging large language models (LLMs) to augment training data for attribute extraction from diverse textual sources. Our method extracts over 1,000 unique attributes across 2,000 product categories in multiple Southeast Asian languages, including Thai, Vietnamese, and Indonesian. Rigorous evaluations show significant improvements in accuracy and coverage compared to seller-provided attributes, with enhanced recall and F1 scores. Additionally, GAVEL reduces operational costs by minimizing instruction token usage and improves inference speed. The results of the A/B testing indicate that our model has a positive impact on Gross Merchandise Value (GMV) per page view (PV) across all three operating countries. This research highlights the potential of generative techniques for optimizing attribute extraction in multi-language e-commerce applications.

pdf bib
Leveraging Domain Knowledge at Inference Time for LLM Translation: Retrieval versus Generation
Bryan Li | Jiaming Luo | Eleftheria Briakou | Colin Cherry

While large language models (LLMs) have been increasingly adopted for machine translation (MT), their performance for specialist domains such as medicine and law remains an open challenge. Prior work has shown that LLMs can be domain-adapted at test-time by retrieving targeted few-shot demonstrations or terminologies for inclusion in the prompt. Meanwhile, for general-purpose LLM MT, recent studies have found some success in generating similarly useful domain knowledge from an LLM itself, prior to translation. Our work studies domain-adapted MT with LLMs through a careful prompting setup, finding that demonstrations consistently outperform terminology, and retrieval consistently outperforms generation. We find that generating demonstrations with weaker models can close the gap with larger model’s zero-shot performance. Given the effectiveness of demonstrations, we perform detailed analyses to understand their value. We find that domain-specificity is particularly important, and that the popular multi-domain benchmark is testing adaptation to a particular writing style more so than to a specific domain.

pdf bib
Enhancing Cross-Language Code Translation via Task-Specific Embedding Alignment in Retrieval-Augmented Generation
Manish Bhattarai | Minh N. Vu | Javier E. Santos | Ismael Ismael | Daniel O’Malley

We introduce a novel method to enhance cross-language code translation from Fortran to C++ by integrating task-specific embedding alignment into a Retrieval-Augmented Generation (RAG) framework. Unlike conventional retrieval approaches that utilize generic embeddings agnostic to the downstream task, our strategy aligns the retrieval model directly with the objective of maximizing translation quality, as quantified by the CodeBLEU metric. This alignment ensures that the embeddings are semantically and syntactically meaningful for the specific code translation task. Our methodology involves constructing a dataset of 25,000 Fortran code snippets sourced from Stack-V2 dataset and generating their corresponding C++ translations using the LLaMA 3.1-8B language model. We compute pairwise CodeBLEU scores between the generated translations and ground truth examples to capture fine-grained similarities. These scores serve as supervision signals in a contrastive learning framework, where we optimize the embedding model to retrieve Fortran-C++ pairs that are most beneficial for improving the language model’s translation performance. By integrating these CodeBLEU-optimized embeddings into the RAG framework, our approach significantly enhances both retrieval accuracy and code generation quality over methods employing generic embeddings. On the HPC Fortran2C++ dataset, our method elevates the average CodeBLEU score from 0.64 to 0.73, achieving a 14% relative improvement. On the Numerical Recipes dataset, we observe an increase from 0.52 to 0.60, marking a 15% relative improvement. Importantly, these gains are realized without any fine-tuning of the language model, underscoring the efficiency and practicality of our approach.

pdf bib
LLM Reasoning Engine: Specialized Training for Enhanced Mathematical Reasoning
Shuguang Chen | Guang Lin

Large Language Models (LLMs) have shown remarkable performance in various natural language processing tasks but face challenges in mathematical reasoning, where complex problem-solving requires both linguistic understanding and mathematical reasoning skills. Existing approaches to address this challenge often rely on ensemble methods and suffer from the problem of data scarcity in target domains. In this work, we present a novel method to enhance the capabilities of LLMs in mathematical reasoning tasks. Motivated by the need to bridge this gap, our approach incorporates a question paraphrase strategy, which aims to diversify the linguistic forms of mathematical questions to improve generalization. Additionally, specialized training objectives are employed to guide the model’s learning process, focusing on enhancing its understanding of mathematical concepts and reasoning processes. We conduct experiments on four datasets using different LLMs, and demonstrate the effectiveness of our approach in improving LLMs’ performance on mathematical reasoning tasks. Our findings underscore the significance of our methodology in advancing large language models and their potential implications for real-world applications that require mathematical reasoning abilities.

pdf bib
RouteNator: A Router-Based Multi-Modal Architecture for Generating Synthetic Training Data for Function Calling LLMs
Dewang Sultania | Vibha Belavadi | Tushar Vatsa | Suhas Suresha | Ishita Verma | Tracy Holloway King | Mifriedr Mifriedr | Cheng Chen

This paper addresses fine-tuning Large Language Models (LLMs) for function calling tasks when real user interaction data is unavailable. In digital content creation tools, where users express their needs through natural language queries that must be mapped to API calls, the lack of real-world task-specific data and privacy constraints for training on it necessitate synthetic data generation. Existing approaches to synthetic data generation fall short in diversity and complexity, failing to replicate real-world data distributions and leading to suboptimal performance after LLM fine-tuning. We present a novel router-based architecture that leverages domain resources like content metadata and structured knowledge graphs, along with text-to-text and vision-to-text language models to generate high-quality synthetic training data. Our architecture’s flexible routing mechanism enables synthetic data generation that matches observed real-world distributions, addressing a fundamental limitation of traditional approaches. Evaluation on a comprehensive set of real user queries demonstrates significant improvements in both function classification accuracy and API parameter selection. Models fine-tuned with our synthetic data consistently outperform traditional approaches, establishing new benchmarks for function calling tasks.

pdf bib
StoC-TOT: Stochastic Tree-of-Thought with Constrained Decoding for Complex Reasoning in Multi-Hop Question Answering
Zhenyu Bi | Daniel Hajialigol | Zhongkai Sun | Jie Hao | Xuan Wang

Multi-hop question answering (MHQA) requires a model to retrieve and integrate information from multiple passages to answer a complex question. Recent systems leverage the power of large language models and integrate evidence retrieval with reasoning prompts (e.g., chain-of-thought reasoning) for the MHQA task. However, the complexities in the question types (bridge v.s. comparison questions) and the reasoning types (sequential v.s. parallel reasonings) require more novel and fine-grained prompting methods to enhance the performance of MHQA under the zero-shot setting.In this paper, we propose StoC-ToT, a stochastic tree-of-thought reasoning prompting method with constrained decoding for MHQA and conduct a detailed comparison with other reasoning prompts on different question types and reasoning types. Specifically, we construct a tree-like reasoning structure by prompting the model to break down the original question into smaller sub-questions to form different reasoning paths. In addition, we prompt the model to provide a probability estimation for each reasoning path at each reasoning step. At answer time, we conduct constrained decoding on the model to generate more grounded answers and reduce hallucination. Experiments comparing StoC-ToT with on two MHQA datasets and five large language models showed that outperforms other reasoning prompts by a significant margin.

pdf bib
EKRAG: Benchmark RAG for Enterprise Knowledge Question Answering
Tan Yu | Wenfei Zhou | Leiyang Leiyang | Aaditya Shukla | Mmadugula Mmadugula | Pritam Gundecha | Nicholas Burnett | Anbang Xu | Viseth Viseth | Tbar Tbar | Rama Akkiraju | Vivienne Zhang

Retrieval-augmented generation (RAG) offers a robust solution for developing enterprise internal virtual assistants by leveraging domain-specific knowledge and utilizing information from frequently updated corporate document repositories. In this work, we introduce the Enterprise-Knowledge RAG (EKRAG) dataset to benchmark RAG for enterprise knowledge question-answering (QA) across a diverse range of corporate documents, such as product releases, technical blogs, and financial reports. Using EKRAG, we systematically evaluate various retrieval models and strategies tailored for corporate content. We propose novel embedding-model (EM)-as-judge and ranking-model (RM)-as-judge approaches to assess answer quality in the context of enterprise information. Combining these with the existing LLM-as-judge method, we then comprehensively evaluate the correctness, relevance, and faithfulness of generated answers to corporate queries. Our extensive experiments shed light on optimizing RAG pipelines for enterprise knowledge QA, providing valuable guidance for practitioners. This work contributes to enhancing information retrieval and question-answering capabilities in corporate environments that demand high degrees of factuality and context-awareness.

pdf bib
Towards Effectively Leveraging Execution Traces for Program Repair with Code LLMs
Mirazul Haque | Petr Babkin | Farima Farmahinifarahani | Manuela Veloso

Large Language Models (LLMs) show promising performance on various programming tasks, including Automatic Program Repair (APR).However, most approaches to LLM-based APR are limited to the static analysis of the programs, while disregarding their runtime behavior.Inspired by knowledge-augmented NLP, in this work, we aim to remedy this potential blind spot by augmenting standard APR prompts with program execution traces.We evaluate our approach using the GPT family of models on three popular APR datasets. Our findings suggest that simply incorporating execution traces into the prompt provides a limited performance improvement over trace-free baselines, in only 2 out of 6 tested dataset/model configurations. We further find that the effectiveness of execution traces for APR diminishes as their complexity increases. We explore several strategies for leveraging traces in promptsand demonstrate that LLM-optimized prompts help outperform trace-free prompts more consistently.Additionally, we show trace-based prompting to be superior to finetuning a smaller LLM on a small-scale dataset; and conduct probing studies reinforcing the notion that execution traces can complement the reasoning abilities of the LLMs.

pdf bib
A Novel Multi-Document Retrieval Benchmark: Journalist Source-Selection in Newswriting
Alexander Spangher | Tenghao Huang | Yiqin Huang | Lucas Spangher | Sewon Min | Mark Dredze

Multi-document retrieval approaches often overlook the ways different retrievals complement each other when addressing complex queries. In this work, we study journalist source selection in news article writing and examine the discourse roles that different sources serve when paired together, finding that discourse function (not simply informational content) is an important component of source usage. Then, we introduce a novel IR task to benchmark how well language models can reason about this narrative process. We extract a journalist’s initial query and the sources they used from news articles and aim to recover the sources that support this query. We demonstrate that large language models (LLMs) can be employed in multi-step query planning, identifying informational gaps and enhancing retrieval performance, but current approaches to interleave queries fall short. By training auxiliary discourse planners and incorporating this information into LLMs, we enhance query planning, achieving a significant 5% improvement in precision and a 2% increase in F1 score over the previous SOTA, all while maintaining recall.

pdf bib
HEAL: Hierarchical Embedding Alignment Loss for Improved Retrieval and Representation Learning
Manish Bhattarai | Ryan Barron | Maksim E. Eren | Minh N. Vu | Vesselin Grantcharov | Ismael Ismael | Valentin Stanev | Cynthia Matuszek | Vladimir I Valtchinov | Kim Rasmussen | Boian S. Alexandrov

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by integrating external document retrieval to provide domain-specific or up-to-date knowledge. The effectiveness of RAG depends on the relevance of retrieved documents, which is influenced by the semantic alignment of embeddings with the domain’s specialized content. Although full fine-tuning can align language models to specific domains, it is computationally intensive and demands substantial data. This paper introduces Hierarchical Embedding Alignment Loss (HEAL), a novel method that leverages hierarchical fuzzy clustering with matrix factorization within contrastive learning to efficiently align LLM embeddings with domain-specific content. HEAL computes level/depth-wise contrastive losses and incorporates hierarchical penalties to align embeddings with the underlying relationships in label hierarchies. This approach enhances retrieval relevance and document classification, effectively reducing hallucinations in LLM outputs. In our experiments, we benchmark and evaluate HEAL across diverse domains, including Healthcare, Material Science, Cyber-security, and Applied Maths.

pdf bib
Hybrid AI for Responsive Multi-Turn Online Conversations with Novel Dynamic Routing and Feedback Adaptation
Priyaranjan Pattnayak | Amit Agarwal | Hansa Meghwani | Hitesh Laxmichand Patel | Srikant Panda

Retrieval-Augmented Generation (RAG) systems and large language model (LLM)-powered chatbots have significantly advanced conversational AI by combining generative capabilities with external knowledge retrieval. Despite their success, enterprise-scale deployments face critical challenges, including diverse user queries, high latency, hallucinations, and difficulty integrating frequently updated domain-specific knowledge. This paper introduces a novel hybrid framework that integrates RAG with intent-based canned responses, leveraging predefined high-confidence responses for efficiency while dynamically routing complex or ambiguous queries to the RAG pipeline. Our framework employs a dialogue context manager to ensure coherence in multi-turn interactions and incorporates a feedback loop to refine intents, dynamically adjust confidence thresholds, and expand response coverage over time. Experimental results demonstrate that the proposed framework achieves a balance of high accuracy (95%) and low latency (180ms), outperforming RAG and intent-based systems across diverse query types, positioning it as a scalable and adaptive solution for enterprise conversational AI applications.

pdf bib
Chain of Evidences and Evidence to Generate: Prompting for Context Grounded and Retrieval Augmented Reasoning
Md Rizwan Parvez

While chain-of-thoughts (CoT) prompting has revolutionized how LLMs perform reasoning tasks, its current methods and variations (e.g, Self-consistency, ReACT, Reflexion, Tree-of-Thoughts (ToT), Cumulative Reasoning (CR) etc.,) suffer from limitations like limited context grounding, hallucination/inconsistent output generation, and iterative sluggishness. To overcome these challenges, we introduce a novel mono/dual-step zero-shot prompting framework built upon two unique strategies Chain of Evidences (CoE) and Evidence to Generate (E2G). Instead of unverified reasoning claims, our innovative approaches leverage the power of “evidence for decision making” by first focusing exclusively on the thought sequences explicitly mentioned in the context which then serve as extracted evidence, guiding the LLM’s output generation process with greater precision and efficiency. This simple yet potent approach unlocks the full potential of chain-of-thoughts prompting, facilitating faster, more reliable, and contextually aware reasoning in LLMs. Our framework consistently achieves remarkable results across various knowledge-intensive reasoning and generation tasks, surpassing baseline approaches with state-of-the-art LLMs. For instance, (i) on the LogiQA benchmark using GPT-4, CoE achieves a new state-of-the-art accuracy of 53.8%, surpassing CoT by 18%, ToT by 11%, and CR by 9%; (ii) CoE with PaLM-2 outperforms the variable-shot performance of Gemini Ultra by 0.9 F1 points, achieving an F1 score of 83.3 on DROP. We release our prompts and outputs on these benchmarks as a new instruction tuning dataset for future research at Hugging Face.

pdf bib
Expertly Informed, Generatively Summarized: A Hybrid RAG Approach to Informed Consent Summarization with Auxiliary Expert Knowledge
Autumn Toney-Wails | Ryan Wails | Caleb Smith

The utility of retrieval augmented generation (RAG) systems is actively being explored across a wide range of domains. Reliable generative output is increasingly useful in fields where routine tasks can be streamlined and potentially improved by integrating domain-specific data in addition to individual expert knowledge, such as medical care. To that end, we present a hybrid RAG and GraphRAG user interface system to summarize the key information (KI) section in IRB informed consent documents. KI summaries are a unique task, as generative summarization helps the end user (clinical trial expert) but can pose a risk to the affected user (potential study participants) if inaccurately constructed. Thus, the KI summarization task requires reliable, structured output with input from an expert knowledge source outside of the informed consent document. Reviewed by IRB domain experts and clinical trial PIs, our summarization application produces accurate (70% to 100% varied by accuracy type) and useful summaries (63% of PIs stating summaries were as good as or better than their accepted summaries).

pdf bib
MSR2: A Benchmark for Multi-Source Retrieval and Reasoning in Visual Question Answering
Kuo-Han Hung | Hung-Chieh Fang | Chao-Wei Huang | Yun-Nung Chen

This paper introduces MSR2, a benchmark for multi-source retrieval and reasoning in visual question answering. Unlike previous knowledge-based visual question answering datasets, MSR2 focuses on questions involving multiple fine-grained entities, providing a unique opportunity to assess a model’s spatial reasoning ability and its capacity to retrieve and aggregate information from various sources for different entities. Through comprehensive evaluation using MSR2, we gain valuable insights into the capabilities and limitations of state-of-the-art large vision-language models (LVLMs).Our findings reveal that even state-of-the-art LVLMs struggle with questions requiring multi-entities and knowledge-intensive reasoning, highlighting important new directions for future research.Additionally, we demonstrate that enhanced visual entity recognition and knowledge retrieval can significantly improve performance on MSR2, pinpointing key areas for advancement.

pdf bib
PROPEL: Prompt Optimization with Expert Priors for Small and Medium-sized LLMs
Kawin Mayilvaghanan | Varun Nathan | Ayush Kumar

pdf bib
ClaimCheck: Automatic Fact-Checking of Textual Claims using Web Evidence
Akshith Reddy Putta | Jacob Devasier | Chengkai Li

We introduce ClaimCheck, an efficient fact-checking system that verifies textual claims using smaller, open-source large language models. ClaimCheck integrates two fact-checking strategies, claim-matching and novel claim processing. Claim-matching uses related fact-checks from trusted organizations to fact-check a claim. Novel claim processing breaks down fact-checking into manageable subtasks—generating targeted questions, retrieving Web evidence, extracting answers, and synthesizing verdicts. Evaluation on the AVeriTeC benchmark demonstrates 62.6% verdict prediction accuracy, with claim-matching providing a 2.8% improvement. ClaimCheck approaches the performance of state-of-the-art systems while requiring significantly fewer computational resources, demonstrating the effectiveness of using small language models for fact-checking tasks. Furthermore, our code is publicly available to help make automated fact-checking more accessible.

pdf bib
Can dependency parses facilitate generalization in language models? A case study of cross-lingual relation extraction
Ritam Dutt | Shounak Sural | Carolyn Rose

In this work, we propose DEPGEN, a framework for evaluating the generalization capabilities of language models on the task of relation extraction, with dependency parses as scaffolds. We use a GNN-based framework that takes dependency parses as input and learns embeddings of entities which are augmented to a baseline multilingual encoder. We also investigate the role of dependency parses when they are included as part of the prompt to LLMs in a zero-shot learning setup. We observe that including off-the-shelf dependency parses can aid relation extraction, with the best performing model having a mild relative improvement of 0.91% and 1.5% in the in-domain and zero-shot setting respectively across two datasets. For the in-context learning setup, we observe an average improvement of 1.67%, with significant gains for low-performing LLMs. We also carry out extensive statistical analysis to investigate how different factors such as the choice of the dependency parser or the nature of the prompt impact performance. We make our code and results publicly available for the research community at https://github.com/ShoRit/multilingual-re.git.

pdf bib
Can dependency parses facilitate generalization in language models? A case study of cross-lingual relation extraction
Ritam Dutt | Shounak Sural | Carolyn Rose

In this work, we propose DEPGEN, a framework for evaluating the generalization capabilities of language models on the task of relation extraction, with dependency parses as scaffolds. We use a GNN-based framework that takes dependency parses as input and learns embeddings of entities which are augmented to a baseline multilingual encoder. We also investigate the role of dependency parses when they are included as part of the prompt to LLMs in a zero-shot learning setup. We observe that including off-the-shelf dependency parses can aid relation extraction, with the best performing model having a mild relative improvement of 0.91% and 1.5% in the in-domain and zero-shot setting respectively across two datasets. For the in-context learning setup, we observe an average improvement of 1.67%, with significant gains for low-performing LLMs. We also carry out extensive statistical analysis to investigate how different factors such as the choice of the dependency parser or the nature of the prompt impact performance. We make our code and results publicly available for the research community at https://github.com/ShoRit/multilingual-re.git.

pdf bib
DocBench: A Benchmark for Evaluating LLM-based Document Reading Systems
Anni Zou | Wenhao Yu | Hongming Zhang | Kaixin Ma | Deng Cai | Zhuosheng Zhang | Hai Zhao | Dong Yu

Recent advancements in proprietary large language models (LLMs), such as those from OpenAI and Anthropic, have led to the development of document reading systems capable of handling raw files with complex layouts, intricate formatting, lengthy content, and multi-modal information. However, the absence of a standardized benchmark hinders objective evaluation of these systems. To address this gap, we introduce DocBench, a benchmark designed to simulate real-world scenarios, where each raw file consists of a document paired with one or more questions. DocBench uniquely evaluates entire document reading systems and adopts a user-centric approach, allowing users to identify the system best suited to their needs.

up

pdf (full)
bib (full)
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)

pdf bib
Proceedings of the 9th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature (LaTeCH-CLfL 2025)
Anna Kazantseva | Stan Szpakowicz | Stefania Degaetano-Ortlieb | Yuri Bizzoni | Janis Pagel

pdf bib
Matching and Linking Entries in Historical Swedish Encyclopedias
Simon Börjesson | Erik Ersmark | Pierre Nugues

The Nordisk familjebok is a Swedish encyclopedia from the 19th and 20th centuries. It was written by a team of experts and aimed to be an intellectual reference, stressing precision and accuracy. This encyclopedia had four main editions remarkable by their size, ranging from 20 to 38 volumes. As a consequence, the Nordisk familjebok had a considerable influence in universities, schools, the media, and society overall. As new editions were released, the selection of entries and their content evolved, reflecting intellectual changes in Sweden.In this paper, we used digitized versions from Project Runeberg. We first resegmented the raw text into entries and matched pairs of entries between the first and second editions using semantic sentence embeddings. We then extracted the geographical entries from both editions using a transformer-based classifier and linked them to Wikidata. This enabled us to identify geographic trends and possible shifts between the first and second editions, written between 1876–1899 and 1904–1926, respectively.Interpreting the results, we observe a small but significant shift in geographic focus away from Europe and towards North America, Africa, Asia, Australia, and northern Scandinavia from the first to the second edition, confirming the influence of the First World War and the rise of new powers. The code and data are available on GitHub at https://github.com/sibbo/nordisk-familjebok.

pdf bib
Preserving Comorian Linguistic Heritage: Bidirectional Transliteration Between the Latin Alphabet and the Kamar-Eddine System
Abdou Mohamed Naira | Abdessalam Bahafid | Zakarya Erraji | Anass Allak | Mohamed Soibira Naoufal | Imade Benelallam

The Comoros Islands, rich in linguistic diversity, are home to dialects derived from Swahili and influenced by Arabic. Historically, the Kamar-Eddine system, based on the Arabic alphabet, was one of the first writing systems used for Comorian. However, it has gradually been replaced by the Latin alphabet, even though numerous archival texts are written in this system, and older speakers continue to use it, highlighting its cultural and historical significance. In this article, we present Shialifube, a bidirectional transliteration tool between Latin and Arabic scripts, designed in accordance with the rules of the Kamar-Eddine system. To evaluate its performance, we applied a round-trip transliteration technique, achieving a word error rate of 14.84% and a character error rate of 9.56%. These results demonstrate the reliability of our system for complex tasks. Furthermore, Shialifube was tested in a practical case related to speech recognition, showcasing its potential in Natural Language Processing. This project serves as a bridge between tradition and modernity, contributing to the preservation of Comorian linguistic heritage while paving the way for better integration of local dialects into advanced technologies.

pdf bib
LLM-based Adversarial Dataset Augmentation for Automatic Media Bias Detection
Martin Wessel

This study presents BiasAdapt, a novel data augmentation strategy designed to enhance the robustness of automatic media bias detection models. Leveraging the BABE dataset, BiasAdapt uses a generative language model to identify bias-indicative keywords and replace them with alternatives from opposing categories, thus creating adversarial examples that preserve the original bias labels. The contributions of this work are twofold: it proposes a scalable method for augmenting bias datasets with adversarial examples while preserving labels, and it publicly releases an augmented adversarial media bias dataset.Training on BiasAdapt reduces the reliance on spurious cues in four of the six evaluated media bias categories.

pdf bib
HieroLM: Egyptian Hieroglyph Recovery with Next Word Prediction Language Model
Xuheng Cai | Erica Zhang

Egyptian hieroglyphs are found on numerous ancient Egyptian artifacts, but it is common that they are blurry or even missing due to erosion. Existing efforts to restore blurry hieroglyphs adopt computer vision techniques such as CNNs and model hieroglyph recovery as an image classification task, which suffers from two major limitations: (i) They cannot handle severely damaged or completely missing hieroglyphs. (ii) They make predictions based on a single hieroglyph without considering contextual and grammatical information. This paper proposes a novel approach to model hieroglyph recovery as a next word prediction task and use language models to address it. We compare the performance of different SOTA language models and choose LSTM as the architecture of our HieroLM due to the strong local affinity of semantics in Egyptian hieroglyph texts. Experiments show that HieroLM achieves over 44% accuracy and maintains notable performance on multi-shot predictions and scarce data, which makes it a pragmatic tool to assist scholars in inferring missing hieroglyphs. It can also complement CV-based models to significantly reduce perplexity in recognizing blurry hieroglyphs. Ourcode is available at https://github.com/Rick-Cai/HieroLM/.

pdf bib
Evaluating LLM-Prompting for Sequence Labeling Tasks in Computational Literary Studies
Axel Pichler | Janis Pagel | Nils Reiter

Prompt engineering holds the promise for the computational literary studies (CLS) to obtain high quality markup for literary research questions by simply prompting large language models with natural language strings. We test prompt engineering’s validity for two CLS sequence labeling tasks under the following aspects: (i) how generalizable are the results of identical prompts on different dataset splits?, (ii) how robust are performance results when re-formulating the prompts?, and (iii) how generalizable are certain fixed phrases added to the prompts that are generally considered to increase performance. We find that results are sensitive to data splits and prompt formulation, while the addition of fixed phrases does not change performance in most cases, depending on the chosen model.

pdf bib
Generation of Russian Poetry of Different Genres and Styles Using Neural Networks with Character-Level Tokenization
Ilya Koziev | Alena Fenogenova

Automatic poetry generation is an immensely complex task, even for the most advanced Large Language Models (LLMs) that requires a profound understanding of intelligence, world and linguistic knowledge, and a touch of creativity.This paper investigates the use of LLMs in generating Russian syllabo-tonic poetry of various genres and styles. The study explores a character-level tokenization architectures and demonstrates how a language model can be pretrained and finetuned to generate poetry requiring knowledge of a language’s phonetics. Additionally, the paper assesses the quality of the generated poetry and the effectiveness of the approach in producing different genres and styles. The study’s main contribution is the introduction of two end-to-end architectures for syllabo-tonic Russian poetry: pretrained models, a comparative analysis of the approaches, and poetry evaluation metrics.

pdf bib
Automating Violence Detection and Categorization from Ancient Texts
Alhassan Abdelhalim | Michaela Regneri

Violence descriptions in literature offer valuable insights for a wide range of research in the humanities. For historians, depictions of violence are of special interest for analyzing the societal dynamics surrounding large wars and individual conflicts of influential people. Harvesting data for violence research manually is laborious and time-consuming. This study is the first one to evaluate the effectiveness of large language models (LLMs) in identifying violence in ancient texts and categorizing it across multiple dimensions. Our experiments identify LLMs as a valuable tool to scale up the accurate analysis of historical texts and show the effect of fine-tuning and data augmentation, yielding an F1-score of up to 0.93 for violence detection and 0.86 for fine-grained violence categorization.

pdf bib
Rethinking Scene Segmentation. Advancing Automated Detection of Scene Changes in Literary Texts
Svenja Guhr | Huijun Mao | Fengyi Lin

Automated scene segmentation is an ongoing challenge in computational literary studies (CLS) to approach literary texts by analyzing comparable units. In this paper, we present our approach (work in progress) to text segmentation using a classifier that identifies the position of a scene change in English-language fiction. By manually annotating novels from a 20th-century US-English romance fiction corpus, we prepared training data for fine-tuning transformer models, yielding promising preliminary results for improving automated text segmentation in CLS.

pdf bib
Sentence-Alignment in Semi-parallel Datasets
Steffen Frenzel | Manfred Stede

In this paper, we are testing sentence alignment on complex, semi-parallel corpora, i.e., different versions of the same text that have been altered to some extent. We evaluate two hypotheses: To make alignment algorithms more efficient, we test the hypothesis that matching pairs can be found in the immediate vicinity of the source sentence and that it is sufficient to search for paraphrases in a ‘context window’. To improve the alignment quality on complex, semi-parallel texts, we test the implementation of a segmentation into Elementary Discourse Units (EDUs) in order to make more precise alignments at this level. Since EDUs are the smallest possible unit for communicating a full proposition, we assume that aligning at this level can improve the overall quality. Both hypotheses are tested and validated with several embedding models on varying degrees of parallel German datasets. The advantages and disadvantages of the different approaches are presented, and our next steps are outlined.

pdf bib
Argumentation in political empowerment on Instagram
Aenne Knierim | Ulrich Heid

This paper adopts a distant reading approach to analyze political empowerment on Instagram. We focus on argument mining and content classification to uncover cooccurences between aspects of political empowerment and argument components. We develop an annotation scheme based on literature in digital political empowerment, classifying content into five primary categories along the aspects of political awareness, personal e-identity and political participation. We implement the modified toulmin scheme for argument component detection. As an example discourse, we chose the German discourses #WirSindMehr and #NieWiederIstJetzt.The upheaval was targeted against right-wing extremism and antisemitism. Political awareness emerged as the dominant category, highlighting convergent public concern against antisemitism and right-wing extremism. Claims and backings often contain statements about societal change and aim to raise consciousness.Calls for participation in offline events appear mostly in non-argumentative texts.

pdf bib
Interpretable Models for Detecting Linguistic Variation in Russian Media: Towards Unveiling Propagandistic Strategies during the Russo-Ukrainian War
Anastasiia Vestel | Stefania Degaetano-Ortlieb

With the start of the full-scale Russian invasion of Ukraine in February 2022, the spread of pro-Kremlin propaganda increased to justify the war, both in the official state media and social media. This position paper explores the theoretical background of propaganda detection in the given context and proposes a thorough methodology to investigate how language has been strategically manipulated to align with ideological goals and adapt to the changing narrative surrounding the invasion. Using the WarMM-2022 corpus, the study seeks to identify linguistic patterns across media types and their evolution over time. By doing so, we aim to enhance the understanding of the role of linguistic strategies in shaping propaganda narratives. The findings are intended to contribute to the broader discussion of information manipulation in politically sensitive contexts.

pdf bib
Tuning Into Bias: A Computational Study of Gender Bias in Song Lyrics
Danqing Chen | Adithi Satish | Rasul Khanbayov | Carolin Schuster | Georg Groh

The application of text mining methods is becoming increasingly prevalent, particularly within Humanities and Computational Social Sciences, as well as in a broader range of disciplines. This paper presents an analysis of gender bias in English song lyrics using topic modeling and bias measurement techniques. Leveraging BERTopic, we cluster a dataset of 537,553 English songs into distinct topics and analyze their temporal evolution. Our results reveal a significant thematic shift in song lyrics over time, transitioning from romantic themes to a heightened focus on the sexualization of women. Additionally, we observe a substantial prevalence of profanity and misogynistic content across various topics, with a particularly high concentration in the largest thematic cluster. To further analyse gender bias across topics and genres in a quantitative way, we employ the Single Category Word Embedding Association Test (SC-WEAT) to calculate bias scores for word embeddings trained on the most prominent topics as well as individual genres. The results indicate a consistent male bias in words associated with intelligence and strength, while appearance and weakness words show a female bias. Further analysis highlights variations in these biases across topics, illustrating the interplay between thematic content and gender stereotypes in song lyrics.

pdf bib
Artificial Relationships in Fiction: A Dataset for Advancing NLP in Literary Domains
Despina Christou | Grigorios Tsoumakas

Relation extraction (RE) in fiction presents unique NLP challenges due to implicit, narrative-driven relationships. Unlike factual texts, fiction weaves complex connections, yet existing RE datasets focus on non-fiction. To address this, we introduce Artificial Relationships in Fiction (ARF), a synthetically annotated dataset for literary RE. Built from diverse Project Gutenberg fiction, ARF considers author demographics, publication periods, and themes. We curated an ontology for fiction-specific entities and relations, and using GPT-4o, generated artificial relationships to capture narrative complexity. Our analysis demonstrates its value for finetuning RE models and advancing computational literary studies. By bridging a critical RE gap, ARF enables deeper exploration of fictional relationships, enriching NLP research at the intersection of storytelling and AI-driven literary analysis.

pdf bib
Improving Hate Speech Classification with Cross-Taxonomy Dataset Integration
Jan Fillies | Adrian Paschke

Algorithmic hate speech detection faces significant challenges due to the diverse definitions and datasets used in research and practice. Social media platforms, legal frameworks, and institutions each apply distinct yet overlapping definitions, complicating classification efforts. This study addresses these challenges by demonstrating that existing datasets and taxonomies can be integrated into a unified model, enhancing prediction performance and reducing reliance on multiple specialized classifiers. The work introduces a universal taxonomy and a hate speech classifier capable of detecting a wide range of definitions within a single framework. Our approach is validated by combining two widely used but differently annotated datasets, showing improved classification performance on an independent test set. This work highlights the potential of dataset and taxonomy integration in advancing hate speech detection, increasing efficiency, and ensuring broader applicability across contexts.

pdf bib
Classifying Textual Genre in Historical Magazines (1875-1990)
Vera Danilova | Ylva Söderfeldt

Historical magazines are a valuable resource for understanding the past, offering insights into everyday life, culture, and evolving social attitudes. They often feature diverse layouts and genres. Short stories, guides, announcements, and promotions can all appear side by side on the same page. Without grouping these documents by genre, term counts and topic models may lead to incorrect interpretations.This study takes a step towards addressing this issue by focusing on genre classification within a digitized collection of European medical magazines in Swedish and German. We explore 2 scenarios: 1) leveraging the available web genre datasets for zero-shot genre prediction, 2) semi-supervised learning over the few-shot setup. This paper offers the first experimental insights in this direction.We find that 1) with a custom genre scheme tailored to historical dataset characteristics it is possible to effectively utilize categories from web genre datasets for cross-domain and cross-lingual zero-shot prediction, 2) semi-supervised training gives considerable advantages over few-shot for all models, particularly for the historical multilingual BERT.

pdf bib
Lexical Semantic Change Annotation with Large Language Models
Thora Hagen

This paper explores the application of state-of-the-art large language models (LLMs) to the task of lexical semantic change annotation (LSCA) using the historical German DURel dataset. We evaluate five LLMs, and investigate whether retrieval-augmented generation (RAG) with historical encyclopedic knowledge enhances results. Our findings show that the Llama3.3 model achieves comparable performance to GPT-4o despite significant parameter differences, while RAG marginally improves predictions for smaller models but hampers performance for larger ones. Further analysis suggests that our additional context benefits nouns more than verbs and adjectives, demonstrating the nuances of integrating external knowledge for semantic tasks.

pdf bib
AI Conversational Interviewing: Transforming Surveys with LLMs as Adaptive Interviewers
Alexander Wuttke | Matthias Aßenmacher | Christopher Klamm | Max M. Lang | Quirin Würschinger | Frauke Kreuter

Traditional methods for eliciting people’s opinions face a trade-off between depth and scale: structured surveys enable large-scale data collection but limit respondents’ ability to voice their opinions in their own words, while conversational interviews provide deeper insights but are resource-intensive. This study explores the potential of replacing human interviewers with large language models (LLMs) to conduct scalable conversational interviews. Our goal is to assess the performance of AI Conversational Interviewing and to identify opportunities for improvement in a controlled environment. We conducted a small-scale, in-depth study with university students who were randomly assigned to a conversational interview by either AI or human interviewers, both employing identical questionnaires on political topics. Various quantitative and qualitative measures assessed interviewer adherence to guidelines, response quality, participant engagement, and overall interview efficacy. The findings indicate the viability of AI Conversational Interviewing in producing quality data comparable to traditional methods, with the added benefit of scalability. We publish our data and materials for re-use and present specific recommendations for effective implementation.

pdf bib
Embedded Personalities: Word Embeddings and the “Big Five” Personality Model
Oliver Müller | Stefania Degaetano-Ortlieb

pdf bib
Prompting the Past: Exploring Zero-Shot Learning for Named Entity Recognition in Historical Texts Using Prompt-Answering LLMs
Crina Tudor | Beata Megyesi | Robert Östling

This paper investigates the application of prompt-answering Large Language Models (LLMs) for the task of Named Entity Recognition (NER) in historical texts. Historical NER presents unique challenges due to language change through time, spelling variation, limited availability of digitized data (and, in particular, labeled data), and errors introduced by Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) processes. Leveraging the zero-shot capabilities of prompt-answering LLMs, we address these challenges by prompting the model to extract entities such as persons, locations, organizations, and dates from historical documents. We then conduct an extensive error analysis of the model output in order to identify and address potential weaknesses in the entity recognition process. The results show that, while such models display ability for extracting named entities, their overall performance is lackluster. Our analysis reveals that model performance is significantly affected by hallucinations in the model output, as well as by challenges imposed by the evaluation of NER output.

pdf bib
LLMs for Translation: Historical, Low-Resourced Languages and Contemporary AI Models
Merve Tekgürler

Large Language Models (LLMs) have demonstrated remarkable adaptability in performing various tasks, including machine translation (MT), without explicit training. Models such as OpenAI’s GPT-4 and Google’s Gemini are frequently evaluated on translation benchmarks and utilized as translation tools due to their high performance. This paper examines Gemini’s performance in translating an 18th-century Ottoman Turkish manuscript, Prisoner of the Infidels: The Memoirs of Osman Agha of Timișoara, into English. The manuscript recounts the experiences of Osman Agha, an Ottoman subject who spent 11 years as a prisoner of war in Austria, and includes his accounts of warfare and violence. Our analysis reveals that Gemini’s safety mechanisms flagged between 14% and 23% of the manuscript as harmful, resulting in untranslated passages. These safety settings, while effective in mitigating potential harm, hinder the model’s ability to provide complete and accurate translations of historical texts. Through real historical examples, this study highlights the inherent challenges and limitations of current LLM safety implementations in the handling of sensitive and context-rich materials. These real-world instances underscore potential failures of LLMs in contemporary translation scenarios, where accurate and comprehensive translations are crucial—for example, translating the accounts of modern victims of war for legal proceedings or humanitarian documentation.

pdf bib
Optimizing Cost-Efficiency with LLM-Generated Training Data for Conversational Semantic Frame Analysis
Shiho Matta | Yin Jou Huang | Fei Cheng | Hirokazu Kiyomaru | Yugo Murawaki

Recent studies have shown that few-shot learning enables large language models (LLMs) to generate training data for supervised models at a low cost. However, for complex tasks, the quality of LLM-generated data often falls short compared to human-labeled data. This presents a critical challenge: how should one balance the trade-off between the higher quality but more expensive human-annotated data and the lower quality yet significantly cheaper LLM-generated data? In this paper, we tackle this question for a demanding task: conversational semantic frame analysis (SFA). To address this, we propose a novel method for synthesizing training data tailored to this complex task. Through experiments conducted across a wide range of budget levels, we find that smaller budgets favor a higher reliance on LLM-generated data to achieve optimal cost-efficiency.

pdf bib
Don’t stop pretraining! Efficiently building specialised language models in resource-constrained settings.
Sven Najem-Meyer | Frédéric Kaplan | Matteo Romanello

Developing specialised language models for low-resource domains typically involves a trade-off between two specialisation strategies: adapting a general-purpose model through continued pretraining or retraining a model from scratch. While adapting preserves the model’s linguistic knowledge, retraining benefits from the flexibility of an in-domain tokeniser – a potentially significant advantage when handling rare languages. This study investigates the impact of tokenisation, specialisation strategy, and pretraining data availability using classical scholarship – a multilingual, code-switching and highly domain-specific field – as a case study. Through extensive experiments, we assess whether domain-specific tokenisation improves model performance, whether character-based models provide a viable alternative to subword-based models, and which specialisation strategy is optimal given the constraints of limited pretraining data. Contrary to prior findings, our results show that in-domain tokenisation does not necessarily enhance performance. Most notably, adaptation consistently outperforms retraining, even with limited data, confirming its efficiency as the preferred strategy for resource-constrained domains. These insights provide valuable guidelines for developing specialised models in fields with limited textual resources.

pdf bib
‘... like a needle in a haystack”: Annotation and Classification of Comparative Statements
Pritha Majumdar | Franziska Pannach | Arianna Graciotti | Johan Bos

We present a clear distinction between the phenomena of comparisons and similes along with a fine-grained annotation guideline that facilitates the structural annotation and assessment of the two classes, with three major contributions: 1) a publicly available annotated data set of 100 comparative statements; 2) theoretically grounded annotation guidelines for human annotators; and 3) results of machine learning experiments to establish how the–often subtle–distinction between the two phenomena can be automated.

pdf bib
Identifying Small Talk in Natural Conversations
Steffen Frenzel | Annette Hautli-Janisz

Small talk is part and parcel of human interaction and is rather employed to communicate values and opinions than pure information. Despite small talk being an omnipresent phenomenon in spoken language, it is difficult to identify: Small talk is situated, i.e., for interpreting a string of words or discourse units, outside references such as the context of the interlocutors and their previous experiences have to be interpreted.In this paper, we present a dataset of natural conversation annotated with a theoretically well-motivated distillation of what constitutes small talk. This dataset comprises of verbatim transcribed public service encounters in German authorities and are the basis for empirical work in administrative policy on how the satisfaction of the citizen manifests itself in the communication with the authorities. We show that statistical models achieve comparable results to those of state-of-the-art LLMs.

pdf bib
Why Novels (Don’t) Break Through: Dynamics of Canonicity in the Danish Modern Breakthrough (1870-1900)
Alie Lassche | Pascale Feldkamp | Yuri Bizzoni | Katrine Baunvig | Kristoffer Nielbo

Recent studies suggest that canonical works possess unique textual profiles, often tied to innovation and higher cognitive demands. However, recent work on Danish 19th century literary novels has shown that some non-canonical works shared similar textual qualities with canonical works, underscoring the role of text-extrinsic factors in shaping canonicity. The present study examines the same corpus (more than 800 Danish novels from the Modern Breakthrough era (1870–1900)) to explore socio-economic and institutional factors, as well as demographic features, specifically, book prices, publishers, and the author’s nationality – in determining canonical status. We combine expert-based and national definitions of canon to set up a classification experiment to test the predictive power of these external features, and to understand how they relate to that of text-intrinsic features. We show that the canonization process is influenced by external factors – such as publisher and nationality – but that text-intrinsic features nevertheless maintain predictive power in a dynamic interplay of text and context.

pdf bib
Adapting Multilingual Embedding Models to Historical Luxembourgish
Andrianos Michail | Corina Raclé | Juri Opitz | Simon Clematide

The growing volume of digitized historical texts requires effective semantic search using text embeddings. However, pre-trained multilingual models face challenges with historical content due to OCR noise and outdated spellings. This study examines multilingual embeddings for cross-lingual semantic search in historical Luxembourgish (LB), a low-resource language. We collect historical Luxembourgish news articles from various periods and use GPT-4o for sentence segmentation and translation, generating 20,000 parallel training sentences per language pair. Additionally, we create a semantic search (Historical LB Bitext Mining) evaluation set and find that existing models perform poorly on cross-lingual search for historical Luxembourgish. Using our historical and additional modern parallel training data, we adapt several multilingual embedding models through contrastive learning or knowledge distillation and increase accuracy significantly for all models. We release our adapted models and historical Luxembourgish-German/French/English bitexts to support further research.

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)

pdf bib
Proceedings of the 1st Workshop on Language Models for Underserved Communities (LM4UC 2025)
Duc Nguyen

pdf bib
Enhance Contextual Learning in ASR for Endangered Low-resource Languages
Zhaolin Li | Jan Niehues

Automatic Speech Recognition (ASR) facilitates documenting endangered low-resource languages. While recent advances in acoustic modelling have been substantial, contextual learning remains underexplored. This study investigates the main factors that influence the integration of knowledge from language models (LMs) into state-of-the-art ASR models for endangered low-resource languages. Through experiments on five diverse low-resource languages, we find: 1) Fine-grained tokenization effectively improves ASR performance by addressing the prevalent unknown words and improving data usage efficiency; 2) The integration of transformer-based LMs into ASR systems surpasses that of N-gram LMs only in one language, even though they consistently achieve better results in language modelling tasks. 3) ASR performance is highly sensitive to language-specific optimization, as shown by a 43% performance degradation in one language due to parameter transfer across languages. We open-source our scripts to support further research and applications.

pdf bib
Empowering Low-Resource Languages: TraSe Architecture for Enhanced Retrieval-Augmented Generation in Bangla
Atia Shahnaz Ipa | Mohammad Abu Tareq Rony | Mohammad Shariful Islam

Research on Retrieval-Augmented Generation for low-resource languages has been sparse because of limited resources. To address this, we focus on Bangla, a low-resource language, and have created a dataset of 200 question-answer pairs as a basis for our study from Bangla Wikipedia dumps data. This paper introduces the TraSe architecture, which enhances RAG for Bangla using Translative prompting. Our experiments demonstrate that TraSe improves answer selection accuracy, achieving 34% with automatic retrieval and 63% with Human-in-the-Loop retrieval, outperforming baseline methods. The TraSe architecture marks a significant advancement in RAG for low-resource languages and has the potential to enhance question-answering systems for Bangla and similar languages. Future research could explore additional low-resource languages. The code is available at the following GitHub repository: https://github.com/Atia6/TraSe-Bangla-RAG.

pdf bib
ABDUL: A New Approach to Build Language Models for Dialects Using Formal Language Corpora Only
Yassine Toughrai | Kamel Smaïli | David Langlois

Arabic dialects present major challenges for natural language processing (NLP) due to their diglossic nature, phonetic variability, and the scarcity of resources. To address this, we introduce a phoneme-like transcription approach that enables the training of robust language models for North African Dialects (NADs) using only formal language data, without the need for dialect-specific corpora.Our key insight is that Arabic dialects are highly phonetic, with NADs particularly influenced by European languages. This motivated us to develop a novel approach in which we convert Arabic script into a Latin-based representation, allowing our language model, ABDUL, to benefit from existing Latin-script corpora.Our method demonstrates strong performance in multi-label emotion classification and named entity recognition (NER) across various Arabic dialects. ABDUL achieves results comparable to or better than specialized and multilingual models such as DarijaBERT, DziriBERT, and mBERT. Notably, in the NER task, ABDUL outperforms mBERT by 5% in F1-score for Modern Standard Arabic (MSA), Moroccan, and Algerian Arabic, despite using a vocabulary four times smaller than mBERT.

pdf bib
Untangling the Influence of Typology, Data, and Model Architecture on Ranking Transfer Languages for Cross-Lingual POS Tagging
Enora Rice | Ali Marashian | Hannah Haynie | Katharina Wense | Alexis Palmer

Cross-lingual transfer learning is an invaluable tool for overcoming data scarcity, yet selecting a suitable transfer language remains a challenge. The precise roles of linguistic typology, training data, and model architecture in transfer language choice are not fully understood. We take a holistic approach, examining how both dataset-specific and fine-grained typological features influence transfer language selection for part-of-speech tagging, considering two different sources for morphosyntactic features. While previous work examines these dynamics in the context of bilingual biLSTMS, we extend our analysis to a more modern transfer learning pipeline: zero-shot prediction with pretrained multilingual models. We train a series of transfer language ranking systems and examine how different feature inputs influence ranker performance across architectures. Word overlap, type-token ratio, and genealogical distance emerge as top features across all architectures. Our findings reveal that a combination of typological and dataset-dependent features leads to the best rankings, and that good performance can be obtained with either feature group on its own.

pdf bib
Serving the Underserved: Leveraging BARTBahnar Language Model for Bahnaric-Vietnamese Translation
Long Nguyen | Tran Le | Huong Nguyen | Quynh Vo | Phong Nguyen | Tho Quan

The Bahnar people, one of Vietnam’s ethnic minorities, represent an underserved community with limited access to modern technologies. Developing an effective Bahnaric-Vietnamese translation system is essential for fostering linguistic exchange, preserving cultural heritage, and empowering local communities by bridging communication barriers. With advancements in Artificial Intelligence (AI), Neural Machine Translation (NMT) has achieved remarkable success across various language pairs. However, the low-resource nature of Bahnaric, characterized by data scarcity, vocabulary constraints, and the lack of parallel corpora, poses significant challenges to building an accurate and efficient translation system. To address these challenges, we propose a novel hybrid architecture for Bahnaric-Vietnamese translation, with BARTBahnar as its core language model. BARTBahnar is developed by continually training a pre-trained Vietnamese model, BARTPho, on augmented monolingual Bahnaric data, followed by fine-tuning on bilingual datasets. This transfer learning approach reduces training costs while effectively capturing linguistic similarities between the two languages. Additionally, we implement advanced data augmentation techniques to enrich and diversify training data, further enhancing BARTBahnar’s robustness and translation accuracy. Beyond leveraging the language model, our hybrid system integrates rule-based and statistical methods to improve translation quality. Experimental results show substantial improvements on bilingual Bahnaric-Vietnamese datasets, validating the effectiveness of our approach for low-resource translation. To support further research, we open-source our code and related materials at https://github.com/ura-hcmut/BARTBahnar.

pdf bib
Caption Generation in Cultural Heritage: Crowdsourced Data and Tuning Multimodal Large Language Models
Artem Reshetnikov | Maria-Cristina Marinescu

Automated caption generation for paintings enables enhanced access and understanding of visual artworks. This work introduces a novel caption dataset, obtained by manual annotation of about 7500 images from the publicly available DEArt dataset for object detection and pose estimation. Our focus is on describing the visual scenes rather than the context or style of the artwork - more common in other existing captioning datasets. The dataset is the result of a crowdsourcing initiative spanning 13 months, with volunteers adhering to explicit captioning guidelines reflecting our requirements. We provide each artwork in the dataset with five captions, created independently by volunteers to ensure diversity of interpretation and increase the robustness of the captioning model. In addition, we explore using the crowdsourced dataset for fine-tuning Large Language Models with vision encoders for domain-specific caption generation. The goal is to improve the performance of multimodal LLMs in the context of cultural heritage, a domain with “small data” which often struggles with the nuanced visual analysis and interpretation required for cultural objects such as paintings. The use of crowdsourced data in the domain adaptation process enables us to incorporate the collective perceptual insights of diverse annotators, resulting in an exploration of visual narratives and observing a reduction in hallucinations otherwise produced by these large language models.

pdf bib
Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems
Mahfuz Ahmed Anik | Abdur Rahman | Azmine Toushik Wasi | Md Manjurul Ahsan

Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction. Existing AI-driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance, leading to translations that marginalize linguistic diversity. To address these challenges, we propose a multi-agent AI framework designed for culturally adaptive translation in underserved language communities. Our approach leverages specialized agents for translation, interpretation, content synthesis, and bias evaluation, ensuring that linguistic accuracy and cultural relevance are preserved. Using CrewAI and LangChain, our system enhances contextual fidelity while mitigating biases through external validation. Comparative analysis shows that our framework outperforms GPT-4o, producing contextually rich and culturally embedded translations—a critical advancement for Indigenous, regional, and low-resource languages. This research underscores the potential of multi-agent AI in fostering equitable, sustainable, and culturally sensitive NLP technologies, aligning with the AI Governance, Cultural NLP, and Sustainable NLP pillars of Language Models for Underserved Communities. Our full experimental codebase is publicly avail able at: github.com/ciol-researchlab/Context-Aware_Translation_MAS.

pdf bib
Enhancing Small Language Models for Cross-Lingual Generalized Zero-Shot Classification with Soft Prompt Tuning
Fred Philippy | Siwen Guo | Cedric Lothritz | Jacques Klein | Tegawendé Bissyandé

In NLP, Zero-Shot Classification (ZSC) has become essential for enabling models to classify text into categories unseen during training, particularly in low-resource languages and domains where labeled data is scarce. While pretrained language models (PLMs) have shown promise in ZSC, they often rely on large training datasets or external knowledge, limiting their applicability in multilingual and low-resource scenarios.Recent approaches leveraging natural language prompts reduce the dependence on large training datasets but struggle to effectively incorporate available labeled data from related classification tasks, especially when these datasets originate from different languages or distributions. Moreover, existing prompt-based methods typically rely on manually crafted prompts in a specific language, limiting their adaptability and effectiveness in cross-lingual settings.To address these challenges, we introduce RoSPrompt, a lightweight and data-efficient approach for training soft prompts that enhance cross-lingual ZSC while ensuring robust generalization across data distribution shifts. RoSPrompt is designed for small multilingual PLMs, enabling them to leverage high-resource languages to improve performance in low-resource settings without requiring extensive fine-tuning or high computational costs. We evaluate our approach on multiple multilingual PLMs across datasets covering 106 languages, demonstrating strong cross-lingual transfer performance and robust generalization capabilities over unseen classes.

pdf bib
Cognate and Contact-Induced Transfer Learning for Hamshentsnag: A Low-Resource and Endangered Language
Onur Keleş | Baran Günay | Berat Doğan

This study investigates zero-shot and few-shot cross-lingual transfer effects in Part-of-Speech (POS) tagging and Named Entity Recognition (NER) for Hamshentsnag, an endangered Western Armenian dialect. We examine how different source languages, Western Armenian (contact cognate), Eastern Armenian (ancestral cognate), Turkish (substrate or contact-induced), and English (non-cognate), affect the task performance using multilingual BERT and BERTurk. Results show that cognate varieties improved POS tagging by 8% F1, while the substrate source enhanced NER by 15% F1. BERTurk outperformed mBERT on NER but not on POS. We attribute this to task-specific advantages of different source languages. We also used script conversion and phonetic alignment with the target for non-Latin scripts, which alleviated transfer.

pdf bib
Nayana OCR: A Scalable Framework for Document OCR in Low-Resource Languages
Adithya Kolavi | Samarth P | Vyoman Jain

We introduce Nayana, a scalable and efficient framework for adapting Vision-Language Models (VLMs) to low-resource languages. Despite significant advances, modern VLMs remain constrained by the scarcity of training data in non-English languages, limiting their global applicability. Our framework addresses this fundamental challenge through a novel layout-aware synthetic data generation pipeline combined with parameter-efficient adaptation techniques. Instead of requiring extensive manually annotated datasets, Nayana enables existing models to learn new languages effectively using purely synthetic data. Using Low-Rank Adaptation (LoRA), we demonstrate this capability across ten Indic languages: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu. Through extensive experiments in OCR tasks, we show that models can achieve strong performance in new languages without the traditional requirements of large-scale annotated datasets or extensive model modifications. Nayana’s success in adapting VLMs to new languages with synthetic data establishes a practical pathway for extending AI capabilities to underserved languages, particularly in scenarios where annotated data is scarce or unavailable.

pdf bib
On Tables with Numbers, with Numbers
Konstantinos Kogkalidis | Stergios Chatzikyriakidis

This paper is a critical reflection on the epistemic culture of contemporary computational linguistics, framed in the context of its growing obsession with tables with numbers. We argue against tables with numbers on the basis of their epistemic irrelevance, their environmental impact, their role in enabling and exacerbating social inequalities, and their deep ties to commercial applications and profit-driven research. We substantiate our arguments with empirical evidence drawn from a meta-analysis of computational linguistics research over the last decade.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Language Models for Low-Resource Languages

pdf bib
Proceedings of the First Workshop on Language Models for Low-Resource Languages
Hansi Hettiarachchi | Tharindu Ranasinghe | Paul Rayson | Ruslan Mitkov | Mohamed Gaber | Damith Premasiri | Fiona Anting Tan | Lasitha Uyangodage

pdf bib
Overview of the First Workshop on Language Models for Low-Resource Languages (LoResLM 2025)
Hansi Hettiarachchi | Tharindu Ranasinghe | Paul Rayson | Ruslan Mitkov | Mohamed Gaber | Damith Premasiri | Fiona Anting Tan | Lasitha Randunu Chandrakantha Uyangodage

The first Workshop on Language Models for Low-Resource Languages (LoResLM 2025) was held in conjunction with the 31st International Conference on Computational Linguistics (COLING 2025) in Abu Dhabi, United Arab Emirates. This workshop mainly aimed to provide a forum for researchers to share and discuss their ongoing work on language models (LMs) focusing on low-resource languages, following the recent advancements in neural language models and their linguistic biases towards high-resource languages. LoResLM 2025 attracted notable interest from the natural language processing (NLP) community, resulting in 35 accepted papers from 52 submissions. These contributions cover a broad range of low-resource languages from eight language families and 13 diverse research areas, paving the way for future possibilities and promoting linguistic inclusivity in NLP.

pdf bib
Atlas-Chat: Adapting Large Language Models for Low-Resource Moroccan Arabic Dialect
Guokan Shang | Hadi Abdine | Yousef Khoubrane | Amr Mohamed | Yassine Abbahaddou | Sofiane Ennadir | Imane Momayiz | Xuguang Ren | Eric Moulines | Preslav Nakov | Michalis Vazirgiannis | Eric Xing

We introduce Atlas-Chat, the first-ever collection of LLMs specifically developed for dialectal Arabic. Focusing on Moroccan Arabic, also known as Darija, we construct our instruction dataset by consolidating existing Darija language resources, creating novel datasets both manually and synthetically, and translating English instructions with stringent quality control. Atlas-Chat-2B, 9B, and 27B models, fine-tuned on the dataset, exhibit superior ability in following Darija instructions and performing standard NLP tasks. Notably, our models outperform both state-of-the-art and Arabic-specialized LLMs like LLaMa, Jais, and AceGPT, e.g., our 9B model gains a 13% performance boost over a larger 13B model on DarijaMMLU, in our newly introduced evaluation suite for Darija covering both discriminative and generative tasks. Furthermore, we perform an experimental analysis of various fine-tuning strategies and base model choices to determine optimal configurations. All our resources are publicly accessible, and we believe our work offers comprehensive design methodologies of instruction-tuning for low-resource languages, which are often neglected in favor of data-rich languages by contemporary LLMs.

pdf bib
Empowering Persian LLMs for Instruction Following: A Novel Dataset and Training Approach
Hojjat Mokhtarabadi | Ziba Zamani | Abbas Maazallahi | Mohammad Hossein Manshaei

Instruction-tuned large language models have demonstrated remarkable capabilities in following human instructions across various domains. However, their proficiency remains notably deficient in many low-resource languages. To address this challenge, we begin by introducing FarsInstruct: a comprehensive instruction dataset designed to enhance the instruction-following ability of large language models specifically for the Persian language—a significant yet underrepresented language globally. FarsInstruct encompasses a wide range of task types and datasets, each containing a mix of straightforward to complex manual written instructions, as well as translations from the Public Pool of Prompts, ensuring a rich linguistic and cultural representation. Furthermore, we introduce Co-CoLA, a framework designed to enhance the multi-task adaptability of LoRA-tuned models. Through extensive experimental analyses, our study showcases the effectiveness of the FarsInstruct dataset coupled with training by the Co-CoLA framework, in improving the performance of large language models within the Persian context. As of the current writing, FarsInstruct comprises 197 templates across 21 distinct datasets, and we intend to update it consistently, thus augmenting its applicability.

pdf bib
BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis
Sadia Alam | Md Farhan Ishmam | Navid Hasin Alvee | Md Shahnewaz Siddique | Md Azam Hossain | Abu Raihan Mostofa Kamal

The widespread availability of code-mixed data in digital spaces can provide valuable insights into low-resource languages like Bengali, which have limited annotated corpora. Sentiment analysis, a pivotal text classification task, has been explored across multiple languages, yet code-mixed Bengali remains underrepresented with no large-scale, diverse benchmark. Code-mixed text is particularly challenging as it requires the understanding of multiple languages and their interaction in the same text. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali comprising 20,000 samples with 4 sentiment labels, sourced from Facebook, YouTube, and e-commerce sites. By aggregating multiple sources, we ensure linguistic diversity reflecting realistic code-mixed scenarios. We implement a novel automated text filtering pipeline using fine-tuned language models to detect code-mixed samples and expand code-mixed text corpora. We further propose baselines using machine learning, neural networks, and transformer-based language models. The availability of a diverse dataset is a critical step towards democratizing NLP and ultimately contributing to a better understanding of code-mixed languages.

pdf bib
Using Language Models for assessment of users’ satisfaction with their partner in Persian
Zahra Habibzadeh | Masoud Asadpour

Sentiment analysis, the process of gauging user attitudes and emotions through their textual data, including social media posts and other forms of communication, is a valuable tool for informed decision-making. In other words, a statement conveys positivity, negativity, or neutrality, sentiment analysis offers insights into public sentiment regarding a product, individual, event, or other significant topics. This research focuses on the effectiveness of sentiment analysis techniques, using Machine Learning (ML) and Natural Language Processing (NLP) especially pre-trained language models for Persian, in assessing users’ satisfaction with their partner, using data collected from X (formerly Twitter). Our motivation stems from traditional in-person surveys, which periodically analyze societal challenges in Iran. The limitations of these surveys led us to explore Artificial Intelligence (AI) as an alternative solution for addressing contemporary social issues. We collected Persian tweets and utilized data annotation techniques to label them according to our research question, forming the dataset. Our goal also was to provide a benchmark of Persian tweets on this specific topic. To evaluate our dataset, we employed several classification methods to achieve our goal, including classical ML models, Deep Neural Networks, and pre-trained language models for Persian. Following a comprehensive evaluation, our results show that BERTweet-FA (one of the pre-trained language models for Persian) emerged as the best performer among the classifiers for assessing users’ satisfaction. This point indicates the ability of language models to understand conversational Persian text and perform sentiment analysis, even in a low-resource language like Persian.

pdf bib
Enhancing Plagiarism Detection in Marathi with a Weighted Ensemble of TF-IDF and BERT Embeddings for Low-Resource Language Processing
Atharva Mutsaddi | Aditya Prashant Choudhary

Plagiarism involves using another person’s work or concepts without proper attribution, presenting them as original creations. With the growing amount of data communicated in regional languages such as Marathi—one of India’s regional languages—it is crucial to design robust plagiarism detection systems tailored for low-resource languages. Language models like Bidirectional Encoder Representations from Transformers (BERT) have demonstrated exceptional capability in text representation and feature extraction, making them essential tools for semantic analysis and plagiarism detection. However, the application of BERT for low-resource languages remains underexplored, particularly in the context of plagiarism detection. This paper presents a method to enhance the accuracy of plagiarism detection for Marathi texts using BERT sentence embeddings in conjunction with Term Frequency-Inverse Document Frequency (TF-IDF) feature representation. By combining TF-IDF with BERT, the system’s performance is significantly improved, which is especially pronounced in languages where BERT models are not extremely robust due to a lack of resources and corpora. This approach effectively captures statistical, semantic, and syntactic aspects of text features through a weighted voting ensemble of machine learning models.

pdf bib
Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in Hausa Language Using AfriBERTa
Sani Abdullahi Sani | Shamsuddeen Hassan Muhammad | Devon Jarvis

Sentiment analysis (SA) plays a vital role in Natural Language Processing (NLP) by identifying sentiments expressed in text. Although significant advances have been made in SA for widely spoken languages, low-resource languages such as Hausa face unique challenges, primarily due to a lack of digital resources. This study investigates the effectiveness of Language-Adaptive Fine-Tuning (LAFT) to improve SA performance in Hausa. We first curate a diverse, unlabeled corpus to expand the model’s linguistic capabilities, followed by applying LAFT to adapt AfriBERTa specifically to the nuances of the Hausa language. The adapted model is then fine-tuned on the labeled NaijaSenti sentiment dataset to evaluate its performance. Our findings demonstrate that LAFT gives modest improvements, which may be attributed to the use of formal Hausa text rather than informal social media data. Nevertheless, the pre-trained AfriBERTa model significantly outperformed models not specifically trained on Hausa, highlighting the importance of using pre-trained models in low-resource contexts. This research emphasizes the necessity for diverse data sources to advance NLP applications for low-resource African languages. We will publish the code and the data set to encourage further research and facilitate reproducibility in low-resource NLP

pdf bib
Automated Collection of Evaluation Dataset for Semantic Search in Low-Resource Domain Language
Anastasia Zhukova | Christian E. Matt | Bela Gipp

Domain-specific languages that use a lot of specific terminology often fall into the category of low-resource languages. Collecting test datasets in a narrow domain is time-consuming and requires skilled human resources with domain knowledge and training for the annotation task. This study addresses the challenge of automated collecting test datasets to evaluate semantic search in low-resource domain-specific German language of the process industry. Our approach proposes an end-to-end annotation pipeline for automated query generation to the score reassessment of query-document pairs. To overcome the lack of text encoders trained in the German chemistry domain, we explore a principle of an ensemble of “weak” text encoders trained on common knowledge datasets. We combine individual relevance scores from diverse models to retrieve document candidates and relevance scores generated by an LLM, aiming to achieve consensus on query-document alignment. Evaluation results demonstrate that the ensemble method significantly improves alignment with human-assigned relevance scores, outperforming individual models in both inter-coder agreement and accuracy metrics. These findings suggest that ensemble learning can effectively adapt semantic search systems for specialized, low-resource languages, offering a practical solution to resource limitations in domain-specific contexts.

pdf bib
Filipino Benchmarks for Measuring Sexist and Homophobic Bias in Multilingual Language Models from Southeast Asia
Lance Calvin Lim Gamboa | Mark Lee

Bias studies on multilingual models confirm the presence of gender-related stereotypes in masked models processing languages with high NLP resources. We expand on this line of research by introducing Filipino CrowS-Pairs and Filipino WinoQueer: benchmarks that assess both sexist and anti-queer biases in pretrained language models (PLMs) handling texts in Filipino, a low-resource language from the Philippines. The benchmarks consist of 7,074 new challenge pairs resulting from our cultural adaptation of English bias evaluation datasets—a process that we document in detail to guide similar forthcoming efforts. We apply the Filipino benchmarks on masked and causal multilingual models, including those pretrained on Southeast Asian data, and find that they contain considerable amounts of bias. We also find that for multilingual models, the extent of bias learned for a particular language is influenced by how much pretraining data in that language a model was exposed to. Our benchmarks and insights can serve as a foundation for future work analyzing and mitigating bias in multilingual models.

pdf bib
Exploiting Word Sense Disambiguation in Large Language Models for Machine Translation
Van-Hien Tran | Raj Dabre | Hour Kaing | Haiyue Song | Hideki Tanaka | Masao Utiyama

Machine Translation (MT) has made great strides with the use of Large Language Models (LLMs) and advanced prompting techniques. However, translating sentences with ambiguous words remains challenging, especially when LLMs have limited proficiency in the source language. This paper introduces two methods to enhance MT performance by leveraging the word sense disambiguation capabilities of LLMs. The first method integrates all the available senses of an ambiguous word into the prompting template. The second method uses a pre-trained source language model to predict the correct sense of the ambiguous word, which is then incorporated into the prompting template. Additionally, we propose two prompting template styles for providing word sense information to LLMs. Experiments on the HOLLY dataset demonstrate the effectiveness of our approach in improving MT performance.

pdf bib
Low-Resource Interlinear Translation: Morphology-Enhanced Neural Models for Ancient Greek
Maciej Rapacz | Aleksander Smywiński-Pohl

Contemporary machine translation systems prioritize fluent, natural-sounding output with flexible word ordering. In contrast, interlinear translation maintains the source text’s syntactic structure by aligning target language words directly beneath their source counterparts. Despite its importance in classical scholarship, automated approaches to interlinear translation remain understudied. We evaluated neural interlinear translation from Ancient Greek to English and Polish using four transformer-based models: two Ancient Greek-specialized (GreTa and PhilTa) and two general-purpose multilingual models (mT5-base and mT5-large). Our approach introduces novel morphological embedding layers and evaluates text preprocessing and tag set selection across 144 experimental configurations using a word-aligned parallel corpus of the Greek New Testament. Results show that morphological features through dedicated embedding layers significantly enhance translation quality, improving BLEU scores by 35% (44.67 → 60.40) for English and 38% (42.92 → 59.33) for Polish compared to baseline models. PhilTa achieves state-of-the-art performance for English, while mT5-large does so for Polish. Notably, PhilTa maintains stable performance using only 10% of training data. Our findings challenge the assumption that modern neural architectures cannot benefit from explicit morphological annotations. While preprocessing strategies and tag set selection show minimal impact, the substantial gains from morphological embeddings demonstrate their value in low-resource scenarios.

pdf bib
Language verY Rare for All
Ibrahim Merad | Amos Wolf | Ziad Mazzawi | Yannick Léo

In the quest to overcome language barriers, encoder-decoder models like NLLB have expanded machine translation to rare languages, with some models (e.g., NLLB 1.3B) even trainable on a single GPU. While general-purpose LLMs perform well in translation, open LLMs prove highly competitive when fine-tuned for specific tasks involving unknown corpora. We introduce LYRA (Language verY Rare for All), a novel approach that combines open LLM fine-tuning, retrieval-augmented generation (RAG), and transfer learning from related high-resource languages. This study is exclusively focused on single-GPU training to facilitate ease of adoption. Our study focuses on two-way translation between French and Monégasque — a rare language unsupported by existing translation tools due to limited corpus availability. Our results demonstrate LYRA’s effectiveness, frequently surpassing and consistently matching state-of-the-art encoder-decoder models in rare language translation.

pdf bib
Improving LLM Abilities in Idiomatic Translation
Sundesh Donthi | Maximilian Spencer | Om B. Patel | Joon Young Doh | Eid Rodan | Kevin Zhu | Sean O’Brien

Translating idiomatic expressions remains a challenge for large language models (LLMs), as they often produce literal, semantically incorrect translations—for instance, directly converting “break a leg” into a nonsensical phrase in the target language. While external resources like IdiomKB can supply the figurative meaning and thus yield semantically accurate translations, this approach does not preserve the cultural and stylistic nuances that make idioms so distinctive. Our study focuses on idiomatic translations across multiple languages, including Chinese (ZH), Urdu (UR), and Hindi (HI), with clearly defined abbreviations for each. We propose two methods for improving idiomatic translation fidelity: a Semantic Idiom Alignment (SIA) approach that uses pre-trained sentence embeddings to identify target-language idioms, and a Language-Model-based Idiom Alignment (LIA) approach that prompts an LLM to suggest appropriate idiom counterparts. Human evaluations across multiple language pairs show that SIA better preserves idiomatic style. To support this work, we introduce idiom datasets in low-resource languages (Urdu and Hindi). Our results indicate that aligning idioms at the semantic level can improve cross-lingual style preservation and cultural authenticity.

pdf bib
A Comparative Study of Static and Contextual Embeddings for Analyzing Semantic Changes in Medieval Latin Charters
Yifan Liu | Gelila Tilahun | Xinxiang Gao | Qianfeng Wen | Michael Gervers

The Norman Conquest of 1066 C.E. brought profound transformations to England’s administrative, societal, and linguistic practices. The DEEDS (Documents of Early England Data Set) database offers a unique opportunity to explore these changes by examining shifts in word meanings within a vast collection of Medieval Latin charters. While computational linguistics typically relies on vector representations of words like static and contextual embeddings to analyze semantic changes, existing embeddings for scarce and historical Medieval Latin are limited and may not be well-suited for this task. This paper presents the first computational analysis of semantic change pre- and post-Norman Conquest and the first systematic comparison of static and contextual embeddings in a scarce historical data set. Our findings confirm that, consistent with existing studies, contextual embeddings outperform static word embeddings in capturing semantic change within a scarce historical corpus.

pdf bib
Bridging Literacy Gaps in African Informal Business Management with Low-Resource Conversational Agents
Maimouna Ouattara | Abdoul Kader Kaboré | Jacques Klein | Tegawendé F. Bissyandé

Position paper: In many African countries, the informal business sector represents the backbone of the economy, providing essential livelihoods and opportunities where formal employment is limited. Despite, however, the growing adoption of digital tools, entrepreneurs in this sector often face significant challenges due to lack of literacy and language barriers. These barriers not only limit accessibility but also increase the risk of fraud and financial insecurity. This position paper explores the potential of conversational agents (CAs) adapted to low-resource languages (LRLs), focusing specifically on Mooré, a language widely spoken in Burkina Faso. By enabling natural language interactions in local languages, AI-driven conversational agents offer a promising solution to enable informal traders to manage their financial transactions independently, thus promoting greater autonomy and security in business, while providing a step towards formalization of their business. Our study examines the main challenges in developing AI for African languages, including data scarcity and linguistic diversity, and reviews viable strategies for addressing them, such as cross-lingual transfer learning and data augmentation techniques.

pdf bib
Social Bias in Large Language Models For Bangla: An Empirical Study on Gender and Religious Bias
Jayanta Sadhu | Maneesha Rani Saha | Rifat Shahriyar

The rapid growth of Large Language Models (LLMs) has put forward the study of biases as a crucial field. It is important to assess the influence of different types of biases embedded in LLMs to ensure fair use in sensitive fields. Although there have been extensive works on bias assessment in English, such efforts are rare and scarce for a major language like Bangla. In this work, we examine two types of social biases in LLM generated outputs for Bangla language. Our main contributions in this work are: (1) bias studies on two different social biases for Bangla, (2) a curated dataset for bias measurement benchmarking and (3) testing two different probing techniques for bias detection in the context of Bangla. This is the first work of such kind involving bias assessment of LLMs for Bangla to the best of our knowledge. All our code and resources are publicly available for the progress of bias related research in Bangla NLP.

pdf bib
Extracting General-use Transformers for Low-resource Languages via Knowledge Distillation
Jan Christian Blaise Cruz

In this paper, we propose the use of simple knowledge distillation to produce smaller and more efficient single-language transformers from Massively Multilingual Transformers (MMTs) to alleviate tradeoffs associated with the use of such in low-resource settings. Using Tagalog as a case study, we show that these smaller single-language models perform on-par with strong baselines in a variety of benchmark tasks in a much more efficient manner. Furthermore, we investigate additional steps during the distillation process that improves the soft-supervision of the target language, and provide a number of analyses and ablations to show the efficacy of the proposed method.

pdf bib
Beyond Data Quantity: Key Factors Driving Performance in Multilingual Language Models
Sina Bagheri Nezhad | Ameeta Agrawal | Rhitabrat Pokharel

Multilingual language models (MLLMs) are crucial for handling text across various languages, yet they often show performance disparities due to differences in resource availability and linguistic characteristics. While the impact of pre-train data percentage and model size on performance is well-known, our study reveals additional critical factors that significantly influence MLLM effectiveness. Analyzing a wide range of features, including geographical, linguistic, and resource-related aspects, we focus on the SIB-200 dataset for classification and the Flores-200 dataset for machine translation, using regression models and SHAP values across 204 languages. Our findings identify token similarity and country similarity as pivotal factors, alongside pre-train data and model size, in enhancing model performance. Token similarity facilitates cross-lingual transfer, while country similarity highlights the importance of shared cultural and linguistic contexts. These insights offer valuable guidance for developing more equitable and effective multilingual language models, particularly for underrepresented languages.

pdf bib
BabyLMs for isiXhosa: Data-Efficient Language Modelling in a Low-Resource Context
Alexis Matzopoulos | Charl Hendriks | Hishaam Mahomed | Francois Meyer

The BabyLM challenge called on participants to develop sample-efficient language models. Submissions were pretrained on a fixed English corpus, limited to the amount of words children are exposed to in development (<100m). The challenge produced new architectures for data-efficient language modelling, outperforming models trained on trillions of words. This is promising for low-resource languages, where available corpora are limited to much less than 100m words. In this paper, we explore the potential of BabyLMs for low-resource languages, using the isiXhosa language as a case study. We pretrain two BabyLM architectures, ELC-BERT and MLSM, on an isiXhosa corpus. They outperform a vanilla pretrained model on POS tagging and NER, achieving notable gains (+3.2 F1) for the latter. In some instances, the BabyLMs even outperform XLM-R. Our findings show that data-efficient models are viable for low-resource languages, but highlight the continued importance, and lack of, high-quality pretraining data. Finally, we visually analyse how BabyLM architectures encode isiXhosa.

pdf bib
Mapping Cross-Lingual Sentence Representations for Low-Resource Language Pairs Using Pre-trained Language Models
Andreea Ioana Tudor | Tsegaye Misikir Tashu

In this work, we explore different linear mapping techniques to learn cross-lingual document representations from pre-trained multilingual large language models for low-resource languages. Three different mapping techniques namely Linear Concept Approximation (LCA), Linear Concept Compression (LCC), and Neural Concept Approximation (NCA) and four multilingual language models such as mBERT, mT5, XLM-R, and ErnieM were used to extract embeddings. The inter-lingual representations were created mappings the monolingual representation extracted from multilingual language models. The experimental results showed that LCA and LCC significantly outperform NCA, with models like ErnieM achieving the highest alignment quality. Language pairs exhibit variable performance, influenced by linguistic similarity and data availability, with the Amharic-English pair yielding particularly high scores. The results showed the utility of LCA and LCC in enabling cross-lingual tasks for low-resource languages.

pdf bib
How to age BERT Well: Continuous Training for Historical Language Adaptation
Anika Harju | Rob van der Goot

As the application of computational tools increases to digitalize historical archives, automatic annotation challenges persist due to distinct linguistic and morphological features of historical languages like Old English (OE). Existing tools struggle with the historical language varieties due to insufficient training. Previous research has focused on adapting pre-trained language models to new languages or domains but has rarely explored the modeling of language variety across time. Hence, we investigate the effectiveness of continuous language model training for adapting language models to OE on domain-specific data. We compare the continuous training of an English model (EN) and a multilingual model (ML), and use POS tagging for downstream evaluation. Results show that continuous pre-training substantially improves performance. We retrain a modern English (EN) model and a Multi-lingual (ML) BERT model for OE. We confirmed the effectiveness of continuous pre-training for language adaptation and downstream evaluation utilizing part-of-speech (POS) tagging, advancing the potential to understand the unique grammatical structures of historical OE archives. More concretely, EN BERT initially outperformed ML BERT with an accuracy of 83% during the language modeling phase. However, on the POS tagging task, ML BERT surpassed EN BERT, achieving an accuracy of 94%, which suggests effective performance to the historical language varieties.

pdf bib
Exploiting Task Reversibility of DRS Parsing and Generation: Challenges and Insights from a Multi-lingual Perspective
Muhammad Saad Amin | Luca Anselma | Alessandro Mazzei

Semantic parsing and text generation exhibit reversible properties when utilizing Discourse Representation Structures (DRS). However, both processes—text-to-DRS parsing and DRS-to-text generation—are susceptible to errors. In this paper, we exploit the reversible nature of DRS to explore both error propagation, which is commonly seen in pipeline methods, and the less frequently studied potential for error correction. We investigate two pipeline approaches: Parse-Generate-Parse (PGP) and Generate-Parse-Generate (GPG), utilizing pre-trained language models where the output of one model becomes the input for the next. Our evaluation uses the Parallel Meaning Bank dataset, focusing on Urdu as a low-resource language, Italian as a mid-resource language, and English serving as a high-resource baseline. Our analysis highlights that while pipelines are theoretically suited for error correction, they more often propagate errors, with Urdu exhibiting the greatest sensitivity, Italian showing a moderate effect, and English demonstrating the highest stability. This variation highlights the unique challenges faced by low-resource languages in semantic processing tasks. Further, our findings suggest that these pipeline methods support the development of more linguistically balanced datasets, enabling a comprehensive assessment across factors like sentence structure, length, type, polarity, and voice. Our cross-linguistic analysis provides valuable insights into the behavior of DRS processing in low-resource contexts, demonstrating both the potential and limitations of reversible pipeline approaches.

pdf bib
BBPOS: BERT-based Part-of-Speech Tagging for Uzbek
Latofat Bobojonova | Arofat Akhundjanova | Phil Sidney Ostheimer | Sophie Fellenz

This paper advances NLP research for the low-resource Uzbek language by evaluating two previously untested monolingual Uzbek BERT models on the part-of-speech (POS) tagging task and introducing the first publicly available UPOS-tagged benchmark dataset for Uzbek. Our fine-tuned models achieve 91% average accuracy, outperforming the baseline multi-lingual BERT as well as the rule-based tagger. Notably, these models capture intermediate POS changes through affixes and demonstrate context sensitivity, unlike existing rule-based taggers.

pdf bib
When Every Token Counts: Optimal Segmentation for Low-Resource Language Models
Vikrant Dewangan | Bharath Raj S | Garvit Suri | Raghav Sonavane

Traditional greedy tokenization methods have been a critical step in Natural Language Processing (NLP), influencing how text is converted into tokens and directly impacting model performance. While subword tokenizers like Byte-Pair Encoding (BPE) are widely used, questions remain about their optimality across model scales and languages. In this work, we demonstrate through extensive experiments that an optimal BPE configuration significantly reduces token count compared to greedy segmentation, yielding improvements in token-saving percentages and performance benefits, particularly for smaller models. We evaluate tokenization performance across various intrinsic and extrinsic tasks, including generation and classification. Our findings suggest that compression-optimized tokenization strategies could provide substantial advantages for multilingual and low-resource (LR) language applications, highlighting a promising direction for further research and inclusive NLP.

pdf bib
Recent Advancements and Challenges of Turkic Central Asian Language Processing
Yana Veitsman | Mareike Hartmann

Research in NLP for Central Asian Turkic languages - Kazakh, Uzbek, Kyrgyz, and Turkmen - faces typical low-resource language challenges like data scarcity, limited linguistic resources and technology development. However, recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks. Thus, this paper aims to summarize recent progress and identify future research directions. It provides a high-level overview of each language’s linguistic features, the current technology landscape, the application of transfer learning from higher-resource languages, and the availability of labeled and unlabeled data. By outlining the current state, we hope to inspire and facilitate future research.

pdf bib
CaLQuest.PT: Towards the Collection and Evaluation of Natural Causal Ladder Questions in Portuguese for AI Agents
Uriel Anderson Lasheras | Vladia Pinheiro

Large Language Models (LLMs) are increasingly central to the development of generative AI across diverse fields. While some anticipate these models may mark a step toward artificial general intelligence, their ability to handle complex causal reasoning remains unproven. Causal reasoning, particularly at Pearl’s interventional and counterfactual levels, is essential for true general intelligence. In this work, we introduce CaLQuest.PT, a dataset of over 8,000 natural causal questions in Portuguese, collected from real human interactions. Built upon a novel three-axis taxonomy, CaLQuest.PT categorizes questions by causal intent, action requirements, and the level of causal reasoning needed (associational, interventional, or counterfactual). Our findings from evaluating CaLQuest.PT’s seed questions with GPT-4o reveal that this LLM face challenges in handling interventional and relation-seeking causal queries. These results suggest limitations in using GPT-4o for extending causal question annotations and highlight the need for improved LLM strategies in causal reasoning. CaLQuest.PT provides a foundation for advancing LLM capabilities in causal understanding, particularly for the Portuguese-speaking world.

pdf bib
PersianMCQ-Instruct: A Comprehensive Resource for Generating Multiple-Choice Questions in Persian
Kamyar Zeinalipour | Neda Jamshidi | Fahimeh Akbari | Marco Maggini | Monica Bianchini | Marco Gori

We present PersianMCQ-Instruct, a comprehensive resource that includes a dataset and advanced models for generating multiple-choice questions (MCQs) in standard Iranian Persian, a low-resource language spoken by over 80 million people. This resource features three state-of-the-art models for Persian MCQ generation: PMCQ-Gemma2-9b, PMCQ-Llama3.1-8b, and PMCQ-Mistral-7B. Inspired by the Agent Instruct framework and GPT-4o, we created the dataset by curating over 4,000 unique Persian Wikipedia pages, resulting in three MCQs per page and a total of over 12,000 questions. To ensure the quality of this dataset, we conducted human evaluations and model fine-tuning, both of which demonstrated significant performance improvements in Persian MCQ generation. The dataset and models are publicly available, offering valuable tools for researchers and educators, with particular benefits for advancing Persian-language educational technology.

pdf bib
Stop Jostling: Adaptive Negative Sampling Reduces the Marginalization of Low-Resource Language Tokens by Cross-Entropy Loss
Galim Turumtaev

Neural language models often struggle with low-resource languages due to the limited availability of training data, making tokens from these languages rare in the training set. This paper addresses a specific challenge during training: rare tokens are disproportionately affected by marginalization, which prevents them from learning effectively. We propose a thresholding technique that reduces the impact of this marginalization, allowing rare tokens to benefit from more meaningful alignment. Through experiments with a character-level language model, we demonstrate that this method significantly improves performance on low-resource language validation data. This work is the first to show how negative sampling can be applied to improve the representation of rare tokens by limiting the harmful influence of excessive marginalization, offering a new approach to enhancing language model performance for underrepresented languages.

pdf bib
Towards Inclusive Arabic LLMs: A Culturally Aligned Benchmark in Arabic Large Language Model Evaluation
Omer Nacar | Serry Taiseer Sibaee | Samar Ahmed | Safa Ben Atitallah | Adel Ammar | Yasser Alhabashi | Abdulrahman S. Al-Batati | Arwa Alsehibani | Nour Qandos | Omar Elshehy | Mohamed Abdelkader | Anis Koubaa

Arabic Large Language Models are usually evaluated using Western-centric benchmarks that overlook essential cultural contexts, making them less effective and culturally misaligned for Arabic-speaking communities. This study addresses this gap by evaluating the Arabic Massive Multitask Language Understanding (MMLU) Benchmark to assess its cultural alignment and relevance for Arabic Large Language Models (LLMs) across culturally sensitive topics. A team of eleven experts annotated over 2,500 questions, evaluating them based on fluency, adequacy, cultural appropriateness, bias detection, religious sensitivity, and adherence to social norms. Through human assessment, the study highlights significant cultural misalignments and biases, particularly in sensitive areas like religion and morality. In response to these findings, we propose annotation guidelines and integrate culturally enriched data sources to enhance the benchmark’s reliability and relevance. The research highlights the importance of cultural sensitivity in evaluating inclusive Arabic LLMs, fostering more widely accepted LLMs for Arabic-speaking communities.

pdf bib
Controlled Evaluation of Syntactic Knowledge in Multilingual Language Models
Daria Kryvosheieva | Roger Levy

Language models (LMs) are capable of acquiring elements of human-like syntactic knowledge. Targeted syntactic evaluation tests have been employed to measure how well they form generalizations about syntactic phenomena in high-resource languages such as English. However, we still lack a thorough understanding of LMs’ capacity for syntactic generalizations in low-resource languages, which are responsible for much of the diversity of syntactic patterns worldwide. In this study, we develop targeted syntactic evaluation tests for three low-resource languages (Basque, Hindi, and Swahili) and use them to evaluate five families of open-access multilingual Transformer LMs. We find that some syntactic tasks prove relatively easy for LMs while others (agreement in sentences containing indirect objects in Basque, agreement across a prepositional phrase in Swahili) are challenging. We additionally uncover issues with publicly available Transformers, including a bias toward the habitual aspect in Hindi in multilingual BERT and underperformance compared to similar-sized models in XGLM-4.5B.

pdf bib
Evaluating Large Language Models for In-Context Learning of Linguistic Patterns In Unseen Low Resource Languages
Hongpu Zhu | Yuqi Liang | Wenjing Xu | Hongzhi Xu

This paper investigates the ability of Large language Models (LLMs) in capturing linguistic patterns from unseen languages and applying them to translation between the languages and English within an in-context learning framework. Inspired by the International Linguistics Olympiad (IOL), we create test data consisting of translation puzzles between 40 low resource languages and English. We test the LLMs in two different strategies: direct prompting and step-by-step prompting. In the latter, the puzzles are manually decomposed into intermediate steps to allow LLMs learn and apply linguistic rules incrementally. The results show that this strategy can significantly improve the performance of LLMs, achieving comparable or slightly superior results to humans when translating the unseen languages to English. However, LLMs still struggle with translating English into the unseen languages, typically with complex syntactic rules. We further observe that LLMs cannot deal with languages with object-subject and noun-adjective word order compared to others, reflecting the potential impact imposed by typological features of languages in training data.

pdf bib
Next-Level Cantonese-to-Mandarin Translation: Fine-Tuning and Post-Processing with LLMs
Yuqian Dai | Chun Fai Chan | Ying Ki Wong | Tsz Ho Pun

Large Language Models (LLMs) have improved performance across various natural language processing tasks. Despite these improvements, LLMs continue to face significant challenges, such as grammatical issues and code-switching to English, when applied to low-resource languages like Cantonese in Machine Translation (MT) scenarios. By addressing the unique linguistic and contextual challenges of Cantonese, we present a novel strategy to improve the understanding and translation capabilities of LLMs for Cantonese-to-Mandarin MT. Our strategy comprises three key components: (1) Syntax and Part-of-Speech (POS) fine-tuning, where we use the Universal Dependencies (UD) corpus to fine-tune LLM, focusing on the linguistic structures of Cantonese; (2) Specialized Cantonese to Mandarin sentence pairs, collected from diverse sources such as Cantonese grammar textbooks and manually translated sentences across various domains, to expose the model to a wide range of linguistic contexts; (3) Post-processing with additional LLMs, where we introduce additional LLMs to improve the initial translations, correcting Mandarin grammar and punctuation. Empirical evaluations on human-created test sets show that our proposed strategy improves translation performance and outperforms existing commercial translation models with at least 3 BLEU scores. Additionally, our strategy also benefits other LLMs and a reversed translation direction, demonstrating its generalization and effectiveness.

pdf bib
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages
Archchana Sindhujan | Diptesh Kanojia | Constantin Orasan | Shenbin Qian

This paper investigates the reference-less evaluation of machine translation for low-resource language pairs, known as quality estimation (QE). Segment-level QE is a challenging cross-lingual language understanding task that provides a quality score (0 -100) to the translated output. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios and perform instruction fine-tuning using a novel prompt based on annotation guidelines. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models. Our error analysis reveals tokenization issues, along with errors due to transliteration and named entities, and argues for refinement in LLM pre-training for cross-lingual tasks. We release the data, and models trained publicly for further research.

pdf bib
Does Machine Translation Impact Offensive Language Identification? The Case of Indo-Aryan Languages
Alphaeus Dmonte | Shrey Satapara | Rehab Alsudais | Tharindu Ranasinghe | Marcos Zampieri

The accessibility to social media platforms can be improved with the use of machine translation (MT). Non-standard features present in user-generated on social media content such as hashtags, emojis, and alternative spellings can lead to mistranslated instances by the MT systems. In this paper, we investigate the impact of MT on offensive language identification in Indo-Aryan languages. We use both original and MT datasets to evaluate the performance of various offensive language models. Our evaluation indicates that offensive language identification models achieve superior performance on original data than on MT data, and that the models trained on MT data identify offensive language more precisely on MT data than the models trained on original data.

pdf bib
IsiZulu noun classification based on replicating the ensemble approach for Runyankore
Zola Mahlaza | C. Maria Keet | Imaan Sayed | Alexander Van Der Leek

A noun’s class is a crucial component in NLP, because it governs agreement across the sentence in Niger Congo B (NCB) languages, among others. The phenomenon is ill-documented in most NCB languages, or in a non-reusable format, such as a printed dictionary subject to copyright restrictions. A promising approach by Byamugisha (2022) used a data-driven approach for Runyankore that combined syntax and semantics. The code and data are inaccessible however, and it remains to be seen whether it is suitable for other NCB languages. We aimed to reproduce Byamugisha’s experiment, but then for isiZulu. We conducted this as two independent experiments, so that we also could subject it to a meta-analysis. Results showed that it was reproducible only in part, mainly due to imprecision in the original description, and the current impossibility to generate the same kind of source data set generated from an existing grammar. The different choices made in attempting to reproduce the pipeline as well as differences in choice of training and test data had a large effect on the eventual accuracy of noun class disambiguation but could produce accuracies in the same range as for Runyankore: 80-85%.

pdf bib
From Arabic Text to Puzzles: LLM-Driven Development of Arabic Educational Crosswords
Kamyar Zeinalipour | Moahmmad Saad | Marco Maggini | Marco Gori

We present an Arabic crossword puzzle generator from a given text that utilizes advanced language models such as GPT-4-Turbo, GPT-3.5-Turbo, and Llama3-8B-Instruct, specifically developed for educational purposes, this innovative generator leverages a meticulously compiled dataset named Arabic-Clue-Instruct with over 50,000 entries encompassing text, answers, clues, and categories. This dataset is intricately designed to aid in the generation of pertinent clues linked to specific texts and keywords within defined categories. This project addresses the scarcity of advanced educational tools tailored for the Arabic language, promoting enhanced language learning and cognitive development. By providing a culturally and linguistically relevant tool, our objective is to make learning more engaging and effective through gamification and interactivity. Integrating state-of-the-art artificial intelligence with contemporary learning methodologies, this tool can generate crossword puzzles from any given educational text, thereby facilitating an interactive and enjoyable learning experience. This tool not only advances educational paradigms but also sets a new standard in interactive and cognitive learning technologies.

up

pdf (full)
bib (full)
Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)

pdf bib
Proceedings of the Eighth Workshop on Technologies for Machine Translation of Low-Resource Languages (LoResMT 2025)
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jonathan Washington | Nathaniel Oco | Xiaobing Zhao

pdf bib
Comparative Evaluation of Machine Translation Models Using Human-Translated Social Media Posts as References: Human-Translated Datasets
Shareefa Ahmed Al Amer | Mark G. Lee | Phillip Smith

Machine translation (MT) of social media text presents unique challenges due to its informal nature, linguistic variations, and rapid evolution of language trends. In this paper, we propose a human-translated English dataset to Arabic, Italian, and Spanish, and a human-translated Arabic dataset to Modern Standard Arabic (MSA) and English. We also perform a comprehensive analysis of three publicly accessible MT models using human translations as a reference. We investigate the impact of social media informality on translation quality by translating the MSA version of the text and comparing BLEU and METEOR scores with the direct translation of the original social media posts. Our findings reveal that MarianMT provides the closest translations to human for Italian and Spanish among the three models, with METEOR scores of 0.583 and 0.640, respectively, while Google Translate provides the closest translations for Arabic, with a METEOR score of 0.354. By comparing the translation of the original social media posts with the MSA version, we confirm that the informality of social media text significantly impacts translation quality, with an increase of 12 percentage points in METEOR scores over the original posts. Additionally, we investigate inter-model alignment and the degree to which the output of these MT models align.

pdf bib
Enhanced Zero-Shot Machine Translation via Fixed Prefix Pair Bootstrapping
Van-Hien Tran | Masao Utiyama

Zero-shot in-context learning allows large language models (LLMs) to perform tasks using only provided instructions. However, pre-trained LLMs often face calibration issues in zero-shot scenarios, leading to challenges such as hallucinations and off-target translations that compromise output quality, particularly in machine translation (MT). This paper introduces a new method to improve zero-shot MT using fixed prefix pair bootstrapping. By initializing translations with an accurate bilingual prefix pair at the start of both source and target sentences, this approach effectively guides the model to generate precise target-language outputs. Extensive evaluations across four model architectures and multiple translation directions demonstrate significant and consistent improvements, showcasing the potential of this straightforward strategy to enhance zero-shot MT performance.

pdf bib
UTER: Capturing the Human Touch in Evaluating Morphologically Rich and Low-Resource Languages
Samy Ouzerrout

We introduce UTER, a novel automatic translation evaluation metric specifically designed for morphologically complex languages. Unlike traditional TER approaches, UTER incorporates a reordering algorithm and leverages the Sørensen-Dicse similarity measure to better account for morphological variations.Tested on morphologically rich and low resource languages from the WMT22 dataset, such as Finnish, Estonian, Kazakh, and Xhosa, UTER delivers results that align more closely with human direct assessments (DA) and outperforms benchmark metrics, including chrF and METEOR. Furthermore, its effectiveness has also been demonstrated on languages with complex writing systems, such as Chinese and Japanese, showcasing its versatility and robustness.

pdf bib
From Text to Multi-Modal: Advancing Low-Resource-Language Translation through Synthetic Data Generation and Cross-Modal Alignments
Bushi Xiao | Qian Shen | Daisy Zhe Wang

In this study, we propose a novel paradigm for multi-modal low resource language dataset generation that eliminates dependency on existing parallel multi-modal datasets. Leveraging advances in large image-generation models, we introduce a systematic pipeline that transforms text-only parallel corpora into rich multi-modal translation datasets. We then validate the generated content through human evaluation. We design and implement a new MMT model framework suitable for our new generated dataset. The model contains a verification mechanism with a large language model to ensure consistency between visual content and textual translations. Experimental results across four African low-resource languages with less than 10k training corpus demonstrate significant improvements over NLLB baselines, with average gains of up to 9.8% in BLEU score and 4.3% in METEOR score. Our method shows particular effectiveness in correctly translating concrete objects and contextual elements, suggesting its potential for improving low-resource machine translation through visual grounding.

pdf bib
Wenzhou Dialect Speech to Mandarin Text Conversion
Zhipeng Gao | Akihiro Tamura | Tsuneo Kato

The Wenzhou dialect is a Chinese dialect that is significantly distinct from Mandarin, the official language of China. It is among the most complex Chinese dialects and is nearly incomprehensible to people from regions such as Northern China, thereby creating substantial communication barriers. Therefore, the conversion between the Wenzhou dialect and Mandarin is essential to facilitate communication between Wenzhou dialect speakers and those from other Chinese regions. However, as a low-resource language, the Wenzhou dialect lacks publicly available datasets, and such conversion technologies have not been extensively researched. Thus, in this study, we create a parallel dataset containing Wenzhou dialect speech and the corresponding Mandarin text and build benchmark models for Wenzhou dialect speech-to-Mandarin text conversion. In particular, we fine-tune two self-supervised learning-based pretrained models, that is, TeleSpeech-ASR1.0 and Wav2Vec2-XLS-R, with our training dataset and report their performance on our test dataset as baselines for future research.

pdf bib
Fostering Digital Inclusion for Low-Resource Nigerian Languages: A Case Study of Igbo and Nigerian Pidgin
Ebelechukwu Nwafor | Minh Phuc Nguyen

Current state-of-the-art large language models (LLMs) like GPT-4 perform exceptionally well in language translation tasks for high-resource languages, such as English, but often lack high accuracy results for low-resource African languages such as Igbo and Nigerian Pidgin, two native languages in Nigeria. This study addresses the need for Artificial Intelligence (AI) linguistic diversity by creating benchmark datasets for Igbo-English and Nigerian Pidgin-English language translation tasks. The dataset developed is curated from reputable online sources and meticulously annotated by crowd-sourced native-speaking human annotators. Using the datasets, we evaluate the translation abilities of GPT-based models alongside other state-of-the-art translation models specifically designed for low-resource languages. Our results demonstrate that current state-of-the-art models outperform GPT-based models in translation tasks. In addition, these datasets can significantly enhance LLM performance in these translation tasks, marking a step toward reducing linguistic bias and promoting more inclusive AI models.

pdf bib
Low-resource Machine Translation: what for? who for? An observational study on a dedicated Tetun language translation service
Raphael Merx | Adérito José Guterres Correia | Hanna Suominen | Ekaterina Vylomova

Low-resource machine translation (MT) presents a diversity of community needs and application challenges that remain poorly understood. To complement surveys and focus groups, which tend to rely on small samples of respondents, we propose an observational study on actual usage patterns of a specialized MT service for the Tetun language, which is the lingua franca in Timor-Leste. Our analysis of 100,000 translation requests reveals patterns that challenge assumptions based on existing corpora. We find that users, many of them students on mobile devices, typically translate text from a high-resource language into Tetun across diverse domains including science, healthcare, and daily life. This contrasts sharply with available Tetun corpora, which are dominated by news articles covering government and social issues.Our results suggest that MT systems for institutionalized minority languages like Tetun should prioritize accuracy on domains relevant to educational contexts, in the high-resource to low-resource direction. More broadly, this study demonstrates how observational analysis can inform low-resource language technology development, by grounding research in practical community needs.

pdf bib
Jamo-Level Subword Tokenization in Low-Resource Korean Machine Translation
Junyoung Lee | Marco Cognetta | Sangwhan Moon | Naoaki Okazaki

Subword tokenization, where text is represented in an intermediate form between full words and characters, is ubiquitous in modern NLP due to its ability to represent any input sentence with a small vocabulary. However for Korean, where there are 11,172 base characters (*syllables*) in its alphabet, it is difficult to have a vocabulary large enough to succinctly encode text while fitting within parameter-budget constraints. This motivates us to explore an alternative representation for Korean which relies on the decompositional nature of Korean syllables: a syllable can be uniquely decomposed into a sequence of two or three subcharacters (*jamo*), of which there are only 68.Using jamo as the basis for subword tokenization (e.g., byte-pair encoding) leads to shorter tokenized sequences with fewer vocabulary parameters, exposes the model to sub-syllable-level morphological information, and increases the amount of augmentation gained from subword regularization. We evaluate jamo-level subword tokenization on several Korean translation tasks and find that jamo-level subword models consistently outperform syllable- and byte-level models in low-resource and restricted-vocabulary settings.

pdf bib
Beyond English: The Impact of Prompt Translation Strategies across Languages and Tasks in Multilingual LLMs
Itai Mondshine | Tzuf Paz-Argaman | Reut Tsarfaty

Despite advances in the multilingual capabilities of Large Language Models (LLMs) across diverse tasks, English remains the dominant language for LLM research and development. So, when working with a different language, this has led to the widespread practice of pre-translation, i.e., translating the task prompt into English before inference. Selective pre-translation, a more surgical approach, focuses on translating specific prompt components. However, its current use is sporagic and lacks a systematic research foundation. Consequently, the optimal pre-translation strategy for various multilingual settings and tasks remains unclear. In this work, we aim to uncover the optimal setup for pre-translation by systematically assessing its use. Specifically, we view the prompt as a modular entity, composed of four functional parts: instruction, context, examples, and output, either of which could be translated or not. We evaluate pre-translation strategies across 35 languages covering both low and high-resource languages, on various tasks including Question Answering (QA), Natural Language Inference (NLI), Named Entity Recognition (NER), and Abstractive Summarization. Our experiments show the impact of factors as similarity to English, translation quality and the size of pre-trained data, on the model performance with pre-translation. We suggest practical guidelines for choosing optimal strategies in various multilingual settings

pdf bib
ModeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
Nathan Andrew Chi | Teodor Malchev | Riley Kong | Ryan Andrew Chi | Lucas Huang | Ethan A Chi | R. Thomas McCoy | Dragomir Radev

We introduce ModeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language’s grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, ModeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that ModeLing can be used to measure further progress in linguistic reasoning.

pdf bib
Multilingual State Space Models for Structured Question Answering in Indic Languages
Arpita Vats | Rahul Raja | Mrinal Mathur | Aman Chadha | Vinija Jain

The diversity and complexity of Indic languages present unique challenges for natural language processing (NLP) tasks, particularly in the domain of question answering (QA).To address these challenges, this paper explores the application of State Space Models (SSMs) to build efficient and contextually aware QA systems tailored for Indic languages. SSMs are particularly suited for this task due to their ability to model long-term and short-term dependencies in sequential data, making them well-equipped to handle the rich morphology, complex syntax, and contextual intricacies characteristic of Indian languages. We evaluated multiple SSM architectures across diverse datasets representing various Indic languages and conducted a comparative analysis of their performance. Our results demonstrate that these models effectively capture linguistic subtleties, leading to significant improvements in question interpretation, context alignment, and answer generation. This work represents the first application of SSMs to question answering tasks in Indic languages, establishing a foundational benchmark for future research in this domain. Furthermore, we propose enhancements to existing SSM frameworks, optimizing their applicability to low-resource settings and multilingual scenarios prevalent in Indic languages.

pdf bib
Parallel Corpora for Machine Translation in Low-Resource Indic Languages: A Comprehensive Review
Rahul Raja | Arpita Vats

Parallel corpora play an important role in training machine translation (MT) models, particularly for low-resource languages where high-quality bilingual data is scarce. This review provides a comprehensive overview of available parallel corpora for Indic languages, which span diverse linguistic families, scripts, and regional variations. We categorize these corpora into text-to-text, code-switched, and various categories of multimodal datasets, highlighting their significance in the development of robust multilingual MT systems. Beyond resource enumeration, we critically examine the challenges faced in corpus creation, including linguistic diversity, script variation, data scarcity, and the prevalence of informal textual content. We also discuss and evaluate these corpora in various terms such as alignment quality and domain representativeness. Furthermore, we address open challenges such as data imbalance across Indic languages, the trade-off between quality and quantity, and the impact of noisy, informal, and dialectal data on MT performance. Finally, we outline future directions, including leveraging cross-lingual transfer learning, expanding multilingual datasets, and integrating multimodal resources to enhance translation quality. To the best of our knowledge, this paper presents the first comprehensive review of parallel corpora specifically tailored for low-resource Indic languages in the context of machine translation.

pdf bib
Low-Resource Transliteration for Roman-Urdu and Urdu Using Transformer-Based Models
Umer Butt | Stalin Varanasi | Günter Neumann

As the Information Retrieval (IR) field increasingly recognizes the importance of inclusivity, addressing the needs of low-resource languages remains a significant challenge. Transliteration between Urdu and its Romanized form, Roman Urdu, remains underexplored despite the widespread use of both scripts in South Asia. Prior work using RNNs on the Roman-Urdu-Parl dataset showed promising results but suffered from poor domain adaptability and limited evaluation. We propose a transformer-based approach using the m2m100 multilingual translation model, enhanced with masked language modeling (MLM) pretraining and fine-tuning on both Roman-Urdu-Parl and the domain diverse Dakshina dataset. To address previous evaluation flaws, we introduce rigorous dataset splits and assess performance using BLEU, character-level BLEU, and CHRF. Our model achieves strong transliteration performance, with Char-BLEU scores of 96.37 for Urdu→Roman-Urdu and 97.44 for Roman-Urdu→Urdu. These results outperform both RNN baselines and GPT-4o Mini and demonstrate the effectiveness of multilingual transfer learning for low-resource transliteration tasks.

pdf bib
Building Data Infrastructure for Low-Resource Languages
Sarah K. K. Luger | Rafael Mosquera | Pedro Ortiz Suarez

The MLCommons Datasets Working Group presents a comprehensive initiative to advance the development and accessibility of artificial intelligence (AI) training and testing resources. This paper introduces three key projects aimed at addressing critical gaps in the AI data ecosystem: the Unsupervised People’s Speech Dataset, containing over 821,000 hours of speech across 89+ languages; a strategic collaboration with Common Crawl to enhance web crawling capabilities for low-resource languages; and a framework for knowledge graph extraction evaluation. By focusing on languages other than English (LOTE) and creating permissively licensed, high-quality datasets, these initiatives aim to democratize AI development and improve model performance across diverse linguistic contexts. This work represents a significant step toward more inclusive and capable AI systems that can serve global communities.

pdf bib
Encoder-Aware Sequence-Level Knowledge Distillation for Low-Resource Neural Machine Translation
Menan Velayuthan | Nisansa De Silva | Surangika Ranathunga

Domain adaptation in Neural Machine Translation (NMT) is commonly achieved through fine-tuning, but this approach becomes inefficient as the number of domains increases. Knowledge distillation (KD) provides a scalable alternative by training a compact model on distilled data from a larger model. However, we hypothesize that vanilla sequence-level KD primarily distills the decoder while neglecting encoder knowledge, leading to suboptimal knowledge transfer and limiting its effectiveness in low-resource settings, where both data and computational resources are constrained. To address this, we propose an improved sequence-level KD method that enhances encoder knowledge transfer through a cosine-based alignment loss. Our approach first trains a large model on a mixed-domain dataset and generates a Distilled Mixed Dataset (DMD). A small model is then trained on this dataset via sequence-level KD with encoder alignment. Experiments in a low-resource setting validate our hypothesis, demonstrating that our approach outperforms vanilla sequence-level KD, improves generalization to out-of-domain data, and facilitates efficient domain adaptation while reducing model size and computational cost.

pdf bib
PahGen: Generating Ancient Pahlavi Text via Grammar-guided Zero-shot Translation
Farhan Farsi | Parnian Fazel | Farzaneh Goshtasb | Nadia Hajipour | Sadra Sabouri | Ehsaneddin Asgari | Hossein Sameti

The Pahlavi language, aka Middle Persian, is a critical part of Persian cultural and historical heritage which bridges the Old Persian and Modern Persian (Farsi). However, due to its limited digital presence and the scarcity of comprehensive linguistic resources, Pahlavi is at risk of extinction. As an early attempt to preserve this language, this study introduces a framework to translate English text into Pahlavi. Our approach combines grammar-guided term extraction with zero-shot translation, leveraging large language models (LLMs) to generate syntactically and semantically accurate Pahlavi sentences.This framework aims to preserve the Pahlavi language and serves as a model for reviving other endangered languages with similar characteristics. Finally using our framework, we generate a novel dataset of 360 expert-validated parallel English-Pahlavi texts.

pdf bib
Limitations of Religious Data and the Importance of the Target Domain: Towards Machine Translation for Guinea-Bissau Creole
Jacqueline Rowe | Edward Gow-Smith | Mark Hepple

We introduce a new dataset for machine translation of Guinea-Bissau Creole (Kiriol), comprising around 40 thousand parallel sentences to English and Portuguese. This dataset is made up of predominantly religious data (from the Bible and texts from the Jehovah’s Witnesses), but also a small amount of general domain data (from a dictionary). This mirrors the typical resource availability of many low resource languages. We train a number of transformer-based models to investigate how to improve domain transfer from religious data to a more general domain. We find that adding even 300 sentences from the target domain when training substantially improves the translation performance, highlighting the importance and need for data collection for low-resource languages, even on a small-scale. We additionally find that Portuguese-to-Kiriol translation models perform better on average than other source and target language pairs, and investigate how this relates to the morphological complexity of the languages involved and the degree of lexical overlap between creoles and lexifiers. Overall, we hope our work will stimulate research into Kiriol and into how machine translation might better support creole languages in general.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Multilingual Counterspeech Generation

pdf bib
Proceedings of the First Workshop on Multilingual Counterspeech Generation
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Ráez | Aitor Soroa | María Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri

pdf bib
PANDA - Paired Anti-hate Narratives Dataset from Asia: Using an LLM-as-a-Judge to Create the First Chinese Counterspeech Dataset
Michael Bennie | Demi Zhang | Bushi Xiao | Jing Cao | Chryseis Xinyi Liu | Jian Meng | Alayo Tripp

Despite the global prevalence of Modern Standard Chinese language, counterspeech (CS) resources for Chinese remain virtually nonexistent. To address this gap in East Asian counterspeech research we introduce the a corpus of Modern Standard Mandarin counterspeech that focuses on combating hate speech in Mainland China. This paper proposes a novel approach of generating CS by using an LLM-as-a-Judge, simulated annealing, LLMs zero-shot CN generation and a round-robin algorithm. This is followed by manual verification for quality and contextual relevance. This paper details the methodology for creating effective counterspeech in Chinese and other non-Eurocentric languages, including unique cultural patterns of which groups are maligned and linguistic patterns in what kinds of discourse markers are programmatically marked as hate speech (HS). Analysis of the generated corpora, we provide strong evidence for the lack of open-source, properly labeled Chinese hate speech data and the limitations of using an LLM-as-Judge to score possible answers in Chinese. Moreover, the present corpus servers as the first East Asian language based CS corpus and provides an essential resource for future research on counterspeech generation and evaluation.

pdf bib
RSSN at Multilingual Counterspeech Generation: Leveraging Lightweight Transformers for Efficient and Context-Aware Counter-Narrative Generation
Ravindran V

This paper presents a system for counter-speech generation, developed for the COLING 2025 shared task. By leveraging lightweight transformer models, DistilBART and T5-small, we optimize computational efficiency while maintaining strong performance. The work includes an in-depth analysis of a multilingual dataset, addressing hate speech instances across diverse languages and target groups. Through systematic error analysis, we identify challenges such as lack of specificity and context misinterpretation in generated counter-narratives. Evaluation metrics like BLEU, ROUGE, and BERTScore demonstrate the effectiveness of our approaches, while comparative insights highlight complementary strengths in fluency, contextual integration, and creativity. Future directions focus on enhancing preprocessing, integrating external knowledge sources, and improving scalability.

pdf bib
Northeastern Uni at Multilingual Counterspeech Generation: Enhancing Counter Speech Generation with LLM Alignment through Direct Preference Optimization
Sahil Wadhwa | Chengtian Xu | Haoming Chen | Aakash Mahalingam | Akankshya Kar | Divya Chaudhary

The automatic generation of counter-speech (CS) is a critical strategy for addressing hate speech by providing constructive and informed responses. However, existing methods often fail to generate high-quality, impactful, and scalable CS, particularly across diverse lin- guistic contexts. In this paper, we propose a novel methodology to enhance CS generation by aligning Large Language Models (LLMs) using Supervised Fine-Tuning (SFT) and Di- rect Preference Optimization (DPO). Our ap- proach leverages DPO to align LLM outputs with human preferences, ensuring contextu- ally appropriate and linguistically adaptable responses. Additionally, we incorporate knowl- edge grounding to enhance the factual accuracy and relevance of generated CS. Experimental results demonstrate that DPO-aligned models significantly outperform SFT baselines on CS benchmarks while scaling effectively to mul- tiple languages. These findings highlight the potential of preference-based alignment tech- niques to advance CS generation across var- ied linguistic settings. The model supervision and alignment is done in English and the same model is used for reporting metrics across other languages like Basque, Italian, and Spanish.

pdf bib
NLP@IIMAS-CLTL at Multilingual Counterspeech Generation: Combating Hate Speech Using Contextualized Knowledge Graph Representations and LLMs
David Salvador Preciado Márquez | Helena Gómez Adorno | Ilia Markov | Selene Baez Santamaria

We present our approach for the shared task on Multilingual Counterspeech Generation (MCG) to counteract hate speech (HS) in Spanish, English, Basque, and Italian. To accomplish this, we followed two different strategies: 1) a graph-based generative model that encodes graph representations of knowledge related to hate speech, and 2) leveraging prompts for a large language model (LLM), specifically GPT-4o. We find that our graph-based approach tends to perform better in terms of traditional evaluation metrics (i.e., RougeL, BLEU, BERTScore), while the JudgeLM evaluation employed in the shared task favors the counter-narratives generated by the LLM-based approach, which was ranked second for English and third for Spanish on the leaderboard.

pdf bib
CODEOFCONDUCT at Multilingual Counterspeech Generation: A Context-Aware Model for Robust Counterspeech Generation in Low-Resource Languages
Michael Bennie | Bushi Xiao | Chryseis Xinyi Liu | Demi Zhang | Jian Meng | Alayo Tripp

This paper introduces a context-aware model for robust counterspeech generation, which achieved significant success in the MCG-COLING-2025 shared task. Our approach particularly excelled in low-resource language settings. By leveraging a simulated annealing algorithm fine-tuned on multilingual datasets, the model generates factually accurate responses to hate speech. We demonstrate state-of-the-art performance across four languages (Basque, English, Italian, and Spanish), with our system ranking first for Basque, second for Italian, and third for both English and Spanish. Notably, our model swept all three top positions for Basque, highlighting its effectiveness in low-resource scenarios. Evaluation of the shared task employs both traditional metrics (BLEU, ROUGE, BERTScore, Novelty) and the LLM-based JudgeLM. We present a detailed analysis of our results, including error cases and potential improvements. This work contributes to the growing body of research on multilingual counterspeech generation, offering insights into developing robust models that can adapt to diverse linguistic and cultural contexts in the fight against online hate speech.

pdf bib
HW-TSC at Multilingual Counterspeech Generation
Xinglin Lyu | Haolin Wang | Min Zhang | Hao Yang

Multilingual counterspeech generation (MCSG) contributes to generating counterspeech with respectful, non-offensive information that is specific and truthful for the given hate speech, especially those for languages other than English. Generally, the training data of MCSG in low-source language is rare and hard to curate. Even with the impressive large language models (LLMs), it is a struggle to generate an appreciative counterspeech under the multilingual scenario. In this paper, we design a pipeline with a generation-reranking mode to effectively generate counterspeech under the multilingual scenario via LLM. Considering the scarcity of training data, we first utilize the training-free strategy, i.e., in-context learning (ICL), to generate the candidate counterspeechs. Then, we propose to rerank those candidate counterspeech via the Elo rating algorithm and a fine-tuned reward model. Experimental results on four languages, including English (EN), Italian (IT), Basque (EU) and Spanish (ES), our system achieves a comparative or even better performance in four metrics compared to the winner in this shared task.

pdf bib
MilaNLP@Multilingual Counterspeech Generation: Evaluating Translation and Background Knowledge Filtering
Emanuele Moscato | Arianna Muti | Debora Nozza

We describe our participation in the Multilingual Counterspeech Generation shared task, which aims to generate a counternarrative to counteract hate speech, given a hateful sentence and relevant background knowledge. Our team tested two different aspects: translating outputs from English vs generating outputs in the original languages and filtering pieces of the background knowledge provided vs including all the background knowledge. Our experiments show that filtering the background knowledge in the same prompt and leaving data in the original languages leads to more adherent counternarrative generations, except for Basque, where translating the output from English and filtering the background knowledge in a separate prompt yields better results. Our system ranked first in English, Italian, and Spanish and fourth in Basque.

pdf bib
Hyderabadi Pearls at Multilingual Counterspeech Generation : HALT : Hate Speech Alleviation using Large Language Models and Transformers
Md Shariq Farhan

This paper explores the potential of using fine- tuned Large Language Models (LLMs) for generating counter-narratives (CNs) to combat hate speech (HS). We focus on English and Basque, leveraging the ML_MTCONAN_KN dataset, which provides hate speech and counter-narrative pairs in multiple languages. Our study compares the performance of Mis- tral, Llama, and a Llama-based LLM fine- tuned on a Basque language dataset for CN generation. The generated CNs are evalu- ated using JudgeLM (a LLM to evaluate other LLMs in open-ended scenarios) along with traditional metrics such as ROUGE-L, BLEU, BERTScore, and other traditional metrics. The results demonstrate that fine-tuned LLMs can produce high-quality contextually relevant CNs for low-resource languages that are comparable to human-generated responses, offering a sig- nificant contribution to combating online hate speech across diverse linguistic settings.

pdf bib
TrenTeam at Multilingual Counterspeech Generation: Multilingual Passage Re-Ranking Approaches for Knowledge-Driven Counterspeech Generation Against Hate
Daniel Russo

Hate speech (HS) in online spaces poses severe risks, including real-world violence and psychological harm to victims, necessitating effective countermeasures. Counterspeech (CS), which responds to hateful messages with opposing yet non-hostile narratives, offer a promising solution by mitigating HS while upholding free expression. However, the growing volume of HS demands automation, making Natural Language Processing a viable solution for the automatic generation of CS. Recent works have explored knowledge-driven approaches, leveraging external sources to improve the relevance and informativeness of responses. These methods typically involve multi-step pipelines combining retrieval and passage re-ranking modules. While effective, most studies have focused on English, with limited exploration of multilingual contexts. This paper addresses these gaps by proposing a multilingual, knowledge-driven approach to CS generation. We integrate state-of-the-art re-ranking mechanisms into the CS generation pipeline and evaluate them using the MT-CONAN-KN dataset, which includes hate speech, relevant knowledge sentences, and counterspeech in four languages: English, Italian, Spanish, and Basque. Our approach compares reranker-based systems employing multilingual cross-encoders and LLMs to a simpler end-to-end system where the language model directly handles both knowledge selection and CS generation. Results demonstrate that reranker-based systems outperformed end-to-end systems in syntactic and semantic similarity metrics, with LLM-based re-rankers delivering the strongest performance overall. This work is the result of our participation in the Shared Task on Multilingual Counterspeech Generation held at COLING 2025.

pdf bib
The First Workshop on Multilingual Counterspeech Generation at COLING 2025: Overview of the Shared Task
Helena Bonaldi | María Estrella Vallecillo-Rodríguez | Irune Zubiaga | Arturo Montejo-Raez | Aitor Soroa | María-Teresa Martín-Valdivia | Marco Guerini | Rodrigo Agerri

This paper presents an overview of the Shared Task organized in the First Workshop on Multilingual Counterspeech Generation at COLING 2025. While interest in automatic approaches to Counterspeech generation has been steadily growing, the large majority of the published experimental work has been carried out for English. This is due to the scarcity of both non-English manually curated training data and to the crushing predominance of English in the generative Large Language Models (LLMs) ecosystem. The task’s goal is to promote and encourage research on Counterspeech generation in a multilingual setting (Basque, English, Italian, and Spanish) potentially leveraging background knowledge provided in the proposed dataset. The task attracted 11 participants, 9 of whom presented a paper describing their systems. Together with the task, we introduce a new multilingual counterspeech dataset with 2384 triplets of hate speech, counterspeech, and related background knowledge covering 4 languages. The dataset is available at: https://huggingface.co/datasets/LanD-FBK/ML_MTCONAN_KN.

up

pdf (full)
bib (full)
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)

pdf bib
Proceedings of the 21st Workshop on Multiword Expressions (MWE 2025)
Atul Kr. Ojha | Voula Giouli | Verginica Barbu Mititelu | Mathieu Constant | Gražina Korvel | A. Seza Doğruöz | Alexandre Rademaker

pdf bib
Syntagmatic Productivity of MWEs in Scientific English
Diego Alves | Stefan Fischer | Elke Teich

This paper presents an analysis of the syntagmatic productivity (SynProd) of different classes of multiword expressions (MWEs) in English scientific writing over time (mid 17th to 20th c.). SynProd refers to the variability of the syntagmatic context in which a word or other kind of linguistic unit is used. To measure SynProd, we use entropy. The study reveals that, similar to single-token units of various parts of speech, MWEs exhibit an increasing trend in syntagmatic productivity over time, particularly after the mid-19th century. Furthermore, when compared to similar parts of speech (PoS), MWEs show a more pronounced increase in SynProd over time.

pdf bib
Probing Internal Representations of Multi-Word Verbs in Large Language Models
Hassane Kissane | Achim Schilling | Patrick Krauss

This study investigates the internal representations of verb-particle combinations, called multi-word verbs, within transformer-based large language models (LLMs), specifically examining how these models capture lexical and syntactic properties at different neural network layers. Using the BERT architecture, we analyze the representations of its layers for two different verb-particle constructions: phrasal verbs like “give up” and prepositional verbs like “look at”. Our methodology includes training probing classifiers on the model output to classify these categories at both word and sentence levels. The results indicate that the model’s middle layers achieve the highest classification accuracies. To further analyze the nature of these distinctions, we conduct a data separability test using the Generalized Discrimination Value (GDV). While GDV results show weak linear separability between the two verb types, probing classifiers still achieve high accuracy, suggesting that representations of these linguistic categories may be “non-linearly separable”. This aligns with previous research indicating that linguistic distinctions in neural networks are not always encoded in a linearly separable manner. These findings computationally support usage-based claims on the representation of verb-particle constructions and highlight the complex interaction between neural network architectures and linguistic structures.

pdf bib
VMWE identification with models trained on GUD (a UDv.2 treebank of Standard Modern Greek)
Stella Markantonatou | Vivian Stamou | Stavros Bompolas | Katerina Anastasopoulou | Irianna Linardaki Vasileiadi | Konstantinos Diamantopoulos | Yannis Kazos | Antonios Anastasopoulos

UD_Greek-GUD (GUD) is the most recent Universal Dependencies (UD) treebank for Standard Modern Greek (SMG) and the first SMG UD treebank to annotate Verbal Multiword Expressions (VMWEs). GUD contains material from fiction texts and various sites that use colloquial SMG. We describe the special annotation decisions we implemented with GUD, the pipeline we developed to facilitate the active annotation of new material, and we report on the method we designed to evaluate the performance of models trained on GUD as regards VMWE identification tasks.

pdf bib
Using LLMs to Advance Idiom Corpus Construction
Doğukan Arslan | Hüseyin Anıl Çakmak | Gulsen Eryigit | Joakim Nivre

Idiom corpora typically include both idiomatic and literal examples of potentially idiomatic expressions, but creating such corpora traditionally requires substantial expert effort and cost. In this article, we explore the use of large language models (LLMs) to generate synthetic idiom corpora as a more time- and cost-efficient alternative. We evaluate the effectiveness of synthetic data in training task-specific models and testing GPT-4 in few-shot prompting setting using synthetic data for idiomaticity detection. Our findings reveal that although models trained on synthetic data perform worse than those trained on human-generated data, synthetic data generation offers considerable advantages in terms of cost and time. Specifically, task-specific idiomaticity detection models trained on synthetic data outperform the general-purpose LLM that generated the data when evaluated in a zero-shot setting, achieving an average improvement of 11 percentage points across four languages. Moreover, synthetic data enhances the LLM’s performance, enabling it to match the task-specific models trained with synthetic data when few-shot prompting is applied.

pdf bib
Gathering Compositionality Ratings of Ambiguous Noun-Adjective Multiword Expressions in Galician
Laura Castro | Marcos Garcia

Multiword expressions pose numerous challenges to most NLP tasks, and so do their compositionality and semantic ambiguity. The need for resources that make it possible to explore such phenomena is rather pressing, even more so in the case of low-resource languages. In this paper, we present a dataset of noun-adjective compounds in Galician with compositionality scores at token level. These MWEs are ambiguous due to being potentially idiomatic expressions, as well as due to the ambiguity and productivity of their constituents. The dataset comprises 240 MWEs that amount to 322 senses, which are contextualized in two sets of sentences, manually created, and extracted from corpora, totaling 1,858 examples. For this dataset, we gathered human judgments on compositionality levels for compounds, heads, and modifiers. Furthermore, we obtained frequency, ambiguity, and productivity data for compounds and their constituents, and we explored potential correlations between mean compositionality scores and these three properties in terms of compounds, heads, and modifiers. This valuable resource helps evaluate language models on (non-)compositionality and ambiguity, key challenges in NLP, and is especially relevant for Galician, a low-resource variety lacking annotated datasets for such linguistic phenomena.

pdf bib
Survey on Lexical Resources Focused on Multiword Expressions for the Purposes of NLP
Verginica Mititelu | Voula Giouli | Gražina Korvel | Chaya Liebeskind | Irina Lobzhanidze | Rusudan Makhachashvili | Stella Markantonatou | Aleksandra Markovic | Ivelina Stoyanova

Lexica of MWEs have always been a valuable resource for various NLP tasks. This paper presents the results of a comprehensive survey on multiword lexical resources that extends a previous one from 2016 to the present. We analyze a diverse set of lexica across multiple languages, reporting on aspects such as creation date, intended usage, languages covered and linguality type, content, acquisition method, accessibility, and linkage to other language resources. Our findings highlight trends in MWE lexicon development focusing on the representation level of languages. This survey aims to support future efforts in creating MWE lexica for NLP applications by identifying these gaps and opportunities.

pdf bib
A European Portuguese corpus annotated for verbal idioms
David Antunes | Jorge Baptista | Nuno J. Mamede

This paper presents the construction of VIDiom-PT, a corpus in European Portuguese annotated for verbal idioms (e.g. O Rui bateu a bota, lit.: Rui hit the boot ‘Rui died’). This linguistic resource aims to support the development of systems capable of processing such constructions in this language variety. To assist in the annotation effort, two tools were built. The first allows for the detection of possible instances of verbal idioms in texts, while the second provides a graphical interface for annotating them. This effort culminated in the annotation of a total of 5,178 instances of 747 different verbal idioms in more than 200,000 sentences in European Portuguese. A highly reliable inter-annotator agreement was achieved, using Krippendorff’s alpha for nominal data (0.869) with 5% of the data independently annotated by 3 experts. Part of the annotated corpus is also made publicly available.

pdf bib
MultiCoPIE: A Multilingual Corpus of Potentially Idiomatic Expressions for Cross-lingual PIE Disambiguation
Uliana Sentsova | Debora Ciminari | Josef Van Genabith | Cristina España-Bonet

Language models are able to handle compositionality and, to some extent, non-compositional phenomena such as semantic idiosyncrasy, a feature most prominent in the case of idioms. This work introduces the MultiCoPIE corpus that includes potentially idiomatic expressions in Catalan, Italian, and Russian, extending the language coverage of PIE corpus data. The new corpus provides additional linguistic features of idioms, such as their semantic compositionality, part-of-speech of idiom head as well as their corresponding idiomatic expressions in English. With this new resource at hand, we first fine-tune an XLM-RoBERTa model to classify figurative and literal usage of potentially idiomatic expressions in English. We then study cross-lingual transfer to the languages represented in the MultiCoPIE corpus, evaluating the model’s ability to generalize an idiom-related task to languages not seen during fine-tuning. We show the effect of ‘cross-lingual lexical overlap’: the performance of the model, fine-tuned on English idiomatic expressions and tested on the MultiCoPIE languages, increases significantly when classifying ‘shared idioms’ -idiomatic expressions that have direct counterparts in English with similar form and meaning. While this observation raises questions about the generalizability of cross-lingual learning, the results from experiments on PIEs demonstrate strong evidence of effective cross-lingual transfer, even when accounting for idioms similar across languages.

pdf bib
Named Entity Recognition for the Irish Language
Jane Adkins | Hugo Collins | Joachim Wagner | Abigail Walsh | Brian Davis

The Irish language has been deemed ‘definitely endangered’ (Moseley, 2012) and has been clas- sified as having ‘weak or no support’ (Lynn, 2023) regarding digital resources in spite of its status as the first official and national language of the Republic of Ireland. This research de- velops the first named entity recognition (NER) tool for the Irish language, one of the essen- tial tasks identified by the Digital Plan for Irish (Ní Chasaide et al., 2022). In this study, we produce a small gold-standard NER-annotated corpus and compare both monolingual and mul- tilingual BERT models fine-tuned on this task. We experiment with different model architec- tures and low-resource language approaches to enrich our dataset. We test our models on a mix of single- and multi-word named entities as well as a specific multi-word named entity test set. Our proposed gaBERT model with the implementation of random data augmentation and a conditional random fields layer demon- strates significant performance improvements over baseline models, alternative architectures, and multilingual models, achieving an F1 score of 76.52. This study contributes to advanc- ing Irish language technologies and supporting Irish language digital resources, providing a basis for Irish NER and identification of other MWE types.

up

pdf (full)
bib (full)
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)

pdf bib
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Abteen Ebrahimi | Samar Haider | Emmy Liu | Sammar Haider | Maria Leonor Pacheco | Shira Wein

pdf bib
Fine-Grained and Multi-Dimensional Metrics for Document-Level Machine Translation
Yirong Sun | Dawei Zhu | Yanjun Chen | Erjia Xiao | Xinghao Chen | Xiaoyu Shen

Large language models (LLMs) have excelled in various NLP tasks, including machine translation (MT), yet most studies focus on sentence-level translation. This work investigates the inherent capability of instruction-tuned LLMs for document-level translation (docMT). Unlike prior approaches that require specialized techniques, we evaluate LLMs by directly prompting them to translate entire documents in a single pass. Our results show that this method improves translation quality compared to translating sentences separately, even without document-level fine-tuning. However, this advantage is not reflected in BLEU scores, which often favor sentence-based translations. We propose using the LLM-as-a-judge paradigm for evaluation, where GPT-4 is used to assess document coherence, accuracy, and fluency in a more nuanced way than n-gram-based metrics. Overall, our work demonstrates that instruction-tuned LLMs can effectively leverage document context for translation. However, we caution against using BLEU scores for evaluating docMT, as they often provide misleading outcomes, failing to capture the quality of document-level translation.

pdf bib
INSIGHTBUDDY-AI: Medication Extraction and Entity Linking using Pre-Trained Language Models and Ensemble Learning
Pablo Romero | Lifeng Han | Goran Nenadic

This paper presents our system, InsightBuddy-AI, designed for extracting medication mentions and their associated attributes, and for linking these entities to established clinical terminology resources, including SNOMED-CT, the British National Formulary (BNF), ICD, and the Dictionary of Medicines and Devices (dm+d).To perform medication extraction, we investigated various ensemble learning approaches, including stacked and voting ensembles (using first, average, and max voting methods) built upon eight pre-trained language models (PLMs). These models include general-domain PLMs—BERT, RoBERTa, and RoBERTa-Large—as well as domain-specific models such as BioBERT, BioClinicalBERT, BioMedRoBERTa, ClinicalBERT, and PubMedBERT.The system targets the extraction of drug-related attributes such as adverse drug effects (ADEs), dosage, duration, form, frequency, reason, route, and strength.Experiments conducted on the n2c2-2018 shared task dataset demonstrate that ensemble learning methods outperformed individually fine-tuned models, with notable improvements of 2.43% in Precision and 1.35% in F1-score.We have also developed cross-platform desktop applications for both entity recognition and entity linking, available for Windows and macOS.The InsightBuddy-AI application is freely accessible for research use at https://github.com/HECTA-UoM/InsightBuddy-AI.

pdf bib
Linguistic Features in German BERT: The Role of Morphology, Syntax, and Semantics in Multi-Class Text Classification
Henrike Beyer | Diego Frassinelli

Most studies on the linguistic information encoded by BERT primarily focus on English. Our study examines a monolingual German BERT model using a semantic classification task on newspaper articles, analysing the linguistic features influencing classification decisions through SHAP values. We use the TüBa-D/Z corpus, a resource with gold-standard annotations for a set of linguistic features, including POS, inflectional morphology, phrasal, clausal, and dependency structures. Semantic features of nouns are evaluated via the GermaNet ontology using shared hypernyms. Our results indicate that the features identified in English also affect classification in German but suggests important language- and task-specific features as well.

pdf bib
Thesis Proposal: Uncertainty in Knowledge Graph Embeddings
Yuqicheng Zhu

Knowledge Graph Embedding (KGE) methods are widely used to map entities and relations from knowledge graphs (KGs) into continuous vector spaces, enabling non-classical reasoning over knowledge structures. Despite their effectiveness, the uncertainty of KGE methods has not been extensively studied in the literature. This gap poses significant challenges, particularly when deploying KGE models in high-stakes domains like medicine, where reliability and risk assessment are critical. This dissertation seeks to investigate various types of uncertainty in KGE methods and explore strategies to quantify, mitigate, and reason under uncertainty effectively. The outcomes of this research will contribute to enhancing the reliability of KGE methods, providing greater confidence in their use beyond benchmark datasets, and supporting their application in real-world, high-stakes domains.

pdf bib
Detecting Sexism in Tweets: A Sentiment Analysis and Graph Neural Network Approach
Diana P. Madera-Espíndola | Zoe Caballero-Domínguez | Valeria J. Ramírez-Macías | Sabur Butt | Hector Ceballos

In the digital age, social media platforms like Twitter serve as an extensive repository of public discourse, including instances of sexism. It is important to identify such behavior since radicalized ideologies can lead to real-world violent acts. This project aims to develop a deep learning-based tool that leverages a combination of BERT (both English and multilingual versions) and GraphSAGE, a Graph Neural Network (GNN) model, alongside sentiment analysis and natural language processing (NLP) techniques. The tool is designed to analyze tweets for sexism detection and classify them into five categories.

pdf bib
Towards Codec-LM Co-design for Neural Codec Language Models
Shih-Lun Wu | Aakash Lahoti | Arjun D Desai | Karan Goel | Chris Donahue | Albert Gu

Neural codec language models (or codec LMs) are emerging as a powerful framework for audio generation tasks like text-to-speech (TTS). These models leverage advancements in language modeling and residual vector quantization (RVQ)-based audio codecs, which compress audios into discrete codes for LMs to process. Despite the close interdependence of codecs and LMs in these systems, research on codecs and LMs has largely remained siloed. In this work, we propose three techniques for better codec-LM co-design: (i) a frame-wise codec encoder that improves both LM log-likelihood and end-to-end TTS metrics, (ii) LM codebook level dropout, a method to efficiently navigate a portion of the codec-LM design space by training a single LM, and (iii) increased codec frame duration, which we show can accelerate inference while maintaining end-to-end performance. Our experiments demonstrate that combining all three co-design techniques results in doubled inference speed, and improvements in intelligibility, audio quality, and speaker control in TTS relative to a siloed baseline.

pdf bib
Low-resource Machine Translation for Code-switched Kazakh-Russian Language Pair
Maksim Borisov | Zhanibek Kozhirbayev | Valentin Malykh

Machine translation for low-resource language pairs is a challenging task. This task could become extremely difficult once a speaker uses code switching. We present the first code-switching Kazakh-Russian parallel corpus.Additionally, we propose a method to build a machine translation model for code-switched Kazakh-Russian language pair with no labeled data. Our method is basing on generation of synthetic data. This method results in a model beating an existing commercial system by human evaluation.

pdf bib
Generative Product Recommendations for Implicit Superlative Queries
Kaustubh Dhole | Nikhita Vedula | Saar Kuzi | Giuseppe Castellucci | Eugene Agichtein | Shervin Malmasi

In recommender systems, users often seek the best products through indirect, vague, or under-specified queries such as “best shoes for trail running.” These queries, referred to as implicit superlative queries, pose a challenge for standard retrieval and ranking systems due to their lack of explicit attribute mentions and the need for identifying and reasoning over complex attributes. We investigate how Large Language Models (LLMs) can generate implicit attributes for ranking and reason over them to improve product recommendations for such queries. As a first step, we propose a novel four-point schema, called SUPERB, for annotating the best product candidates for superlative queries, paired with LLM-based product annotations. We then empirically evaluate several existing retrieval and ranking approaches on our newly created dataset, providing insights and discussing how to integrate these findings into real-world e-commerce production systems.

pdf bib
ConQuer: A Framework for Concept-Based Quiz Generation
Yicheng Fu | Zikui Wang | Liuxin Yang | Meiqing Huo | Zhongdongming Dai

Quizzes play a crucial role in education by reinforcing students’ understanding of key concepts and encouraging self-directed exploration. However, compiling high-quality quizzes can be challenging and require deep expertise and insight into specific subject matter. Although LLMs have greatly enhanced the efficiency of quiz generation, concerns remain regarding the quality of these AI-generated quizzes and their educational impact on students. To address these issues, we introduce ConQuer, a concept-based quiz generation framework that leverages external knowledge sources. We employ comprehensive evaluation dimensions to assess the quality of the generated quizzes, using LLMs as judges. Our experiment results demonstrate a 4.8% improvement in evaluation scores and a 77.52% win rate in pairwise comparisons against baseline quiz sets. Ablation studies further underscore the effectiveness of each component in our framework.

pdf bib
What is it? Towards a Generalizable Native American Language Identification System
Ivory Yang | Weicheng Ma | Carlos Guerrero Alvarez | William Dinauer | Soroush Vosoughi

This paper presents a research thesis proposal to develop a generalizable Native American language identification system. Despite their cultural and historical significance, Native American languages remain entirely unsupported by major commercial language identification systems. This omission not only underscores the systemic neglect of endangered languages in technological development, but also highlights the urgent need for dedicated, community-driven solutions. We propose a two-pronged approach: (1) systematically curating linguistic resources across all Native American languages for robust training, and (2) tailored data augmentation to generate synthetic yet linguistically coherent training samples. As proof of concept, we extend an existing rudimentary Athabaskan language classifier by integrating Plains Apache, an extinct Southern Athabaskan language, as an additional language class. We also adapt a data generation framework for low-resource languages to create synthetic Plains Apache data, highlighting the potential of data augmentation. This proposal advocates for a community-driven, technological approach to supporting Native American languages.

pdf bib
Med-CoDE: Medical Critique based Disagreement Evaluation Framework
Mohit Gupta | Akiko Aizawa | Rajiv Ratn Shah

The emergence of large language models (LLMs) has significantly influenced numerous fields, including healthcare, by enhancing the capabilities of automated systems to process and generate human-like text. However, despite their advancements, the reliability and accuracy of LLMs in medical contexts remain critical concerns. Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance, leading to potential risks in clinical settings. In this work, we propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges. The framework leverages a critique-based approach to quantitatively measure the degree of disagreement between model-generated responses and established medical ground truths. This framework captures both accuracy and reliability in medical settings. The proposed evaluation framework aims to fill the existing gap in LLM assessment by offering a systematic method to evaluate the quality and trustworthiness of medical LLMs. Through extensive experiments and case studies, we illustrate the practicality of our framework in providing a comprehensive and reliable evaluation of medical LLMs.

pdf bib
Sentimatic: Sentiment-guided Automatic Generation of Preference Datasets for Customer Support Dialogue System
Suhyun Lee | ChangHeon Han

Supervised Fine-tuning (SFT) and preference optimization (PO) are key methods for enhancing language models and aligning them with human preferences. However, scaling preference datasets for PO training is challenging, leading AI customer support systems to rely on SFT. To address this, we propose the Sentiment-guided Automatic Generation of Preference Datasets (Sentimatic) methodology to automatically generate customer preference datasets without human intervention using a publicly available dataset constructed for SFT. Our approach classifies responses by sentiment, fine-tunes models on them, and applies advanced sampling and evaluation techniques to ensure diversity and quality. Ultimately, we generated 1,174 customer preference datasets based on 357 test datasets, and through experiments, we confirmed that the AI customer support system trained on these datasets is capable of carefully considering customer emotions and generating professional and appropriate responses.

pdf bib
Privacy-Preserving Federated Learning for Hate Speech Detection
Ivo de Souza Bueno Júnior | Haotian Ye | Axel Wisiorek | Hinrich Schütze

This paper presents a federated learning system with differential privacy for hate speech detection, tailored to low-resource languages. By fine-tuning pre-trained language models, ALBERT emerged as the most effective option for balancing performance and privacy. Experiments demonstrated that federated learning with differential privacy performs adequately in low-resource settings, though datasets with fewer than 20 sentences per client struggled due to excessive noise. Balanced datasets and augmenting hateful data with non-hateful examples proved critical for improving model utility. These findings offer a scalable and privacy-conscious framework for integrating hate speech detection into social media platforms and browsers, safeguarding user privacy while addressing online harm.

pdf bib
From Annotation to Adaptation: Metrics, Synthetic Data, and Aspect Extraction for Aspect-Based Sentiment Analysis with Large Language Models
Nikita Neveditsin | Pawan Lingras | Vijay Kumar Mago

This study examines the performance of Large Language Models (LLMs) in Aspect-Based Sentiment Analysis (ABSA), with a focus on implicit aspect extraction in a novel domain. Using a synthetic sports feedback dataset, we evaluate open-weight LLMs’ ability to extract aspect-polarity pairs and propose a metric to facilitate the evaluation of aspect extraction with generative models. Our findings highlight both the potential and limitations of LLMs in the ABSA task.

pdf bib
Developing Japanese CLIP Models Leveraging an Open-weight LLM for Large-scale Dataset Translation
Issa Sugiura | Shuhei Kurita | Yusuke Oda | Daisuke Kawahara | Naoaki Okazaki

CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available.

pdf bib
Self-Vocabularizing Training for Neural Machine Translation
Pin-Jie Lin | Ernie Chang | Yangyang Shi | Vikas Chandra

Past vocabulary learning techniques identify relevant vocabulary before training, relying on statistical and entropy-based assumptions that largely neglect the role of model training.Empirically, we observe that trained translation models are induced to use a byte-pair encoding (BPE) vocabulary subset distinct from the original BPE vocabulary, leading to performance improvements when retrained with the induced vocabulary.In this paper, we analyze this discrepancy in neural machine translation by examining vocabulary and entropy shifts during self-training—where each iteration generates a labeled dataset by pairing source sentences with the model’s predictions to define a new vocabulary.Building on these insights, we propose *self-vocabularizing training*, an iterative method that self-selects a smaller, more optimal vocabulary, yielding up to a 1.49 BLEU improvement.Moreover, we find that deeper model architectures lead to both an increase in unique token usage and a 6–8% reduction in vocabulary size.

pdf bib
CCT-Code: Cross-Consistency Training for Multilingual Clone Detection and Code Search
Nikita Sorokin | Tikhonov Anton | Dmitry Abulkhanov | Ivan Sedykh | Irina Piontkovskaya | Valentin Malykh

We consider the well-known and important tasks of clone detection and information retrieval for source code. The most standard setup is to search clones inside the same language code snippets. But it is also useful to find code snippets with identical behaviour in different programming languages. Nevertheless multi- and cross-lingual clone detection has been little studied in literature. We present a novel training procedure, cross-consistency training (CCT) leveraging cross-lingual similarity, that we apply to train language models on source code in various programming languages. We show that this training is effective both for encoder- and decoder-based models.The trained encoder-based CCT-LM model%and fine-tuned with CCT,achieves a new state of the art on POJ-104 (monolingual C++ clone detection benchmark) with 96.73% MAP and AdvTest (monolingual Python code search benchmark) with 47.18% MRR. The decoder-based CCT-LM model shows comparable performance in these tasks. In addition, we formulate the multi- and cross-lingual clone detection problem and present XCD, a new benchmark dataset produced from CodeForces submissions.

pdf bib
Text Compression for Efficient Language Generation
David Gu | Peter Belcak | Roger Wattenhofer

We challenge the prevailing assumption that LLMs must rely fully on sub-word tokens for high-quality text generation. To this end, we propose the “Generative Pretrained Thoughtformer” (GPTHF), a hierarchical transformer language model capable of text generation by compressing text into sentence embeddings and employing a sentence attention mechanism. GPTHF retains GPT’s architecture, modifying only token interactions via dynamic sparse attention masks. Our experiments show that GPTHF achieves an up to an order of magnitude improvement in FLOPs efficiency and a threefold increase in runtime speed compared to equally-sized GPT models in the low-size regime. This is achieved through a unique generation method that caches and reuses sentence embeddings, allowing significant portions of the input to bypass large parts of the network.

pdf bib
Multilingual Native Language Identification with Large Language Models
Dhiman Goswami | Marcos Zampieri | Kai North | Shervin Malmasi | Antonios Anastasopoulos

Native Language Identification (NLI) is the task of automatically identifying the native language (L1) of individuals based on their second language (L2) production. The introduction of Large Language Models (LLMs) with billions of parameters has renewed interest in text-based NLI, with new studies exploring LLM-based approaches to NLI on English L2. The capabilities of state-of-the-art LLMs on non-English NLI corpora, however, have not yet been fully evaluated. To fill this important gap, we present the first evaluation of LLMs for multilingual NLI. We evaluated the performance of several LLMs compared to traditional statistical machine learning models and language-specific BERT-based models on NLI corpora in English, Italian, Norwegian, and Portuguese. Our results show that fine-tuned GPT-4 models achieve state-of-the-art NLI performance.

pdf bib
Generating Synthetic Free-text Medical Records with Low Re-identification Risk using Masked Language Modeling
Samuel Belkadi | Libo Ren | Nicolo Micheletti | Lifeng Han | Goran Nenadic

The abundance of medical records holds great promise for enhancing healthcare and advancing biomedical research. However, due to privacy constraints, access to such data is typically limited to internal use.Recent studies have attempted to overcome this challenge by generating synthetic data through Causal Language Modelling. Yet, this approach often fails to ensure patient anonymity and offers limited control over output diversity—unless additional computational cost is introduced.In response, we propose a method for generating synthetic free-text medical records based on Masked Language Modelling. Our approach retains key medical details while introducing variability in the generated texts and reducing the risk of patient re-identification. With a relatively lightweight architecture of approximately 120 million parameters, the system ensures low inference costs.Experimental results show that our method produces high-quality synthetic data, achieving a HIPAA-compliant PHI recall of 96% and a re-identification risk of only 3.5%. Furthermore, downstream evaluations reveal that models trained on the synthetic data perform comparably to those trained on real-world data. Our trained models are publicly available on Github as SynDeidMLM (at https://github.com/SamySam0/SynDeidMLM) (meaning synthetic and de-identified data generation using MLM).

pdf bib
How many words does it take to understand a low-resource language?
Emily Chang | Nada Basit

When developing language technology, researchers have routinely turned to transfer learning to resolve the data scarcity conundrum presented in low-resource languages. As far as we know, this study is the first to evaluate the amount of documentation needed for transfer learning, specifically the smallest vocabulary size needed to create a sentence embedding space. In adopting widely spoken languages as a proxy for low-resource languages, our experiments show that the relationship between a sentence embedding’s vocabulary size and performance is logarithmic with performance leveling at a vocabulary size of 25,000. It should be noted that this relationship cannot be replicated across all languages and this level of documentation does not exist for many low-resource languages. We do observe, however, that performance accelerates at a vocabulary size of 1000, a quantity that is present in most low-resource language documentation. These results can aid researchers in understanding whether a low-resource language has enough documentation necessary to support the creation of a sentence embedding and language model.

pdf bib
Linear Relational Decoding of Morphology in Language Models
Eric Xia | Jugal Kalita

A two-part affine approximation has been found to be a good approximation for transformer computations over certain subject-object relations. Adapting the Bigger Analogy Test Set, we show that the linear transformation W s , where s is a middle-layer representation of a subject token and W is derived from model derivatives, can accurately reproduce final object states for many relations. This linear technique achieves 90% faithfulness on morphological relations, with similar findings across languages and models. Our results suggest that some conceptual relationships in language models, such as morphology, are readily interpretable from latent space and are sparsely encoded by cross-layer linear transformations.

pdf bib
SPY: Enhancing Privacy with Synthetic PII Detection Dataset
Maksim Savkin | Timur Ionov | Vasily Konovalov

We introduce **SPY Dataset**: a novel synthetic dataset for the task of **Personal Identifiable Information (PII) detection**, underscoring the significance of protecting PII in modern data processing. Our research innovates by leveraging Large Language Models (LLMs) to generate a dataset that emulates real-world PII scenarios. Through evaluation, we validate the dataset’s quality, providing a benchmark for PII detection. Comparative analyses reveal that while PII and Named Entity Recognition (NER) share similarities, **dedicated NER models exhibit limitations** when applied to PII-specific contexts. This work contributes to the field by making the generation methodology and the generated dataset publicly, thereby enabling further research and development in this field.

pdf bib
Tighter Clusters, Safer Code? Improving Vulnerability Detection with Enhanced Contrastive Loss
Pranav Kapparad | Biju R Mohan

Distinguishing vulnerable code from non-vulnerable code is challenging due to high inter-class similarity. Supervised contrastive learning (SCL) improves embedding separation but struggles with intra-class clustering, especially when variations within the same class are subtle. We propose Cluster-Enhanced Supervised Contrastive Loss (CESCL), an extension of SCL with a distance-based regularization term that tightens intra-class clustering while maintaining inter-class separation. Evaluating on CodeBERT and GraphCodeBERT with Binary Cross Entropy (BCE), BCE + SCL, and BCE + CESCL, our method improves F1 score by 1.76% on CodeBERT and 4.1% on GraphCodeBERT, demonstrating its effectiveness in code vulnerability detection and broader applicability to high-similarity classification tasks.

pdf bib
Text Extraction and Script Completion in Images of Arabic Script-Based Calligraphy: A Thesis Proposal
Dilara Zeynep Gürer | Ümit Atlamaz | Şaziye Betül Özateş

Arabic calligraphy carries rich historical information and meaning. However, the complexity of its artistic elements and the absence of a consistent baseline make text extraction from such works highly challenging. In this paper, we provide an in-depth analysis of the unique obstacles in processing and interpreting these images, including the variability in calligraphic styles, the influence of artistic distortions, and the challenges posed by missing or damaged text elements. We explore potential solutions by leveraging state-of-the-art architectures and deep learning models, including visual language models, to improve text extraction and script completion.

pdf bib
Subasa - Adapting Language Models for Low-resourced Offensive Language Detection in Sinhala
Shanilka Haturusinghe | Tharindu Cyril Weerasooriya | Christopher M Homan | Marcos Zampieri | Sidath Ravindra Liyanage

Accurate detection of offensive language is essential for a number of applications related to social media safety. There is a sharp contrast in performance in this task between low and high-resource languages. In this paper, we adapt fine-tuning strategies that have not been previously explored for Sinhala in the downstream task of offensive language detection. Using this approach, we introduce four models: “Subasa-XLM-R”, which incorporates an intermediate Pre-Finetuning step using Masked Rationale Prediction. Two variants of “Subasa-Llama” and “Subasa-Mistral”, are fine-tuned versions of Llama (3.2) and Mistral (v0.3), respectively, with a task-specific strategy. We evaluate our models on the SOLD benchmark dataset for Sinhala offensive language detection. All our models outperform existing baselines. Subasa-XLM-R achieves the highest Macro F1 score (0.84) surpassing state-of-the-art large language models like GPT-4o when evaluated on the same SOLD benchmark dataset under zero-shot settings. The models and code are publicly available.

pdf bib
Integrating Symbolic Execution into the Fine-Tuning of Code-Generating LLMs
Marina Sakharova | Abhinav Anand | Mira Mezini

Code-generating Large Language Models (LLMs) have become essential tools in modern software development, enhancing productivity and accelerating development. This paper aims to investigate the fine-tuning of code-generating LLMs using Reinforcement Learning and Direct Preference Optimization, further improving their performance. To achieve this, we enhance the training data for the reward model with the help of symbolic execution techniques, ensuring more comprehensive and objective data. With symbolic execution, we create a custom dataset that better captures the nuances in code evaluation. Our reward models, fine-tuned on this dataset, demonstrate significant improvements over the baseline, CodeRL, in estimating the quality of generated code. Our code-generating LLMs, trained with the help of reward model feedback, achieve similar results compared to the CodeRL benchmark.

pdf bib
Through the Looking Glass: Common Sense Consistency Evaluation of Weird Images
Elisei Rykov | Kseniia Petrushina | Kseniia Titova | Anton Razzhigaev | Alexander Panchenko | Vasily Konovalov

Measuring how real images look is a complex task in artificial intelligence research. For example, an image of Albert Einstein holding a smartphone violates common-sense because modern smartphone were invented after Einstein’s death. We introduce a novel method, which we called Through the Looking Glass (TLG), to assess image common sense consistency using Large Vision-Language Models (LVLMs) and Transformer-based encoder. By leveraging LVLM to extract atomic facts from these images, we obtain a mix of accurate facts. We proceed by fine-tuning a compact attention-pooling classifier over encoded atomic facts. Our TLG has achieved a new state-of-the-art performance on the WHOOPS! and WEIRD datasets while leveraging a compact fine-tuning component.

pdf bib
ColorFoil: Investigating Color Blindness in Large Vision and Language Models
Ahnaf Mozib Samin | M Firoz Ahmed | Md. Mushtaq Shahriyar Rafee

With the utilization of Transformer architecture, large Vision and Language (V&L) models have shown promising performance in even zero-shot settings. Several studies, however, indicate a lack of robustness of the models when dealing with complex linguistics and visual attributes. In this work, we introduce a novel V&L benchmark - ColorFoil, by creating color-related foils to assess the models’ perception ability to detect colors like red, white, green, etc. We evaluate seven state-of-the-art V&L models including CLIP, ViLT, GroupViT, and BridgeTower, etc. in a zero-shot setting and present intriguing findings from the V&L models. The experimental evaluation indicates that ViLT and BridgeTower demonstrate much better color perception capabilities compared to CLIP and its variants and GroupViT. Moreover, CLIP-based models and GroupViT struggle to distinguish colors that are visually distinct to humans with normal color perception ability.

pdf bib
Towards Practical and Knowledgeable LLMs for a Multilingual World: A Thesis Proposal
Bryan Li

The frontier of large language model (LLM) development has largely been substantiated by knowledge-intensive tasks specified in English. In this proposed thesis, I argue for the key role that multilinguality occupies in the development of practical and knowledgeable LLMs.First, I consider practical methods to improve LLM’s performance on standard natural language processing (NLP) tasks by leveraging their existing multilingual knowledge.Then, I investigate the underlying multilingual knowledge of LLMs with two benchmarks: on complex reasoning, and on territorial disputes. These benchmarks reveal LLMs’ inconsistent performance across languages. I then design efficient techniques, both at inference-time and training-time, to address these discrepancies. Finally, I extend the territorial disputes benchmark to retrieval-augmented generation (RAG) setting, comparing the effects of different retrieval settings on cross-lingual robustness. My proposal shows that informed use of multilinguality enhances LLMs’ capabilities, and our understanding thereof.

pdf bib
MDC3: A Novel Multimodal Dataset for Commercial Content Classification in Bengali
Anik Mahmud Shanto | Mst. Sanjida Jamal Priya | Fahim Shakil Tamim | Mohammed Moshiul Hoque

Identifying commercial posts in resource-constrained languages among diverse and unstructured content remains a significant challenge for automatic text classification tasks. To address this, this work introduces a novel dataset named MDC3 (Multimodal Dataset for Commercial Content Classification), comprising 5,007 annotated Bengali social media posts classified as commercial and noncommercial. A comprehensive annotation guideline accompanying the dataset is included to aid future dataset creation in resource-constrained languages. Furthermore, we performed extensive experiments on MDC3 considering both unimodal and multimodal domains. Specifically, the late fusion of textual (mBERT) and visual (ViT) models (i.e., ViT+mBERT) achieves the highest F1 score of 90.91, significantly surpassing other baselines.

pdf bib
DateLogicQA: Benchmarking Temporal Biases in Large Language Models
Gagan Bhatia | Ming Ze Tang | Cristina Mahanta | Madiha Kazi

We introduce DateLogicQA, a human-curated benchmark of 190 questions specifically designed to understand temporal bias in Large Language Models (LLMs). Covering seven date formats across past, present, and future contexts, DateLogicQA examines four reasoning types: commonsense, factual, conceptual, and numerical. Through human-led evaluations of 12 state-of-the-art LLMs, we identify Representation-Level Bias, arising from suboptimal embeddings that distort date semantics, and Logical-Level Bias, manifesting when correct date tokens yield flawed temporal reasoning. Our findings underscore persistent challenges in handling various date formats and temporal contexts, revealing the need for more robust pretraining data, targeted post-training methods, and precise tokenization strategies. By illuminating these biases, we provide actionable insights to guide the development of LLMs for accurate temporal reasoning across diverse real-world applications.

pdf bib
AMR-RE: Abstract Meaning Representations for Retrieval-Based In-Context Learning in Relation Extraction
Peitao Han | Lis Pereira | Fei Cheng | Wan Jou She | Eiji Aramaki

Existing in-context learning (ICL) methods for relation extraction (RE) often prioritize language similarity over structural similarity, which may result in overlooking entity relationships. We propose an AMR-enhanced retrieval-based ICL method for RE to address this issue. Our model retrieves in-context examples based on semantic structure similarity between task inputs and training samples. We conducted experiments in the supervised setting on four standard English RE datasets. The results show that our method achieves state-of-the-art performance on three datasets and competitive results on the fourth. Furthermore, our method outperforms baselines by a large margin across all datasets in the more demanding unsupervised setting.

pdf bib
Linguistic Analysis of Veteran Job Interviews to Assess Effectiveness in Translating Military Expertise to the Civilian Workforce
Caroline J. Wendt | Ehsanul Haque Nirjhar | Theodora Chaspari

The ways in which natural language processing (NLP) can inform how veterans can improve effectiveness in translating military experience to workforce utility is underexplored. We design NLP experiments to evaluate the degree of explanation in veteran job interview responses as a proxy for perceived hireability. We examine linguistic and psycholinguistic features, context, and participant variability to investigate the mechanics of effective communication in employee selection. Results yield good performance when distinguishing between varying degrees of explanation in responses using LIWC features, indicating robustness of linguistic feature integration. Classifying Over- and Under-explained responses reflects challenges of class imbalance and the limitations of tested NLP methods for detecting subtleties in overly verbose or concise communication. Our findings have immediate applications for assistive technologies in job interview settings, and broader implications for enhancing automated communication assessment tools and refining strategies for training and interventions in communication-heavy fields.

pdf bib
MetaMeme: A Dataset for Meme Template and Meta-Category Classification
Benjamin Lambright | Jordan Youner | Constantine Lignos

This paper introduces a new dataset for classifying memes by their template and communicative intent.It includes a broad selection of meme templates and examples scraped from imgflip and a smaller hand-annotated set of memes scraped from Reddit.The Reddit memes have been annotated for meta-category using a novel annotation scheme that classifies memes by the structure of the perspective they are being used to communicate.YOLOv11 and ChatGPT 4o are used to provide baseline modeling results.We find that YOLO struggles with template classification on real-world data but outperforms ChatGPT in classifying meta-categories.

pdf bib
Representing and Clustering Errors in Offensive Language Detection
Jood Otey | Laura Biester | Steven R Wilson

Content moderation is essential in preventing the spread of harmful content on the Internet. However, there are instances where moderation fails and it is important to understand when and why that happens. Workflows that aim to uncover a system’s weakness typically use clustering of the data points’ embeddings to group errors together. In this paper, we evaluate the K-Means clustering of four text representations for the task of offensive language detection in English and Levantine Arabic. We find Sentence-BERT (SBERT) embeddings give the most human-interpretable clustering for English errors and the grouping is mainly based on the targeted group in the text. Meanwhile, SBERT embeddings of Large Language Model (LLM)-generated linguistic features give the most interpretable clustering for Arabic errors.

pdf bib
ELIOT: Zero-Shot Video-Text Retrieval through Relevance-Boosted Captioning and Structural Information Extraction
Xuye Liu | Yimu Wang | Jian Zhao

Recent advances in video-text retrieval (VTR) have largely relied on supervised learning and fine-tuning. In this paper, we introduce , a novel zero-shot VTR framework that leverages off-the-shelf video captioners, large language models (LLMs), and text retrieval methods—entirely without additional training or annotated data. Due to the limited power of captioning methods, the captions often miss important content in the video, resulting in unsatisfactory retrieval performance. To translate more information into video captions, we first generates initial captions for videos, then enhances them using a relevance-boosted captioning strategy powered by LLMs, enriching video descriptions with salient details. To further emphasize key content, we propose structural information extraction, organizing visual elements such as objects, events, and attributes into structured templates, further boosting the retrieval performance. Benefiting from the enriched captions and structuralized information, extensive experiments on several video-text retrieval benchmarks demonstrate the superiority of over existing fine-tuned and pretraining methods without any data. They also show that the enriched captions capture key details from the video with minimal noise. Code and data will be released to facilitate future research.

pdf bib
Can Large Language Models Advance Crosswalks? The Case of Danish Occupation Codes
Bolei Ma | Cynthia A. Huang | Anna-Carolina Haensch

Crosswalks, which map one classification system to another, are critical tools for harmonizing data across time, countries, or frameworks. However, constructing crosswalks is labor-intensive and often requires domain expertise. This paper investigates the potential of Large Language Models (LLMs) to assist in creating crosswalks, focusing on two Danish occupational classification systems from different time periods as a case study. We propose a two-stage, prompt-based framework for this task, where LLMs perform similarity assessments between classification codes and identify final mappings through a guided decision process. Using four instruction-tuned LLMs and comparing them against an embedding-based baseline, we evaluate the performance of different models in crosswalks. Our results highlight the strengths of LLMs in crosswalk creation compared to the embedding-based baseline, showing the effectiveness of the interactive prompt-based framework for conducting crosswalks by LLMs. Furthermore, we analyze the impact of model combinations across two interactive rounds, highlighting the importance of model selection and consistency. This work contributes to the growing field of NLP applications for domain-specific knowledge mapping and demonstrates the potential of LLMs in advancing crosswalk methodologies.

pdf bib
Paraphrase-based Contrastive Learning for Sentence Pair Modeling
Seiji Sugiyama | Risa Kondo | Tomoyuki Kajiwara | Takashi Ninomiya

To improve the performance of sentence pair modeling tasks, we propose an additional pre-training method, also known as transfer fine-tuning, for pre-trained masked language models.Pre-training for masked language modeling is not necessarily designed to bring semantically similar sentences closer together in the embedding space.Our proposed method aims to improve the performance of sentence pair modeling by applying contrastive learning to pre-trained masked language models, in which sentence embeddings of paraphrase pairs are made similar to each other.While natural language inference corpora, which are standard in previous studies on contrastive learning, are not available on a large-scale for non-English languages, our method can construct a training corpus for contrastive learning from a raw corpus and a paraphrase dictionary at a low cost.Experimental results on four sentence pair modeling tasks revealed the effectiveness of our method in both English and Japanese.

pdf bib
Do Video Language Models really understand the video contexts?
Jeongwan Shin | Jinhyeong Lim | Hyeyoung Park

This paper examines how well visual language models (VLMs) understand video question answering (VideoQA) tasks and generate responses accordingly. Recently, VLMs based on Large Language Models (LLMs) have shown remarkable performance, but the processes of understanding and reasoning in VLMs remain under-explored. To tackle this challenge, we propose Video Understanding and Response Consistency Assessment, VURCA, a framework that incorporates a fine-grained question generation and answering process to measure how well the responses generated by VLMs align with what the model understands. In addition, we introduce an extended benchmark dataset, FgNExT-QA, which builds upon NExT-QA by incorporating more fine-grained VideoQA tasks. FgNExT-QA is designed to evaluate fine-grained understanding in video question answering. Through experiments, we found that despite the strong overall QA performance of VLMs, their understanding of both the video content and the question remains limited. In particular, they exhibit poor video comprehension in fine-grained VideoQA tasks.

pdf bib
Evaluating Text Style Transfer Evaluation: Are There Any Reliable Metrics?
Sourabrata Mukherjee | Atul Kr. Ojha | John Philip McCrae | Ondrej Dusek

Text style transfer (TST) is the task of transforming a text to reflect a particular style while preserving its original content. Evaluating TSToutputs is a multidimensional challenge, requiring the assessment of style transfer accuracy, content preservation, and naturalness. Us-ing human evaluation is ideal but costly, as is common in other natural language processing (NLP) tasks; however, automatic metrics forTST have not received as much attention as metrics for, e.g., machine translation or summarization. In this paper, we examine both set ofexisting and novel metrics from broader NLP tasks for TST evaluation, focusing on two popular subtasks—sentiment transfer and detoxification—in a multilingual context comprising English, Hindi, and Bengali. By conducting meta-evaluation through correlation with hu-man judgments, we demonstrate the effectiveness of these metrics when used individually and in ensembles. Additionally, we investigatethe potential of large language models (LLMs) as tools for TST evaluation. Our findings highlight newly applied advanced NLP metrics andLLM-based evaluations provide better insights than existing TST metrics. Our oracle ensemble approaches show even more potential.

pdf bib
(CPER) From Guessing to Asking: An Approach to Resolving Persona Knowledge Gap in LLMs during Multi-Turn Conversations
Sarvesh Baskar | Manas Gaur | Srinivasan Parthasarathy | Tanmay Tulsidas Verlekar

In multi-turn dialogues, large language models face a critical challenge of ensuring coherence while adapting to user-specific information.. This study introduces the persona knowledge gap, the discrepancy between a model’s internal understanding and the knowledge required for coherent, personalized conversations. While prior research has recognized these gaps, computational methods for their identification and resolution remain underexplored. We propose Conversation Preference Elicitation and Recommendation (CPER), a novel framework that dynamically detects and resolves persona knowledge gaps using intrinsic uncertainty quantification and feedback-driven refinement. CPER consists of three key modules: a Contextual Understanding Module for preference extraction, a Dynamic Feedback Module for measuring uncertainty and refining persona alignment, and a Persona-Driven Response Generation module for adapting responses based on accumulated user context. We evaluate CPER on two real-world datasets: CCPE-M for preferential movie recommendations and ESConv for mental health support. Using A/B testing, human evaluators preferred CPER’s responses 42% more often than baseline models in CCPE-M and 27% more often in ESConv. A qualitative human evaluation confirms that CPER’s responses are preferred for maintaining contextual relevance and coherence, particularly in longer (12+ turn) conversations.

pdf bib
Streamlining LLMs: Adaptive Knowledge Distillation for Tailored Language Models
Prajvi Saxena | Sabine Janzen | Wolfgang Maass

Large language models (LLMs) like GPT-4 and LLaMA-3 offer transformative potential across industries, e.g., enhancing customer service, revolutionizing medical diagnostics, or identifying crises in news articles. However, deploying LLMs faces challenges such as limited training data, high computational costs, and issues with transparency and explainability. Our research focuses on distilling compact, parameter-efficient tailored language models (TLMs) from LLMs for domain-specific tasks with comparable performance. Current approaches like knowledge distillation, fine-tuning, and model parallelism address computational efficiency but lack hybrid strategies to balance efficiency, adaptability, and accuracy. We present ANON - an adaptive knowledge distillation framework integrating knowledge distillation with adapters to generate computationally efficient TLMs without relying on labeled datasets. ANON uses cross-entropy loss to transfer knowledge from the teacher’s outputs and internal representations while employing adaptive prompt engineering and a progressive distillation strategy for phased knowledge transfer. We evaluated ANON’s performance in the crisis domain, where accuracy is critical and labeled data is scarce. Experiments showed that ANON outperforms recent approaches of knowledge distillation, both in terms of the resulting TLM performance and in reducing the computational costs for training and maintaining accuracy compared to LLMs for domain-specific applications.

pdf bib
LLM DEBATE OPPONENT : Counter-argument Generation focusing on Implicit and Critical Premises
Taisei Ozaki | Chihiro Nakagawa | Naoya Inoue | Shoichi Naito | Kenshi Yamaguchi

Debate education fosters critical thinking skills but often incurs high human costs. Recent advancements in Large Language Models (LLMs) show promise in automating counter-argument generation. However, it remains unclear how best to guide LLMs to target both implicit and critical premises. In this study, we systematically compare multi-step and one-step generation methods for counter-arguments across 100 debate topics. Our findings reveal that one-step approaches consistently outperform multi-step pipelines, owing to their better grasp of the “motion spirit,” minimized propagation of hallucinations, and avoidance of challenging intermediate tasks. Among premise-targeting methods, a one-step strategy that accounts for both implicit and explicit premises—Generated and Targeted Premise Attack (GTG)—emerges as the strongest performer in expert and automated evaluations. These results highlight the value of direct, integrated prompts for leveraging LLMs in complex argumentation tasks and offer insights for developing more effective automated debate agents.

pdf bib
AutoML Meets Hugging Face: Domain-Aware Pretrained Model Selection for Text Classification
Parisa Safikhani | David Broneske

The effectiveness of embedding methods is crucial for optimizing text classification performance in Automated Machine Learning (AutoML). However, selecting the most suitable pre-trained model for a given task remains challenging. This study introduces the Corpus-Driven Domain Mapping (CDDM) pipeline, which utilizes a domain-annotated corpus of pre-fine-tuned models from the Hugging Face Model Hub to improve model selection. Integrating these models into AutoML systems significantly boosts classification performance across multiple datasets compared to baseline methods. Despite some domain recognition inaccuracies, results demonstrate CDDM’s potential to enhance model selection, streamline AutoML workflows, and reduce computational costs.

pdf bib
Paraphrasing Attack Resilience of Various Machine-Generated Text Detection Methods
Andrii Shportko | Inessa Verbitsky

The recent large-scale emergence of LLMs has left an open space for dealing with their consequences, such as plagiarism or the spread of false information on the Internet. Coupling this with the rise of AI detector bypassing tools, reliable machine-generated text detection is in increasingly high demand. We investigate the paraphrasing attack resilience of various machine-generated text detection methods, evaluating three approaches: fine-tuned RoBERTa, Binoculars, and text feature analysis, along with their ensembles using Random Forest classifiers. We discovered that Binoculars-inclusive ensembles yield the strongest results, but they also suffer the most significant losses during attacks. In this paper, we present the dichotomy of performance versus resilience in the world of AI text detection, which complicates the current perception of reliability among state-of-the-art techniques.

pdf bib
Detecting, Generating, and Evaluating in the Writing Style of Different Authors
Mosab Rezaei

In recent years, stylometry has been investigated in many different fields. Hence, in this work, we are going to tackle this problem, detecting, generating, and evaluating textual documents according to the writing style by leveraging state-of-the-art models. In the first step, the sentences will be extracted from several different books, each belonging to a different author, to create a dataset. Then the selected models will be trained to detect the author of sentences in the dataset. After that, generator models are utilized to generate sentences based on the authors’ writing styles with unpaired samples in the dataset. Finally, to evaluate the performance of the generators, the previously trained models will be used to assess the generated sentences and to compare the distribution of various syntactic features between the original and generated sentences. We hope the result shows that models can be achieved to detect and generate textual documents for the given authors according to their writing style.

pdf bib
Collaborative Data Exploration through Visualization: A Thesis Proposal Analyzing Impact of Conversational Assistants
Abari Bhattacharya | Barbara Di Eugenio

Data visualization is integral to any Exploratory Data Analysis (EDA) task. However, generating visualization requires expertise, presenting a steep learning curve and a significant cognitive load. Natural language interfaces for EDA aim to lower this barrier by allowing users to generate visualizations through natural language queries. However, complexity remains when EDA is performed collaboratively, requiring an environment to support multi-user interaction. In this thesis proposal, we discuss challenges in user-system interaction in a collaborative multi-user setup, such as errors in visualization generation due to misinterpretation of user requests. We hypothesize that a Conversational Assistant (CA) capable of understanding user-initiated clarification requests and generating accurate responses can improve user experience and support collaborative EDA tasks. To this end, we propose to develop such a CA (Figure tab:system_issues) and evaluate it through a user study, thus examining its impact on user experience in a collaborative environment for EDA.

pdf bib
MENDER: Multi-hop Commonsense and Domain-specific CoT Reasoning for Knowledge-grounded Empathetic Counseling of Crime Victims
Abid Hossain | Priyanshu Priya | Armita Mani Tripathi | Pradeepika Verma | Asif Ekbal

Commonsense inference and domain-specific expertise are crucial for understanding and responding to emotional, cognitive, and topic-specific cues in counseling conversations with crime victims. However, these key evidences are often dispersed across multiple utterances, making it difficult to capture through single-hop reasoning. To address this, we propose MENDER, a novel Multi-hop commonsensE and domaiN-specific Chain-of-Thought (CoT) reasoning framework for knowleDge-grounded empathEtic Response generation in counseling dialogues. MENDER leverages large language models (LLMs) to integrate commonsense and domain knowledge via multi-hop reasoning over the dialogue context. It employs two specialized reasoning chains, viz. Commonsense Knowledge-driven CoT and Domain Knowledge-driven CoT rationales, which extract and aggregate dispersed emotional, cognitive, and topical evidences to generate knowledge-grounded empathetic counseling responses. Experimental evaluations on counseling dialogue dataset, POEM validate MENDER’s efficacy in generating coherent, empathetic, knowledge-grounded responses.

pdf bib
SkipCLM: Enhancing Crosslingual Alignment of Decoder Transformer Models via Contrastive Learning and Skip Connection
Nikita Sushko | Alexander Panchenko | Elena Tutubalina

This paper proposes SkipCLM, a novel method for improving multilingual machine translation in Decoder Transformers. We augment contrastive learning for cross-lingual alignment with a trainable skip connection to preserve information crucial for accurate target language generation. Experiments with XGLM-564M on the Flores-101 benchmark demonstrate improved performance, particularly for en-de and en-zh direction translations, compared to direct sequence-to-sequence training and existing contrastive learning methods. Code is available at: https://github.com/s-nlp/skipclm.

pdf bib
Towards LLMs Robustness to Changes in Prompt Format Styles
Lilian Ngweta | Kiran Kate | Jason Tsay | Yara Rizk

Large language models (LLMs) have gained popularity in recent years for their utility in various applications. However, they are sensitive to non-semantic changes in prompt formats, where small changes in the prompt format can lead to significant performance fluctuations. In the literature, this problem is commonly referred to as prompt brittleness. Previous research on prompt engineering has focused mainly on developing techniques for identifying the optimal prompt for specific tasks. Some studies have also explored the issue of prompt brittleness and proposed methods to quantify performance variations; however, no simple solution has been found to address this challenge. We propose Mixture of Formats (MOF), a simple and efficient technique for addressing prompt brittleness in LLMs by diversifying the styles used in the prompt few-shot examples. MOF was inspired by computer vision techniques that utilize diverse style datasets to prevent models from associating specific styles with the target variable. Empirical results show that our proposed technique reduces style-induced prompt brittleness in various LLMs while also enhancing overall performance across prompt variations and different datasets.

pdf bib
Reliability of Distribution Predictions by LLMs: Insights from Counterintuitive Pseudo-Distributions
Toma Suzuki | Ayuki Katayama | Seiji Gobara | Ryo Tsujimoto | Hibiki Nakatani | Kazuki Hayashi | Yusuke Sakai | Hidetaka Kamigaito | Taro Watanabe

The proportion of responses to a question and its options, known as the response distribution, enables detailed analysis of human society. Recent studies highlight the use of Large Language Models (LLMs) for predicting response distributions as a cost-effective survey method. However, the reliability of these predictions remains unclear. LLMs often generate answers by blindly following instructions rather than applying rational reasoning based on pretraining-acquired knowledge. This study investigates whether LLMs can rationally estimate distributions when presented with explanations of “artificially generated distributions” that are against commonsense. Specifically, we assess whether LLMs recognize counterintuitive explanations and adjust their predictions or simply follow these inconsistent explanations. Results indicate that smaller or less human-optimized LLMs tend to follow explanations uncritically, while larger or more optimized models are better at resisting counterintuitive explanations by leveraging their pretraining-acquired knowledge. These findings shed light on factors influencing distribution prediction performance in LLMs and are crucial for developing reliable distribution predictions using language models.

pdf bib
Rosetta-PL: Propositional Logic as a Benchmark for Large Language Model Reasoning
Shaun Lee Baek | Shaun Esua-Mensah | Cyrus Tsui | Sejan Vigneswaralingam | Abdullah Alali | Michael Lu | Vasu Sharma | Kevin Zhu

Large Language Models (LLMs) are primarily trained on high-resource natural languages, limiting their effectiveness in low-resource settings and in tasks requiring deep logical reasoning. This research introduces Rosetta-PL, a benchmark designed to evaluate LLMs’ logical reasoning and generalization capabilities in a controlled environment. We construct Rosetta-PL by translating a dataset of logical propositions from Lean into a custom logical language, which is then used to fine-tune an LLM (e.g., GPT-4o). Our experiments analyze the impact of the size of the dataset and the translation methodology on the performance of the model. Our results indicate that preserving logical relationships in the translation process significantly boosts precision, with accuracy plateauing beyond roughly 20,000 training samples. These insights provide valuable guidelines for optimizing LLM training in formal reasoning tasks and improving performance in various low-resource language applications.

up

bib (full) Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)

pdf bib
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Nouha Dziri | Sean (Xiang) Ren | Shizhe Diao

pdf bib
Dataverse: Open-Source ETL (Extract, Transform, Load) Pipeline for Large Language Models
Hyunbyung Park | Sukyung Lee | Gyoungjin Gim | Yungi Kim | Dahyun Kim | Chanjun Park

To address the challenges associated with data processing at scale, we propose Dataverse, a unified open-source Extract-Transform-Load (ETL) pipeline for large language models (LLMs) with a user-friendly design at its core. Easy addition of custom processors with block-based interface in Dataverse allows users to readily and efficiently use Dataverse to build their own ETL pipeline. We hope that Dataverse will serve as a vital tool for LLM development and open source the entire library to welcome community contribution. Additionally, we provide a concise, two-minute video demonstration of our system, illustrating its capabilities and implementation.

pdf bib
ATAIGI: An AI-Powered Multimodal Learning App Leveraging Generative Models for Low-Resource Taiwanese Hokkien
Yun-Hsin Chu | Shuai Zhu | Shou-Yi Hung | Bo-Ting Lin | En-Shiun Annie Lee | Richard Tzong-Han Tsai

Many endangered languages are at risk of extinction due to barriers in communication and generational gaps that hinder their preservation. A cause for languages becoming endangered is the lack of language educational tools and artificial intelligence (AI) models for these low-resource languages. To address this, we propose the ATAIGI learning app designed with AI-powered models leveraging multimodal generative techniques. Our app offers users a comprehensive learning experience by providing translated phrases and definitions, example sentences, illustrative images, romanized pronunciation, and audio speech to accelerate language learning. ATAIGI is built on five AI models that are rigorously benchmarked individually, with our Transliteration Model achieving state-of-the-art results for Taiwanese Hokkien transliteration. ATAIGI is available for all to learn the endangered language of Taiwanese Hokkien, an endangered language spoken in Taiwan. A human evaluation conducted demonstrates the effectiveness of ATAIGI in improving language proficiency and cultural understanding, supporting its potential for the preservation and education of endangered languages like the Taiwanese Hokkien.

pdf bib
CLEAR-Command: Coordinated Listening, Extraction, and Allocation for Emergency Response with Large Language Models
Achref Doula | Bela Bohlender | Max Mühlhäuser | Alejandro Sanchez Guinea

Effective communication is vital in emergency response scenarios where clarity and speed can save lives. Traditional systems often struggle under the chaotic conditions of real-world emergencies, leading to breakdowns in communication and task management. This paper introduces CLEAR-Command, a system that leverages Large Language Models (LLMs) to enhance emergency communications. CLEAR stands for $textbfCoordinatedListening,Extraction, andAllocation inResponse. CLEAR-Command automates the transcription, summarization, and task extraction from live radio communications of emergency first responders using the OpenAI Whisper API for transcription and gpt-4o for summarization and task extraction. Our system provides a dynamic overview of task allocations and their execution status, significantly improving the accuracy of task identification and the clarity of communication. We evaluated our system through an expert pre-study with 4 experts and a user study with 13 participants. The expert pre-study identified gpt-4o as providing the most accurate task extraction, while the user study showed that CLEAR-Command significantly outperforms traditional radio communication in terms of clarity, trust, and correctness of task extraction. Our demo is hosted under thislink, and all project details are presented in ourGitlab page$.

pdf bib
LM-Pub-Quiz: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models
Max Ploner | Jacek Wiland | Sebastian Pohl | Alan Akbik

Knowledge probing evaluates to which extent a language model (LM) has acquired relational knowledge during its pre-training phase. It provides a cost-effective means of comparing LMs of different sizes and training setups and is useful for monitoring knowledge gained or lost during continual learning (CL). In prior work, we presented an improved knowledge probe called BEAR (Wiland et al., 2024), which enables the comparison of LMs trained with different pre-training objectives (causal and masked LMs) and addresses issues of skewed distributions in previous probes to deliver a more unbiased reading of LM knowledge. With this paper, we present LM-Pub-Quiz, a Python framework and leaderboard built around the BEAR probing mechanism that enables researchers and practitioners to apply it in their work. It provides options for standalone evaluation and direct integration into the widely-used training pipeline of the Hugging Face transformers library. Further, it provides a fine-grained analysis of different knowledge types to assist users in better understanding the knowledge in each evaluated LM. We publicly release LM-Pub-Quiz as an open-source project.https://lm-pub-quiz.github.io/

pdf bib
TRACE: Real-Time Multimodal Common Ground Tracking in Situated Collaborative Dialogues
Hannah VanderHoeven | Brady Bhalla | Ibrahim Khebour | Austin C. Youngren | Videep Venkatesha | Mariah Bradford | Jack Fitzgerald | Carlos Mabrey | Jingxuan Tu | Yifan Zhu | Kenneth Lai | Changsoo Jung | James Pustejovsky | Nikhil Krishnaswamy

We present TRACE, a novel system for live *common ground* tracking in situated collaborative tasks. With a focus on fast, real-time performance, TRACE tracks the speech, actions, gestures, and visual attention of participants, uses these multimodal inputs to determine the set of task-relevant propositions that have been raised as the dialogue progresses, and tracks the group’s epistemic position and beliefs toward them as the task unfolds. Amid increased interest in AI systems that can mediate collaborations, TRACE represents an important step forward for agents that can engage with multiparty, multimodal discourse.

pdf bib
MT-LENS: An all-in-one Toolkit for Better Machine Translation Evaluation
Javier García Gilabert | Carlos Escolano | Audrey Mash | Xixian Liao | Maite Melero

We introduce MT-Lens, a framework designed to evaluate Machine Translation (MT) systems across a variety of tasks, including translation quality, gender bias detection, added toxicity, and robustness to misspellings. While several toolkits have become very popular for benchmarking the capabilities of Large Language Models (LLMs), existing evaluation tools often lack the ability to thoroughly assess the diverse aspects of MT performance. MT-Lens addresses these limitations by extending the capabilities of LM-eval-harness for MT, supporting state-of-the-art datasets and a wide range of evaluation metrics. It also offers a user-friendly platform to compare systems and analyze translations with interactive visualizations. MT-Lens aims to broaden access to evaluation strategies that go beyond traditional translation quality evaluation, enabling researchers and engineers to better understand the performance of a NMT model and also easily measure system’s biases.

pdf bib
A Learning-based Multi-Frame Visual Feature Framework for Real-Time Driver Fatigue Detection
Liang Xie | Songlin Fan

Driver fatigue is a significant factor contributing to road accidents, highlighting the need for reliable and accurate detection methods. In this study, we introduce a novel learning-based multi-frame visual feature framework (LMVFF) designed for precise fatigue detection. Our methodology comprises several clear and interpretable steps. Initially, facial landmarks are detected, enabling the calculation of distances between eyes, lips, and the assessment of head rotation angles based on 68 identified landmarks. Subsequently, visual features from the eye region are extracted, and an effective visual model is developed to accurately classify eye openness. Additionally, features characterizing lip movements are analyzed to detect yawning, thereby enriching fatigue detection through continuous monitoring of eye blink frequency, yawning occurrences, and head movements. Compared to conventional single-feature detection approaches, LMVFF significantly reduces instances of fatigue misidentification. Moreover, we employ various quantization and compression techniques for multiple computation stages, substantially reducing the latency of our system and achieving a real-time frame rate of 25-30 FPS for practical applications.

pdf bib
TRUSTEVAL: A Dynamic Evaluation Toolkit on Trustworthiness of Generative Foundation Models
Yanbo Wang | Jiayi Ye | Siyuan Wu | Chujie Gao | Yue Huang | Xiuying Chen | Yue Zhao | Xiangliang Zhang

Ensuring the trustworthiness of Generative Foundation Models (GenFMs) is a pressing challenge as they gain widespread use. Existing evaluation toolkits are often limited in scope, dynamism, and flexibility. This paper introduces TRUSTEVAL, a dynamic and comprehensive toolkit designed for evaluating GenFMs across various dimensions. TRUSTEVAL supports both dynamic dataset generation and evaluation, offering advanced features including comprehensiveness, usability, and flexibility. TRUSTEVAL integrates diverse generative models, datasets, evaluation methods, metrics, inference efficiency enhancement, and evaluation report generation. Through case studies, we demonstrate TRUSTEVAL’s potential to advance the trustworthiness evaluation of GenFMs.

pdf bib
AutoClean: LLMs Can Prepare Their Training Corpus
Xingyu Shen | Shengding Hu | Xinrong Zhang | Xu Han | Xiaojun Meng | Jiansheng Wei | Zhiyuan Liu | Maosong Sun

Recent studies highlight the reliance of Large Language Models (LLMs) on high-quality, diverse data for optimal performance. The data sourced from the Internet often aggregated into datasets like the Common Crawl corpus, presents significant quality variability and necessitates extensive cleaning. Moreover, specific domain knowledge is usually presented in HTML, but there is a lack of effective methods to clean them into the training corpus automatically. Traditional cleaning methods involve either labor-intensive human teams that lack scalability or static heuristics that lead to suboptimal outcomes and are unable to be applied to specific target domains. In this paper, inspired by the recent progress in employing LLMs as versatile agents for diverse tasks, we take the initiative to explore the potential of these agents in automating data-cleaning methodologies. By configuring LLMs as an agent team that imitates the human data-cleaning team, we can automatically generate cleaning rules that traditionally require the involvement of data-cleaning experts. These rules are developed using a limited number of data samples and can then be applied broadly to substantial portions of raw data from the same domain. We demonstrate the efficiency and effectiveness of on both pre-train scale corpora such as Common Crawl and specific target websites. Both automatic and human evaluations of the quality of the cleaned content highlight the feasibility of using LLMs to prepare their training corpus.

pdf bib
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Wenxuan Zhang | Hou Pong Chan | Yiran Zhao | Mahani Aljunied | Jianyu Wang | Chaoqun Liu | Yue Deng | Zhiqiang Hu | Weiwen Xu | Yew Ken Chia | Xin Li | Lidong Bing

Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

pdf bib
Prompto: An open source library for asynchronous querying of LLM endpoints
Ryan Sze-Yin Chan | Federico Nanni | Angus Redlarski Williams | Edwin Brown | Liam Burke-Moore | Ed Chapman | Kate Onslow | Tvesha Sippy | Jonathan Bright | Evelina Gabasova

Recent surge in Large Language Model (LLM) availability has opened exciting avenues for research. However, efficiently interacting with these models presents a significant hurdle since LLMs often reside on proprietary or self-hosted API endpoints, each requiring custom code for interaction. Conducting comparative studies between different models can therefore be time-consuming and necessitate significant engineering effort, hindering research efficiency and reproducibility. To address these challenges, we present prompto, an open source Python library which facilitates asynchronous querying of LLM endpoints enabling researchers to interact with multiple LLMs concurrently, while maximising efficiency and utilising individual rate limits. Our library empowers researchers and developers to interact with LLMs more effectively and allowing faster experimentation, data generation and evaluation. prompto is released with an introductory video (https://youtu.be/lWN9hXBOLyQ) under MIT License and is available via GitHub (https://github.com/alan-turing-institute/prompto).

pdf bib
ESPnet-SpeechLM: An Open Speech Language Model Toolkit
Jinchuan Tian | Jiatong Shi | William Chen | Siddhant Arora | Yoshiki Masuyama | Takashi Maekaku | Yihan Wu | Junyi Peng | Shikhar Bharadwaj | Yiwen Zhao | Samuele Cornell | Yifan Peng | Xiang Yue | Chao-Han Huck Yang | Graham Neubig | Shinji Watanabe

We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.

pdf bib
InspectorRAGet: An Introspection Platform for RAG Evaluation
Kshitij P Fadnis | Siva Sankalp Patel | Odellia Boni | Yannis Katsis | Sara Rosenthal | Benjamin Sznajder | Marina Danilevsky

Large Language Models (LLM) have become a popular approach for implementing Retrieval Augmented Generation (RAG) systems, and a significant amount of effort has been spent on building good models and metrics. In spite of increased recognition of the need for rigorous evaluation of RAG systems, few tools exist that go beyond the creation of model output and automatic calculation. We present InspectorRAGet, an introspection platform for performing a comprehensive analysis of the quality of RAG system output. InspectorRAGet allows the user to analyze aggregate and instance-level performance of RAG systems, using both human and algorithmicmetrics as well as annotator quality. InspectorRAGet is suitable for multiple use cases and is available publicly to the community.A live instance of the platform is available at https://ibm.biz/InspectorRAGet

pdf bib
Cerebrum (AIOS SDK): A Platform for Agent Development, Deployment, Distribution, and Discovery
Balaji Rama | Kai Mei | Yongfeng Zhang

Autonomous LLM-based agents have emerged as a powerful paradigm for complex task execution, yet the field lacks standardized tools for development, deployment, and distribution. We present Cerebrum, an open-source platform that addresses this gap through three key components: (1) a comprehensive SDK featuring a modular four-layer architecture for agent development, encompassing LLM, memory, storage, and tool management; (2) a community-driven Agent Hub for sharing and discovering agents, complete with version control and dependency management; and (3) an interactive web interface for testing and evaluating agents. The platform’s effectiveness is demonstrated through implementations of various agent architectures, including Chain of Thought (CoT), ReAct, and tool-augmented agents. Cerebrum advances the field by providing a unified framework that standardizes agent development while maintaining flexibility for researchers and developers to innovate and distribute their work. Live url for demo can be found at https://app.aios.foundation. Code can be found at https://github.com/agiresearch/Cerebrum. Video demo can be found at https://app.aios.foundation/video-demo.

pdf bib
GenSim: A General Social Simulation Platform with Large Language Model based Agents
Jiakai Tang | Heyang Gao | Xuchen Pan | Lei Wang | Haoran Tan | Dawei Gao | Yushuo Chen | Xu Chen | Yankai Lin | Yaliang Li | Bolin Ding | Jingren Zhou | Jun Wang | Ji-Rong Wen

With the rapid advancement of large language models (LLMs), recent years have witnessed many promising studies on leveraging LLM-based agents to simulate human social behavior. While prior work has demonstrated significant potential across various domains, much of it has focused on specific scenarios involving a limited number of agents and has lacked the ability to adapt when errors occur during simulation. To overcome these limitations, we propose a novel LLM-agent-based simulation platform called GenSim, which: (1) Abstracts a set of general functions to simplify the simulation of customized social scenarios; (2) Supports one hundred thousand agents to better simulate large-scale populations in real-world contexts; (3) Incorporates error-correction mechanisms to ensure more reliable and long-term simulations. To evaluate our platform, we assess both the efficiency of large-scale agent simulations and the effectiveness of the error-correction mechanisms. To our knowledge, GenSim represents an initial step toward a general, large-scale, and correctable social simulation platform based on LLM agents, promising to further advance the field of social science.

pdf bib
Semi-automatic Sequential Sentence Classification in the Discourse Analysis Tool Suite
Tim Fischer | Chris Biemann

This paper explores an AI-assisted approach to sequential sentence annotation designed to enhance qualitative data analysis (QDA) workflows within the open-source Discourse Analysis Tool Suite (DATS) developed at our university.We introduce a three-phase Annotation Assistant that leverages the capabilities of large language models (LLMs) to assist researchers during annotation.Based on the number of annotations, the assistant employs zero-shot prompting, few-shot prompting, or fine-tuned models to provide the best suggestions.To evaluate this approach, we construct a benchmark with five diverse datasets.We assess the performance of three prominent open-source LLMs — Llama 3.1, Gemma 2, and Mistral NeMo — and a sequence tagging model based on SentenceTransformers.Our findings demonstrate the effectiveness of our approach, with performance improving as the number of annotated examples increases. Consequently, we implemented the Annotation Assistant within DATS and report the implementation details.With this, we hope to contribute to a novel AI-assisted workflow and further democratize access to AI for qualitative data analysis.

pdf bib
CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation
Faria Huq | Zora Zhiruo Wang | Frank F. Xu | Tianyue Ou | Shuyan Zhou | Jeffrey P. Bigham | Graham Neubig

While much work on web agents emphasizes the promise of autonomously performing tasks on behalf of users, in reality, agents often fallshort on complex tasks in real-world contexts and modeling user preference. This presents an opportunity for humans to collaborate with the agent and leverage the agent’s capabilities effectively. We propose CowPilot, a frame- work supporting autonomous as well as human-agent co llaborative w eb navigation, and evaluation across task success and task efficiency. CowPilot reduces the number of steps humans need to perform by allowing agents to propose next steps, while users are able to pause, reject, or take alternative actions. During execution, users can interleave their actions with the agent’s by overriding suggestions or resuming agent control when needed. We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps. Even with human interventions during task execution, the agent successfully drives up to half of task success on its own. CowPilot can serve as a useful tool for data collection and agent evaluation across websites, which we believe will enable research in how users and agents can work together. Video demonstrations are available at https://oaishi.github.io/cowpilot.html

pdf bib
eRevise+RF: A Writing Evaluation System for Assessing Student Essay Revisions and Providing Formative Feedback
Zhexiong Liu | Diane Litman | Elaine L Wang | Tianwen Li | Mason Gobat | Lindsay Clare Matsumura | Richard Correnti

The ability to revise essays in response to feedback is important for students’ writing success. An automated writing evaluation (AWE) system that supports students in revising their essays is thus essential. We present eRevise+RF, an enhanced AWE system for assessing student essay revisions (e.g., changes made to an essay to improve its quality in response to essay feedback) and providing revision feedback. We deployed the system with 6 teachers and 406 students across 3 schools in Pennsylvania and Louisiana. The results confirmed its effectiveness in (1) assessing student essays in terms of evidence usage, (2) extracting evidence and reasoning revisions across essays, and (3) determining revision success in responding to feedback. The evaluation also suggested eRevise+RF is a helpful system for young students to improve their argumentative writing skills through revision and formative feedback.

pdf bib
VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music
Jiatong Shi | Hye-jin Shim | Jinchuan Tian | Siddhant Arora | Haibin Wu | Darius Petermann | Jia Qi Yip | You Zhang | Yuxun Tang | Wangyou Zhang | Dareen Safar Alharthi | Yichen Huang | Koichi Saito | Jionghao Han | Yiwen Zhao | Chris Donahue | Shinji Watanabe

In this work, we introduce VERSA, a unified and standardized evaluation toolkit designed for various speech, audio, and music signals. The toolkit features a Pythonic interface with flexible configuration and dependency control, making it user-friendly and efficient. With full installation, VERSA offers 65 metrics with 729 metric variations based on different configurations. These metrics encompass evaluations utilizing diverse external resources, including matching and non-matching reference audio, text transcriptions, and text captions. As a lightweight yet comprehensive toolkit, VERSA is versatile to support the evaluation of a wide range of downstream scenarios. To demonstrate its capabilities, this work highlights example use cases for VERSA, including audio coding, speech synthesis, speech enhancement, singing synthesis, and music generation. The toolkit is available at https://github.com/shinjiwlab/versa.

pdf bib
Persona-SQ: A Personalized Suggested Question Generation Framework For Real-world Documents
Zihao Lin | Zichao Wang | Yuanting Pan | Varun Manjunatha | Ryan A. Rossi | Angela Lau | Lifu Huang | Tong Sun

Suggested questions (SQs) provide an effective initial interface for users to engage with their documents in AI-powered reading applications. In practical reading sessions, users have diverse backgrounds and reading goals, yet current SQ features typically ignore such user information, resulting in homogeneous or ineffective questions. We introduce a pipeline that generates personalized SQs by incorporating reader profiles (professions and reading goals) and demonstrate its utility in two ways: 1) as an improved SQ generation pipeline that produces higher quality and more diverse questions compared to current baselines, and 2) as a data generator to fine-tune extremely small models that perform competitively with much larger models on SQ generation. Our approach can not only serve as a drop-in replacement in current SQ systems to immediately improve their performance but also help develop on-device SQ models that can run locally to deliver fast and private SQ experience.

pdf bib
ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems
Siddhant Arora | Yifan Peng | Jiatong Shi | Jinchuan Tian | William Chen | Shikhar Bharadwaj | Hayato Futami | Yosuke Kashiwagi | Emiru Tsunoo | Shuichiro Shimizu | Vaibhav Srivastav | Shinji Watanabe

Advancements in audio foundation models (FMs) have fueled interest in end-to-end (E2E) spoken dialogue systems, but different web interfaces for each system makes it challenging to compare and contrast them effectively. Motivated by this, we introduce an open-source, user-friendly toolkit designed to build unified web interfaces for various cascaded and E2E spoken dialogue systems. Our demo further provides users with the option to get on-the-fly automated evaluation metrics such as (1) latency, (2) ability to understand user input, (3) coherence, diversity, and relevance of system response, and (4) intelligibility and audio quality of system output. Using the evaluation metrics, we compare various cascaded and E2E spoken dialogue systems with a human-human conversation dataset as a proxy. Our analysis demonstrates that the toolkit allows researchers to effortlessly compare and contrast different technologies, providing valuable insights such as current E2E systems having poorer audio quality and less diverse responses. An example demo produced using our toolkit is publicly available here: https://huggingface.co/spaces/Siddhant/Voice_Assistant_Demo.

pdf bib
SURF: A System to Unveil Explainable Risk Relations between Firms
Yu-Hsiang Wang | Wei-Ning Chiu | Yi-Tai Hsiao | Yu-Shiang Huang | Yi-Shyuan Chiang | Shuo-En Wu | Chuan-Ju Wang

Firm risk relations are crucial in financial applications, including hedging and portfolio construction. However, the complexity of extracting relevant information from financial reports poses significant challenges in quantifying these relations. To this end, we introduce SURF, a System to Unveil Explainable Risk Relations between Firms. SURF employs a domain-specific encoder and an innovative scoring mechanism to uncover latent risk connections from financial reports. It constructs a network graph to visualize these firm-level risk interactions and incorporates a rationale explainer to elucidate the underlying links. Our evaluation using stock data shows that SURF outperforms baseline methods in effectively capturing firm risk relations. The demo video of the system is publicly available.

pdf bib
Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability
Haonan Li | Xudong Han | Zenan Zhai | Honglin Mu | Hao Wang | Zhenxuan Zhang | Yilin Geng | Shom Lin | Renxi Wang | Artem Shelmanov | Xiangyu Qi | Yuxia Wang | Donghai Hong | Youliang Yuan | Meng Chen | Haoqin Tu | Fajri Koto | Cong Zeng | Tatsuki Kuribayashi | Rishabh Bhardwaj | Bingchen Zhao | Yawen Duan | Yi Liu | Emad A. Alghamdi | Yaodong Yang | Yinpeng Dong | Soujanya Poria | Pengfei Liu | Zhengzhong Liu | Hector Xuguang Ren | Eduard Hovy | Iryna Gurevych | Preslav Nakov | Monojit Choudhury | Timothy Baldwin

As large language models (LLMs) continue to evolve, leaderboards play a significant role in steering their development. Existing leaderboards often prioritize model capabilities while overlooking safety concerns, leaving a significant gap in responsible AI development. To address this gap, we introduce Libra-Leaderboard, a comprehensive framework designed to rank LLMs through a balanced evaluation of performance and safety. Combining a dynamic leaderboard with an interactive LLM arena, Libra-Leaderboard encourages the joint optimization of capability and safety. Unlike traditional approaches that average performance and safety metrics, Libra-Leaderboard uses a distance-to-optimal-score method to calculate the overall rankings. This approach incentivizes models to achieve a balance rather than excelling in one dimension at the expense of some other ones. In the first release, Libra-Leaderboard evaluates 26 mainstream LLMs from 14 leading organizations, identifying critical safety challenges even in state-of-the-art models.

pdf bib
Unlocking Korean Verbs: A User-Friendly Exploration into the Verb Lexicon
Seohyun Song | Eunkyul Leah Jo | Yige Chen | Jeen-Pyo Hong | Kyuwon Kim | Jin Wee | Kang Miyoung | KyungTae Lim | Jungyeul Park | Chulwoo Park

The Sejong dictionary dataset offers a valuable resource, providing extensive coverage of morphology, syntax, and semantic representation. This dataset can be utilized to explore linguistic information in greater depth.The labeled linguistic structures within this dataset form the basis for uncovering relationships between words and phrases and their associations with target verbs. This paper introduces a user-friendly web interface designed for the collection and consolidation of verb-related information, with a particular focus on subcategorization frames. Additionally, it outlines our efforts in mapping this information by aligning subcategorization frames with corresponding illustrative sentence examples.Furthermore, we provide a Python library that would simplify syntactic parsing and semantic role labeling. These tools are intended to assist individuals interested in harnessing the Sejong dictionary dataset to develop applications for Korean language processing.

pdf bib
TransformerRanker: A Tool for Efficiently Finding the Best-Suited Language Models for Downstream Classification Tasks
Lukas Garbas | Max Ploner | Alan Akbik

Classification tasks in NLP are typically addressed by selecting a pre-trained language model (PLM) from a model hub, and fine-tuning it for the task at hand. However, given the very large number of PLMs that are currently available, a practical challenge is to determine which of them will perform best for a specific downstream task. With this paper, we introduce TransformerRanker, a lightweight library that efficiently ranks PLMs for classification tasks without the need for computationally costly fine-tuning. Our library implements current approaches for transferability estimation (LogME, H-Score, kNN), in combination with layer aggregation options, which we empirically showed to yield state-of-the-art rankings of PLMs (Garbas et al., 2024). We designed the interface to be lightweight and easy to use, allowing users to directly connect to the HuggingFace Transformers and Dataset libraries. Users need only select a downstream classification task and a list of PLMs to create a ranking of likely best-suited PLMs for their task. We make TransformerRanker available as a pip-installable open-source library.

pdf bib
Learning Low-Resource Languages Through NLP-Driven Flashcards: A Case Study of Hokkien in Language Learning Applications
Tai Zhang | Lucie Yang | Erin Chen | Karen Riani | Jessica Zipf | Mariana Shimabukuro | En-Shiun Annie Lee

LangLearn is an open-source framework designed to facilitate autonomous learning of low-resource languages (LRL). By combining a language-agnostic approach with AI-enhanced flashcards, LangLearn empowers users to generate custom flashcards for their vocabulary, while offering structured learning through both pre-curated and self-curated decks. The framework integrates six key components: the word definition, corresponding Hanji characters, romanization with numeric tones, audio pronunciation, a sample sentence, as well as a contextual AI-generated image. LangLearn currently supports English and Taiwanese Hokkien (a variety of Southern Min), with plans to extend support for other dialects. Our preliminary study demonstrates that LangLearn positively empowers users to engage with LRLs using their vocabulary preferences, with a comprehensive user study currently underway. LangLearn’s modular structure enables future expansion, including ASR-based pronunciation practice. The code is available at https://github.com/HokkienTranslation/HokkienTranslation.

pdf bib
A Sentence-Level Visualization of Attention in Large Language Models
Seongbum Seo | Sangbong Yoo | Hyelim Lee | Yun Jang | Ji Hwan Park | Jeong-Nam Kim

We introduce SAVIS, a sentence-level attention visualization tool that enhances the interpretability of long documents processed by Large Language Models (LLMs). By computing inter-sentence attention (ISA) through token-level attention aggregation, SAVIS reduces the complexity of attention analysis, enabling users to identify meaningful document-level patterns. The tool offers an interactive interface for exploring how sentences relate to each other in model processing. Our comparative analysis with existing visualization tools demonstrates that SAVIS improves task accuracy and reduces error identification time. We demonstrate its effectiveness for text analysis applications through case studies on various analysis tasks. Our open-source tool is available at https://pypi.org/project/savis with a screencast video at https://youtu.be/fTZZPHA55So.

pdf bib
NeMo-Inspector: A Visualization Tool for LLM Generation Analysis
Daria Gitman | Igor Gitman | Evelina Bakhturina

Adapting Large Language Models (LLMs) to novel tasks and enhancing their overall capabilities often requires large, high-quality training datasets. Synthetic data, generated at scale, serves a valuable alternative when real-world data is scarce or difficult to obtain. However, ensuring the quality of synthetic datasets is challenging, as developers must manually inspect and refine numerous samples to identify errors and areas for improvement. This process is time-consuming and requires specialized tools. We introduce NeMo-Inspector, an open-source tool designed to simplify the analysis of synthetic datasets with integrated inference capabilities. We demonstrate its effectiveness through two real-world cases. Analysis and cleaning of the synthetically generated GSM-Plus dataset with NeMo-Inspector led to a significant decrease in low-quality samples from 46.99% to 19.51%. The tool also helped identify and correct generation errors in OpenMath models, improving accuracy by 1.92% on the MATH dataset and by 4.17% on the GSM8K dataset for a Meta-Llama-3-8B model fine-tuned on synthetic data generated from Nemotron-4-340B.

pdf bib
Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots
Hongming Zhang | Xiaoman Pan | Hongwei Wang | Kaixin Ma | Wenhao Yu | Dong Yu

We introduce Cognitive Kernel, an open-source agent system towards the goal of generalist autopilots. Unlike copilot systems, which primarily rely on users to provide essential state information, autopilot systems complete tasks from start to finish independently. This requires the system to acquire the missing state information actively. Cognitive Kernel adopts a dynamic programming design where the central policy model (a fine-tuned LLM) could initiate an environment state perception task, essentially another agent task, as needed. The results demonstrate that Cognitive Kernel achieves better or comparable performance to other closed-source systems on core autopilot capabilities. Cognitive Kernel is fully dockerized, ensuring everyone can deploy it privately and securely. We open-source the system to encourage further research on LLM-driven autopilot systems

pdf bib
SOTOPIA-S4: a user-friendly system for flexible, customizable, and large-scale social simulation
Xuhui Zhou | Zhe Su | Sophie Feng | Jiaxu Zhou | Jen-tse Huang | Hsien-Te Kao | Spencer Lynch | Svitlana Volkova | Tongshuang Wu | Anita Woolley | Hao Zhu | Maarten Sap

Social simulation through large language model (LLM) agents is a promising approach to explore and validate social science hypotheses.We present SOTOPIA-S4, a fast, flexible, and scalable social simulation system that addresses the technical barriers of current frameworks while enabling practitioners to generate realistic, multi-turn and multi-party interactions with customizable evaluation metrics for hypothesis testing. SOTOPIA-S4 comes as a pip package that contains a simulation engine, an API server with flexible RESTful APIs for simulation management, and a web interface that enables both technical and non-technical users to design, run, and analyze simulations without programming. We demonstrate the usefulness of SOTOPIA-S4 with two use cases involving dyadic hiring negotiation scenarios and multi-party planning scenarios.

pdf bib
SafeSpeech: A Comprehensive and Interactive Tool for Analysing Sexist and Abusive Language in Conversations
Xingwei Tan | Chen Lyu | Hafiz Muhammad Umer | Sahrish Khan | Mahathi Parvatham | Lois Arthurs | Simon Cullen | Shelley Wilson | Arshad Jhumka | Gabriele Pergola

Detecting toxic language, including sexism, harassment, and abusive behaviour, remains a critical challenge, particularly in its subtle and context-dependent forms. Existing approaches largely focus on isolated message-level classification, overlooking toxicity that emerges across conversational contexts. To promote and enable future research in this direction, we introduce *SafeSpeech*, a comprehensive platform for toxic content detection and analysis that bridges message-level and conversation-level insights. The platform integrates fine-tuned classifiers and large language models (LLMs) to enable multi-granularity detection, toxic-aware conversation summarization, and persona profiling. *SafeSpeech* also incorporates explainability mechanisms, such as perplexity gain analysis, to highlight the linguistic elements driving predictions. Evaluations on benchmark datasets, including EDOS, OffensEval, and HatEval, demonstrate the reproduction of state-of-the-art performance across multiple tasks, including fine-grained sexism detection.

pdf bib
ALOHA: Empowering Multilingual Agent for University Orientation with Hierarchical Retrieval
Mingxu Tao | Bowen Tang | Mingxuan Ma | Yining Zhang | Hourun Li | Feifan Wen | Ma Hao | Jia Yang

The rise of Large Language Models (LLMs) revolutionizes information retrieval, allowing users to obtain required answers through complex instructions within conversations. However, publicly available services remain inadequate in addressing the needs of faculty and students to search campus-specific information. It is primarily due to the LLM’s lack of domain-specific knowledge and the limitation of search engines in supporting multilingual and timely scenarios. To tackle these challenges, we introduce ALOHA, a multilingual agent enhanced by hierarchical retrieval for university orientation. We also integrate external APIs into the front-end interface to provide interactive service. The human evaluation and case study show our proposed system has strong capabilities to yield correct, timely, and user-friendly responses to the queries in multiple languages, surpassing commercial chatbots and search engines. The system has been deployed and has provided service for more than 12,000 people.

pdf bib
MeKB-Sim: Personal Knowledge Base-Powered Multi-Agent Simulation
Zhenran Xu | Jifang Wang | Baotian Hu | Longyue Wang | Min Zhang

Language agents have demonstrated remarkable emergent social behaviors within simulated sandbox environments. However, the characterization of these agents has been constrained by static prompts that outline their profiles, highlighting a gap in achieving simulations that closely mimic real-life interactions. To close this gap, we introduce MeKB-Sim, a multi-agent simulation platform based on a dynamic personal knowledge base, termed MeKB. Each agent’s MeKB contains both fixed and variable attributes—such as linguistic style, personality, and memory—crucial for theory-of-mind modeling. These attributes are updated when necessary, in response to events that the agent experiences. Comparisons with human annotators show that the LLM-based attribute updates are reliable. Based on the dynamic nature of MeKB, experiments and case study show that MeKB-Sim enables agents to adapt their planned activities and interactions with other agents effectively. Our platform includes a Unity WebGL game interface for visualization and an interactive monitoring panel that presents the agents’ planning, actions, and evolving MeKBs over time. For more information, including open-source code, a live demo website, and videos, please visit our project page at https://mekb-sim.github.io/.

pdf bib
MetaScientist: A Human-AI Synergistic Framework for Automated Mechanical Metamaterial Design
Jingyuan Qi | Zian Jia | Minqian Liu | Wangzhi Zhan | Junkai Zhang | Xiaofei Wen | Jingru Gan | Jianpeng Chen | Qin Liu | Mingyu Derek Ma | Bangzheng Li | Haohui Wang | Adithya Kulkarni | Muhao Chen | Dawei Zhou | Ling Li | Wei Wang | Lifu Huang

The discovery of novel mechanical metamaterials, whose properties are dominated by their engineered structures rather than chemical composition, is a knowledge-intensive and resource-demanding process. To accelerate the design of novel metamaterials, we present MetaScientist, a human-in-the-loop system that integrates advanced AI capabilities with expert oversight with two primary phases: (1) hypothesis generation, where the system performs complex reasoning to generate novel and scientifically sound hypotheses, supported with domain-specific foundation models and inductive biases retrieved from existing literature; (2) 3D structure synthesis, where a 3D structure is synthesized with a novel 3D diffusion model based on the textual hypothesis and refined it with a LLM-based refinement model to achieve better structure properties. At each phase, domain experts iteratively validate the system outputs, and provide feedback and supplementary materials to ensure the alignment of the outputs with scientific principles and human preferences. Through extensive evaluation from human scientists, MetaScientist is able to deliver novel and valid mechanical metamaterial designs that have the potential to be highly impactful in the metamaterial field.

pdf bib
FACTS&EVIDENCE: An Interactive Tool for Transparent Fine-Grained Factual Verification of Machine-Generated Text
Varich Boonsanong | Vidhisha Balachandran | Xiaochuang Han | Shangbin Feng | Lucy Lu Wang | Yulia Tsvetkov

With the widespread consumption of AI-generated content, there has been an increased focus on developing automated tools to verify the factual accuracy of such content. However, prior research and tools developed for fact verification treat it as a binary classification or a linear regression problem. Although this is a useful mechanism as part of automatic guardrails in systems, we argue that such tools lack transparency in the prediction reasoning and diversity in source evidence to provide a trustworthy user experience.We develop FACTS&EVIDENCE—an interactive and transparent tool for user-driven verification of complex text. The tool facilitates the intricate decision-making involved in fact-verification, presenting its users a breakdown of complex input texts to visualize the credibility of individual claims along with explanation of model decisions and attribution to multiple, diverse evidence sources. FACTS&EVIDENCE aims to empower consumers of machine-generated text and give them agency to understand, verify, selectively trust and use such text.

pdf bib
LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications
Danqing Zhang | Balaji Rama | Jingyi Ni | Shiying He | Fu Zhao | Kunyu Chen | Arnold Chen | Junyu Cao

We introduce LiteWebAgent, an open-source suite for VLM-based web agent applications. Our framework addresses a critical gap in the web agent ecosystem with a production-ready solution that combines minimal serverless backend configuration, intuitive user and browser interfaces, and extensible research capabilities in agent planning, memory, and tree search. For the core LiteWebAgent agent framework, we implemented a simple yet effective baseline using recursive function calling, providing with decoupled action generation and action grounding. In addition, we integrate advanced research components such as agent planning, agent workflow memory, and tree search in a modular and extensible manner. We then integrate the LiteWebAgent agent framework with frontend and backend as deployed systems in two formats: (1) a production Vercel-based web application, which provides users with an agent-controlled remote browser, (2) a Chrome extension leveraging LiteWebAgent’s API to control an existing Chrome browser via CDP (Chrome DevTools Protocol). The LiteWebAgent framework is available at https://github.com/PathOnAI/LiteWebAgent, with deployed frontend at https://lite-web-agent.vercel.app/.

pdf bib
L3GO: Language Agents with Chain-of-3D-Thoughts for Generating Unconventional Objects
Yutaro Yamada | Khyathi Chandu | Bill Yuchen Lin | Jack Hessel | Ilker Yildirim | Yejin Choi

Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as “a chair with five legs”. In this paper, we propose a language agent with chain-of-3D-thoughts (L3GO), an inference-time approach that can reason about part-based 3D construction of unconventional objects that current data-driven diffusion models struggle with. More concretely, we use large language models as agents to compose a desired object via trial-and-error within the 3D simulation environment. To facilitate our investigation, we develop a new benchmark, Unconventionally Feasible Objects (UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender where language agents can build and compose atomic building blocks via API calls. Human and automatic GPT-4V evaluations show that our approach surpasses the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our approach outperforms other state-of-the-art text-to-2D image and text-to-3D models based on human evaluation.

pdf bib
Constructing Multimodal Datasets from Scratch for Rapid Development of a Japanese Visual Language Model
Keito Sasagawa | Koki Maeda | Issa Sugiura | Shuhei Kurita | Naoaki Okazaki | Daisuke Kawahara

To develop high-performing Visual Language Models (VLMs), it is essential to prepare multimodal resources, such as image-text pairs, interleaved data, and instruction data. While multimodal resources for English are abundant, there is a significant lack of corresponding resources for non-English languages, such as Japanese. To address this problem, we take Japanese as a non-English language and propose Japanese multimodal datasets for rapidly developing a Japanese multimodal model. We collect Japanese image-text pairs and interleaved data from web archives and generate Japanese instruction data using an existing large language model and a VLM. Our experimental results show that a VLM trained on these native datasets outperforms those relying on machine-translated content. The resulting VLM, dataset and code used for training is publicly available.

pdf bib
Storybranch - generating multimedia content from novels
Rushikesh Hiray | Venelin Kovatchev

We present Storybranch - an automated system for generating multimedia content from long texts such as novels and fanfiction.The Storybranch pipeline includes structured information extraction, text parsing and processing, content generation using Gen-AI models and syncronization of different streams (audio, video, background). Our system is highly modular and can efficiently generate three different types of multimodal content: audiobooks, simple animated videos, and visual novel text-and-image-style video games.Storybranch successfully addresses challenges such as generating unique and consistent image and voice for each character and narrator, identifying and generating background images and sounds effects, and syncronizing character expressions and lip movement with text.As part of the Storybranch , we develop and release BookNLP2 - a new open-source library for parsing and extracting information from books, based on the legacy library BookNLP.

pdf bib
EventFull: Complete and Consistent Event Relation Annotation
Alon Eirew | Eviatar Nachshoni | Aviv Slobodkin | Ido Dagan

Event relation detection is a fundamental NLP task, leveraged in many downstream applications, whose modeling requires datasets annotated with event relations of various types. However, systematic and complete annotation of these relations is costly and challenging, due to the quadratic number of event pairs that need to be considered. Consequently, many current event relation datasets lack systematicity and completeness.In response, we introduce EventFull, the first tool that supports consistent, complete and efficient annotation of temporal, causal and coreference relations via a unified and synergetic process.A pilot study demonstrates that EventFull accelerates and simplifies the annotation process while yielding high inter-annotator agreement.

pdf bib
METAPHORSHARE: A Dynamic Collaborative Repository of Open Metaphor Datasets
Joanne Boisson | Arif Mehmood | Jose Camacho-Collados

The metaphor studies community has developed numerous valuable labelled corpora in various languages over the years. Many of these resources are not only unknown to the NLP community, but are also often not easily shared among the researchers. Both in human sciences and in NLP, researchers could benefit from a centralised database of labelled resources, easily accessible and unified under an identical format. To facilitate this, we present MetaphorShare, a website to integrate metaphor datasets making them open and accessible. With this effort, our aim is to encourage researchers to share and upload more datasets in any language in order to facilitate metaphor studies and the development of future metaphor processing NLP systems. The website has four main functionalities: upload, download, search and label metaphor datasets. It is accessible at www.metaphorshare.com.

pdf bib
Towards Unified, Dynamic and Annotation-based Visualisations and Exploration of Annotated Big Data Corpora with the Help of Unified Corpus Explorer
Kevin Bönisch | Giuseppe Abrami | Alexander Mehler

The annotation and exploration of large text corpora, both automatic and manual, presents significant challenges across multiple disciplines, including linguistics, digital humanities, biology, and legal science. These challenges are exacerbated by the heterogeneity of processing methods, which complicates corpus visualization, interaction, and integration. To address these issues, we introduce the Unified Corpus Explorer (UCE), a standardized, dockerized, open-source and dynamic Natural Language Processing (NLP) application designed for flexible and scalable corpus navigation. Herein, UCE utilizes the UIMA format for NLP annotations as a standardized input, constructing interfaces and features around those annotations while dynamically adapting to the corpora and their extracted annotations. We evaluate UCE based on a user study and demonstrate its versatility as a corpus explorer based on generative AI.

pdf bib
MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation
Zichen Zhu | Hao Tang | Yansi Li | Dingye Liu | Hongshen Xu | Kunyao Lan | Danyang Zhang | Yixuan Jiang | Hao Zhou | Chenrun Wang | Situo Zhang | Liangtai Sun | Yixiao Wang | Yuheng Sun | Lu Chen | Kai Yu

Existing Multimodal Large Language Model (MLLM)-based agents face significant challenges in handling complex GUI (Graphical User Interface) interactions on devices. These challenges arise from the dynamic and structured nature of GUI environments, which integrate text, images, and spatial relationships, as well as the variability in action spaces across different pages and tasks. To address these limitations, we propose MobA, a novel MLLM-based mobile assistant system. MobA introduces an adaptive planning module that incorporates a reflection mechanism for error recovery and dynamically adjusts plans to align with the real environment contexts and action module’s execution capacity. Additionally, a multifaceted memory module provides comprehensive memory support to enhance adaptability and efficiency. We also present MobBench, a dataset designed for complex mobile interactions. Experimental results on MobBench and AndroidArena demonstrate MobA’s ability to handle dynamic GUI environments and perform complex mobile tasks.

pdf bib
OpenReviewer: A Specialized Large Language Model for Generating Critical Scientific Paper Reviews
Maximilian Idahl | Zahra Ahmadi

We present OpenReviewer, an open-source system for generating high-quality peer reviews of machine learning and AI conference papers. At its core is Llama-OpenReviewer-8B, an 8B parameter language model specifically fine-tuned on 79,000 expert reviews from top conferences. Given a PDF paper submission and review template as input, OpenReviewer extracts the full text, including technical content like equations and tables, and generates a structured review following conference-specific guidelines. Our evaluation on 400 test papers shows that OpenReviewer produces considerably more critical and realistic reviews compared to general-purpose LLMs like GPT-4 and Claude-3.5. While other LLMs tend toward overly positive assessments, OpenReviewer’s recommendations closely match the distribution of human reviewer ratings. The system provides authors with rapid, constructive feedback to improve their manuscripts before submission, though it is not intended to replace human peer review. OpenReviewer is available as an online demo and open-source tool.

up

pdf (full)
bib (full)
Proceedings of the first International Workshop on Nakba Narratives as Language Resources

pdf bib
Proceedings of the first International Workshop on Nakba Narratives as Language Resources
Mustafa Jarrar | Habash Habash | Mo El-Haj

pdf bib
Deciphering Implicatures: On NLP and Oral Testimonies
Zainab Sabra

The utterance of a word does not intrinsically convey its intended force. The semantic of utterances is not shaped by the precise references of the words used. Asserting that “it is shameful to abandon our country” does not merely convey information; rather, it asserts an act of resilience. In most of our exchanges, we rarely utilize sentences to describe reality or the world around us. More frequently, our statements aim to express opinions, to influence, or be influenced by others. Words carry more than just their syntax and semantics; they also embody a pragmatic normative force. This divergence between literal and conveyed meaning was depicted in the literature of philosophy of language as the difference between sentence meaning and speaker meaning. Where the former is the literal understanding of the words combined in a sentence, the latter is what the speaker is trying to convey through her expression. In order to derive the speaker meaning from the sentence meaning, J.L. Austin (the author of How To Do Things with Words) relied on conventions, whereas H.P. Grice (the author of Logic and Conversations) relied on conventional and non conventional implicatures. This paper aims to decipher how we can infer speaker meaning from sentence meaning and thereby capture the force of what has been articulated, focusing specifically on oral testimonies. I argue that oral testimonies are forms of speech acts that aim to produce normative changes. Following this discussion, I will examine various natural language processing (NLP) models that make explicit what is implicit in oral testimonies with its benefits and limitations. Lastly, I will address two challenges, the former is related to implicatures that are not governed by conventions and the latter is concerned with the biases inherent in hermeneutical approaches.

pdf bib
A cultural shift in Western perceptions of Palestine
Terry Regier | Muhammad Ali Khalidi

We argue that a cultural shift in Western perceptions of Palestine began in the late 1990s to 2000s, leading to increased openness to Palestinian perspectives, including awareness of the Nakba. We present 3 computational analyses designed to test this idea against data from the 2020 Google Books English dataset. The results support the claim of a cultural shift, and help to characterize that shift.

pdf bib
Cognitive Geographies of Catastrophe Narratives: Georeferenced Interview Transcriptions as Language Resource for Models of Forced Displacement
Annie K. Lamar | Rick Castle | Carissa Chappell | Emmanouela Schoinoplokaki | Allene M. Seet | Amit Shilo | Chloe Nahas

We present a machine-understandable geotagged dataset of translated interviews from the Nakba Archive alongside a complete georeferenced dataset of named locations mentioned in the interviews. In a preliminary analysis of this dataset, we find that the cognitive relationship of interviewees to place and spatiality is significantly correlated with gender. Our data also shows that interviewees with birthplaces depopulated in the 1948 Nakba incorporate references to named places in their interviews in substantially different ways than other interviewees. This suggests that the status of the interviewee’s birthplace may impact the way they narrate their experiences. Our work serves as a foundation for continued and expanded statistical and cognitive models of Palestinian forced displacement.

pdf bib
Sentiment Analysis of Nakba Oral Histories: A Critical Study of Large Language Models
Huthaifa I. Ashqar

This study explores the use of Large Language Models (LLMs), specifically ChatGPT, for sentiment analysis of Nakba oral histories, which document the experiences of Palestinian refugees. The study compares sentiment analysis results from full testimonies (average 2500 words) and their summarized versions (300 words). The findings reveal that summarization increased positive sentiment and decreased negative sentiment, suggesting that the process may highlight more hopeful themes while oversimplifying emotional complexities. The study highlights both the potential and limitations of using LLMs for analyzing sensitive, trauma-based narratives and calls for further research to improve sentiment analysis in such contexts.

pdf bib
The Nakba Lexicon: Building a Comprehensive Dataset from Palestinian Literature
Izza AbuHaija | Salim Al Mandhari | Mo El-Haj | Jonas Sibony | Paul Rayson

This paper introduces the Nakba Lexicon, a comprehensive dataset derived from the poetry collection Asifa ‘Ala al-Iz‘aj (Sorry for the Disturbance) by Istiqlal Eid, a Palestinian poet from El-Birweh. Eid’s work poignantly reflects on themes of Palestinian identity, displacement, and resilience, serving as a resource for preserving linguistic and cultural heritage in the context of post-Nakba literature. The dataset is structured into ten thematic domains, including political terminology, memory and preservation, sensory and emotional lexicon, toponyms, nature, and external linguistic influences such as Hebrew, French, and English, thereby capturing the socio-political, emotional, and cultural dimensions of the Nakba. The Nakba Lexicon uniquely emphasises the contributions of women to Palestinian literary traditions, shedding light on often-overlooked narratives of resilience and cultural continuity. Advanced Natural Language Processing (NLP) techniques were employed to analyse the dataset, with fine-tuned pre-trained models such as ARABERT and MARBERT achieving F1-scores of 0.87 and 0.68 in language and lexical classification tasks, respectively, significantly outperforming traditional machine learning models. These results highlight the potential of domain-specific computational models to effectively analyse complex datasets, facilitating the preservation of marginalised voices. By bridging computational methods with cultural preservation, this study enhances the understanding of Palestinian linguistic heritage and contributes to broader efforts in documenting and analysing endangered narratives. The Nakba Lexicon paves the way for future interdisciplinary research, showcasing the role of NLP in addressing historical trauma, resilience, and cultural identity.

pdf bib
Arabic Topic Classification Corpus of the Nakba Short Stories
Osama Hamed | Nadeem Zaidkilani

In this paper, we enrich Arabic Natural Language Processing (NLP) resources by introducing the “Nakba Topic Classification Corpus (NTCC),” a novel annotated Arabic corpus derived from narratives about the Nakba. The NTCC comprises approximately 470 sentences extracted from eight short stories and captures the thematic depth of the Nakba narratives, providing insights into both historical and personal dimensions. The corpus was annotated in a two-step process. One third of the dataset was manually annotated, achieving an IAA of 87% (later resolved to 100%), while the rest was annotated using a rule-based system based on thematic patterns. This approach ensures consistency and reproducibility, enhancing the corpus’s reliability for NLP research. The NTCC contributes to the preservation of the Palestinian cultural heritage while addressing key challenges in Arabic NLP, such as data scarcity and linguistic complexity. By like topic modeling and classification tasks, the NTCC offers a valuable resource for advancing Arabic NLP research and fostering a deeper understanding of the Nakba narratives

pdf bib
Exploring Author Style in Nakba Short Stories: A Comparative Study of Transformer-Based Models
Osama Hamed | Nadeem Zaidkilani

Measuring semantic similarity and analyzing authorial style are fundamental tasks in Natural Language Processing (NLP), with applications in text classification, cultural analysis, and literary studies. This paper investigates the semantic similarity and stylistic features of Nakba short stories, a key component of Palestinian literature, using transformer-based models, AraBERT, BERT, and RoBERTa. The models effectively capture nuanced linguistic structures, cultural contexts, and stylistic variations in Arabic narratives, outperforming the traditional TF-IDF baseline. By comparing stories of similar length, we minimize biases and ensure a fair evaluation of both semantic and stylistic relationships. Experimental results indicate that RoBERTa achieves slightly higher performance, highlighting its ability to distinguish subtle stylistic patterns. This study demonstrates the potential of AI-driven tools to provide more in-depth insights into Arabic literature, and contributes to the systematic analysis of both semantic and stylistic elements in Nakba narratives.

pdf bib
Detecting Inconsistencies in Narrative Elements of Cross Lingual Nakba Texts
Nada Hamarsheh | Zahia Elabour | Aya Murra | Adnan Yahya

This paper suggests a methodology for contradiction detection in cross lingual texts about the Nakba. We propose a pipeline that includes text translation using Google’s Gemini for context-aware translations, followed by a fact extraction task using either Gemini or the TextRank algorithm. We then apply Natural Language Inference (NLI) by using models trained for this task, such as XLM-RoBERTa and BART to detect contradictions from different texts about the Nakba. We also describe how the performance of such NLI models is affected by the complexity of some sentences as well as the unique syntactic and semantic characteristics of the Arabic language. Additionally, we introduce a method using cosine similarity of vector embeddings of facts for identifying missing or underrepresented topics among historical narrative texts. The approach we propose in this paper provides insights into biases, contradictions, and gaps in narratives surrounding the Nakba, offering a deeper understanding of historical perspectives.

pdf bib
Multilingual Propaganda Detection: Exploring Transformer-Based Models mBERT, XLM-RoBERTa, and mT5
Mohamed Ibrahim Ragab | Ensaf Hussein Mohamed | Walaa Medhat

This research investigates multilingual propaganda detection by employing transformer-based models, specifically mBERT, XLM-RoBERTa, and mT5. The study utilizes a balanced dataset from the BiasFigNews corpus, annotated for propaganda and bias across five languages. The models were finely tuned to generate embeddings for classification tasks. The evaluation revealed mT5 as the most effective model, achieving an accuracy of 99.61% and an F1-score of 0.9961, followed by mBERT and XLM-RoBERTa with accuracies of 92% and 91.41%, respectively. The findings demonstrate the efficacy of transformer-based embeddings in detecting propaganda while also highlighting challenges in subtle class distinctions. Future work aims to enhance cross-lingual adaptability and explore lightweight models for resource-constrained settings.

pdf bib
Collective Memory and Narrative Cohesion: A Computational Study of Palestinian Refugee Oral Histories in Lebanon
Ghadir A. Awad | Tamara N. Rayan | Lavinia Dunagan | David Gamba

This study uses the Palestinian Oral History Archive (POHA) to investigate how Palestinian refugee groups in Lebanon sustain a cohesive collective memory of the Nakba through shared narratives. Grounded in Halbwachs’ theory of group memory, we employ statistical analysis of pairwise similarity of narratives, focusing on the influence of shared gender and location. We use textual representation and semantic embeddings of narratives to represent the interviews themselves. Our analysis demonstrates that shared origin is a powerful determinant of narrative similarity across thematic keywords, landmarks, and significant figures, as well as in semantic embeddings of the narratives. Meanwhile, shared residence fosters cohesion, with its impact significantly amplified when paired with shared origin. Additionally, women’s narratives exhibit heightened thematic cohesion, particularly in recounting experiences of the British occupation, underscoring the gendered dimensions of memory formation. This research deepens the understanding of collective memory in diasporic settings, emphasizing the critical role of oral histories in safeguarding Palestinian identity and resisting erasure.

pdf bib
The Missing Cause: An Analysis of Causal Attributions in Reporting on Palestine
Paulina Garcia Corral | Hannah Bechara | Krishnamoorthy Manohara | Slava Jankin

Missing cause bias is a specific type of bias in media reporting that relies on consistently omitting causal attribution to specific events, for example when omitting specific actors as causes of incidents. Identifying these patterns in news outlets can be helpful in assessing the level of bias present in media content. In this paper, we examine the prevalence of this bias in reporting on Palestine by identifying causal constructions in headlines. We compare headlines from three main news media outlets: CNN, the BBC, and AJ (AlJazeera), that cover the Israel-Palestine conflict. We also collect and compare these findings to data related to the Ukraine-Russia war to analyze editorial style within press organizations. We annotate a subset of this data and evaluate two causal language models (UniCausal and GPT-4o) for the identification and extraction of causal language in news headlines. Using the top performing model, GPT-4o, we machine annotate the full corpus and analyze missing bias prevalence within and across news organizations. Our findings reveal that BBC headlines tend to avoid directly attributing causality to Israel for the violence in Gaza, both when compared to other news outlets, and to its own reporting on other conflicts.

pdf bib
Bias Detection in Media: Traditional Models vs. Transformers in Analyzing Social Media Coverage of the Israeli-Gaza Conflict
Marryam Yahya Mohammed | Esraa Ismail Mohamed | Mariam Nabil Esmat | Yomna Ashraf Nagib | Nada Ahmed Radwan | Ziad Mohamed Elshaer | Ensaf Hussein Mohamed

Bias in news reporting significantly influences public perception, particularly in sensitive and polarized contexts like the Israel-Gaza conflict. Detecting bias in such cases presents unique challenges due to political, cultural, and ideological complexities, often amplifying disparities in reporting. While prior research has addressed media bias and dataset fairness, these approaches inadequately capture the nuanced dynamics of the Israel-Gaza conflict. To address this gap, we propose an NLP-based framework that leverages Nakba narratives as linguistic resources for bias detection in news coverage. Using a multilingual corpus focusing on Arabic texts, we apply rigorous data cleaning, pre-processing, and methods to mitigate imbalanced class distributions that could skew classification outcomes. Our study explores various approaches, including Machine Learning (ML), Deep Learning (DL), Transformer-based architectures, and generative models. The findings demonstrate promising advancements in automating bias detection, and enhancing fairness and accuracy in politically sensitive reporting.

pdf bib
NakbaTR: A Turkish NER Dataset for Nakba Narratives
Esma Fatıma Bilgin Tasdemir | Şaziye Betül Özateş

This paper introduces a novel, annotated Named Entity Recognition (NER) dataset derived from a collection of 181 news articles about the Nakba and its witnesses. Given their prominence as a primary source of information on the Nakba in Turkish, news articles were selected as the primary data source. Some 4,032 news sentences are collected from web sites of two news agencies, Anadolu Ajansı and TRTHaber. We applied a filtering process to make sure that only the news which contain witness testimonies regarding the ongoing Nakba are included in the dataset. After a semi-automatic annotation for entities of type Person, Location, and Organization, we obtained a NER dataset of 2,289 PERSON, 5,875 LOCATION, and 1,299 ORGANIZATION tags. We expect the dataset to be useful in several NLP tasks such as sentiment analysis and relation extraction for Nakba event while providing a new language resource for Turkish. As a future work, we aim to improve the dataset by increasing the number of news and entity types.

pdf bib
Integrating Argumentation Features for Enhanced Propaganda Detection in Arabic Narratives on the Israeli War on Gaza
Sara Nabhani | Claudia Borg | Kurt Micallef | Khalid Al-Khatib

Propaganda significantly shapes public opinion, especially in conflict-driven contexts like the Israeli-Palestinian conflict. This study explores the integration of argumentation features, such as claims, premises, and major claims, into machine learning models to enhance the detection of propaganda techniques in Arabic media. By leveraging datasets annotated with fine-grained propaganda techniques and employing crosslingual and multilingual NLP methods, along with GPT-4-based annotations, we demonstrate consistent performance improvements. A qualitative analysis of Arabic media narratives on the Israeli war on Gaza further reveals the model’s capability to identify diverse rhetorical strategies, offering insights into the dynamics of propaganda. These findings emphasize the potential of combining NLP with argumentation features to foster transparency and informed discourse in politically charged settings.


up

pdf (full)
bib (full)
Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025

pdf bib
Proceedings of Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning @ COLING 2025
Kang Liu | Yangqiu Song | Zhen Han | Rafet Sifa | Shizhu He | Yunfei Long

pdf bib
Chain of Knowledge Graph: Information-Preserving Multi-Document Summarization for Noisy Documents
Kangil Lee | Jinwoo Jang | Youngjin Lim | Minsu Shin

With the advent of large language models, the complexity of multi-document summarization task has been substantially reduced. The summarization process must effectively handle noisy documents that are irrelevant to the main topic while preserving essential information. Recently, Chain-of-Density (COD) and Chain-of-Event (CoE) have proposed prompts to effectively handle the noisy documents by using entity-centric approaches for the summarization. However, CoD and CoE are prone to information loss during entity extraction due to their tendency to overly filter out entities perceived as less critical but that could still be important. In this paper, we propose a novel instruction prompt termed as Chain of Knowledge Graph (CoKG) for multi-document summarization. Our prompt extracts entities and constructs relationships between entities to form a Knowledge Graph (KG). Next, the prompt enriches these relationships to recognize potentially important entities and assess the strength of each relation. If the acquired KG meets a predefined quality level, the KG is used to summarize the given documents. This process helps alleviate the information loss in multi-document summarization. Experimental results demonstrate that our prompt effectively preserves key entities and is robust to noisy documents.

pdf bib
CEGRL-TKGR: A Causal Enhanced Graph Representation Learning Framework for Temporal Knowledge Graph Reasoning
Jinze Sun | Yongpan Sheng | Lirong He | Yongbin Qin | Ming Liu | Tao Jia

Temporal knowledge graph reasoning (TKGR) is increasingly gaining attention for its ability to extrapolate new events from historical data, thereby enriching the inherently incomplete temporal knowledge graphs. Existing graph-based representation learning frameworks have made significant strides in developing evolving representations for both entities and relational embeddings. Despite these achievements, there’s a notable tendency in these models to inadvertently learn biased data representations and mine spurious correlations, consequently failing to discern the causal relationships between events. This often leads to incorrect predictions based on these false correlations. To address this, we propose an innovative Causal Enhanced Graph Representation Learning framework for TKGR (named CEGRL-TKGR). This framework introduces causal structures in graph-based representation learning to unveil the essential causal relationships between events, ultimately enhancing the performance of the TKGR task. Specià̄¬cally, we first disentangle the evolutionary representations of entities and relations in a temporal knowledge graph sequence into two distinct components, namely causal representations and confounding representations. Then, drawing on causal intervention theory, we advocate the utilization of causal representations for predictions, aiming to mitigate the effects of erroneous correlations caused by confounding features, thus achieving more robust and accurate predictions. Finally, extensive experimental results on six benchmark datasets demonstrate the superior performance of our model in the link prediction task.

pdf bib
Reasoning Knowledge Filter for Logical Table-to-Text Generation
Yu Bai | Baoqiang Liu | Shuang Xue | Fang Cai | Na Ye | Guiping Zhang

Logical table-to-text generation (LT2T) seeks to produce logically faithful textual descriptions base on tables. Current end-to-end LT2T models, which use descriptions directly as learning objectives, frequently face challenges in maintaining logical faithfulness due to the lack of a reasoning knowledge. Recent research have introduced reasoning knowledge generated by models for LT2T task, but the noise along with it limited its performance. We therefore propose a framework reasoning knowledge filter that leverages the collaboration between large language models and smaller models to filter data points with high-quality reasoning knowledge. This framework aims to provide highly matched table, description and reasoning knowledge triplets for LT2T. The results obtained on LogicNLG database demonstrate that the efficiencies of the method in this paper has achieved optimal performance with a reduced amount of data. Specifically, it enhances SP-Acc by 1.4 points and NLI-Acc by 0.7 points compared to the current state-of-the-art model.

pdf bib
From Chain to Tree: Refining Chain-like Rules into Tree-like Rules on Knowledge Graphs
Wangtao Sun | Shizhu He | Jun Zhao | Kang Liu

With good explainability and controllability, rule-based methods play an important role in the task of Knowledge Graph Completion (KGC). However, existing studies primarily focused on learning chain-like rules, whose chain-like structure limits their expressive power. Consequently, chain-like rules often exhibit lower Standard Confidence, and are prone to the incorrect grounding values during reasoning, thus producing erroneous reasoning results. In this paper, we propose the concept of tree-like rules on knowledge graphs to expand the scope of the application and improve the reasoning ability of rule-based methods. To achieve this, we formalize the problem of tree-like rule refinement and propose an effective framework for refining chain-like rules into tree-like rules. Experimental evaluations on four public datasets demonstrate that the proposed framework can seamlessly adapt to various chain-like rule induction methods and the refined tree-like rules consistently exhibit higher Standard Confidence and achieve better performances than the original chain-like rules on link prediction tasks. Furthermore, we illustrate that the improvements brought by tree-like rules are positively correlated with the density of the knowledge graphs. The data and code of this paper can be available at https://github.com/forangel2014/tree-rule.

pdf bib
LAB-KG: A Retrieval-Augmented Generation Method with Knowledge Graphs for Medical Lab Test Interpretation
Rui Guo | Barry Devereux | Greg Farnan | Niall McLaughlin

Laboratory tests generate structured numerical data, which a clinician must interpret to justify diagnoses and help patients understand the outcomes of the tests. LLMs have the potential to assist with the generation of interpretative comments, but legitimate concerns remain about the accuracy and reliability of the generation process. This work introduces LAB-KG, which conditions the generation process of an LLM on information retrieved from a knowledge graph of relevant patient conditions and lab test results. This helps to ground the text-generation process in accurate medical knowledge and enables generated text to be traced back to the knowledge graph. Given a dataset of laboratory test results and associated interpretive comments, we show how an LLM can build a KG of the relationships between laboratory test results, reference ranges, patient conditions and demographic information. We further show that the interpretive comments produced by an LLM conditioned on information retrieved from the KG are of higher quality than those from a standard RAG method. Finally, we show how our KG approach can improve the interpretability of the LLM generated text.

pdf bib
Bridging Language and Scenes through Explicit 3-D Model Construction
Tiansi Dong | Writwick Das | Rafet Sifa

We introduce the methodology of explicit model construction to bridge linguistic descriptions and scene perception and demonstrate that in Visual Question-Answering (VQA) using MC4VQA (Model Construction for Visual Question-Answering), a method developed by us. Given a question about a scene, our MC4VQA first recognizes objects utilizing pre-trained deep learning systems. Then, it constructs an explicit 3-D layout by repeatedly reducing the difference between the input scene image and the image rendered from the current 3-D spatial environment. This novel “iterative rendering” process endows MC4VQA the capability of acquiring spatial attributes without training data. MC4VQA outperforms NS-VQA (the SOTA system) by reaching 99.94% accuracy on the benchmark CLEVR datasets, and is more robust than NS-VQA on new testing datasets. With newly created testing data, NS-VQA’s performance dropped to 97.60%, while MC4VQA still kept the 99.0% accuracy. This work sets a new SOTA performance of VQA on the benchmark CLEVR datasets, and shapes a new method that may solve the out-of-distribution problem.

pdf bib
VCRMNER: Visual Cue Refinement in Multimodal NER using CLIP Prompts
Yu Bai | Lianji Wang | Xiang Liu | Haifeng Chi | Guiping Zhang

With the continuous growth of multi-modal data on social media platforms, traditional Named Entity Recognition has rendered insufficient for handling contemporary data formats. Consequently, researchers proposed Multi-modal Named Entity Recognition (MNER). Existing studies focus on capturing the visual regions corresponding to entities to assist in entity recognition. However, these approaches still struggle to mitigate interference from visual regions that are irrelevant to the entities. To address this issue, we propose an innovative framework, Visual Cue Refinement in MNER(VCRMNER) using CLIP Prompts, to accurately capture visual cues (object-level visual regions) associated with entities. We leverage prompts to represent the semantic information of entity categories, which helps us assess visual cues and minimize interference from those irrelevant to the entities. Furthermore, we designed an interaction transformer that operates in two stages—first within each modality and then between modalities—to refine visual cues by learning from a frozen image encoder, thereby reducing differences between text and visual modalities. Comprehensive experiments were conducted on two public datasets, Twitter15 and Twitter17. The results and detailed analyses demonstrate that our method exhibits robust and competitive performance.

pdf bib
Neuro-Conceptual Artificial Intelligence: Integrating OPM with Deep Learning to Enhance Question Answering Quality
Xin Kang | Veronika Shteyngardt | Yuhan Wang | Dov Dori

Knowledge representation and reasoning are critical challenges in Artificial Intelligence (AI), particularly in integrating neural and symbolic approaches to achieve explainable and transparent AI systems. Traditional knowledge representation methods often fall short of capturing complex processes and state changes. We introduce Neuro-Conceptual Artificial Intelligence (NCAI), a specialization of the neuro-symbolic AI approach that integrates conceptual modeling using Object-Process Methodology (OPM) ISO 19450:2024 with deep learning to enhance question-answering (QA) quality. By converting natural language text into OPM models using in-context learning, NCAI leverages the expressive power of OPM to represent complex OPM elements—processes, objects, and states—beyond what traditional triplet-based knowledge graphs can easily capture. This rich structured knowledge representation improves reasoning transparency and answer accuracy in an OPM-QA system. We further propose transparency evaluation metrics to quantitatively measure how faithfully the predicted reasoning aligns with OPM-based conceptual logic. Our experiments demonstrate that NCAI outperforms traditional methods, highlighting its potential for advancing neuro-symbolic AI by providing rich knowledge representations, measurable transparency, and improved reasoning.

pdf bib
Emergence of symbolic abstraction heads for in-context learning in large language models
Ali Al-Saeedi | Aki Harma

Large Language Models (LLMs) based on self-attention circuits are able to perform, at inference time, novel reasoning tasks, but the mechanisms inside the models are currently not fully understood. We assume that LLMs are able to generalize abstract patterns from the input and form an internal symbolic internal representation of the content. In this paper, we study this by analyzing the performance of small LLM models trained with sequences of instantiations of abstract sequential symbolic patterns or templates. It is shown that even a model with two layers is able to learn an abstract template and use it to generate correct output representing the pattern. This can be seen as a form of symbolic inference taking place inside the network. In this paper, we call the emergent mechanism abstraction head. Identifying mechanisms of symbolic reasoning in a neural network can help to find new ways to merge symbolic and neural processing.

pdf bib
Linking language model predictions to human behaviour on scalar implicatures
Yulia Zinova | David Arps | Katharina Spalek | Jacopo Romoli

We explore the behaviour of language models on adjectival scales in connection with negation when prompted with material used in human experiments. We propose several metrics extracted from the model predictions and analyze those metrics in relation to human data as well as use them to propose new items to be tested in human experiments.

pdf bib
Generative FrameNet: Scalable and Adaptive Frames for Interpretable Knowledge Storage and Retrieval for LLMs Powered by LLMs
Harish Tayyar Madabushi | Taylor Hudson | Claire Bonial

Frame semantics provides an explanation for how we make use of conceptual frames, which encapsulate background knowledge and associations, to more completely understand the meanings of words within a context. Unfortunately, FrameNet, the only widely available implementation of frame semantics, is limited in both scale and coverage. Therefore, we introduce a novel mechanism for generating task-specific frames using large language models (LLMs), which we call Generative FrameNet. We demonstrate its effectiveness on a task that is highly relevant in the current landscape of LLMs: the interpretable storage and retrieval of factual information. Specifically, Generative Frames enable the extension of Retrieval-Augmented Generation (RAG), providing an interpretable framework for reducing inaccuracies in LLMs. We conduct experiments to demonstrate the effectiveness of this method both in terms of retrieval effectiveness as well as the relevance of the automatically generated frames and frame relations. Expert analysis shows that Generative Frames capture a more suitable level of semantic specificity than the frames from FrameNet. Thus, Generative Frames capture a notion of frame semantics that is closer to Fillmore’s originally intended definition, and offer potential for providing data-driven insights into Frame Semantics theory. Our results also show that this novel mechanism of Frame Semantic-based interpretable retrieval improves RAG for question answering with LLMs—outperforming a GPT-4 based baseline by up to 8 points. We provide open access to our data, including prompts and Generative FrameNet.


up

pdf (full)
bib (full)
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities

pdf bib
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Mika Hämäläinen | Emily Öhman | Yuri Bizzoni | So Miyagawa | Khalid Alnajjar

pdf bib
A Comparative Analysis of Word Segmentation, Part-of-Speech Tagging, and Named Entity Recognition for Historical Chinese Sources, 1900-1950
Zhao Fang | Liang-Chun Wu | Xuening Kong | Spencer Dean Stewart

This paper compares large language models (LLMs) and traditional natural language processing (NLP) tools for performing word segmentation, part-of-speech (POS) tagging, and named entity recognition (NER) on Chinese texts from 1900 to 1950. Historical Chinese documents pose challenges for text analysis due to their logographic script, the absence of natural word boundaries, and significant linguistic changes. Using a sample dataset from the Shanghai Library Republican Journal corpus, traditional tools such as Jieba and spaCy are compared to LLMs, including GPT-4o, Claude 3.5, and the GLM series. The results show that LLMs outperform traditional methods in all metrics, albeit at considerably higher computational costs, highlighting a trade-off between accuracy and efficiency. Additionally, LLMs better handle genre-specific challenges such as poetry and temporal variations (i.e., pre-1920 versus post-1920 texts), demonstrating that their contextual learning capabilities can advance NLP approaches to historical texts by reducing the need for domain-specific training data.

pdf bib
Analyzing register variation in web texts through automatic segmentation
Erik Henriksson | Saara Hellström | Veronika Laippala

This study introduces a novel method for analyzing register variation in web texts through classification-based register segmentation. While traditional text-linguistic register analysis treats web documents as single units, we present a recursive binary segmentation approach that automatically identifies register shifts within web documents without labeled segment data, using a ModernBERT classifier fine-tuned on full web documents. Manual evaluation shows our approach to be reliable, and our experimental results reveal that register segmentation leads to more accurate register classification, helps models learn more distinct register categories, and produces text units with more consistent linguistic characteristics. The approach offers new insights into documentinternal register variation in online discourse.

pdf bib
Analyzing Large Language Models’ pastiche ability: a case study on a 20th century Romanian author
Anca Dinu | Andra-Maria Florescu | Liviu Dinu

This study evaluated the ability of several Large Language Models (LLMs) to pastiche the literary style of the Romanian 20th century author Mateiu Caragiale, by continuing one of his novels left unfinished upon his death. We assembled a database of novels consisting of six texts by Mateiu Caragiale, including his unfinished one, six texts by Radu Albala, including a continuation of Mateiu’s novel, and six LLM generated novels that try to pastiche it. We compared the LLM generated texts with the continuation by Radu Albala, using various methods. We automatically evaluated the pastiches by standard metrics such as ROUGE, BLEU, and METEOR. We performed stylometric analysis, clustering, and authorship attribution, and a manual analysis. Both computational and manual analysis of the pastiches indicated that LLMs are able to produce fairly qualitative pastiches, without matching the professional writer performance. The study also showed that ML techniques outperformed the more recent DL ones in both clusterization and authorship attribution tasks, probably because the dataset consists of only a few literary archaic texts in Romanian. In addition, linguistically informed features were shown to be competitive compared to automatically extracted features.

pdf bib
RAG-Enhanced Neural Machine Translation of Ancient Egyptian Text: A Case Study of THOTH AI
So Miyagawa

This paper demonstrates how Retrieval-Augmented Generation (RAG) significantly improves translation accuracy for Middle Egyptian, a historically rich but low-resource language. We integrate a vectorized Coptic-Egyptian lexicon and morphological database into a specialized tool called THOTH AI. By supplying domain-specific linguistic knowledge to Large Language Models (LLMs) like Claude 3.5 Sonnet, our system yields translations that are more contextually grounded and semantically precise. We compare THOTH AI against various mainstream models, including Gemini 2.0, DeepSeek R1, and GPT variants, evaluating performance with BLEU, SacreBLEU, METEOR, ROUGE, and chrF. Experimental results on the coronation decree of Thutmose I (18th Dynasty) show that THOTH AI’s RAG approach provides the most accurate translations, highlighting the critical value of domain knowledge in natural language processing for ancient, specialized corpora. Furthermore, we discuss how our method benefits e-learning, digital humanities, and language revitalization efforts, bridging the gap between purely data-driven approaches and expert-driven resources in historical linguistics.

pdf bib
Restructuring and visualising dialect dictionary data: Report on Erzya and Moksha materials
Jack Rueter | Niko Partanen

There are a number of Uralic dialect dictionaries based on fieldwork documentation of individual minority languages from the Pre-Soviet Era. The first of these published by the Finno-Ugrian Society features the Mordvin languages, Erzya and Moksha.In this article, we describe the possibility of reusing XML dialect dictionary collection point and phonetic variant data for visualizing informative linguistic isoglosses with R programming language’s Shiny web application frame-work.We provide a description of the ‘H. Paasonen Mordvin Dictionary’, which will possibly provide the reader with a better perspective of what data and challenges might present themselves in minority language dialect dictionaries.We provide a description of how we processed our data, and then we provide conclusions followed by a more extensive section on limitations. The conclusions state that only some of the data should be rendered with R Shiny web application, whereas some data might be better rendered by other applications.Our limitations section description calls for the extension the dialect dictionary database for a more concise description of the languageforms.

pdf bib
Podcast Outcasts: Understanding Rumble’s Podcast Dynamics
Utkucan Balci | Jay Patel | Berkan Balci | Jeremy Blackburn

The rising popularity of podcasts as an emerging medium opens new avenues for digital humanities research, particularly when examining video-based media on alternative platforms. We present a novel data analysis pipeline for analyzing over 13K podcast videos (526 days of video content) from Rumble and YouTube that integrates advanced speech-to-text transcription, transformer-based topic modeling, and contrastive visual learning. We uncover the interplay between spoken rhetoric and visual elements in shaping political bias. Our findings reveal a distinct right-wing orientation in Rumble’s podcasts, contrasting with YouTube’s more diverse and apolitical content. By merging computational techniques with comparative analysis, our study advances digital humanities by demonstrating how large-scale multimodal analysis can decode ideological narratives in emerging media format.

pdf bib
I only read it for the plot! Maturity Ratings Affect Fanfiction Style and Community Engagement
Mia Jacobsen | Ross Kristensen-McLachlan

We consider the textual profiles of different fanfiction maturity ratings, how they vary across fan groups, and how this relates to reader engagement metrics. Previous studies have shown that fanfiction writing is motivated by a combination of admiration for and frustration with the fan object. These findings emerge when looking at fanfiction as a whole, as well as when it is divided into subgroups, also called fandoms. However, maturity ratings are used to indicate the intended audience of the fanfiction, as well as whether the story includes mature themes and explicit scenes. Since these ratings can be used to filter readers and writers, they can also be seen as a proxy for different reader/writer motivations and desires. We find that explicit fanfiction in particular has a distinct textual profile when compared to other maturity ratings. These findings thus nuance our understanding of reader/writer motivations in fanfiction communities, and also highlights the influence of the community norms and fan behavior more generally on these cultural products.

pdf bib
The AI Co-Ethnographer: How Far Can Automation Take Qualitative Research?
Fabian Retkowski | Andreas Sudmann | Alexander Waibel

Qualitative research often involves labor-intensive processes that are difficult to scale while preserving analytical depth. This paper introduces The AI Co-Ethnographer (AICoE), a novel end-to-end pipeline developed for qualitative research and designed to move beyond the limitations of simply automating code assignments, offering a more integrated approach. AICoE organizes the entire process, encompassing open coding, code consolidation, code application, and even pattern discovery, leading to a comprehensive analysis of qualitative data.

pdf bib
Irony Detection in Hebrew Documents: A Novel Dataset and an Evaluation of Neural Classification Methods
Avi Shmidman | Elda Weizman | Avishay Gerczuk

This paper focuses on the use of single words in quotation marks in Hebrew, which may or may not be an indication of irony. Because no annotated dataset yet exists for such cases, we annotate a new dataset consisting of over 4000 cases of words within quotation marks from Hebrew newspapers. On the basis of this dataset, we train and evaluate a series of seven BERT-based classifiers for irony detection, identifying the features and configurations that most effectively contribute the irony detection task. We release this novel dataset to the NLP community to promote future research and benchmarking regarding irony detection in Hebrew.

pdf bib
Masks and Mimicry: Strategic Obfuscation and Impersonation Attacks on Authorship Verification
Kenneth Alperin | Rohan Leekha | Adaku Uchendu | Trang Nguyen | Srilakshmi Medarametla | Carlos Levya Capote | Seth Aycock | Charlie Dagli

The increasing use of Artificial Intelligence(AI) technologies, such as Large LanguageModels (LLMs) has led to nontrivial improvementsin various tasks, including accurate authorshipidentification of documents. However,while LLMs improve such defense techniques,they also simultaneously provide a vehicle formalicious actors to launch new attack vectors.To combat this security risk, we evaluate theadversarial robustness of authorship models(specifically an authorship verification model)to potent LLM-based attacks. These attacksinclude untargeted methods - authorship obfuscationand targeted methods - authorshipimpersonation. For both attacks, the objectiveis to mask or mimic the writing style of an authorwhile preserving the original texts’ semantics,respectively. Thus, we perturb an accurateauthorship verification model, and achievemaximum attack success rates of 92% and 78%for both obfuscation and impersonation attacks,respectively.

pdf bib
Song Lyrics Adaptations: Computational Interpretation of the Pentathlon Principle
Barbora Štěpánková | Rudolf Rosa

Songs are an integral part of human culture, and they often resonate the most when we can sing them in our native language. However, translating song lyrics presents a unique challenge: maintaining singability, naturalness, and semantic fidelity. In this work, we computationally interpret Low’s Pentathlon Principle of singable translations to be able to properly measure the quality of adapted lyrics, breaking it down into five measurable metrics that reflect the key aspects of singable translations. Building on this foundation, we introduce a text-to-text song lyrics translation system based on generative large language models, designed to meet the Pentathlon Principle’s criteria, without relying on melodies or bilingual training data.We experiment on the English-Czech language pair: we collect a dataset of English-to-Czech bilingual song lyrics and identify the desirable values of the five Pentathlon Principle metrics based on the values achieved by human translators. Through detailed human assessment of automatically generated lyric translations, we confirm the appropriateness of the proposed metrics as well as the general validity of the Pentathlon Principle, with some insights into the variation in people’s individual preferences. All code and data are available at https://github.com/stepankovab/Computational-Interpretation-of-the-Pentathlon-Principle.

pdf bib
MITRA-zh-eval: Using a Buddhist Chinese Language Evaluation Dataset to Assess Machine Translation and Evaluation Metrics
Sebastian Nehrdich | Avery Chen | Marcus Bingenheimer | Lu Huang | Rouying Tang | Xiang Wei | Leijie Zhu | Kurt Keutzer

With the advent of large language models, machine translation (MT) has become a widely used, but little understood, tool for accessing historical and multilingual texts. While models like GPT, Claude, and Deepseek increasingly enable translation of low-resource and ancient languages, critical questions remain about their evaluation, optimal model selection, and the value of domain-specific training and retrieval-augmented generation setups.While AI models like GPT, Claude, and Deepseek are improving translation capabilities for low-resource and ancient languages, researchers still face important questions about how to evaluate their performance, which models work best, and whether specialized training approaches provide meaningful improvements in translation quality.This study introduces a comprehensive evaluation dataset for Buddhist Chinese to English translation, comprising 2,662 bilingual data points from 32 texts that have been selected to represent the full breadth of the Chinese Buddhist canon.We evaluate various computational metrics of translation quality (BLEU, chrF, BLEURT, GEMBA) against expert annotations from five domain specialists who rated 182 machine-generated translations. Our analysis reveals that LLM-based GEMBA scoring shows the strongest correlation with human judgment, significantly outperforming traditional metrics. We then benchmark commercial models (GPT-4 Turbo, Claude 3.5, Gemini), open-source models (Gemma 2, Deepseek-r1), and a domain-specialized model (Gemma 2 Mitra) using GEMBA. Our results demonstrate that domain-specific training enables open-weights models to achieve competitive performance with commercial systems, while also showing that retrieval-augmented generation (RAG) significantly improves translation quality for the best performing commercial models.

pdf bib
Effects of Publicity and Complexity in Reader Polarization
Yuri Bizzoni | Pascale Feldkamp | Kristoffer Nielbo

We investigate how Goodreads rating distributions reflect variations in audience reception across literary works. By examining a large-scale dataset of novels, we analyze whether metrics such as the entropy or standard deviation of rating distributions correlate with textual features – including perplexity, nominal ratio, and syntactic complexity. These metrics reveal a disagreement continuum: more complex texts – i.e., more cognitively demanding books, with a more canon-like textual profile – generate polarized reader responses, while mainstream works produce more uniform reactions. We compare evaluation patterns across canonical and non-canonical works, bestsellers, and prize-winners, finding that textual complexity drives rating polarization even when controlling for publicity effects. Our findings demonstrate that linguistically unpredictable texts, particularly those with higher nominal density and dependency distance, generate divergent reader evaluations. This challenges conventional literary success metrics and suggests that the shape of rating distributions offers valuable insights beyond average scores. We hope our approach establishes a productive framework for understanding how literary features influence reception and how disagreement metrics can enhance our understanding of public literary judgment.

pdf bib
PsyTEx: A Knowledge-Guided Approach to Refining Text for Psychological Analysis
Avanti Bhandarkar | Ronald Wilson | Anushka Swarup | Gregory Webster | Damon Woodard

LLMs are increasingly applied for tasks requiring deep interpretive abilities and psychological insights, such as identity profiling, mental health diagnostics, personalized content curation, and human resource management. However, their performance in these tasks remains inconsistent, as these characteristics are not explicitly perceptible in the text. To address this challenge, this paper introduces a novel protocol called the “Psychological Text Extraction and Refinement Framework (PsyTEx)” that leverages LLMs to isolate and amplify psychologically informative segments and evaluate LLM proficiency in interpreting complex psychological constructs from text. Using personality recognition as a case study, our extensive evaluation of five SOTA LLMs across two personality models (Big Five and Dark Triad) and two assessment levels (detection and prediction) highlights significant limitations in LLM’s ability to accurately interpret psychological traits. However, our findings show that LLMs, when used within the PsyTEx protocol, can effectively extract relevant information that closely aligns with psychological expectations, offering a structured approach to support future advancements in modeling, taxonomy construction, and text-based psychological evaluations.

pdf bib
Advances and Challenges in the Automatic Identification of Indirect Quotations in Scholarly Texts and Literary Works
Frederik Arnold | Robert Jäschke | Philip Kraut

Literary scholars commonly refer to the interpreted literary work using various types of quotations. Two main categories are direct and indirect quotations. In this work we focus on the automatic identification of two subtypes of indirect quotations: paraphrases and summaries. Our contributions are twofold. First, we present a dataset of scholarly works with annotations of text spans which summarize or paraphrase the interpreted drama and the source of the quotation. Second, we present a two-step approach to solve the task at hand. We found the process of annotating large training corpora very time consuming and therefore leverage GPT-generated summaries to generate training data for our approach.

pdf bib
Assessing Crowdsourced Annotations with LLMs: Linguistic Certainty as a Proxy for Trustworthiness
Tianyi Li | Divya Sree | Tatiana Ringenberg

Human-annotated data is fundamental for training machine learning models, yet crowdsourced annotations often contain noise and bias. In this paper, we investigate the feasibility of employing large language models (LLMs), specifically GPT-4, as evaluators of crowdsourced annotations using a zero-shot prompting strategy. We introduce a certainty-based approach that leverages linguistic cues categorized into five levels (Absolute, High, Moderate, Low, Uncertain) based on Rubin’s framework—to assess the trustworthiness of LLM-generated evaluations. Using the MAVEN dataset as a case study, we compare GPT-4 evaluations against human evaluations and observe that the alignment between LLM and human judgments is strongly correlated with response certainty. Our results indicate that LLMs can effectively serve as a preliminary filter to flag potentially erroneous annotations for further expert review.

pdf bib
The evolution of relative clauses in the IcePaHC treebank
Anton Ingason | Johanna Mechler

We examine how the elements that introduce relative clauses, namely relative complementizers and relative pronouns, evolve over the history of Icelandic using the phrase structure analysis of the IcePaHC treebank. The rate of these elements changes over time and, in the case of relative pronouns, is subject to effects of genre and the type of gap in the relative clause in question. Our paper is a digital humanities study of historical linguistics which would not be possible without a parsed corpus that spans all centuries involved in the change. We relate our findings to studies on the Constant Rate Effect by analyzing these effects in detail.

pdf bib
On Psychology of AI – Does Primacy Effect Affect ChatGPT and Other LLMs?
Mika Hämäläinen

We study the primacy effect in three commercial LLMs: ChatGPT, Gemini and Claude. We do this by repurposing the famous experiment Asch (1946) conducted using human subjects. The experiment is simple, given two candidates with equal descriptions which one is preferred if one description has positive adjectives first before negative ones and another description has negative adjectives followed by positive ones. We test this in two experiments. In one experiment, LLMs are given both candidates simultaneously in the same prompt, and in another experiment, LLMs are given both candidates separately. We test all the models with 200 candidate pairs. We found that, in the first experiment, ChatGPT preferred the candidate with positive adjectives listed first, while Gemini preferred both equally often. Claude refused to make a choice. In the second experiment, ChatGPT and Claude were most likely to rank both candidates equally. In the case where they did not give an equal rating, both showed a clear preference to a candidate that had negative adjectives listed first. Gemini was most likely to prefer a candidate with negative adjectives listed first.

pdf bib
The Literary Canons of Large-Language Models: An Exploration of the Frequency of Novel and Author Generations Across Gender, Race and Ethnicity, and Nationality
Paulina Toro Isaza | Nalani Kopp

Large language models (LLMs) are an emerging site for computational literary and cultural analysis. While such research has focused on applying LLMs to the analysis of literary text passages, the probabilistic mechanism used by these models for text generation lends them to also understanding literary and cultural trends. Indeed, we can imagine LLMs as constructing their own “literary canons” by encoding particular authors and book titles with high probability distributions around relevant words and text. This paper explores the frequency with which certain literary titles and authors are generated by a selection of popular proprietary and open-source models and compares it to existing conceptions of literary canon. It investigates the diversity of author mentions across gender, ethnicity, nationality as well as LLMs’ ability to accurately report such characteristics. We demonstrate that the literary canons of popular large-language models are generally aligned with the Western literary canon in that they slightly prioritize male authors and overwhelmingly prioritize White American and British authors.

pdf bib
Moral reckoning: How reliable are dictionary-based methods for examining morality in text?
Ines Rehbein | Lilly Brauner | Florian Ertz | Ines Reinig | Simone Ponzetto

Due to their availability and ease of use, dictionary-based measures of moral values are a popular tool for text-based analyses of morality that examine human attitudes and behaviour across populations and cultures. In this paper, we revisit the construct validity of different dictionary-based measures of morality in text that have been proposed in the literature. We discuss conceptual challenges for text-based measures of morality and present an annotation experiment where we create a new dataset with human annotations of moral rhetoric in German political manifestos. We compare the results of our human annotations with different measures of moral values, showing that none of them is able to capture the trends observed by trained human coders. Our findings have far-reaching implications for the application of moral dictionaries in the digital humanities.

pdf bib
Bootstrapping AI: Interdisciplinary Approaches to Assessing OCR Quality in English-Language Historical Documents
Samuel Backer | Louis Hyman

New LLM-based OCR and post-OCR correction methods promise to transform computational historical research, yet their efficacy remains contested. We compare multiple correction approaches, including methods for “bootstrapping” fine-tuning with LLM-generated data, and measure their effect on downstream tasks. Our results suggest that standard OCR metrics often underestimate performance gains for historical research, underscoring the need for discipline-driven evaluations that can better reflect the needs of computational humanists.

pdf bib
Poetry in RAGs: Modern Greek interwar poetry generation using RAG and contrastive training
Stergios Chatzikyriakidis | Anastasia Natsina

In this paper, we discuss Modern Greek poetry generation in the style of lesser known Greek poets of the interwar period. The paper proposes the use of Retrieval-Augmented Generation (RAG) to automatically generate poetry using Large Language Models (LLMs). A corpus of Greek interwar poetry is used and prompts exemplifying the poet’s style with respect to a theme are created. These are then fed to an LLM. The results are compared to pure LLM generation and expert evaluators score poems across a number of parameters. Objective metrics such as Vocabulary Density, Average words per Sentence and Readability Index are also used to assess the performance of the models. RAG-assisted models show potential in enhancing poetry generation across a number of parameters. Base LLM models appear quite consistent across a number of categories, while the RAG model that is furthermore contrastive shows the worst performance of the three.

pdf bib
Using Multimodal Models for Informative Classification of Ambiguous Tweets in Crisis Response
Sumiko Teng | Emily Öhman

Social media platforms like X provide real-time information during crises but often include noisy, ambiguous data, complicating analysis. This study examines the effectiveness of multimodal models, particularly a cross-attention-based approach, in classifying tweets about the California wildfires as “informative” or “uninformative,” leveraging both text and image modalities. Using a dataset containing both ambiguous and unambiguous tweets, models were evaluated for their ability to handle real-world noisy data. Results show that the multimodal model outperforms unimodal counterparts, especially for ambiguous tweets, demonstrating its resilience and ability to integrate complementary modalities. These findings highlight the potential of multimodal approaches to enhance humanitarian response efforts by reducing information overload.

pdf bib
Transferring Extreme Subword Style Using Ngram Model-Based Logit Scaling
Craig Messner | Tom Lippincott

We present an ngram model-based logit scaling technique that effectively transfers extreme subword stylistic variation to large language models at inference time. We demonstrate its efficacy by tracking the perplexity of generated text with respect to the ngram interpolated and original versions of an evaluation model. Minimizing the former measure while the latter approaches the perplexity of a text produced by a target author or character lets us select a sufficient degree of adaptation while retaining fluency.

pdf bib
Evaluating Large Language Models for Narrative Topic Labeling
Andrew Piper | Sophie Wu

This paper evaluates the effectiveness of large language models (LLMs) for labeling topics in narrative texts, comparing performance across fiction and news genres. Building on prior studies in factual documents, we extend the evaluation to narrative contexts where story content is central. Using a ranked voting system with 200 crowdworkers, we assess participants’ preferences of topic labels by comparing multiple LLM outputs with human annotations. Our findings indicate minimal inter-model variation, with LLMs performing on par with human readers in news and outperforming humans in fiction. We conclude with a case study using a set of 25,000 narrative passages from novels illustrating the analytical value of LLM topic labels compared to traditional methods. The results highlight the significant promise of LLMs for topic labeling of narrative texts.

pdf bib
Beyond Cairo: Sa’idi Egyptian Arabic Literary Corpus Construction and Analysis
Mai Mohamed Eida | Nizar Habash

Egyptian Arabic (EA) NLP resources have mainly focused on Cairene Egyptian Arabic (CEA), leaving sub-dialects like Sa’idi Egyptian Arabic (SEA) underrepresented. This paper introduces the first SEA corpus – an open-source, 4-million-word literary dataset of a dialect spoken by ~30 million Egyptians. To validate its representation, we analyze SEA-specific linguistic features from dialectal surveys, confirming a higher prevalence in our corpus compared to existing EA datasets. Our findings offer insights into SEA’s orthographic representation in morphology, phonology, and lexicon, incorporating CODA* guidelines for normalization.

pdf bib
Advancing Sentiment Analysis in Tamil-English Code-Mixed Texts: Challenges and Transformer-Based Solutions
Mikhail Krasitskii | Olga Kolesnikova | Liliana Chanona Hernandez | Grigori Sidorov | Alexander Gelbukh

This study examines sentiment analysis in Tamil-English code-mixed texts using advanced transformer-based architectures. The unique linguistic challenges, including mixed grammar, orthographic variability, and phonetic inconsistencies, are addressed. Data limitations and annotation gaps are discussed, highlighting the need for larger datasets. The performance of models such as XLM-RoBERTa, mT5, IndicBERT, and RemBERT is evaluated, with insights into their optimization for low-resource, code-mixed environments.

pdf bib
Language use of political parties over time: Stylistic Fronting in the Icelandic Gigaword Corpus
Johanna Mechler | Lilja Björk Stefánsdóttir | Anton Ingason

Political speech is an active area of investigation and the ongoing ERC project Explaining Individual Lifespan Change (EILisCh) expands on some of the previous findings in this area. Previous work has found that political speech can differ based on party membership in a time-wise static environment and it has also been uncovered that individual politicians can change their linguistic behavior over time. In this paper, we pursue a novel topic in this area, the evolution of language use of entire political parties over time. We focus on Icelandic political parties and their use of Stylistic Fronting from 1999 to 2021, with a particular emphasis on the years around the financial crisis of 2008, and the subsequent years. Our results show that parties in a position of power typically speak more formally, using more Stylistic Fronting, but that at the same time there are some exceptions to this pattern. We highlight the significance of relying on a large speech corpus, when applying a high-definition approach to linguistic analyses across time.

pdf bib
From Causal Parrots to Causal Prophets? Towards Sound Causal Reasoning with Large Language Models
Rahul Babu Shrestha | Simon Malberg | Georg Groh

Causal reasoning is a fundamental property of human and machine intelligence. While large language models (LLMs) excel in many natural language tasks, their ability to infer causal relationships beyond memorized associations is debated. This study systematically evaluates recent LLMs’ causal reasoning across three levels of Pearl’s Ladder of Causation—associational, interventional, and counterfactual—as well as commonsensical, anti-commonsensical, and nonsensical causal structures using the CLadder dataset. We further explore the effectiveness of prompting techniques, including chain of thought (CoT), self-consistency (SC), and causal chain of thought (CausalCoT), in enhancing causal reasoning, and propose two new techniques causal tree of thoughts (CausalToT) and causal program of thoughts (CausalPoT). While larger models tend to outperform smaller ones and are generally more robust against perturbations, our results indicate that all tested LLMs still have difficulties, especially with counterfactual reasoning. However, our CausalToT and CausalPoT significantly improve performance over existing prompting techniques, suggesting that hybrid approaches combining LLMs with formal reasoning frameworks can mitigate these limitations. Our findings contribute to understanding LLMs’ reasoning capacities and outline promising strategies for improving their ability to reason causally as humans would. We release our code and data.

pdf bib
Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan
Matthias Schöffel | Marinus Wiedner | Esteban Garces Arias | Paula Ruppert | Christian Heumann | Matthias Aßenmacher

Large language models (LLMs) have demonstrated remarkable capabilities in natural language processing, yet their effectiveness in handling historical languages remains largely unexplored. This study examines the performance of open-source LLMs in part-of-speech (POS) tagging for Old Occitan, a historical language characterized by non-standardized orthography and significant diachronic variation. Through comparative analysis of two distinct corpora—hagiographical and medical texts—we evaluate how current models handle the inherent challenges of processing a low-resource historical language. Our findings demonstrate critical limitations in LLM performance when confronted with extreme orthographic and syntactic variability. We provide detailed error analysis and specific recommendations for improving model performance in historical language processing. This research advances our understanding of LLM capabilities in challenging linguistic contexts while offering practical insights for both computational linguistics and historical language studies.

pdf bib
A Data-driven Investigation of Euphemistic Language: Comparing the usage of “slave” and “servant” in 19th century US newspapers
Jaihyun Park | Ryan Cordell

Warning: This paper contains examples of offensive language targeting marginalized populations. This study investigates the usage of “slave” and “servant” in 19th century US newspapers using computational methods. While both terms were used to refer to enslaved African Americans, they were used in distinct ways. In the Chronicling America corpus, we included possible OCR errors by using FastText embedding and excluded text reprints to consider text reprint culture in the 19th century. Word2vec embedding was used to find semantically close words to “slave” and “servant” and log-odds ratio was calculated to identify over-represented discourse words in the Southern and Northern newspapers. We found that “slave” is associated with socio-economic, legal, and administrative words, however, “servant” is linked to religious words in the Northern newspapers while Southern newspapers associated “servant” with domestic and familial words. We further found that slave discourse words in Southern newspapers are more prevalent in Northern newspapers while servant discourse words from each side are prevalent in their own region. This study contributes to the understanding of how newspapers created different discourses around enslaved African Americans in the 19th century US.

pdf bib
It’s about What and How you say it: A Corpus with Stance and Sentiment Annotation for COVID-19 Vaccines Posts on X/Twitter by Brazilian Political Elites
Lorena Barberia | Pedro Schmalz | Norton Trevisan Roman | Belinda Lombard | Tatiane Moraes de Sousa

This paper details the development of a corpus with posts in Brazilian Portuguese published by Brazilian political elites on X (formerly Twitter) regarding COVID-19 vaccines. The corpus consists of 9,045 posts annotated for relevance, stance and sentiment towards COVID-19 vaccines and vaccination during the first three years of the COVID-19 pandemic (2020-2022).Nine annotators, working in three groups, classified relevance, stance, and sentiment in messages posted between 2020 and 2022 by local political elites. The annotators underwent extensive training, and weekly meetings were conducted to ensure intra-group annotation consistency. The analysis revealed fair to moderate inter-annotator agreement (Average Krippendorf’s alpha of 0.94 for relevance, 0,67 for sentiment and 0,70 for stance). This work makes four significant contributions to the literature. First, it addresses the scarcity of corpora in Brazilian Portuguese, particularly on COVID-19 or vaccines in general. Second, it provides a reliable annotation scheme for sentiment and stance classification, distinguishing both tasks, thereby improving classification precision. Third, it offers a corpus annotated with stance and sentiment according to this scheme, demonstrating how these tasks differ and how conflating them may lead to inconsistencies in corpus construction, as a results of confounding these phenomena — a recurring issue in NLP research beyond studies focusing on vaccines. And fourth, this annotated corpus may serve as the gold standard for fine-tuning and evaluating supervised machine learning models for relevance, sentiment and stance analysis of X posts on similar domains.

pdf bib
A Bit of This, a Bit of That: Building a Genre and Topic Annotated Dataset of Historical Newspaper Articles with Soft Labels and Confidence Scores
Karin Stahel | Irenie How | Lauren Millar | Luis Paterson | Daniel Steel | Kaspar Middendorf

Digitised historical newspaper collections are becoming increasingly accessible, yet their scale and diverse content still present challenges for researchers interested in specific article types or topics. In a step towards developing models to address these challenges, we have created a dataset of articles from New Zealand’s Papers Past open data annotated with multiple genre and topic labels and annotator confidence scores. Our annotation framework aligns with the perspectivist approach to machine learning, acknowledging the subjective nature of the task and embracing the hybridity and uncertainty of genres. In this paper, we describe our sampling and annotation methods and the resulting dataset of 7,036 articles from 106 New Zealand newspapers spanning the period 1839-1903. This dataset will be used to develop interpretable classification models that enable fine-grained exploration and discovery of articles in Papers Past newspapers based on common aspects of form, function, and topic. The complete dataset, including un-aggregated annotations and supporting documentation, will eventually be openly released to facilitate further research.

pdf bib
Development of Old Irish Lexical Resources, and Two Universal Dependencies Treebanks for Diplomatically Edited Old Irish Text
Adrian Doyle | John McCrae

The quantity and variety of Old Irish text which survives in contemporary manuscripts, those dating from the Old Irish period, is quite small by comparison to what is available for Modern Irish, not to mention better-resourced modern languages. As no native speakers have existed for more than a millennium, no more text will ever be created by native speakers. For these reasons, text surviving in contemporary sources is particularly valuable. Ideally, all such text would be annotated using a single, common standard to ensure compatibility. At present, discrete Old Irish text repositories make use of incompatible annotation styles, few of which are utilised by text resources for other languages. This limits the potential for using text from more than any one resource simultaneously in NLP applications, or as a basis for creating further resources. This paper describes the production of the first Old Irish text resources to be designed specifically to ensure lexical compatibility and interoperability.

pdf bib
Augmented Close Reading for Classical Latin using BERT for Intertextual Exploration
Ashley Gong | Katy Gero | Mark Schiefsky

Intertextuality, the connection between texts, is a critical literary concept for analyzing classical Latin works. Given the emergence of AI in digital humanities, this paper presents Intertext.AI, a novel interface that leverages Latin BERT (Bamman and Burns 2020), a BERT model trained on classical Latin texts, and contextually rich visualizations to help classicists find potential intertextual connections. Intertext.AI identified over 80% of attested allusions from excerpts of Lucan's Pharsalia, demonstrating the system's technical efficacy. Our findings from a user study with 19 participants also suggest that Intertext.AI fosters intertextual discovery and interpretation more easily than other tools. While participants did not identify significantly different types or quantities of connections when using Intertext.AI or other tools, they overall found finding and justifying potential intertextuality easier with Intertext.AI, reported higher confidence in their observations from Intertext.AI, and preferred having access to it during the search process.

pdf bib
An evaluation of Named Entity Recognition tools for detecting person names in philosophical text
Ruben Weijers | Jelke Bloem

For philosophers, mentions of the names of other philosophers and scientists are an important indicator of relevance and influence. However, they don’t always come in neat citations, especially in older works. We evaluate various approaches to named entity recognition for person names in 20th century, English-language philosophical texts. We use part of a digitized corpus of the works of W.V. Quine, manually annotated for person names, to compare the performance of several systems: the rule-based edhiphy, spaCy’s CNN-based system, FLAIR’s BiLSTM-based system, and SpanBERT, ERNIE-v2 and ModernBERT’s transformer-based approaches. We also experiment with enhancing the smaller models with domain-specific embedding vectors. We find that both spaCy and FLAIR outperform transformer-based models, perhaps due to the small dataset sizes involved.

pdf bib
Testing Language Creativity of Large Language Models and Humans
Anca Dinu | Andra-Maria Florescu

Since the advent of Large Language Models (LLMs), the interest and need for a better understanding of artificial creativity has increased.This paper aims to design and administer an integrated language creativity test, including multiple tasks and criteria, targeting both LLMs and humans, for a direct comparison. Language creativity refers to how one uses natural language in novel and unusual ways, by bending lexico-grammatical and semantic norms by using literary devices or by creating new words. The results show a slightly better performance of LLMs compared to humans. We analyzed the responses dataset with computational methods like sentiment analysis, clusterization, and binary classification, for a more in-depth understanding. Also, we manually inspected a part of the answers, which revealed that the LLMs mastered figurative speech, while humans responded more pragmatically.

pdf bib
Strategies for political-statement segmentation and labelling in unstructured text
Dmitry Nikolaev | Sean Papay

Analysis of parliamentary speeches and political-party manifestos has become an integral area of computational study of political texts. While speeches have been overwhelmingly analysed using unsupervised methods, a large corpus of manifestos with by-statement political-stance labels has been created by the participants of the MARPOR project. It has been recently shown that these labels can be predicted by a neural model; however, the current approach relies on provided statement boundaries, limiting out-of-domain applicability. In this work, we propose and test a range of unified split-and-label frameworks—based on linear-chain CRFs, fine-tuned text-to-text models, and the combination of in-context learning with constrained decoding—that can be used to jointly segment and classify statements from raw textual data. We show that our approaches achieve competitive accuracy when applied to raw text of political manifestos, and then demonstrate the research potential of our method by applying it to the records of the UK House of Commons and tracing the political trajectories of four major parties in the last three decades.

pdf bib
Mining the Past: A Comparative Study of Classical and Neural Topic Models on Historical Newspaper Archives
Keerthana Murugaraj | Salima Lamsiyah | Marten During | Martin Theobald

Analyzing historical discourse in large-scale newspaper archives requires scalable and interpretable methods to uncover hidden themes. This study systematically evaluates topic modeling approaches for newspaper articles from 1955 to 2018, comparing probabilistic LDA, matrix factorization NMF, and neural-based models such as Top2Vec and BERTopic across various preprocessing strategies. We benchmark these methods on topic coherence, diversity, scalability, and interpretability. While LDA is commonly used in historical text analysis, our findings demonstrate that BERTopic, leveraging contextual embeddings, consistently outperforms classical models in all tested aspects, making it a more robust choice for large-scale textual corpora. Additionally, we highlight the trade-offs between preprocessing strategies and model performance, emphasizing the importance of tailored pipeline design. These insights advance the field of historical NLP, offering concrete guidance for historians and computational social scientists in selecting the most effective topic-modeling approach for analyzing digitized archives. Our code will be publicly available on GitHub.

pdf bib
A Comparative Analysis of Ethical and Safety Gaps in LLMs using Relative Danger Coefficient
Yehor Tereshchenko | Mika Hämäläinen

Artificial Intelligence (AI) and Large Language Models (LLMs) have rapidly evolved in recent years, showcasing remarkable capabilities in natural language understanding and generation. However, these advancements also raise critical ethical questions regarding safety, potential misuse, discrimination and overall societal impact. This article provides a comparative analysis of the ethical performance of various AI models, including the brand new DeepSeek-V3(R1 with reasoning and without), various GPT variants (4o, 3.5 Turbo, 4 Turbo, o1/o3 mini) and Gemini (1.5 flash, 2.0 flash and 2.0 flash exp) and highlights the need for robust human oversight, especially in situations with high stakes. Furthermore, we present a new metric for calculating harm in LLMs called Relative Danger Coefficient (RDC).

pdf bib
Threefold model for AI Readiness: A Case Study with Finnish Healthcare SMEs
Mohammed Alnajjar | Khalid Alnajjar | Mika Hämäläinen

This study examines AI adoption among Finnish healthcare SMEs through semi-structured interviews with six health-tech companies. We identify three AI engagement categories: AI-curious (exploring AI), AI-embracing (integrating AI), and AI-catering (providing AI solutions). Our proposed threefold model highlights key adoption barriers, including regulatory complexities, technical expertise gaps, and financial constraints. While SMEs recognize AI’s potential, most remain in early adoption stages. We provide actionable recommendations to accelerate AI integration, focusing on regulatory reforms, talent development, and inter-company collaboration, offering valuable insights for healthcare organizations, policymakers, and researchers.

pdf bib
AI Assistant for Socioeconomic Empowerment Using Federated Learning
Nahed Abdelgaber | Labiba Jahan | Nino Castellano | Joshua Oltmanns | Mehak Gupta | Jia Zhang | Akshay Pednekar | Ashish Basavaraju | Ian Velazquez | Zerui Ma

Socioeconomic status (SES) reflects an individual’s standing in society, from a holistic set of factors including income, education level, and occupation. Identifying individuals in low-SES groups is crucial to ensuring they receive necessary support. However, many individuals may be hesitant to disclose their SES directly. This study introduces a federated learning-powered framework capable of verifying individuals’ SES levels through the analysis of their communications described in natural language. We propose to study language usage patterns among individuals from different SES groups using clustering and topic modeling techniques. An empirical study leveraging life narrative interviews demonstrates the effectiveness of our proposed approach.

pdf bib
Team Conversational AI: Introducing Effervesce
Erjon Skenderi | Salla-Maaria Laaksonen | Jukka Huhtamäki

Group conversational AI, especially within digital workspaces, could potentially play a crucial role in enhancing organizational communication. This paper introduces Effervesce, a Large Language Model (LLM) powered group conversational bot integrated into a multi-user Slack environment. Unlike conventional conversational AI applications that are designed for one-to-one interactions, our bot addresses the challenges of facilitating multi-actor conversations. We first evaluated multiple open-source LLMs on a dataset of 1.6k group conversation messages. We then fine-tuned the best performing model using a Parameter Efficient Fine-Tuning technique to better align Effervesce with multi-actor conversation settings. Evaluation through workshops with 40 participants indicates positive impacts on communication dynamics, although areas for further improvement were identified. Our findings highlight the potential of Effervesce in enhancing group communication, with future work aimed at refining the bot’s capabilities based on user feedback.

pdf bib
Mapping Hymns and Organizing Concepts in the Rigveda: Quantitatively Connecting the Vedic Suktas
Venkatesh Bollineni | Igor Crk | Eren Gultepe

Accessing and gaining insight into the Rigveda poses a non-trivial challenge due to its extremely ancient Sanskrit language, poetic structure, and large volume of text. By using NLP techniques, this study identified topics and semantic connections of hymns within the Rigveda that were corroborated by seven well-known groupings of hymns. The 1,028 suktas (hymns) from the modern English translation of the Rigveda by Jamison and Brereton were preprocessed and sukta-level embeddings were obtained using, i) a novel adaptation of LSA, presented herein, ii) SBERT, and iii) Doc2Vec embeddings. Following an UMAP dimension reduction of the vectors, the network of suktas was formed using k-nearest neighbours. Then, community detection of topics in the sukta networks was performed with the Louvain, Leiden, and label propagation methods, whose statistical significance of the formed topics were determined using an appropriate null distribution. Only the novel adaptation of LSA using the Leiden method, had detected sukta topic networks that were significant (z = 2.726, p < .01) with a modularity score of 0.944. Of the seven famous sukta groupings analyzed (e.g., creation, funeral, water, etc.) the LSA derived network was successful in all seven cases, while Doc2Vec was not significant and failed to detect the relevant suktas. SBERT detected four of the famous suktas as separate groups, but mistakenly combined three of them into a single mixed group. Also, the SBERT network was not statistically significant.

pdf bib
EduPo: Progress and Challenges of Automated Analysis and Generation of Czech Poetry
Rudolf Rosa | David Mareček | Tomáš Musil | Michal Chudoba | Jakub Landsperský

This paper explores automated analysis and generation of Czech poetry. We review existing tools, datasets, and methodologies while considering the unique characteristics of the Czech language and its poetic tradition. Our approach builds upon available resources wherever possible, yet requires the development of additional components to address existing gaps. We present and evaluate preliminary experiments, highlighting key challenges and potential directions for future research.

pdf bib
A City of Millions: Mapping Literary Social Networks At Scale
Sil Hamilton | Rebecca Hicke | David Mimno | Matthew Wilkens

We release 70,509 high-quality social networks extracted from multilingual fiction and nonfiction narratives. We additionally provide metadata for ~30,000 of these texts (73% nonfiction and 27% fiction) written between 1800 and 1999 in 58 languages. This dataset provides information on historical social worlds at an unprecedented scale, including data for 2,510,021 individuals in 2,805,482 pair-wise relationships annotated for affinity and relationship type. We achieve this scale by automating previously manual methods of extracting social networks; specifically, we adapt an existing annotation task as a language model prompt, ensuring consistency at scale with the use of structured output. This dataset serves as a unique resource for humanities and social science research by providing data on cognitive models of social realities.

pdf bib
VLG-BERT: Towards Better Interpretability in LLMs through Visual and Linguistic Grounding
Toufik Mechouma | Ismail Biskri | Serge Robert

We present VLG-BERT, a novel LLM model conceived to improve the language meaning encoding. VLG-BERT provides a deeper insights about meaning encoding in Large Language Models (LLMs) by focusing on linguistic and real-world semantics. It uses syntactic dependencies as a form of a ground truth to supervise the learning process of the words representation. VLG-BERT incorporates visual latent representations from pre-trained vision models and their corresponding labels. A vocabulary of 10k tokens corresponding to so-called concrete words is built by extending the set of ImageNet labels. The extension is based on synonyms, hyponyms and hypernyms from WordNet. Thus, a lookup table for this vocabulary is used to initialize the embedding matrix during training, rather than random initialization. This multimodal grounding provides a stronger semantic foundation for encoding the meaning of words. Its architecture aligns seamlessly with foundational theories from across the cognitive sciences. The integration of visual and linguistic grounding makes VLG-BERT consistent with many cognitive theories. Our approach contributes to the ongoing effort to create models that bridge the gap between language and vision, making them more aligned with how humans understand and interpret the world. Experiments on text classification have shown an excellent results compared to BERT Base.

pdf bib
Historical Ink: Exploring Large Language Models for Irony Detection in 19th-Century Spanish
Kevin Cohen | Laura Manrique-Gómez | Ruben Manrique

This study explores the use of large language models (LLMs) to enhance datasets and improve irony detection in 19th-century Latin American newspapers. Two strategies were employed to evaluate the efficacy of BERT and GPT models in capturing the subtle nuances nature of irony, through both multi-class and binary classification tasks. First, we implemented dataset enhancements focused on enriching emotional and contextual cues; however, these showed limited impact on historical language analysis. The second strategy, a semi-automated annotation process, effectively addressed class imbalance and augmented the dataset with high-quality annotations. Despite the challenges posed by the complexity of irony, this work contributes to the advancement of sentiment analysis through two key contributions: introducing a new historical Spanish dataset tagged for sentiment analysis and irony detection, and proposing a semi-automated annotation methodology where human expertise is crucial for refining LLMs results, enriched by incorporating historical and cultural contexts as core features.

pdf bib
Insights into developing analytical categorization schemes: three problem types related to annotation agreement
Pihla Toivanen | Eetu Mäkelä | Antti Kanner

Coding themes, frames, opinions and other attributes are widely used in the social sciences and doing that is also a base for building supervised text classifiers. Coding content needs a lot of resources, and lately this process has been utilized particularly in the training set annotation for machine learning models. Although the objectivity of coding is not always the purpose of coding, it helps in building the machine learning model, if the codings are uniformly done. Usually machine learning models are built by first defining annotation scheme, which contains definitions of categories and instructions for coding. It is known that multiple aspects affect to the annotation results, such as, the domain of annotation, number of annotators, and number of categories in annotation. In this article, we present few more problems that we show to be related with the annotation results in our case study. Those are negated presence of a category, low proportional presence of relevant content and implicit presence of a category. These problems should be resolved in all schemes on the level of scheme definition. To extract our problem categories, we focus on a media research case of extensive data on both the process as well as the results.

pdf bib
A Comprehensive Evaluation of Cognitive Biases in LLMs
Simon Malberg | Roman Poletukhin | Carolin Schuster | Georg Groh Groh

We present a large-scale evaluation of 30 cognitive biases in 20 state-of-the-art large language models (LLMs) under various decision-making scenarios. Our contributions include a novel general-purpose test framework for reliable and large-scale generation of tests for LLMs, a benchmark dataset with 30,000 tests for detecting cognitive biases in LLMs, and a comprehensive assessment of the biases found in the 20 evaluated LLMs. Our work confirms and broadens previous findings suggesting the presence of cognitive biases in LLMs by reporting evidence of all 30 tested biases in at least some of the 20 LLMs. We publish our framework code and dataset to encourage future research on cognitive biases in LLMs: https://github.com/simonmalberg/cognitive-biases-in-llms.

pdf bib
AI with Emotions: Exploring Emotional Expressions in Large Language Models
Shin-nosuke Ishikawa | Atsushi Yoshino

The human-level performance of Large Language Models (LLMs) across various tasks has raised expectations for the potential of artificial intelligence (AI) to possess emotions someday. To explore the capability of current LLMs to express emotions in their outputs, we conducted an experiment using several LLMs (OpenAI GPT, Google Gemini, Meta Llama3, and Cohere Command R+) to role-play as agents answering questions with specified emotional states. We defined the emotional states using Russell’s Circumplex model, a well-established framework that characterizes emotions along the sleepy-activated (arousal) and pleasure-displeasure (valence) axes. We chose this model for its simplicity, utilizing two continuous parameters, which allows for better controllability in applications involving continuous changes in emotional states. The responses generated were evaluated using a sentiment analysis model, independent of the LLMs, trained on the GoEmotions dataset. The evaluation showed that the emotional states of the generated answers were consistent with the specifications, demonstrating the LLMs’ capability for emotional expression. This indicates the potential for LLM-based

pdf bib
Fearful Falcons and Angry Llamas: Emotion Category Annotations of Arguments by Humans and LLMs
Lynn Greschner | Roman Klinger

Arguments evoke emotions, influencing the effect of the argument itself. Not only the emotional intensity but also the category influences the argument’s effects, for instance, the willingness to adapt stances. While binary emotionality has been studied in argumentative texts, there is no work on discrete emotion categories (e.g., ‘anger’) in such data. To fill this gap, we crowdsource subjective annotations of emotion categories in a German argument corpus and evaluate automatic LLM-based labeling methods. Specifically, we compare three prompting strategies (zero-shot, one-shot, chain-of-thought) on three large instruction-tuned language models (Falcon-7b-instruct, Llama-3.1-8B-instruct, GPT-4o-mini). We further vary the definition of the output space to be binary (is there emotionality in the argument?), closed-domain (which emotion from a given label set is in the argument?), or open-domain (which emotion is in the argument?). We find that emotion categories enhance the prediction of emotionality in arguments, emphasizing the need for discrete emotion annotations in arguments. Across all prompt settings and models, automatic predictions show a high recall but low precision for predicting anger and fear, indicating a strong bias toward negative emotions.

pdf bib
HateImgPrompts: Mitigating Generation of Images Spreading Hate Speech
Vineet Kumar Khullar | Venkatesh Velugubantla | Bhanu Prakash Reddy Rella | Mohan Krishna Mannava | Msvpj Sathvik

The emergence of artificial intelligence has proven beneficial to numerous organizations, particularly in its various applications for social welfare. One notable application lies in AI-driven image generation tools. These tools produce images based on provided prompts. While this technology holds potential for constructive use, it also carries the risk of being exploited for malicious purposes, such as propagating hate. To address this we propose a novel dataset “HateImgPrompts”. We have benchmarked the dataset with the latest models including GPT-3.5, LLAMA 2, etc. The dataset consists of 9467 prompts and the accuracy of the classifier after finetuning of the dataset is around 81%.

up

pdf (full)
bib (full)
Proceedings of the 1st Workshop on Ecology, Environment, and Natural Language Processing (NLP4Ecology2025)

pdf bib
Proceedings of the 1st Workshop on Ecology, Environment, and Natural Language Processing (NLP4Ecology2025)
Valerio Basile | Cristina Bosco | Francesca Grasso | Muhammad Okky Ibrohim | Maria Skeppstedt | Manfred Stede

pdf bib
From Data to Grassroots Initiatives: Leveraging Transformer-Based Models for Detecting Green Practices in Social Media
Anna Glazkova | Olga Zakharova

Green practices are everyday activities that support a sustainable relationship between people and the environment. Detecting these practices in social media helps track their prevalence and develop recommendations to promote eco-friendly actions. This study compares machine learning methods for identifying mentions of green waste practices as a multi-label text classification task. We focus on transformer-based models, which currently achieve state-of-the-art performance across various text classification tasks. Along with encoder-only models, we evaluate encoder-decoder and decoder-only architectures, including instruction-based large language models. Experiments on the GreenRu dataset, which consists of Russian social media texts, show the prevalence of the mBART encoder-decoder model. The findings of this study contribute to the advancement of natural language processing tools for ecological and environmental research, as well as the broader development of multi-label text classification methods in other domains.

pdf bib
Perspectives on Forests and Forestry in Finnish Online Discussions - A Topic Modeling Approach to Suomi24
Telma Peura | Attila Krizsán | Salla-Riikka Kuusalu | Veronika Laippala

This paper explores how forests and forest industry are perceived on the largest online discussion forum in Finland, Suomi24 (‘Finland24’). Using 30,636 posts published in 2014–2020, we investigate what kind of topics and perspectives towards forest management can be found. We use BERTopic as our topic modeling approach and evaluate the results of its different modular combinations. As the dataset is not labeled, we demonstrate the validity of our best model through illustrating some of the topics about forest use. The results show that a combination of UMAP and K-means leads to the best topic quality. Our exploratory qualitative analysis indicates that the posts reflect polarized discourses between the forest industry and forest conservation adherents.

pdf bib
Mining for Species, Locations, Habitats, and Ecosystems from Scientific Papers in Invasion Biology: A Large-Scale Exploratory Study with Large Language Models
Jennifer D’Souza | Zachary Laubach | Tarek Al Mustafa | Sina Zarrieß | Robert Frühstückl | Phyllis Illari

This study explores the use of large language models (LLMs), specifically GPT-4o, to extract key ecological entities—species, locations, habitats, and ecosystems—from invasion biology literature. This information is critical for understanding species spread, predicting future invasions, and informing conservation efforts. Without domain-specific fine-tuning, we assess the potential and limitations of GPT-4o, out-of-the-box, for this task, highlighting the role of LLMs in advancing automated knowledge extraction for ecological research and management.

pdf bib
Large Language Models as Annotators of Named Entities in Climate Change and Biodiversity: A Preliminary Study
Elena Volkanovska

This paper examines whether few-shot techniques for Named Entity Recognition (NER) utilising existing large language models (LLMs) as their backbone can be used to reliably annotate named entities (NEs) in scientific texts on climate change and biodiversity. A series of experiments aim to assess whether LLMs can be integrated into an end-to-end pipeline that could generate token- or sentence-level NE annotations; the former being an ideal-case scenario that allows for seamless integration of existing with new token-level features in a single annotation pipeline. Experiments are run on four LLMs, two NER datasets, two input and output data formats, and ten and nine prompt versions per dataset. The results show that few-shot methods are far from being a silver bullet for NER in highly specialised domains, although improvement in LLM performance is observed for some prompt designs and some NE classes. Few-shot methods would find better use in a human-in-the-loop scenario, where an LLM’s output is verified by a domain expert.

pdf bib
Communicating urgency to prevent environmental damage: insights from a linguistic analysis of the WWF24 multilingual corpus
Cristina Bosco | Adriana Silvina Pagano | Elisa Chierchiello

Contemporary environmental discourse focuses on effectively communicating ecological vulnerability to raise public awareness and encourage positive actions. Hence there is a need for studies to support accurate and adequate discourse production, both by humans and computers. Two main challenges need to be tackled. On the one hand, the language used to communicate about environment issues can be very complex for human and automatic analysis, there being few resources to train and test NLP tools. On the other hand, in the current international scenario, most texts are written in multiple languages or translated from a major to minor language, resulting in different meanings in different languages and cultural contexts. This paper presents a novel parallel corpus comprising the text of World Wide Fund (WWF) 2024 Annual Report in English and its translations into Italian and Brazilian Portuguese, and analyses their linguistic features.

pdf bib
Thematic Categorization on Pineapple Production in Costa Rica: An Exploratory Analysis through Topic Modeling
Valentina Tretti Beckles | Adrian Vergara Heidke

Costa Rica is one of the largest producers and exporters of pineapple in the world. This status has encouraged multinational companies to use plantations in this Central American country for experimentation and the cultivation of new varieties, such as the Pinkglow pineapple. However, pineapple monoculture has significant socio-environmental impacts on the regions where it is cultivated.In this exploratory study, we aimed to analyze how pineapple production is portrayed on the Internet. To achieve this, we collected a corpus of texts in Spanish and English from online sources in two phases: using the BootCat tool and manual search on newspaper websites. The Hierarchical Dirichlet Process (HDP) topic model was then applied to identify dominant topics within the corpus. These topics were subsequently classified into thematic categories, and the texts were categorized accordingly. The findings indicate that environmental issues related to pineapple cultivation are underrepresented on the Internet, particularly in comparison to the extensive focus on topics related to pineapple production and marketing.

pdf bib
Entity Linking using LLMs for Automated Product Carbon Footprint Estimation
Steffen Castle | Julian Moreno Schneider

Growing concerns about climate change and sustainability are driving manufacturers to take significant steps toward reducing their carbon footprints. For these manufacturers, a first step towards this goal is to identify the environmental impact of the individual components of their products. We propose a system leveraging large language models (LLMs) to automatically map components from manufacturer Bills of Materials (BOMs) to Life Cycle Assessment (LCA) database entries by using LLMs to expand on available component information. Our approach reduces the need for manual data processing, paving the way for more accessible sustainability practices.

pdf bib
Quantification of Biodiversity from Historical Survey Text with LLM-based Best-Worst-Scaling
Thomas Haider | Tobias Perschl | Malte Rehbein

In this study, we evaluate methods to determine the frequency of species via quantity estimation from historical survey text. To that end, we formulate classification tasks and finally show that this problem can be adequately framed as a regression task using Best-Worst Scaling (BWS) with Large Language Models (LLMs). We test Ministral-8B, DeepSeek-V3, and GPT-4, finding that the latter two have reasonable agreement with humans and each other. We conclude that this approach is more cost-effective and similarly robust compared to a fine-grained multi-class approach, allowing automated quantity estimation across species.

pdf bib
Analyzing the Online Communication of Environmental Movement Organizations: NLP Approaches to Topics, Sentiment, and Emotions
Christina Barz | Melanie Siegel | Daniel Hanss

This project employs state-of-the-art Natural Language Processing (NLP) techniques to analyze the online communication of international Environmental Movement Organizations (EMOs). First, we introduce our overall EMO dataset and describe it through topic modeling. Second, we evaluate current sentiment and emotion classification models for our specific dataset. Third, as we are currently in our annotation process, we evaluate our current progress and issues to determine the most effective approach for creating a high-quality annotated dataset that captures the nuances of EMO communication. Finally, we emphasize the need for domain-specific datasets and tailored NLP tools and suggest refinements for our annotation process moving forward.

pdf bib
No AI on a Dead Planet: Sentiment and Emotion Analysis Across Reddit Communities on AI and the Environment
Arianna Longo | Alessandro Y. Longo

This paper investigates how different online communities perceive and discuss the environmental impact of AI through sentiment analysis and emotion detection. We analyze Reddit discussion from r/artificial and r/climatechange, using pre-trained models fine-tuned on social media data. Our analysis reveals distinct patterns in how these communities engage with AI’s environmental implications: the AI community demonstrates a shift from predominantly neutral and positive sentiment in posts to more balanced perspectives in comments, while the climate community maintains a more critical stance throughout discussions. The findings contribute to our understanding of how different communities conceptualize and respond to the environmental challenges of AI development.

pdf bib
Towards Addressing Anthropocentric Bias in Large Language Models
Francesca Grasso | Stefano Locci | Luigi Di Caro

The widespread use of Large Language Models (LLMs), particularly among non-expert users, has raised ethical concerns about the propagation of harmful biases. While much research has addressed social biases, few works, if any, have examined anthropocentric bias in Natural Language Processing (NLP) technology. Anthropocentric language prioritizes human value, framing non-human animals, living entities, and natural elements solely by their utility to humans; a perspective that contributes to the ecological crisis. In this paper, we evaluate anthropocentric bias in OpenAI’s GPT-4o across various target entities, including sentient beings, non-sentient entities, and natural elements. Using prompts eliciting neutral, anthropocentric, and ecocentric perspectives, we analyze the model’s outputs and introduce a manually curated glossary of 424 anthropocentric terms as a resource for future ecocritical research. Our findings reveal a strong anthropocentric bias in the model’s responses, underscoring the need to address human-centered language use in AI-generated text to promote ecological well-being.

pdf bib
Efficient Scientific Full Text Classification: The Case of EICAT Impact Assessments
Marc Felix Brinner | Sina Zarrieß

This study explores strategies for efficiently classifying scientific full texts using both small, BERT-based models and local large language models like Llama-3.1 8B. We focus on developing methods for selecting subsets of input sentences to reduce input size while simultaneously enhancing classification performance. To this end, we compile a novel dataset consisting of full-text scientific papers from the field of invasion biology, specifically addressing the impacts of invasive species. These papers are aligned with publicly available impact assessments created by researchers for the International Union for Conservation of Nature (IUCN). Through extensive experimentation, we demonstrate that various sources like human evidence annotations, LLM-generated annotations or explainability scores can be used to train sentence selection models that improve the performance of both encoder- and decoder-based language models while optimizing efficiency through the reduction in input length, leading to improved results even if compared to models like ModernBERT that are able to handle the complete text as input. Additionally, we find that repeated sampling of shorter inputs proves to be a very effective strategy that, at a slightly increased cost, can further improve classification performance.

pdf bib
The Accuracy, Robustness, and Readability of LLM-Generated Sustainability-Related Word Definitions
Alice Heiman

A common language with shared standard definitions is essential for effective climate conversations. However, there is concern that LLMs may misrepresent and/or diversify climate-related terms. We compare 305 official IPCC glossary definitions with those generated by OpenAI’s GPT-4o-mini and investigate their adherence, robustness, and readability using a combination of SBERT sentence embeddings and statistical measures. The LLM definitions received average adherence and robustness scores of 0.58 ± 0.15 and 0.96 ± 0.02, respectively. Both sustainability-related terminologies remain challenging to read, with model-generated definitions varying mainly among words with multiple or ambiguous definitions. Thus, the results highlight the potential of LLMs to support environmental discourse while emphasizing the need to align model outputs with established terminology for clarity and consistency.

up

pdf (full)
bib (full)
Proceedings of the Sixth Workshop on Privacy in Natural Language Processing

pdf bib
Proceedings of the Sixth Workshop on Privacy in Natural Language Processing
Ivan Habernal | Sepideh Ghanavati | Vijayanta Jain | Timour Igamberdiev | Shomir Wilson

pdf bib
TUNI: A Textual Unimodal Detector for Identity Inference in CLIP Models
Songze Li | Ruoxi Cheng | Xiaojun Jia

The widespread usage of large-scale multimodal models like CLIP has heightened concerns about the leakage of PII. Existing methods for identity inference in CLIP models require querying the model with full PII, including textual descriptions of the person and corresponding images (e.g., the name and the face photo of the person). However, applying images may risk exposing personal information to target models, as the image might not have been previously encountered by the target model.Additionally, previous MIAs train shadow models to mimic the behaviors of the target model, which incurs high computational costs, especially for large CLIP models. To address these challenges, we propose a textual unimodal detector (TUNI) in CLIP models, a novel technique for identity inference that: 1) only utilizes text data to query the target model; and 2) eliminates the need for training shadow models. Extensive experiments of TUNI across various CLIP model architectures and datasets demonstrate its superior performance over baselines, albeit with only text data.

pdf bib
TAROT: Task-Oriented Authorship Obfuscation Using Policy Optimization Methods
Gabriel Loiseau | Damien Sileo | Damien Riquet | Maxime Meyer | Marc Tommasi

Authorship obfuscation aims to disguise the identity of an author within a text by altering the writing style, vocabulary, syntax, and other linguistic features associated with the text author. This alteration needs to balance privacy and utility. While strong obfuscation techniques can effectively hide the author’s identity, they often degrade the quality and usefulness of the text for its intended purpose. Conversely, maintaining high utility tends to provide insufficient privacy, making it easier for an adversary to de-anonymize the author. Thus, achieving an optimal trade-off between these two conflicting objectives is crucial. In this paper, we propose **TAROT**: **T**ask-Oriented **A**utho**r**ship **O**bfuscation Using Policy Op**t**imization, a new unsupervised authorship obfuscation method whose goal is to optimize the privacy-utility trade-off by regenerating the entire text considering its downstream utility. Our approach leverages policy optimization as a fine-tuning paradigm over small language models in order to rewrite texts by preserving author identity and downstream task utility. We show that our approach largely reduces the accuracy of attackers while preserving utility. We make our code and models publicly available.

pdf bib
Balancing Privacy and Utility in Personal LLM Writing Tasks: An Automated Pipeline for Evaluating Anonymizations
Stefan Pasch | Min Chul Cha

Large language models (LLMs) are widely used for personalized tasks involving sensitive information, raising privacy concerns. While anonymization techniques exist, their impact on response quality remains underexplored. This paper introduces a fully automated evaluation framework to assess anonymization strategies in LLM-generated responses. We generate synthetic prompts for three personal tasks—personal introductions, cover letters, and email writing—and apply anonymization techniques that preserve fluency while enabling entity backmapping. We test three anonymization strategies: simple masking, adding context to masked entities, and pseudonymization. Results show minimal response quality loss (roughly 1 point on a 10-point scale) while achieving 97%-99% entity masking. Responses generated with Llama 3.3:70b perform best with simple entity masking, while GPT-4o benefits from contextual cues. This study provides a framework and empirical insights into balancing privacy protection and response quality in LLM applications.

pdf bib
Named Entity Inference Attacks on Clinical LLMs: Exploring Privacy Risks and the Impact of Mitigation Strategies
Adam Sutton | Xi Bai | Kawsar Noor | Thomas Searle | Richard Dobson

Transformer-based Large Language Models (LLMs) have achieved remarkable success across various domains, including clinical language processing, where they enable state-of-the-art performance in numerous tasks. Like all deep learning models, LLMs are susceptible to inference attacks that exploit sensitive attributes seen during training. AnonCAT, a RoBERTa-based masked language model, has been fine-tuned to de-identify sensitive clinical textual data. The community has a responsibility to explore the privacy risks of these models. This work proposes an attack method to infer sensitive named entities used in the training of AnonCAT models. We perform three experiments; the privacy implications of generating multiple names, the impact of white-box and black-box on attack inference performance, and the privacy-enhancing effects of Differential Privacy (DP) when applied to AnonCAT. By providing real textual predictions and privacy leakage metrics, this research contributes to understanding and mitigating the potential risks associated with exposing LLMs in sensitive domains like healthcare.

pdf bib
Inspecting the Representation Manifold of Differentially-Private Text
Stefan Arnold

Differential Privacy (DP) for text has recently taken the form of text paraphrasing using language models and temperature sampling to better balance privacy and utility. However, the geometric distortion of DP regarding the structure and complexity in the representation space remains unexplored. By estimating the intrinsic dimension of paraphrased text across varying privacy budgets, we find that word-level methods severely raise the representation manifold, while sentence-level methods produce paraphrases whose manifolds are topologically more consistent with human-written paraphrases. Among sentence-level methods, masked paraphrasing, compared to causal paraphrasing, demonstrates superior preservation of structural complexity, suggesting that autoregressive generation propagates distortions from unnatural word choices that cascade and inflate the representation space.

pdf bib
Beyond Reconstruction: Generating Privacy-Preserving Clinical Letters
Libo Ren | Samuel Belkadi | Lifeng Han | Warren Del-Pinto | Goran Nenadic

Due to the sensitive nature of clinical letters, their use in model training, medical research, and education is limited. This work aims to generate diverse, de-identified, and high-quality synthetic clinical letters to enhance privacy protection. This study explores various pre-trained language models (PLMs) for text masking and generation, employing various masking strategies with a focus on Bio_ClinicalBERT. Both qualitative and quantitative methods are used for evaluation, supplemented by a downstream Named Entity Recognition (NER) task. Our results indicate that encoder-only models outperform encoder-decoder models. General-domain and clinical-domain PLMs exhibit comparable performance when clinical information is preserved. Preserving clinical entities and document structure yields better performance than fine-tuning alone. Masking stopwords enhances text quality, whereas masking nouns or verbs has a negative impact. BERTScore proves to be the most reliable quantitative evaluation metric in our task. Contextual information has minimal impact, indicating that synthetic letters can effectively replace original ones in downstream tasks. Unlike previous studies that focus primarily on reconstructing original letters or training a privacy-detection and substitution model, this project provides a framework for generating diverse clinical letters while embedding privacy detection, enabling sensitive dataset expansion and facilitating the use of real-world clinical data. Our codes and trained models will be publicly available at https://github.com/HECTA-UoM/Synthetic4Health.

pdf bib
Beyond De-Identification: A Structured Approach for Defining and Detecting Indirect Identifiers in Medical Texts
Ibrahim Baroud | Lisa Raithel | Sebastian Möller | Roland Roller

Sharing sensitive texts for scientific purposes requires appropriate techniques to protect the privacy of patients and healthcare personnel. Anonymizing textual data is particularly challenging due to the presence of diverse unstructured direct and indirect identifiers. To mitigate the risk of re-identification, this work introduces a schema of nine categories of indirect identifiers designed to account for different potential adversaries, including acquaintances, family members and medical staff. Using this schema, we annotate 100 MIMIC-III discharge summaries and propose baseline models for identifying indirect identifiers. We will release the annotation guidelines, annotation spans (6,199 annotations in total) and the corresponding MIMIC-III document IDs to support further research in this area.

pdf bib
Investigating User Perspectives on Differentially Private Text Privatization
Stephen Meisenbacher | Alexandra Klymenko | Alexander Karpp | Florian Matthes

Recent literature has seen a considerable uptick in *Differentially Private Natural Language Processing* (DP NLP). This includes DP text privatization, where potentially sensitive input texts are transformed under DP to achieve privatized output texts that ideally mask sensitive information *and* maintain original semantics. Despite continued work to address the open challenges in DP text privatization, there remains a scarcity of work addressing user perceptions of this technology, a crucial aspect which serves as the final barrier to practical adoption. In this work, we conduct a survey study with 721 laypersons around the globe, investigating how the factors of *scenario*, *data sensitivity*, *mechanism type*, and *reason for data collection* impact user preferences for text privatization. We learn that while all these factors play a role in influencing privacy decisions, users are highly sensitive to the utility and coherence of the private output texts. Our findings highlight the socio-technical factors that must be considered in the study of DP NLP, opening the door to further user-based investigations going forward.

up

pdf (full)
bib (full)
Proceedings of the Queer in AI Workshop

pdf bib
Proceedings of the Queer in AI Workshop
A Pranav | Alissa Valentine | Shaily Bhatt | Yanan Long | Arjun Subramonian | Amanda Bertsch | Anne Lauscher | Ankush Gupta

pdf bib
Studying the Representation of the LGBTQ+ Community in RuPaul’s Drag Race with LLM-Based Topic Modeling
Mika Hämäläinen

This study investigates the representation of LGBTQ+ community in the widely acclaimed reality television series RuPaul’s Drag Race through a novel application of large language model (LLM)-based topic modeling. By analyzing subtitles from seasons 1 to 16, the research identifies a spectrum of topics ranging from empowering themes, such as self-expression through drag, community support and positive body image, to challenges faced by the LGBTQ+ community, including homophobia, HIV and mental health. Employing an LLM allowed for nuanced exploration of these themes, overcoming the limitations of traditional word-based topic modeling.

pdf bib
Guardrails, not Guidance: Understanding Responses to LGBTQ+ Language in Large Language Models
Joshua Tint

Language models have integrated themselves into many aspects of digital life, shaping everything from social media to translation. This paper investigates how large language models (LLMs) respond to LGBTQ+ slang and heteronormative language. Through two experiments, the study assesses the emotional content and the impact of queer slang on responses from models including GPT-3.5, GPT-4o, Llama2, Llama3, Gemma and Mistral. The findings reveal that heteronormative prompts can trigger safety mechanisms, leading to neutral or corrective responses, while LGBTQ+ slang elicits more negative emotions. These insights punctuate the need to provide equitable outcomes for minority slangs and argots, in addition to eliminating explicit bigotry from language models.

pdf bib
Dehumanization of LGBTQ+ Groups in Sexual Interactions with ChatGPT
Alexandria Leto | Juan Vásquez | Alexis Palmer | Maria Leonor Pacheco

Given the widespread use of LLM-powered conversational agents such as ChatGPT, analyzing the ways people interact with them could provide valuable insights into human behavior. Prior work has shown that these agents are sometimes used in sexual contexts, such as to obtain advice, to role-play as sexual companions, or to generate erotica. While LGBTQ+ acceptance has increased in recent years, dehumanizing practices against minorities continue to prevail. In this paper, we hone in on this and perform an analysis of dehumanizing tendencies toward LGBTQ+ individuals by human users in their sexual interactions with ChatGPT. Through a series of experiments that model various concept vectors associated with distinct shades of dehumanization, we find evidence of the reproduction of harmful stereotypes. However, many user prompts lack indications of dehumanization, suggesting that the use of these agents is a complex and nuanced issue which warrants further investigation.

pdf bib
Leveraging Large Language Models in Detecting Anti-LGBTQIA+ User-generated Texts
Quoc-Toan Nguyen | Josh Nguyen | Tuan Pham | William John Teahan

Anti-LGBTQIA+ texts in user-generated content pose significant risks to online safety and inclusivity. This study investigates the capabilities and limitations of five widely adopted Large Language Models (LLMs)—DeepSeek-V3, GPT-4o, GPT-4o-mini, GPT-o1-mini, and Llama3.3-70B—in detecting such harmful content. Our findings reveal that while LLMs demonstrate potential in identifying offensive language, their effectiveness varies across models and metrics, with notable shortcomings in calibration. Furthermore, linguistic analysis exposes deeply embedded patterns of discrimination, reinforcing the urgency for improved detection mechanisms for this marginalised population. In summary, this study demonstrates the significant potential of LLMs for practical application in detecting anti-LGBTQIA+ user-generated texts and provides valuable insights from text analysis that can inform topic modelling. These findings contribute to developing safer digital platforms and enhancing protection for LGBTQIA+ individuals.

pdf bib
A Bayesian account of pronoun and neopronoun acquisition
Cassandra L Jacobs | Morgan Grobol

A major challenge to equity among members of queer communities is the use of one’s chosen forms of reference, such as personal names or pronouns. Speakers often dimiss errors in pronominal use as unintentional, and claim that their errors reflect many decades of fossilized mainstream language use, including attitudes or expectations about the relationship between one’s appearance and acceptable forms of reference. Here, we propose a modeling framework that allows language use and speech communities to change over time, including the adoption of neopronouns and other forms for self-reference. We present a probabilistic graphical modeling approach to pronominal reference that is flexible in the face of change and experience while also moving beyond form-to-meaning mappings. The model critically also does not rely on lexical covariance structure to learn referring expressions. We show that such a model can account for individual differences in how quickly pronouns or names are integrated into symbolic knowledge and can empower computational systems to be both flexible and respectful of queer people with diverse gender expression.

up

pdf (full)
bib (full)
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)

pdf bib
Proceedings of the 1st Regulatory NLP Workshop (RegNLP 2025)
Tuba Gokhan | Kexin Wang | Iryna Gurevych | Ted Briscoe

pdf bib
Shared Task RIRAG-2025: Regulatory Information Retrieval and Answer Generation
Tuba Gokhan | Kexin Wang | Iryna Gurevych | Ted Briscoe

This paper provides an overview of the Shared Task RIRAG-2025, which focused on advancing the field of Regulatory Information Retrieval and Answer Generation (RIRAG). The task was designed to evaluate methods for answering regulatory questions using the ObliQA dataset. This paper summarizes the shared task, participants’ methods, and the results achieved by various teams.

pdf bib
Challenges in Technical Regulatory Text Variation Detection
Shriya Vaagdevi Chikati | Samuel Larkin | David Minicola | Chi-kiu Lo

We present a preliminary study on the feasibility of using current natural language processing techniques to detect variations between the construction codes of different jurisdictions. We formulate the task as a sentence alignment problem and evaluate various sentence representation models for their performance in this task. Our results show that task-specific trained embeddings perform marginally better than other models, but the overall accuracy remains a challenge. We also show that domain-specific fine-tuning hurts the task performance. The results highlight the challenges of developing NLP applications for technical regulatory texts.

pdf bib
Bilingual BSARD: Extending Statutory Article Retrieval to Dutch
Ehsan Lotfi | Nikolay Banar | Nerses Yuzbashyan | Walter Daelemans

Statutory article retrieval plays a crucial role in making legal information more accessible to both laypeople and legal professionals. Multilingual countries like Belgium present unique challenges for retrieval models due to the need for handling legal issues in multiple languages. Building on the Belgian Statutory Article Retrieval Dataset (BSARD) in French, we introduce the bilingual version of this dataset, bBSARD. The dataset contains parallel Belgian statutory articles in both French and Dutch, along with legal questions from BSARD and their Dutch translation. Using bBSARD, we conduct extensive benchmarking of retrieval models available for Dutch and French. Our benchmarking setup includes lexical models, zero-shot dense models, and fine-tuned small foundation models. Our experiments show that BM25 remains a competitive baseline compared to many zero-shot dense models in both languages. We also observe that while proprietary models outperform open alternatives in the zero-shot setting, they can be matched or surpassed by fine-tuning small language-specific models. Our dataset and evaluation code are publicly available.

pdf bib
Unifying Large Language Models and Knowledge Graphs for efficient Regulatory Information Retrieval and Answer Generation
Kishore Vanapalli | Aravind Kilaru | Omair Shafiq | Shahzad Khan

In a rapidly changing socio-economic land-scape, regulatory documents play a pivotal role in shaping responses to emerging challenges. An efficient regulatory document monitoring system is crucial for addressing the complexi ties of a dynamically evolving world, enabling prompt crisis response, simplifying compliance, and empowering data-driven decision-making. In this work, we present a novel comprehensive analytical framework, PolicyInsight, which is based on a specialized regulatory data model and state-of-the-art NLP techniques of Large Language Models (LLMs) and Knowledge Graphs to derive timely insights, facilitating data-driven decision-making and fostering a more transparent and informed governance ecosystem for regulators, businesses, and citizens.

pdf bib
A Hybrid Approach to Information Retrieval and Answer Generation for Regulatory Texts
Jhon Stewar Rayo Mosquera | Carlos Raul De La Rosa Peredo | Mario Garrido Cordoba

pdf bib
1-800-SHARED-TASKS at RegNLP: Lexical Reranking of Semantic Retrieval (LeSeR) for Regulatory Question Answering
Jebish Purbey | Drishti Sharma | Siddhant Gupta | Khawaja Murad | Siddartha Pullakhandam | Ram Mohan Rao Kadiyala

This paper presents the system description of our entry for the COLING 2025 RegNLP RIRAG (Regulatory Information Retrieval and Answer Generation) challenge, focusing on leveraging advanced information retrieval and answer generation techniques in regulatory domains. We experimented with a combination of embedding models, including Stella, BGE, CDE, and Mpnet, and leveraged fine-tuning and reranking for retrieving relevant documents in top ranks. We utilized a novel approach, LeSeR, which achieved competitive results with a recall@10 of 0.8201 and map@10 of 0.6655 for retrievals. This work highlights the transformative potential of natural language processing techniques in regulatory applications, offering insights into their capabilities for implementing a retrieval augmented generation system while identifying areas for future improvement in robustness and domain adaptation.

pdf bib
MST-R: Multi-Stage Tuning for Retrieval Systems and Metric Evaluation
Yash Malviya | Karan Dhingra | Maneesh Singh

Regulatory documents are rich in nuanced terminology and specialized semantics. FRAG systems: Frozen retrieval-augmented generators utilizing pre-trained (or, frozen) components face consequent challenges with both retriever and answering performance. We present a system that adapts the retriever performance to the target domain using a multi-stage tuning (MST) strategy. Our retrieval approach, called MST-R (a) first fine-tunes encoders used in vector stores using hard negative mining, (b) then uses a hybrid retriever, combining sparse and dense retrievers using reciprocal rank fusion, and then (c) adapts the cross-attention encoder by fine-tuning only the top-k retrieved results. We benchmark the system performance on the dataset released for the RIRAG challenge (as part of the RegNLP workshop at COLING 2025). We achieve significant performance gains obtaining a top rank on the RegNLP challenge leaderboard. We also show that a trivial answering approach *games* the RePASs metric outscoring all baselines and a pre-trained Llama model. Analyzing this anomaly, we present important takeaways for future research. We also release our [code base](https://github.com/Indic-aiDias/MST-R)

pdf bib
AUEB-Archimedes at RIRAG-2025: Is Obligation concatenation really all you need?
Ioannis Chasandras | Odysseas S. Chlapanis | Ion Androutsopoulos

This paper presents the systems we developed for RIRAG-2025, a shared task that requires answering regulatory questions by retrieving relevant passages. The generated answers are evaluated using RePASs, a reference-free and model-based metric. Our systems use a combination of three retrieval models and a reranker. We show that by exploiting a neural component of RePASs that extracts important sentences (‘obligations’) from the retrieved passages, we achieve a dubiously high score (0.947), even though the answers are directly extracted from the retrieved passages and are not actually generated answers. We then show that by selecting the answer with the best RePASs among a few generated alternatives and then iteratively refining this answer by reducing contradictions and covering more obligations, we can generate readable, coherent answers that achieve a more plausible and relatively high score (0.639).

pdf bib
Structured Tender Entities Extraction from Complex Tables with Few-short Learning
Asim Abbas | Mark Lee | Niloofer Shanavas | Venelin Kovatchev | Mubashir Ali

Extracting structured text from complex tables in PDF tender documents remains a challenging task due to the loss of structural and positional information during the extraction process. AI-based models often require extensive training data, making development from scratch both tedious and time-consuming. Our research focuses on identifying tender entities in complex table formats within PDF documents. To address this, we propose a novel approach utilizing few-shot learning with large language models (LLMs) to restore the structure of extracted text. Additionally, handcrafted rules and regular expressions are employed for precise entity classification. To evaluate the robustness of LLMs with few-shot learning, we employ data-shuffling techniques. Our experiments show that current text extraction tools fail to deliver satisfactory results for complex table structures. However, the few-shot learning approach significantly enhances the structural integrity of extracted data and improves the accuracy of tender entity identification.

pdf bib
A Two-Stage LLM System for Enhanced Regulatory Information Retrieval and Answer Generation
Fengzhao Sun | Jun Yu | Jiaming Hou | Yutong Lin | Tianyu Liu

This technical report describes our methodology for the Regulatory Information Retrieval and Answer Generation (RIRAG) Shared Task, a component of the RegNLP workshop at COLING 2025. The challenge aims to effectively navigate and extract relevant information from regulatory texts to generate precise, coherent answers for compliance and obligation-related queries. To tackle subtask1, we introduce a two-stage approach comprising an initial output stage and a subsequent refinement stage. Initially, we fine-tune the LLaMa-2-7B model using LoRA to produce a preliminary output. This is followed by the application of an expert mechanism to enhance the results. For subtask2, we design specific prompt to facilitate the generation of high-quality answers. Consequently, our approach has achieved state-of-the-art performance on the leaderboard, which serves as a testament to the effectiveness and competitiveness of our proposed methodology.

pdf bib
NUST Nova at RIRAG 2025: A Hybrid Framework for Regulatory Information Retrieval and Question Answering
Mariam Babar Khan | Huma Ameer | Seemab Latif | Mehwish Fatima

NUST Nova participates in RIRAG Shared Task, addressing two critical challenges: Task 1 involves retrieving relevant subsections from regulatory documents based on user queries, while Task 2 focuses on generating concise, contextually accurate answers using the retrieved information. We propose a Hybrid Retrieval Framework that combines graph-based retrieval, vector-based methods, and keyword matching BM25 to enhance relevance and precision in regulatory QA. Using score-based fusion and iterative refinement, the framework retrieves the top 10 relevant passages, which are then used by an LLM to generate accurate, context-aware answers. After empirical evaluation, we also conduct an error analysis to identify our framework’s limitations.

pdf bib
NUST Alpha at RIRAG 2025: Fusion RAG for Bridging Lexical and Semantic Retrieval and Question Answering
Muhammad Rouhan Faisal | Muhammad Abdullah | Faizyaab Ali Shah | Shalina Riaz | Huma Ameer | Seemab Latif | Mehwish Fatima

NUST Alpha participates in the Regulatory Information Retrieval and Answer Generation (RIRAG) shared task. We propose FusionRAG that combines OpenAI embeddings, BM25, FAISS, and Rank-Fusion to improve information retrieval and answer generation. We also explores multiple variants of our model to assess the impact of each component in overall performance. FusionRAG strength comes from our rank fusion and filter strategy. Rank fusion integrates semantic and lexical relevance scores to optimize retrieval accuracy and result diversity, and Filter mechanism remove irrelevant passages before answer generation. Our experiments demonstrate that FusionRAG offers a robust and scalable solution for automating the analysis of regulatory documents, improving compliance efficiency, and mitigating associated risks. We further conduct an error analysis to explore the limitations of our model’s performance.

pdf bib
NUST Omega at RIRAG 2025: Investigating Context-aware Retrieval and Answer Generations-Lessons and Challenges
Huma Ameer | Muhammad Hannan Akram | Seemab Latif | Mehwish Fatima

NUST Omega participates in Regulatory Information Retrieval and Answer Generation (RIRAG) Shared Task. Regulatory documents poses unique challenges in retrieving and generating precise and relevant answers due to their inherent complexities. We explore the task by proposing a progressive retrieval pipeline and investigate its performance with multiple variants. Some variants include different embeddings to explore their effects on the retrieval score. Some variants examine the inclusion of keyword-driven query matching technique. After exploring such variations, we include topic modeling in our pipeline to investigate its impact on the performance. We also study the performance of various prompt techniques with our proposed pipeline. With empirical experiments, we find some strengths and limitations in the proposed pipeline. These findings will help the research community by offering valuable insights to make advancements in tackling this complex task.

pdf bib
Enhancing Regulatory Compliance Through Automated Retrieval, Reranking, and Answer Generation
Kübranur Umar | Hakan Doğan | Onur Özcan | İsmail Karakaya | Alper Karamanlıoğlu | Berkan Demirel

This paper explains a Retrieval-Augmented Generation (RAG) pipeline that optimizes reg- ularity compliance using a combination of em- bedding models (i.e. bge-m3, jina-embeddings- v3, e5-large-v2) with reranker (i.e. bge- reranker-v2-m3). To efficiently process long context passages, we introduce context aware chunking method. By using the RePASS met- ric, we ensure comprehensive coverage of obli- gations and minimizes contradictions, thereby setting a new benchmark for RAG-based regu- latory compliance systems. The experimen- tal results show that our best configuration achieves a score of 0.79 in Recall@10 and 0.66 in MAP@10 with LLaMA-3.1-8B model for answer generation.

pdf bib
A REGNLP Framework: Developing Retrieval-Augmented Generation for Regulatory Document Analysis
Ozan Bayer | Elif Nehir Ulu | Yasemin Sarkın | Ekrem Sütçü | Defne Buse Çelik | Alper Karamanlıoğlu | İsmail Karakaya | Berkan Demirel

This study presents the development of a Retrieval-Augmented Generation (RAG) framework tailored for analyzing regulatory documents from the Abu Dhabi Global Markets (ADGM). The methodology encompasses comprehensive data preprocessing, including extraction, cleaning, and compression of documents, as well as the organization of the ObliQA dataset. The embedding model is utilized for generating embeddings during the retrieval phase, facilitated by the txtai library for managing embeddings and streamlining testing. The training process incorporated innovative strategies such as duplicate recognition, dropout implementation, pooling adjustments, and label modifications to enhance retrieval performance. Hyperparameter tuning further refined the retrieval component, with improvements validated using the recall@10 metric, which measures the proportion of relevant passages among the top-10 results. The refined retrieval component effectively identifies pertinent passages within regulatory documents, expediting information access and supporting compliance efforts.

pdf bib
Regulatory Question-Answering using Generative AI
Devin Quinn | Sumit P. Pai | Iman Yousfi | Nirmala Pudota | Sanmitra Bhattacharya

Although retrieval augmented generation (RAG) has proven to be an effective approach for creating question-answering systems on a corpus of documents, there is a need to improve the performance of these systems, especially in the regulatory domain where clear and accurate answers are required. This paper outlines the methodology used in our submission to the Regulatory Information Retrieval and Answer Generation (RIRAG) shared task at the Regulatory Natural Language Processing Workshop (RegNLP 2025). The goal is to improve document retrieval (Shared Task 1) and answer generation (Shared Task 2). Our pipeline is constructed as a two-step process for Shared Task 1. In the first step, we utilize a text-embedding-ada-002-based retriever, followed by a RankGPT-based re-ranker. The ranked results of Task 1 are then used to generate responses to user queries in Shared Task 2 through a prompt-based approach using GPT-4o. For Shared Task 1, we achieved a recall rate of 75%, and with the prompts we developed, we were able to generate coherent answers for Shared Task 2.

pdf bib
RIRAG: A Bi-Directional Retrieval-Enhanced Framework for Financial Legal QA in ObliQA Shared Task
Xinyan Zhang | Xiaobing Feng | Xiujuan Xu | Zhiliang Zheng | Kai Wu

In professional financial-legal consulting services, accurately and efficiently retrieving and answering legal questions is crucial. Although some breakthroughs have been made in information retrieval and answer generation, few frameworks have successfully integrated these tasks. Therefore, we propose RIRAG (Retrieval-In-the-loop Response and Answer Generation), a bi-directional retrieval-enhanced framework for financial-legal question answering in ObliQA Shared Task. The system introduces BDD-FinLegal, which means Bi-Directional Dynamic finance-legal, a novel retrieval mechanism specifically designed for financial-legal documents, combining traditional retrieval algorithms with modern neural network methods. Legal answer generation is implemented through large language models retrained on expert-annotated datasets. Our method significantly improves the professionalism and interpretability of the answers while maintaining high retrieval accuracy. Experiments on the ADGM dataset show that the system achieved a significant improvement in the Recall@10 evaluation metric and was recognized by financial legal experts for the accuracy and professionalism of the answer generation. This study provides new ideas for building efficient and reliable question-answering systems in the financial-legal domain.

pdf bib
RAGulator: Effective RAG for Regulatory Question Answering
Islam Aushev | Egor Kratkov | Evgenii Nikolaev | Andrei Glinskii | Vasilii Krikunov | Alexander Panchenko | Vasily Konovalov | Julia Belikova

Regulatory Natural Language Processing (RegNLP) is a multidisciplinary domain focused on facilitating access to and comprehension of regulatory regulations and requirements. This paper outlines our strategy for creating a system to address the Regulatory Information Retrieval and Answer Generation (RIRAG) challenge, which was conducted during the RegNLP 2025 Workshop. The objective of this competition is to design a system capable of efficiently extracting pertinent passages from regulatory texts (ObliQA) and subsequently generating accurate, cohesive responses to inquiries related to compliance and obligations. Our proposed method employs a lightweight BM25 pre-filtering in retrieving relevant passages. This technique efficiently shortlisting candidates for subsequent processing with Transformer-based embeddings, thereby optimizing the use of resources.

up

pdf (full)
bib (full)
Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)

pdf bib
Proceedings of the 10th Workshop on Representation Learning for NLP (RepL4NLP-2025)
Vaibhav Adlakha | Alexandra Chronopoulou | Xiang Lorraine Li | Bodhisattwa Prasad Majumder | Freda Shi | Giorgos Vernikos

pdf bib
DEPTH: Discourse Education through Pre-Training Hierarchically
Zachary Elisha Bamberger | Ofek Glick | Chaim Baskin | Yonatan Belinkov

pdf bib
Tracking Universal Features Through Fine-Tuning and Model Merging
Niels Nielsen Horn | Desmond Elliott

We study how features emerge, disappear, and persist across models fine-tuned on different domains of text. More specifically, we start from a base one-layer Transformer language model that is trained on a combination of the BabyLM corpus, and a collection of Python code from The Stack. This base model is adapted to two new domains of text: TinyStories, and the Lua programming language, respectively; and then these two models are merged using these two models using spherical linear interpolation. Our exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.

pdf bib
Prompt Tuning Can Simply Adapt Large Language Models to Text Encoders
Kaiyan Zhao | Qiyu Wu | Zhongtao Miao | Yoshimasa Tsuruoka

Recently, many works have been attempting to adapt Large Language Models (LLMs) for sentence embedding, with most of them fine-tuning LLMs towards the contrastive objective and enabling bi-directional attention for better performance, using LoRA to address the large model scale.In this work, we suggest that this adaptation can also be simply and effectively achieved using causal attention and with even fewer trainable parameters through soft prompt tuning, as an alternative to fine-tuning with LoRA and other methods with extra post-training tasks.Our method only optimizes a few learnable tokens while keeping the rest of the model frozen.Through experiments on a diverse set of evaluation tasks, we find that simply tuning only a few tokens can achieve a competitive performance with that of fine-tuning with LoRA. The percentage of trainable parameters can be reduced to less than 0.001%. Moreover, we also demonstrate that turning causal attention to bi-directional attention with or without extra post-training tasks does not provide additional benefit when soft prompt tuning is applied, suggesting that causal attention can be naturally used in decoder-only LLMs for sentence embedding adaptation.

pdf bib
Cross-Modal Learning for Music-to-Music-Video Description Generation
Zhuoyuan Mao | Mengjie Zhao | Qiyu Wu | Zhi Zhong | Wei-Hsiang Liao | Hiromi Wakaki | Yuki Mitsufuji

Music-to-music-video generation is a challenging task due to the intrinsic differences between the music and video modalities. The advent of powerful text-to-video diffusion models has opened a promising pathway for music-video (MV) generation by first addressing the music-to-MV description task and subsequently leveraging these models for video generation. In this study, we focus on the MV description generation task and propose a comprehensive pipeline encompassing training data construction and multimodal model fine-tuning. We fine-tune existing pre-trained multimodal models on our newly constructed music-to-MV description dataset based on the Music4All dataset, which integrates both musical and visual information. Our experimental results demonstrate that music representations can be effectively mapped to textual domains, enabling the generation of meaningful MV description directly from music inputs. We also identify key components in the dataset construction pipeline that critically impact the quality of MV description and highlight specific musical attributes that warrant greater focus for improved MV description generation.

pdf bib
A Comparative Study of Learning Paradigms in Large Language Models via Intrinsic Dimension
Saahith Janapati | Yangfeng Ji

The performance of Large Language Models (LLMs) on natural language tasks can be improved through both supervised fine-tuning (SFT) and in-context learning (ICL), which operate via distinct mechanisms. SFT updates the model’s weights by minimizing loss on training data, whereas ICL leverages task demonstrations embedded in the prompt, without changing the model’s parameters. This study investigates the effects of these learning paradigms on the hidden representations of LLMs using Intrinsic Dimension (ID). We use ID to estimate the number of degrees of freedom between representations extracted from LLMs as they perform specific natural language tasks. We first explore how the ID of LLM representations evolves during SFT and how it varies due to the number of demonstrations in ICL. We then compare the IDs induced by SFT and ICL and find that ICL consistently induces a higher ID compared to SFT, suggesting that representations generated during ICL reside in higher dimensional manifolds in the embedding space.

pdf bib
Choose Your Words Wisely: Domain-adaptive Masking Makes Language Models Learn Faster
Vanshpreet S. Kohli | Aaron Monis | Radhika Mamidi

Foundational Language Models perform significantly better on downstream tasks in specialised domains (such as law, computer science, and medical science) upon being further pre-trained on extensive domain-specific corpora, but this continual pre-training incurs heavy computational costs. Indeed, some of the most performant specialised language models such as BioBERT incur even higher computing costs during domain-specific training than the pre-training cost of the foundational models they are initialised from. In this paper, we argue that much of the extended pre-training is redundant, with models seemingly wasting valuable resources re-learning lexical and semantic patterns already well-represented in their foundational models such as BERT, T5 and GPT. Focusing on Masked Language Models, we introduce a novel domain-specific masking strategy that is designed to facilitate continual learning while minimizing the training cost. Using this approach, we train and present a BERT-based model trained on a biomedical corpus that matches or surpasses traditionally trained biomedical language models in performance across several downstream classification tasks while incurring up to 11 times lower training costs.

pdf bib
Efficient Document-level Event Relation Extraction
Ruochen Li | Zimu Wang | Xinya Du

Event Relation Extraction (ERE) predicts temporal and causal relationships between events, playing a crucial role in constructing comprehensive event knowledge graphs. However, existing approaches based on pairwise comparisons often suffer from computational inefficiency, particularly at the document level, due to the quadratic operations required. Additionally, the predominance of unrelated events also leads to largely skewed data distributions. In this paper, we propose an innovative two-stage framework to tackle the challenges, consisting of a retriever to identify the related event pairs and a cross-encoder to classify the relationships between the retrieved pairs. Evaluations across representative benchmarks demonstrate our approach achieves better efficiency and significantly better performance. We also investigate leveraging event coreference chains for ERE and demonstrate their effectiveness.

pdf bib
Investigating Adapters for Parameter-efficient Low-resource Automatic Speech Recognition
Ahnaf Mozib Samin | Shekhar Nayak | Andrea De Marco | Claudia Borg

Recent years have witnessed the adoption of parameter-efficient adapters in pre-trained language models for natural language processing. Yet, their application in speech processing remains less studied. In this work, we explore the adapters for low-resource speech recognition, introducing a novel technique - ConvAdapt into pre-trained speech models. We investigate various aspects such as data requirements, transfer learning within adapters, and scaling of feed-forward layers in adapters. Our findings reveal that bottleneck adapters offer competitiveness with full fine-tuning with at least 10 hours of data, but they are not as effective in few-shot learning scenarios. Notably, ConvAdapt demonstrates improved performance in such cases. In addition, transfer learning in adapters shows promise, necessitating research in related languages. Furthermore, employing larger speech models for adapter-tuning surpasses fine-tuning with ample data, potentially due to reduced overfitting than fine-tuning.

pdf bib
Reverse Probing: Evaluating Knowledge Transfer via Finetuned Task Embeddings for Coreference Resolution
Tatiana Anikina | Arne Binder | David Harbecke | Stalin Varanasi | Leonhard Hennig | Simon Ostermann | Sebastian Möller | Josef Van Genabith

In this work, we reimagine classical probing to evaluate knowledge transfer from simple source to more complex target tasks. Instead of probing frozen representations from a complex source task on diverse simple target probing tasks (as usually done in probing), we explore the effectiveness of embeddings from multiple simple source tasks on a single target task. We select coreference resolution, a linguistically complex problem requiring contextual understanding, as focus target task, and test the usefulness of embeddings from comparably simpler tasks tasks such as paraphrase detection, named entity recognition, and relation extraction. Through systematic experiments, we evaluate the impact of individual and combined task embeddings. Our findings reveal that task embeddings vary significantly in utility for coreference resolution, with semantic similarity tasks (e.g., paraphrase detection) proving most beneficial. Additionally, representations from intermediate layers of fine-tuned models often outperform those from final layers. Combining embeddings from multiple tasks consistently improves performance, with attention-based aggregation yielding substantial gains. These insights shed light on relationships between task-specific representations and their adaptability to complex downstream tasks, encouraging further exploration of embedding-level task transfer. Our source code is publicly available under https://github.com/Cora4NLP/multi-task-knowledge-transfer.

pdf bib
Punctuation Restoration Improves Structure Understanding without Supervision
Junghyun Min | Minho Lee | Woochul Lee | Yeonsoo Lee

Unsupervised learning objectives like autoregressive and masked language modeling constitute a significant part in producing pre-trained representations that perform various downstream applications from natural language understanding to conversational tasks. However, despite impressive generative capabilities of recent large language models, their abilities to capture syntactic or semantic structure within text lag behind. We hypothesize that the mismatch between linguistic performance and competence in machines is attributable to insufficient learning of linguistic structure knowledge via currently popular pre-training objectives. Working with English, we show that punctuation restoration as a learning objective improves performance on structure-related tasks like named entity recognition, open information extraction, chunking, and part-of-speech tagging. Punctuation restoration results in ▲≥2%p improvement in 16 out of 18 experiments, across 6 out of 7 tasks. Our results show that punctuation restoration is an effective learning objective that can improve structure understanding and yield a more robust structure-aware representations of natural language in base-sized models.

pdf bib
Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models
Kaiser Sun | Mark Dredze

Large language model development relies on the pre-train-then-align paradigm, in which the model is typically pre-trained on a large text corpus and undergoes a tuning stage to align the model with human preference or downstream tasks. We investigate the relationship between pre-training and supervised fine-tuning by considering multiple tasks as well as different pre-trained model checkpoints. Our results on 18 datasets and two models suggest that i) although the model benefits significantly through supervised fine-tuning, it may forget previously known domain knowledge and tasks that are not seen during fine-tuning; ii) the model exhibits high sensitivity to evaluation prompts after supervised fine-tuning, but this sensitivity can be alleviated through further pre-training; iii) continual pre-training improves the model in a latent way that manifests after fine-tuning; iv) The model can already solve some tasks after pre-training while fine-tuning most benefits datasets where the model does not show capability during pre-training.

pdf bib
State Space Models are Strong Text Rerankers
Zhichao Xu | Jinghua Yan | Ashim Gupta | Vivek Srikumar

Transformers dominate NLP and IR; but their inference inefficiencies and challenges in extrapolating to longer contexts have sparked interest in alternative model architectures. Among these, state space models (SSMs) like Mamba offer promising advantages, particularly time complexity in inference. Despite their potential, SSMs’ effectiveness at text reranking — a task requiring fine-grained query-document interaction and long-context understanding — remains underexplored.This study benchmarks SSM-based architectures (specifically, Mamba-1 and Mamba-2) against transformer-based models across various scales, architectures, and pre-training objectives, focusing on performance and efficiency in text reranking tasks. We find that (1) Mamba architectures achieve competitive text ranking performance, comparable to transformer-based models of similar size; (2) they are less efficient in training and inference compared to transformers with flash attention; and (3) Mamba-2 outperforms Mamba-1 in both performance and efficiency. These results underscore the potential of state space models as a transformer alternative and highlight areas for improvement in future IR applications.

pdf bib
Large Language Models Are Overparameterized Text Encoders
Thennal D K | Tim Fischer | Chris Biemann

Large language models (LLMs) demonstrate strong performance as text embedding models when finetuned with supervised contrastive training. However, their large size balloons inference time and memory requirements. In this paper, we show that by pruning the last % layers of an LLM before supervised training for only 1000 steps, we can achieve a proportional reduction in memory and inference time. We evaluate four different state-of-the-art LLMs on text embedding tasks and find that our method can prune up to 30% of layers with negligible impact on performance and up to 80% with only a modest drop. With only three lines of code, our method is easily implemented in any pipeline for transforming LLMs to text encoders. We also propose L3Prune, a novel layer-pruning strategy based on the model’s initial loss that provides two optimal pruning configurations: a large variant with negligible performance loss and a small variant for resource-constrained settings. On average, the large variant prunes 21% of the parameters with a performance drop, and the small variant only suffers from a decrease while pruning 74% of the model. We consider these results strong evidence that LLMs are overparameterized for text embedding tasks, and can be easily pruned.

pdf bib
Vocabulary-level Memory Efficiency for Language Model Fine-tuning
Miles Williams | Nikolaos Aletras

The extensive memory footprint of language model (LM) fine-tuning poses a challenge for both researchers and practitioners. LMs use an embedding matrix to represent extensive vocabularies, forming a substantial proportion of the model parameters. While previous work towards memory-efficient fine-tuning has focused on minimizing the number of trainable parameters, reducing the memory footprint of the embedding matrix has yet to be explored. We first demonstrate that a significant proportion of the vocabulary remains unused during fine-tuning. We then propose a simple yet effective approach that leverages this finding to minimize memory usage. We show that our approach provides substantial reductions in memory usage across a wide range of models and tasks. Notably, our approach does not impact downstream task performance, while allowing more efficient use of computational resources.

up

pdf (full)
bib (full)
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)

pdf bib
Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)
Špela Arhar Holdt | Nikolai Ilinykh | Barbara Scalvini | Micaella Bruton | Iben Nyholm Debess | Crina Madalina Tudor

pdf bib
Universal Dependencies Treebank for Uzbek
Arofat Akhundjanova | Luigi Talamo

We present the first Universal Dependencies treebank for Uzbek, a low-resource language from the Turkic family. The treebank contains 500 sentences (5850 tokens) sourced from the news and fiction genres and it is annotated for lemmas, part-of-speech (POS) tags, morphological features, and dependency relations. We describe our methodology for building the treebank, which consists of a mix of manual and automatic annotation and discuss some constructions of the Uzbek language that pose challenges to the UD framework.

pdf bib
Fine-Tuning Cross-Lingual LLMs for POS Tagging in Code-Switched Contexts
Shayaan Absar

Code-switching (CS) involves speakers switching between two (or potentially more) languages during conversation and is a common phenomenon in bilingual communities. The majority of NLP research has been devoted to mono-lingual language modelling. Consequentially, most models perform poorly on code-switched data. This paper investigates the effectiveness of Cross-Lingual Large Language Models on the task of POS (Part-of-Speech) tagging in code-switched contexts, once they have undergone a fine-tuning process. The models are trained on code-switched combinations of Indian languages and English. This paper also seeks to investigate whether fine-tuned models are able to generalise and POS tag code-switched combinations that were not a part of the fine-tuning dataset. Additionally, this paper presents a new metric, the S-index (Switching-Index), for measuring the level of code-switching within an utterance.

pdf bib
Second language Korean Universal Dependency treebank v1.2: Focus on Data Augmentation and Annotation Scheme Refinement
Hakyung Sung | Gyu-Ho Shin

We expand the second language (L2) Korean Universal Dependencies (UD) treebank with 5,454 manually annotated sentences. The annotation guidelines are also revised to better align with the UD framework. Using this enhanced treebank, we fine-tune three Korean language models—Stanza, spaCy, and Trankit—and evaluate their performance on in-domain and out-of-domain L2-Korean datasets. The results show that fine-tuning significantly improves their performance across various metrics, thus highlighting the importance of using well-tailored L2 datasets for fine-tuning first-language-based, general-purpose language models for the morphosyntactic analysis of L2 data.

pdf bib
Recommendations for Overcoming Linguistic Barriers in Healthcare: Challenges and Innovations in NLP for Haitian Creole
Ludovic Mompelat

Haitian Creole, spoken by millions in Haiti and its diaspora, remains underrepresented in Natural Language Processing (NLP) research, limiting the availability of effective translation tools. In Miami, a significant Haitian Creole-speaking population faces healthcare disparities exacerbated by language barriers. Existing translation systems fail to address key challenges such as linguistic variation within the Creole language, frequent code-switching, and the lack of standardized medical terminology. This work proposes a structured methodology for the development of an AI-assisted translation and interpretation tool tailored for patient-provider communication in a medical setting. To achieve this, we propose a hybrid NLP approach that integrates fine-tuned Large Language Models (LLMs) with traditional machine translation methods. This combination ensures accurate, context-sensitive translation that adapts to both formal medical discourse and conversational registers while maintaining linguistic consistency. Additionally, we discuss data collection strategies, annotation challenges, and evaluation metrics necessary for building an ethically designed, scalable NLP system. By addressing these issues, this research provides a foundation for improving healthcare accessibility and linguistic equity for Haitian Creole speakers.

pdf bib
Beyond a Means to an End: A Case Study in Building Phonotactic Corpora for Central Australian Languages
Saliha Muradoglu | James Gray | Jane Helen Simpson | Michael Proctor | Mark Harvey

Linguistic datasets are essential across fields: computational linguists use them for NLP development, theoretical linguists for statistical arguments supporting hypotheses about language, and documentary linguists for preserving examples and aiding grammatical descriptions. Transforming raw data (e.g., recordings or dictionaries) into structured forms (e.g., tables) requires non-trivial decisions within processing pipelines.This paper highlights the importance of these processes in understanding linguistic systems. Our contributions include: (1) an interactive dashboard for four central Australian languages with custom filters, and (2) demonstrating how data processing decisions influence measured outcomes.

pdf bib
OCR Error Post-Correction with LLMs in Historical Documents: No Free Lunches
Jenna Kanerva | Cassandra Ledins | Siiri Käpyaho | Filip Ginter

Optical Character Recognition (OCR) systems often introduce errors when transcribing historical documents, leaving room for post-correction to improve text quality. This study evaluates the use of open-weight LLMs for OCR error correction in historical English and Finnish datasets. We explore various strategies, including parameter optimization, quantization, segment length effects, and text continuation methods. Our results demonstrate that while modern LLMs show promise in reducing character error rates (CER) in English, a practically useful performance for Finnish was not reached. Our findings highlight the potential and limitations of LLMs in scaling OCR post-correction for large historical corpora.

pdf bib
FoQA: A Faroese Question-Answering Dataset
Annika Simonsen | Dan Saattrup Nielsen | Hafsteinn Einarsson

We present FoQA, a Faroese extractive question-answering (QA) dataset with 2,000 samples, created using a semi-automated approach combining Large Language Models (LLMs) and human validation. The dataset was generated from Faroese Wikipedia articles using GPT-4-turbo for initial QA generation, followed by question rephrasing to increase complexity and native speaker validation to ensure quality. We provide baseline performance metrics for FoQA across multiple models, including LLMs and BERT, demonstrating its effectiveness in evaluating Faroese QA performance. The dataset is released in three versions: a validated set of 2,000 samples, a complete set of all 10,001 generated samples, and a set of 2,395 rejected samples for error analysis.

pdf bib
Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0
Carlos Daniel Hernández Mena | Barbara Scalvini | Dávid í Lág

Mozilla Common Voice is a crowdsourced project that aims to create a public, multilingual dataset of voice recordings for training speech recognition models. In Common Voice, anyone can contribute by donating or validating recordings in various languages. However, despite the availability of many recordings in certain languages, a significant percentage remains unvalidated by users. This is the case for Spanish, where in version 17.0 of Common Voice, 75% of the 2,220 hours of recordings are unvalidated. In this work, we used the Whisper recognizer to automatically validate approximately 784 hours of recordings which are more than the 562 hours validated by users. To verify the accuracy of the validation, we developed a speech recognition model based on a version of NVIDIA-NeMo’s Parakeet, which does not have an official Spanish version. Our final model achieved a WER of less than 4% on the test and validation splits of Common Voice 17.0. Both the model and the speech corpus are publicly available on Hugging Face.

pdf bib
WikiQA-IS: Assisted Benchmark Generation and Automated Evaluation of Icelandic Cultural Knowledge in LLMs
Þórunn Arnardóttir | Elías Bjartur Einarsson | Garðar Ingvarsson Juto | Þorvaldur Páll Helgason | Hafsteinn Einarsson

This paper presents WikiQA-IS, a novel question-answering dataset focusing on Icelandic culture and history, along with an automated pipeline for dataset generation and evaluation. Leveraging GPT-4 to create questions and answers based on Icelandic Wikipedia articles and news sources, we produced a high-quality corpus of 2,000 question-answer pairs. We introduce an automatic evaluation method using GPT-4o as a judge, which shows strong agreement with human evaluations. Our benchmark reveals varying performances across different language models, with closed-source models generally outperforming open-weights alternatives. This work contributes a resource for evaluating language models’ knowledge of Icelandic culture and offers a replicable framework for creating similar datasets in other cultural contexts.

pdf bib
DUDU: A Treebank for Ottoman Turkish in UD Style
Enes Yılandiloğlu | Janine Siewert

This paper introduces a recently released Ottoman Turkish (ota) treebank in Universal Dependencies (UD) style, DUDU. The DUDU Treebank consists of 1,064 automatically annotated and manually corrected sentences. The texts were manually collected from various academic or literary sources available on the Internet. Following preprocessing, the sentences were annotated using a MaCHAMP-based neural network model utilizing the large language model (LLM) architecture and manually corrected. The treebank became publicly available with the 2.14 release, and future steps involve expanding the treebank with more data and refining the annotation scheme. The treebank is the first and only treebank that utilizes the IJMES transliteration alphabet. The treebank not only gives insight on Ottoman Turkish lexically, morphologically, and syntactically, but also provides a small but robust test set for future computational models for Ottoman Turkish.

pdf bib
A Simple Audio and Text Collection-Annotation Tool Targeted to Brazilian Indigenous Language Native Speakers
Gustavo Padilha Polleti | Fabio Cozman | Fabricio Gerardi

In this paper we present an audio and text annotation tool for native speakers, with a particular focus on Brazilian indigenous languages. Our tool simplifies the process of language resource annotation and employs gamefication techniques typically found in language learning games. Then we describe the annotation tool and present preliminary results for the Bororo language. We discuss the limitations of our tool, highlighting ethical and practical implementation concerns.

pdf bib
First Steps in Benchmarking Latvian in Large Language Models
Inguna Skadina | Bruno Bakanovs | Roberts Darģis

The performance of multilingual large language models (LLMs) in low-resource languages, such as Latvian, has been under-explored. In this paper, we investigate the capabilities of several open and commercial LLMs in the Latvian language understanding tasks. We evaluate these models across several well-known benchmarks, such as the Choice of Plausible Alternatives (COPA) and Measuring Massive Multitask Language Understanding (MMLU), which were adapted into Latvian using machine translation. Our results highlight significant variability in model performance, emphasizing the challenges of extending LLMs to low-resource languages. We also analyze the effect of post-editing on machine-translated datasets, observing notable improvements in model accuracy, particularly with BERT-based architectures. We also assess open-source LLMs using the Belebele dataset, showcasing competitive performance from open-weight models when compared to proprietary systems. This study reveals key insights into the limitations of current LLMs in low-resource settings and provides datasets for future benchmarking efforts.

pdf bib
On the Usage of Semantics, Syntax, and Morphology for Noun Classification in IsiZulu
Imaan Sayed | Zola Mahlaza | Alexander van der Leek | Jonathan Mopp | C. Maria Keet

There is limited work aimed at solving the core task of noun classification for Nguni languages. The task focuses on identifying the semantic categorisation of each noun and plays a crucial role in the ability to form semantically and morphologically valid sentences. The work by Byamugisha (2022) was the first to tackle the problem for a related, but non-Nguni, language. While there have been efforts to replicate it for a Nguni language, there has been no effort focused on comparing the technique used in the original work vs. contemporary neural methods or a number of traditional machine learning classification techniques that do not rely on human-guided knowledge to the same extent. We reproduce Byamugisha (2022)’s work with different configurations to account for differences in access to datasets and resources, compare the approach with a pre-trained transformer-based model, and traditional machine learning models that relyon less human-guided knowledge. The newly created data-driven models outperform the knowledge-infused models, with the best performing models achieving an F1 score of 0.97.

pdf bib
Annotating Attitude in Swedish Political Tweets
Anna Lindahl

There is a lack of Swedish datasets annotated for emotional and argumentative language. This work therefore presents an annotation procedure and a dataset of Swedish political tweets. The tweets are annotated for positive and negative attitude. Challenges with this type of annotation is identified and described. The evaluation shows that the annotators do not agree on where to annotate spans, but that they agree on labels. This is demonstrated with a new implementation of the agreement coefficient Krippendorff’s unitized alpha.

pdf bib
VerbCraft: Morphologically-Aware Armenian Text Generation Using LLMs in Low-Resource Settings
Hayastan Avetisyan | David Broneske

Understanding and generating morphologically complex verb forms is a critical challenge in Natural Language Processing (NLP), particularly for low-resource languages like Armenian. Armenian’s verb morphology encodes multiple layers of grammatical information, such as tense, aspect, mood, voice, person, and number, requiring nuanced computational modeling. We introduce VerbCraft, a novel neural model that integrates explicit morphological classifiers into the mBART-50 architecture. VerbCraft achieves a BLEU score of 0.4899 on test data, compared to the baseline’s 0.9975, reflecting its focus on prioritizing morphological precision over fluency. With over 99% accuracy in aspect and voice predictions and robust performance on rare and irregular verb forms, VerbCraft addresses data scarcity through synthetic data generation with human-in-the-loop validation. Beyond Armenian, it offers a scalable framework for morphologically rich, low-resource languages, paving the way for linguistically informed NLP systems and advancing language preservation efforts.

pdf bib
Post-OCR Correction of Historical German Periodicals using LLMs
Vera Danilova | Gijs Aangenendt

Optical Character Recognition (OCR) is critical for accurate access to historical corpora, providing a foundation for processing pipelines and the reliable interpretation of historical texts. Despite advances, the quality of OCR in historical documents remains limited, often requiring post-OCR correction to address residual errors. Building on recent progress with instruction-tuned Llama 2 models applied to English historical newspapers, we examine the potential of German Llama 2 and Mistral models for post-OCR correction of German medical historical periodicals. We perform instruction tuning using two configurations of training data, augmenting our small annotated dataset with two German datasets from the same time period. The results demonstrate that German Mistral enhances the raw OCR output, achieving a lower average word error rate (WER). However, the average character error rate (CER) either decreases or remains unchanged across all models considered. We perform an analysis of performance within the error groups and provide an interpretation of the results.

pdf bib
From Words to Action: A National Initiative to Overcome Data Scarcity for the Slovene LLM
Špela Arhar Holdt | Špela Antloga | Tina Munda | Eva Pori | Simon Krek

Large Language Models (LLMs) have demonstrated significant potential in natural language processing, but they depend on vast, diverse datasets, creating challenges for languages with limited resources. The paper presents a national initiative that addresses these challenges for Slovene. We outline strategies for large-scale text collection, including the creation of an online platform to engage the broader public in contributing texts and a communication campaign promoting openly accessible and transparently developed LLMs.

pdf bib
Assessing the Similarity of Cross-Lingual Seq2Seq Sentence Embeddings Using Low-Resource Spectral Clustering
Nelson Moll | Tahseen Rabbani

In this work, we study the cross-lingual distance of machine translations through alignment of seq2seq representations over small corpora. First, we use the M2M100 model to collect sentence-level representations of The Book of Revelation in several languages. We then perform unsupervised manifold alignment (spectral clustering) between these collections of embeddings. As verses between translations are not necessarily aligned, our procedure falls under the challenging, but more realistic non-correspondence regime. The cost function associated with each alignment is used to rank the relative (machine) similarity of one language to another. We then perform correspondent alignment over another cluster of languages, this time using FLORES+ parallel NLLB model embeddings. Our experiments demonstrate that the representations of closely-related languages group closely, and are cheap to align (requiring <1000 sentences) via our strategy.

pdf bib
Voices of Luxembourg: Tackling Dialect Diversity in a Low-Resource Setting
Nina Hosseini-Kivanani | Christoph Schommer | Peter Gilles

Dialect classification is essential for preserving linguistic diversity, particularly in low-resource languages such as Luxembourgish. This study introduces one of the first systematic approaches to classifying Luxembourgish dialects, addressing phonetic, prosodic, and lexical variations across four major regions. We benchmarked multiple models, including state-of-the-art pre-trained speech models like Wav2Vec2, XLSR-Wav2Vec2, and Whisper, alongside traditional approaches such as Random Forest and CNN-LSTM. To overcome data limitations, we applied targeted data augmentation strategies and analyzed their impact on model performance. Our findings highlight the superior performance of CNN-Spectrogram and CNN-LSTM models while identifying the strengths and limitations of data augmentation. This work establishes foundational benchmarks and provides actionable insights for advancing dialectal NLP in Luxembourgish and other low-resource languages.

pdf bib
The Application of Corpus-Based Language Distance Measurement to the Diatopic Variation Study (on the Material of the Old Novgorodian Birchbark Letters)
Ilia Afanasev | Olga Lyashevskaya

The paper presents a computer-assisted exploration of a set of texts, where qualitative analysis complements the linguistically-aware vector-based language distance measurements, interpreting them through close reading and thus proving or disproving their conclusions. It proposes using a method designed for small raw corpora to explore the individual, chronological, and gender-based differences within an extinct single territorial lect, known only by a scarce collection of documents. The material under consideration is the Novgorodian birchbark letters, a set of rather small manuscripts (not a single one is more than 1000 tokens) that are witnesses of the Old Novgorodian lect, spoken on the territories of modern Novgorod and Staraya Russa at the first half of the first millennium CE. The study shows the existence of chronological variation, a mild degree of individual variation, and almost absent gender-based differences. Possible prospects of the study include its application to the newly discovered birchbark letters and using an outgroup for more precise measurements.

pdf bib
I Need More Context and an English Translation”: Analysing How LLMs Identify Personal Information in Komi, Polish, and English
Nikolai Ilinykh | Maria Irena Szawerna

Automatic identification of personal information (PI) is particularly difficult for languages with limited linguistic resources. Recently, large language models (LLMs) have been applied to various tasks involving low-resourced languages, but their capability to process PI in such contexts remains under-explored. In this paper we provide a qualitative analysis of the outputs from three LLMs prompted to identify PI in texts written in Komi (Permyak and Zyrian), Polish, and English. Our analysis highlights challenges in using pre-trained LLMs for PI identification in both low- and medium-resourced languages. It also motivates the need to develop LLMs that understand the differences in how PI is expressed across languages with varying levels of availability of linguistic resources.

pdf bib
Multi-label Scandinavian Language Identification (SLIDE)
Mariia Fedorova | Jonas Sebulon Frydenberg | Victoria Handford | Victoria Ovedie Chruickshank Langø | Solveig Helene Willoch | Marthe Løken Midtgaard | Yves Scherrer | Petter Mæhlum | David Samuel

Identifying closely related languages at sentence level is difficult, in particular because it is often impossible to assign a sentence to a single language. In this paper, we focus on multi-label sentence-level Scandinavian language identification (LID) for Danish, Norwegian Bokmål, Norwegian Nynorsk, and Swedish. We present the Scandinavian Language Identification and Evaluation, SLIDE, a manually curated multi-label evaluation dataset and a suite of LID models with varying speed–accuracy tradeoffs. We demonstrate that the ability to identify multiple languages simultaneously is necessary for any accurate LID method, and present a novel approach to training such multi-label LID models.

pdf bib
Federated Meta-Learning for Low-Resource Translation of Kirundi
Kyle Rui Sang | Tahseen Rabbani | Tianyi Zhou

In this work, we reframe multilingual neural machine translation (NMT) as a federated meta-learning problem and introduce a translation dataset for the low-resource Kirundi language. We aggregate machine translation models () locally trained on varying (but related) source languages to produce a global meta-model that encodes abstract representations of key semantic structures relevant to the parent languages. We then use the Reptile algorithm and Optuna fine-tuning to fit the global model onto a target language. The target language may live outside the subset of parent languages (such as closely-related dialects or sibling languages), which is particularly useful for languages with limitedly available sentence pairs. We first develop a novel dataset of Kirundi-English sentence pairs curated from Biblical translation. We then demonstrate that a federated learning approach can produce a tiny 4.8M Kirundi translation model and a stronger NLLB-600M model which performs well on both our Biblical corpus and the FLORES-200 Kirundi corpus.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop in South East Asian Language Processing

pdf bib
Proceedings of the Second Workshop in South East Asian Language Processing
Derry Wijaya | Alham Fikri Aji | Clara Vania | Genta Indra Winata | Ayu Purwarianti

pdf bib
bAI-bAI: A Context-Aware Transliteration System for Baybayin Scripts
Jacob Simon D. Bernardo | Maria Regina Justina E. Estuar

Baybayin, a pre-colonial writing system from the Philippines, has seen a resurgence in recent years. Research in computational linguistics has shown an increasing interest in Baybayin OCR, which focuses on the recognition and classification of script characters. However, existing studies face challenges with ambiguous Baybayin words that have multiple possible transliterations. This study introduces a disambiguation technique that employs word embeddings (WE) for contextual analysis and uses part-of-speech (POS) tagging as an initial filtering step. This approach is compared with an LLM method that prompts GPT-4o mini to determine the most appropriate transliteration given a sentence input. The proposed disambiguation process is integrated into existing Baybayin OCR systems to develop bAI-bAI, a context-aware Baybayin transliteration system capable of handling ambiguous words. Results show that incorporating POS as a filter does not significantly affect performance. The WE-Only method yields an accuracy of 77.46% and takes 5.35ms to process one sample while leveraging GPT-4o mini peaks at a higher accuracy of 90.52% but with a much longer runtime of 3280ms per sample. These findings present an opportunity to further explore and improve NLP approaches in disambiguation methods.

pdf bib
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
Wilson Wongso | David Samuel Setiawan | Steven Limcorn | Ananto Joyoadikusumo

We present NusaBERT, a multilingual model built on IndoBERT and tailored for Indonesia’s diverse languages. By expanding vocabulary and pre-training on a regional corpus, NusaBERT achieves state-of-the-art performance on Indonesian NLU benchmarks, enhancing IndoBERT’s multilingual capability. This study also addresses NusaBERT’s limitations and encourages further research on Indonesia’s underrepresented languages.

pdf bib
Evaluating Sampling Strategies for Similarity-Based Short Answer Scoring: a Case Study in Thailand
Pachara Boonsarngsuk | Pacharapon Arpanantikul | Supakorn Hiranwipas | Wipu Watcharakajorn | Ekapol Chuangsuwanich

Automatic short answer scoring is a task whose aim is to help grade written works by learners of some subject matter. In niche subject domains with small examples, existing methods primarily utilized similarity-based scoring, relying on predefined reference answers to grade each student’s answer based on the similarity to the reference. However, these reference answers are often generated from a randomly selected set of graded student answer, which may fail to represent the full range of scoring variations. We propose a semi-automatic scoring framework that enhances the selective sampling strategy for defining the reference answers through a K-center-based and a K-means-based sampling method. Our results demonstrate that our framework outperforms previous similarity-based scoring methods on a dataset with Thai and English. Moreover, it achieves competitive performance compared to human reference performance and LLMs.

pdf bib
Thai Winograd Schemas: A Benchmark for Thai Commonsense Reasoning
Phakphum Artkaew

Commonsense reasoning is one of the important aspects of natural language understanding, with several benchmarks developed to evaluate it. However, only a few of these benchmarks are available in languages other than English. Developing parallel benchmarks facilitates cross-lingual evaluation, enabling a better understanding of different languages. This research introduces a collection of Winograd Schemas in Thai, a novel dataset designed to evaluate commonsense reasoning capabilities in the context of the Thai language. Through a methodology involving native speakers, professional translators, and thorough validation, the schemas aim to closely reflect Thai language nuances, idioms, and cultural references while maintaining ambiguity and commonsense challenges. We evaluate the performance of popular large language models on this benchmark, revealing their strengths, limitations, and providing insights into the current state-of-the-art. Results indicate that while models like GPT-4 and Claude-3-Opus achieve high accuracy in English, their performance significantly drops in Thai, highlighting the need for further advancements in multilingual commonsense reasoning.

pdf bib
Anak Baik: A Low-Cost Approach to Curate Indonesian Ethical and Unethical Instructions
Sulthan Abiyyu Hakim | Rizal Setya Perdana | Tirana Noor Fatyanosa

This study explores the ethical challenges faced by Indonesian Large Language Models (LLMs), particularly focusing on their ability to distinguish between ethical and unethical instructions. As LLMs become increasingly integrated into sensitive applications, ensuring their ethical operation is crucial. A key contribution of this study is the introduction of the Anak Baik dataset, a resource designed to enhance the ethical reasoning capabilities of Indonesian LLMs. The phrase “Anak Baik”, meaning “Good Boy”, symbolizes the ideal of ethical behavior, as a well-behaved child refrains from engaging in harmful actions. The dataset comprises instruction-response pairs in Indonesian, crafted for Supervised Fine-Tuning (SFT) tasks. It includes examples of both ethical and unethical responses to guide models in learning to generate responses that uphold moral standards. Leveraging Low-Rank Adaptation (LoRA) on models such as Komodo and Cendol shows a significant improvement in ethical decision-making processes. This enhanced performance is quantitatively validated through substantial increases in BLEU and ROUGE scores, indicating a stronger alignment with socially responsible behavior.

pdf bib
Indonesian Speech Content De-Identification in Low Resource Transcripts
Rifqi Naufal Abdjul | Dessi Puji Lestari | Ayu Purwarianti | Candy Olivia Mawalim | Sakriani Sakti | Masashi Unoki

Advancements in technology and the increased use of digital data threaten individual privacy, especially in speech containing Personally Identifiable Information (PII). Therefore, systems that can remove or process privacy-sensitive data in speech are needed, particularly for low-resource transcripts. These transcripts are minimally annotated or labeled automatically, which is less precise than human annotation. However, using them can simplify the development of de-identification systems in any language. In this study, we develop and evaluate an efficient speech de-identification system. We create an Indonesian speech dataset containing sensitive private information and design a system with three main components: speech recognition, information extraction, and masking. To enhance performance in low-resource settings, we incorporate transcription data in training, use data augmentation, and apply weakly supervised learning. Our results show that our techniques significantly improve privacy detection performance, with approximately 29% increase in F1 score, 20% in precision, and 30% in recall with minimally labeled data.

pdf bib
IndoMorph: a Morphology Engine for Indonesian
Ian Kamajaya | David Moeljadi

Indonesian is an agglutinative language and rich in morphology. Although it has more than 250 million speakers, it is a low resource language in NLP field. Many Indonesian NLP resources are scattered, undocumented, and not publicly available. In this paper we address the issue of analyzing morphology as well as generating Indonesian words. We introduce IndoMorph, a morphology analyzer and word generator for Indonesian. In an agglutinative language, morphology deconstruction can be crucial to understand the structure and meaning of words. IndoMorph can be useful for language modeling and testing certain analyses. In addition, it can be employed to make a new Indonesian subword representation resource such as Indonesian morphology dictionary (IMD), used as a language education tool, or embedded in various applications such as text analysis applications. We hope that IndoMorph can be employed not only in the Indonesian NLP research development, but also in the NLP research of any agglutinative languages.

pdf bib
NusaDialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages
Ayu Purwarianti | Dea Adhista | Agung Baptiso | Miftahul Mahfuzh | Yusrina Sabila | Aulia Adila | Samuel Cahyawijaya | Alham Fikri Aji

Developing dialogue summarization for extremely low-resource languages is a challenging task. We introduce NusaDialogue, a dialogue summarization dataset for three underrepresented languages in the Malayo-Polynesian language family: Minangkabau, Balinese, and Buginese. NusaDialogue covers 17 topics and 185 subtopics, with annotations provided by 73 native speakers. Additionally, we conducted experiments using fine-tuning on a specifically designed medium-sized language model for Indonesian, as well as zero- and few-shot learning on various multilingual large language models (LLMs). The results indicate that, for extremely low-resource languages such as Minangkabau, Balinese, and Buginese, the fine-tuning approach yields significantly higher performance compared to zero- and few-shot prompting, even when applied to LLMs with considerably larger parameter sizes.

up

pdf (full)
bib (full)
Proceedings of the The 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics

pdf bib
Proceedings of the The 22nd SIGMORPHON workshop on Computational Morphology, Phonology, and Phonetics
Garrett Nicolai | Eleanor Chodroff | Frederic Mailhot | Çağrı Çöltekin

pdf bib
Prompt and circumstance”:" A word-by-word LLM prompting approach to interlinear glossing for low-resource languages
Micha Elsner | David Liu

This paper presents VeLePa, an inflected verbal lexicon of Central Pame (pbs, cent2154), an Otomanguean language from Mexico. This resource contains 12528 words in phonological form representing the complete inflectional paradigms of 216 verbs, supplemented with use frequencies. Computer-operable (CLDF) inflected lexicons of non-WEIRD underresourced languages are urgently needed to expand digital capacities in this languages (e.g. in NLP). VeLePa contributes to this, and does so with data from a language which is morphologically extraordinary, with unusually high levels of irregularity and multiple conjugations at various loci within the word”:" prefixes, stems, tone, and suffixes constitute different albeit interrelated subsystems of inflection. Partly automated creation of interlinear glossed text (IGT) has the potential to assist in linguistic documentation. We argue that LLMs can make this process more accessible to linguists because of their capacity to follow natural-language instructions. We investigate the effectiveness of a retrieval-based LLM prompting approach to glossing, applied to the seven languages from the SIGMORPHON 2023 shared task. Our system beats the BERTbased shared task baseline for every language in the morpheme-level score category, and we show that a simple 3-best oracle has higher word-level scores than the challenge winner (a tuned sequence model) in five languages. In a case study on Tsez, we ask the LLM to automatically create and follow linguistic instructions, reducing errors on a confusing grammatical feature. Our results thus demonstrate the potential contributions which LLMs can make in interactive systems for glossing, both in making suggestions to human annotators and following directions.

pdf bib
West Germanic noun-noun compounds and the morphology-syntax trade-off
Pablo Mosteiro | Damián Blasi | Denis Paperno

This paper examines the linguistic distinction between syntax and morphology, focusing on noun-noun compounds in three West Germanic languages (English, Dutch, and German). Previous studies using the Parallel Bible Corpus have found a trade-off between word order (syntax) and word structure (morphology), with languages optimizing information conveyance through these systems. Our research question is whether manipulating English noun-noun compounds to resemble Dutch and German constructions can reproduce the observed distance between these languages in the order-structure plane. We extend a word-pasting procedure to merge increasingly common noun-noun pairs in English Bible translations. After each merge, we estimate the information contained in word order and word structure using entropy calculations. Our results show that pasting noun-noun pairs reduces the difference between English and the other languages, suggesting that orthographic conventions defining word boundaries play a role in this distinction. However, the effect is not pronounced, and results are statistically inconclusive.

pdf bib
The Impact of Dialect Variation on Robust Automatic Speech Recognition for Catalan
Zachary Hopton | Eleanor Chodroff

To accurately transcribe a speech signal, automatic speech recognition (ASR) systems must show robustness to a wide range of task independent variation, such as speaker factors, recording quality, or even ädversarial noisedesigned to disrupt performance.We manipulated the dialect composition of fine-tuning data for ASR to study whether balancing the relative proportion of dialects had an impact on models robustness to two such sources of variation”:" dialect variation and adversarial perturbations. We fine-tuned XLSR-53 for Catalan ASR using four different dialect compositions, each containing the Central Catalan dialect. These were defined as 100%, 80%, 50%, and 20% Central Catalan, with the remaining portions split evenly between four other Catalan dialects. While increasing the relative proportion of dialect variants improved models’ dialect robustness, this did not have a meaningful impact on adversarial robustness. These findings suggest that while improvements to ASR can be made by diversifying the training data, such changes do not sufficiently counteract adversarial attacks, leaving the technology open to security threats.

pdf bib
Probing Neural Network Generalization using Default Patterns
Brandon Prickett | Tianyi Nyu | Katya Pertsova

Whether neural-net models can learn minoritydefault patterns has been a matter of some controversy. Results based on modeling real human language data are hard to interpret due to complexity. Therefore, we examine the learning of a simple artificial language pattern involving defaults using three computational models”:" an Encoder-Decoder RNN, a Transformer Encoder, and a Logistic Regression. Overall, we find that the models have the hardest time with minority defaults, but can eventually learn them and apply them to novel words (although not always extend them to completely novel segments or novel CV-sequences). Typefrequency has the largest effect on learning in all models, trumping the effect of distribution. We examine the weights of two models to provide further insights into how defaults are represented inside the models.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Scaling Up Multilingual & Multi-Cultural Evaluation

pdf bib
Proceedings of the Second Workshop on Scaling Up Multilingual & Multi-Cultural Evaluation

pdf bib
The First Multilingual Model For The Detection of Suicide Texts
Rodolfo Joel Zevallos | Annika Marie Schoene | John E. Ortega

Suicidal ideation is a serious health problem affecting millions of people worldwide. Social networks provide information about these mental health problems through users’ emotional expressions. We propose a multilingual model leveraging transformer architectures like mBERT, XML-R, and mT5 to detect suicidal text across posts in six languages - Spanish, English, German, Catalan, Portuguese and Italian. A Spanish suicide ideation tweet dataset was translated into five other languages using SeamlessM4T. Each model was fine-tuned on this multilingual data and evaluated across classification metrics. Results showed mT5 achieving the best performance overall with F1 scores above 85%, highlighting capabilities for cross-lingual transfer learning. The English and Spanish translations also displayed high quality based on perplexity. Our exploration underscores the importance of considering linguistic diversity in developing automated multilingual tools to identify suicidal risk. Limitations exist around semantic fidelity in translations and ethical implications which provide guidance for future human-in-the-loop evaluations.

pdf bib
CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment
Geyu Lin | Bin Wang | Zhengyuan Liu | Nancy F. Chen

Multilingual proficiency presents a significant challenge for large language models (LLMs). English-centric models are usually suboptimal in other languages, particularly those that are linguistically distant from English. This performance discrepancy mainly stems from the imbalanced distribution of training data across languages during pre-training and instruction tuning stages. To address this problem, we propose a novel approach called CrossIn, which utilizes a mixed composition of cross-lingual instruction tuning data. Our method leverages the compressed representation shared by various languages to efficiently enhance the model’s task-solving capabilities and multilingual proficiency within a single process. In addition, we introduce a multi-task and multi-faceted benchmark to evaluate the effectiveness of CrossIn. Experimental results demonstrate that our method substantially improves performance across tasks and languages, and we provide extensive insights into the impact of cross-lingual data volume and the integration of translation data on enhancing multilingual consistency and accuracy.

pdf bib
Evaluating Dialect Robustness of Language Models via Conversation Understanding
Dipankar Srirag | Nihar Ranjan Sahoo | Aditya Joshi

With an evergrowing number of LLMs reporting superlative performance for English, their ability to perform equitably for different dialects of English (i.e., dialect robustness) needs to be ascertained. Specifically, we use English language (US English or Indian English) conversations between humans who play the word-guessing game of ‘taboo‘. We formulate two evaluative tasks: target word prediction (TWP) (i.e., predict the masked target word in a conversation) and target word selection (TWS) (i.e., select the most likely masked target word in a conversation, from among a set of candidate words). Extending MD3, an existing dialectic dataset of taboo-playing conversations, we introduce M-MD3, a target-word-masked version of MD3 with the en-US and en-IN subsets. We create two subsets: en-MV (where en-US is transformed to include dialectal information) and en-TR (where dialectal information is removed from en-IN). We evaluate three multilingual LLMs–one open source (Llama3) and two closed-source (GPT-4/3.5). LLMs perform significantly better for US English than Indian English for both TWP and TWS tasks, for all settings, exhibiting marginalisation against the Indian dialect of English. While GPT-based models perform the best, the comparatively smaller models work more equitably after fine-tuning. Our evaluation methodology exhibits a novel and reproducible way to examine attributes of language models using pre-existing dialogue datasets with language varieties. Dialect being an artifact of one’s culture, this paper demonstrates the gap in the performance of multilingual LLMs for communities that do not use a mainstream dialect.

pdf bib
Cross-Lingual Document Recommendations with Transformer-Based Representations: Evaluating Multilingual Models and Mapping Techniques
Tsegaye Misikir Tashu | Eduard-Raul Kontos | Matthia Sabatelli | Matias Valdenegro-Toro

Recommendation systems, for documents, have become tools for finding relevant content on the Web. However, these systems have limitations when it comes to recommending documents in languages different from the query language, which means they might overlook resources in non-native languages. This research focuses on representing documents across languages by using Transformer Leveraged Document Representations (TLDRs) that are mapped to a cross-lingual domain. Four multilingual pre-trained transformer models (mBERT, mT5 XLM RoBERTa, ErnieM) were evaluated using three mapping methods across 20 language pairs representing combinations of five selected languages of the European Union. Metrics like Mate Retrieval Rate and Reciprocal Rank were used to measure the effectiveness of mapped TLDRs compared to non-mapped ones. The results highlight the power of cross-lingual representations achieved through pre-trained transformers and mapping approaches suggesting a promising direction for expanding beyond language connections, between two specific languages.

pdf bib
VRCP: Vocabulary Replacement Continued Pretraining for Efficient Multilingual Language Models
Yuta Nozaki | Dai Nakashima | Ryo Sato | Naoki Asaba | Shintaro Kawamura

Building large language models (LLMs) for non-English languages involves leveraging extensively trained English models through continued pre-training on the target language corpora. This approach harnesses the rich semantic knowledge embedded in English models, allowing superior performance compared to training from scratch. However, tokenizers not optimized for the target language may make inefficiencies in training. We propose Vocabulary Replacement Continued Pretraining (VRCP), a method that optimizes the tokenizer for the target language by replacing unique (solely available) vocabulary from the source tokenizer while maintaining the overall vocabulary size. This approach preserves the semantic knowledge of the source model while enhancing token efficiency and performance for the target language. We evaluated VRCP using the Llama-2 model on Japanese and Chinese corpora. The results show that VRCP matches the performance of vocabulary expansion methods on benchmarks and achieves superior performance in summarization tasks. Additionally, VRCP provides an optimized tokenizer that balances token efficiency, task performance, and GPU memory footprint, making it particularly suitable for resource-constrained environments.

up

pdf (full)
bib (full)
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)

pdf bib
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)
Trista Cao | Anubrata Das | Tharindu Kumarage | Yixin Wan | Satyapriya Krishna | Ninareh Mehrabi | Jwala Dhamala | Anil Ramakrishna | Aram Galystan | Anoop Kumar | Rahul Gupta | Kai-Wei Chang

pdf bib
Beyond Text-to-SQL for IoT Defense: A Comprehensive Framework for Querying and Classifying IoT Threats
Ryan Pavlich | Nima Ebadi | Richard Tarbell | Billy Linares | Adrian Tan | Rachael Humphreys | Jayanta Das | Rambod Ghandiparsi | Hannah Haley | Jerris George | Rocky Slavin | Kim-Kwang Raymond Choo | Glenn Dietrich | Anthony Rios

Recognizing the promise of natural language interfaces to databases, prior studies have emphasized the development of text-to-SQL systems. Existing research has generally focused on generating SQL statements from text queries, and the broader challenge lies in inferring new information about the returned data. Our research makes two major contributions to address this gap. First, we introduce a novel Internet-of-Things (IoT) text-to-SQL dataset comprising 10,985 text-SQL pairs and 239,398 rows of network traffic activity. The dataset contains additional query types limited in prior text-to-SQL datasets, notably, temporal-related queries. Our dataset is sourced from a smart building’s IoT ecosystem exploring sensor read and network traffic data. Second, our dataset allows two-stage processing, where the returned data (network traffic) from a generated SQL can be categorized as malicious or not. Our results show that joint training to query and infer information about the data improves overall text-to-SQL performance, nearly matching that of substantially larger models. We also show that current large language models (e.g., GPT3.5) struggle to infer new information about returned data (i.e., they are bad at tabular data understanding), thus our dataset provides a novel test bed for integrating complex domain-specific reasoning into LLMs.

pdf bib
Gibberish is All You Need for Membership Inference Detection in Contrastive Language-Audio Pretraining
Ruoxi Cheng | Yizhong Ding | Shuirong Cao | Zhiqiang Wang | Shitong Shao

Audio can disclose PII, particularly when combined with related text data. Therefore, it is essential to develop tools to detect privacy leakage in Contrastive Language-Audio Pretraining(CLAP). Existing MIAs need audio as input, risking exposure of voiceprint and requiring costly shadow models. We first propose PRMID, a membership inference detector based probability ranking given by CLAP, which does not require training shadow models but still requires both audio and text of the individual as input. To address these limitations, we then propose USMID, a textual unimodal speaker-level membership inference detector, querying the target model using only text data. We randomly generate textual gibberish that are clearly not in training dataset. Then we extract feature vectors from these texts using the CLAP model and train a set of anomaly detectors on them. During inference, the feature vector of each test text is input into the anomaly detector to determine if the speaker is in the training set (anomalous) or not (normal). If available, USMID can further enhance detection by integrating real audio of the tested speaker. Extensive experiments on various CLAP model architectures and datasets demonstrate that USMID outperforms baseline methods using only text data.

pdf bib
PBI-Attack: Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for Toxicity Maximization
Ruoxi Cheng | Yizhong Ding | Shuirong Cao | Ranjie Duan | Xiaoshuang Jia | Shaowei Yuan | Zhiqiang Wang | Xiaojun Jia

Understanding the vulnerabilities of Large Vision Language Models (LVLMs) to jailbreak attacks is essential for their responsible real-world deployment. Most previous work requires access to model gradients, or is based on human knowledge (prompt engineering) to complete jailbreak, and they hardly consider the interaction of images and text, resulting in inability to jailbreak in black box scenarios or poor performance. To overcome these limitations, we propose a Prior-Guided Bimodal Interactive Black-Box Jailbreak Attack for toxicity maximization, referred to as PBI-Attack. Our method begins by extracting malicious features from a harmful corpus using an alternative LVLM and embedding these features into a benign image as prior information. Subsequently, we enhance these features through bidirectional cross-modal interaction optimization, which iteratively optimizes the bimodal perturbations in an alternating manner through greedy search, aiming to maximize the toxicity of the generated response. The toxicity level is quantified using a well-trained evaluation model.Experiments demonstrate that PBI-Attack outperforms previous state-of-the-art jailbreak methods, achieving an average attack success rate of 92.5% across three open-source LVLMs and around 67.3% on three closed-source LVLMs.redDisclaimer: This paper contains potentially disturbing and offensive content.

pdf bib
Ambiguity Detection and Uncertainty Calibration for Question Answering with Large Language Models
Zhengyan Shi | Giuseppe Castellucci | Simone Filice | Saar Kuzi | Elad Kravi | Eugene Agichtein | Oleg Rokhlenko | Shervin Malmasi

Large Language Models (LLMs) have demonstrated excellent capabilities in Question Answering (QA) tasks, yet their ability to identify and address ambiguous questions remains underdeveloped. Ambiguities in user queries often lead to inaccurate or misleading answers, undermining user trust in these systems. Despite prior attempts using prompt-based methods, performance has largely been equivalent to random guessing, leaving a significant gap in effective ambiguity detection. To address this, we propose a novel framework for detecting ambiguous questions within LLM-based QA systems. We first prompt an LLM to generate multiple answers to a question, and then analyze them to infer the ambiguity. We propose to use a lightweight Random Forest model, trained on a bootstrapped and shuffled 6-shot examples dataset. Experimental results on ASQA, PACIFIC, and ABG-COQA datasets demonstrate the effectiveness of our approach, with accuracy up to 70.8%. Furthermore, our framework enhances the confidence calibration of LLM outputs, leading to more trustworthy QA systems able to handle complex questions.

pdf bib
Smaller Large Language Models Can Do Moral Self-Correction
Guangliang Liu | Zhiyu Xue | Xitong Zhang | Rongrong Wang | Kristen Johnson

Self-correction is one of the most amazing emerging capabilities of Large Language Models (LLMs), enabling LLMs to self-modify an inappropriate output given a natural language feedback which describes the problems of that output. Moral self-correction is a post-hoc approach correcting unethical generations without requiring a gradient update, making it both computationally lightweight and capable of preserving the language modeling ability. Previous works have shown that LLMs can self-debias, and it has been reported that small models, i.e., those with less than 22B parameters, are not capable of moral self-correction.However, there is no direct proof as to why such smaller models fall short of moral self-correction, though previous research hypothesizes that larger models are skilled in following instructions and understanding abstract social norms.In this paper, we empirically validate this hypothesis in the context of social stereotyping, through meticulous prompting.Our experimental results indicate that (i) surprisingly, 3.8B LLMs with proper safety alignment fine-tuning can achieve very good moral self-correction performance, highlighting the significant effects of safety alignment; and (ii) small LLMs are indeed weaker than larger-scale models in terms of comprehending social norms and self-explanation through CoT, but all scales of LLMs show bad self-correction performance given unethical instructions.

pdf bib
Error Detection for Multimodal Classification
Thomas Bonnier

Machine learning models have proven to be useful in various key applications such as autonomous driving or diagnosis prediction. When a model is implemented under real-world conditions, it is thus essential to detect potential errors with a trustworthy approach. This monitoring practice will render decision-making safer by avoiding catastrophic failures. In this paper, the focus is on multimodal classification. We introduce a method that addresses error detection based on unlabeled data. It leverages fused representations and computes the probability that a model will fail based on detected fault patterns in validation data. To improve transparency, we employ a sampling-based approximation of Shapley values in multimodal settings in order to explain why a prediction is assessed as erroneous in terms of feature values. Further, as explanation methods can sometimes disagree, we suggest evaluating the consistency of explanations produced by different value functions and algorithms. To show the relevance of our method, we measure it against a selection of 9 baselines from various domains on tabular-text and text-image datasets, and 2 multimodal fusion strategies for the classification models. Lastly, we show the usefulness of our explanation algorithm on misclassified samples.

pdf bib
Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refine
Heegyu Kim | Hyunsouk Cho

Language models (LMs) are vulnerable to exploitation for adversarial misuse. Training LMs for safety alignment is extensive, making it hard to respond to fast-developing attacks immediately, such as jailbreaks. We propose self-refine with formatting that achieves outstanding safety even in non-safety-aligned LMsand evaluate our method alongside several defense baselines, demonstrating that it is the safest training-free method against jailbreak attacks.Additionally, we proposed a formatting method that improves the efficiency of the self-refine process while reducing attack success rates in fewer iterations. We observed that non-safety-aligned LMs outperform safety-aligned LMs in safety tasks by giving more helpful and safe responses.In conclusion, our findings can achieve less safety risk with fewer computational costs, allowing non-safety LM to be efficiently utilized in real-world service.

pdf bib
Minimal Evidence Group Identification for Claim Verification
Xiangci Li | Sihao Chen | Rajvi Kapadia | Jessica Ouyang | Fan Zhang

When verifying a claim in real-world settings, e.g. against a large collection of candidate evidence text retrieved from the web, a model is typically expected to identify and aggregate a complete set of evidence pieces that collectively provide full support to a claim.The problem becomes particularly challenging as there might exist different sets of evidence that could be used to verify the claim from different perspectives. In this paper, we formally define and study the problem of identifying such minimal evidence groups (MEGs) for fact verification. We show that MEG identification can be reduced to a Set Cover-like problem, based on an entailment model which estimates whether a given evidence group provides full or partial support to a claim. Our proposed approach achieves 18.4% & 34.8% absolute improvements on WiCE and SciFact datasets over LLM prompting. Finally, we demonstrate the downstream benefit of MEGs in applications such as claim generation.

pdf bib
Cracking the Code: Enhancing Implicit Hate Speech Detection through Coding Classification
Lu Wei | Liangzhi Li | Tong Xiang | Liu Xiao | Noa Garcia

The internet has become a hotspot for hate speech (HS), threatening societal harmony and individual well-being. While automatic detection methods perform well in identifying explicit hate speech (ex-HS), they struggle with more subtle forms, such as implicit hate speech (im-HS). We tackle this problem by introducing a new taxonomy for im-HS detection, defining six encoding strategies named *codetypes*. We present two methods for integrating codetypes into im-HS detection: 1) prompting large language models (LLMs) directly to classify sentences based on generated responses, and 2) using LLMs as encoders with codetypes embedded during the encoding process. Experiments show that the use of codetypes improves im-HS detection in both Chinese and English datasets, validating the effectiveness of our approach across different languages.

pdf bib
Line of Duty: Evaluating LLM Self-Knowledge via Consistency in Feasibility Boundaries
Sahil Kale | Vijaykant Nadadur

As LLMs grow more powerful, their most profound achievement may be recognising when to say “I don’t know”. Existing studies on LLM self-knowledge have been largely constrained by human-defined notions of feasibility, often neglecting the reasons behind unanswerability by LLMs and failing to study deficient types of self-knowledge. This study aims to obtain intrinsic insights into different types of LLM self-knowledge with a novel methodology: allowing them the flexibility to set their own feasibility boundaries and then analysing the consistency of these limits. We find that even frontier models like GPT-4o and Mistral Large are not sure of their own capabilities more than 80% of the time, highlighting a significant lack of trustworthiness in responses. Our analysis of confidence balance in LLMs indicates that models swing between overconfidence and conservatism in feasibility boundaries depending on task categories and that the most significant self-knowledge weaknesses lie in temporal awareness and contextual understanding. These difficulties in contextual comprehension additionally lead models to question their operational boundaries, resulting in considerable confusion within the self-knowledge of LLMs. We make our code and results available publicly.

pdf bib
Multi-lingual Multi-turn Automated Red Teaming for LLMs
Abhishek Singhania | Christophe Dupuy | Shivam Sadashiv Mangale | Amani Namboori

Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to “model alignment”, i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications. One popular method to evaluate safety risks is red-teaming, where agents attempt to bypass alignment by crafting elaborate prompts that trigger unsafe responses from a model. Standard human-driven red-teaming is costly, time-consuming and rarely covers all the recent features (e.g., multi-lingual, multi-modal aspects), while proposed automation methods only cover a small subset of LLMs capabilities (i.e., English or single-turn). We present Multi-lingual Multi-turn Automated Red Teaming (MM-ART), a method to fully automate conversational, multi-lingual red-teaming operations and quickly identify prompts leading to unsafe responses. Through extensive experiments on different languages, we show the studied LLMs are on average 71% more vulnerable after a 5-turn conversation in English than after the initial turn. For conversations in non-English languages, models display up to 195% more safety vulnerabilities than the standard single-turn English approach, confirming the need for automated red-teaming methods matching LLMs capabilities.

pdf bib
Rainbow-Teaming for the Polish Language: A Reproducibility Study
Aleksandra Krasnodębska | Maciej Chrabaszcz | Wojciech Kusa

The development of multilingual large language models (LLMs) presents challenges in evaluating their safety across all supported languages. Enhancing safety in one language (e.g., English) may inadvertently introduce vulnerabilities in others. To address this issue, we implement a methodology for the automatic creation of red-teaming datasets for safety evaluation in Polish language. Our approach generates both harmful and non-harmful prompts by sampling different risk categories and attack styles. We test several open-source models, including those trained on Polish data, and evaluate them using metrics such as Attack Success Rate (ASR) and False Reject Rate (FRR). The results reveal clear gaps in safety performance between models and show that better testing across languages is needed.

pdf bib
BiasEdit: Debiasing Stereotyped Language Models via Model Editing
Xin Xu | Wei Xu | Ningyu Zhang | Julian McAuley

Previous studies have established that language models manifest stereotyped biases. Existing debiasing strategies, such as retraining a model with counterfactual data, representation projection, and prompting often fail to efficiently eliminate bias or directly alter the models’ biased internal representations. To address these issues, we propose BiasEdit, an efficient model editing method to remove stereotypical bias from language models through lightweight networks that act as editors to generate parameter updates. BiasEdit employs a *debiasing loss* guiding editor networks to conduct local edits on partial parameters of a language model for debiasing while preserving the language modeling abilities during editing through a *retention loss*. Experiments on StereoSet and Crows-Pairs demonstrate the effectiveness, efficiency, and robustness of BiasEdit in eliminating bias compared to tangental debiasing baselines, and little to no impact on the language models’ general capabilities. In addition, we conduct bias tracing to probe bias in various modules and explore bias editing impacts on different components of language models.

pdf bib
Do Voters Get the Information They Want? Understanding Authentic Voter FAQs in the US and How to Improve for Informed Electoral Participation
Vipula Rawte | Deja N Scott | Gaurav Kumar | Aishneet Juneja | Bharat Sowrya Yaddanapalli | Biplav Srivastava

Accurate information is crucial for democracy as it empowers voters to make informed decisions about their representatives and keeping them accountable. In the US, state election commissions (SECs), often required by law, are the primary providers of Frequently Asked Questions (FAQs) to voters, and secondary sources like non-profits such as League of Women Voters (LWV) try to complement their information shortfall. However, surprisingly, to the best of our knowledge, there is neither a single source with comprehensive FAQs nor a study analyzing the data at national level to identify current practices and ways to improve the status quo. This paper addresses it by providing the first dataset on Voter FAQs covering all the US states. Second, we introduce metrics for FAQ information quality (FIQ) with respect to questions, answers, and answers to corresponding questions. Third, we use FIQs to analyze US FAQs to identify leading, mainstream and lagging content practices and corresponding states. Finally, we identify what states across the spectrum can do to improve FAQ quality and thus, the overall information ecosystem. Across all 50 U.S. states, 12% were identified as leaders and 8% as laggards for FIQSvoter, while 14% were leaders and 12% laggards for FIQSdeveloper. The code and sample data are provided at https://anonymous.4open.science/r/election-qa-analysis-BE4E.

pdf bib
ViBe: A Text-to-Video Benchmark for Evaluating Hallucination in Large Multimodal Models
Vipula Rawte | Sarthak Jain | Aarush Sinha | Garv Kaushik | Aman Bansal | Prathiksha Rumale Vishwanath | Samyak Rajesh Jain | Aishwarya Naresh Reganti | Vinija Jain | Aman Chadha | Amit Sheth | Amitava Das

Recent advances in Large Multimodal Models (LMMs) have expanded their capabilities to video understanding, with Text-to-Video (T2V) models excelling in generating videos from textual prompts. However, they still frequently produce hallucinated content, revealing AI-generated inconsistencies. We introduce ViBe https://huggingface.co/datasets/ViBe-T2V-Bench/ViBe: a large-scale dataset of hallucinated videos from open-source T2V models. We identify five major hallucination types: Vanishing Subject, Omission Error, Numeric Variability, Subject Dysmorphia, and Visual Incongruity. Using ten T2V models, we generated and manually annotated 3,782 videos from 837 diverse MS COCO captions. Our proposed benchmark includes a dataset of hallucinated videos and a classification framework using video embeddings. ViBe serves as a critical resource for evaluating T2V reliability and advancing hallucination detection. We establish classification as a baseline, with the TimeSFormer + CNN ensemble achieving the best performance (0.345 accuracy, 0.342 F1 score). While initial baselines proposed achieve modest accuracy, this highlights the difficulty of automated hallucination detection and the need for improved methods. Our research aims to drive the development of more robust T2V models and evaluate their outputs based on user preferences. Our code is available at: https://anonymous.4open.science/r/vibe-1840/

pdf bib
Know What You do Not Know: Verbalized Uncertainty Estimation Robustness on Corrupted Images in Vision-Language Models
Mirko Borszukovszki | Ivo Pascal De Jong | Matias Valdenegro-Toro

To leverage the full potential of Large Language Models (LLMs) it is crucial to have some information on their answers’ uncertainty. This means that the model has to be able to quantify how certain it is in the correctness of a given response. Bad uncertainty estimates can lead to overconfident wrong answers undermining trust in these models. Quite a lot of research has been done on language models that work with text inputs and provide text outputs. Still, since the visual capabilities have been added to these models recently, there has not been much progress on the uncertainty of Visual Language Models (VLMs). We tested three state-of-the-art VLMs on corrupted image data. We found that the severity of the corruption negatively impacted the models’ ability to estimate their uncertainty and the models also showed overconfidence in most of the experiments.

pdf bib
Summary the Savior: Harmful Keyword and Query-based Summarization for LLM Jailbreak Defense
Shagoto Rahman | Ian Harris

Large Language Models (LLMs) are widely used for their capabilities, but face threats from jailbreak attacks, which exploit LLMs to generate inappropriate information and bypass their defense system. Existing defenses are often specific to jailbreak attacks and as a result, a robust, attack-independent solution is needed to address both Natural Language Processing (NLP) ambiguities and attack variability. In this study, we have introduced, Summary The Savior, a novel jailbreak detection mechanism leveraging harmful keywords and query-based security-aware summary classification. By analyzing the illegal and improper contents of prompts within the summaries, the proposed method remains robust against attack diversity and NLP ambiguities. Two novel datasets for harmful keyword extraction and security aware summaries utilizing GPT-4 and Llama-3.1 70B respectively have been generated in this regard. Moreover, an “ambiguous harmful” class has been introduced to address content and intent ambiguities. Evaluation results demonstrate that, Summary The Savior achieves higher defense performance, outperforming state-of-the-art defense mechanisms namely Perplexity Filtering, SmoothLLM, Erase and Check with lowest attack success rates across various jailbreak attacks namely PAIR, GCG, JBC and Random Search, on Llama-2, Vicuna-13B and GPT-4. Our codes, models, and results are available at: https://github.com/shrestho10/SummaryTheSavior

pdf bib
Bias A-head? Analyzing Bias in Transformer-Based Language Model Attention Heads
Yi Yang | Hanyu Duan | Ahmed Abbasi | John P. Lalor | Kar Yan Tam

Transformer-based pretrained large language models (PLM) such as BERT and GPT have achieved remarkable success in NLP tasks. However, PLMs are prone to encoding stereotypical biases. Although a burgeoning literature has emerged on stereotypical bias mitigation in PLMs, such as work on debiasing gender and racial stereotyping, how such biases manifest and behave internally within PLMs remains largely unknown. Understanding the internal stereotyping mechanisms may allow better assessment of model fairness and guide the development of effective mitigation strategies. In this work, we focus on attention heads, a major component of the Transformer architecture, and propose a bias analysis framework to explore and identify a small set of biased heads that are found to contribute to a PLM’s stereotypical bias. We conduct extensive experiments to validate the existence of these biased heads and to better understand how they behave. We investigate gender and racial bias in the English language in two types of Transformer-based PLMs: the encoder-based BERT model and the decoder-based autoregressive GPT model, LLaMA-2 (7B), and LLaMA-2-Chat (7B). Overall, the results shed light on understanding the bias behavior in pretrained language models.

pdf bib
Mimicking How Humans Interpret Out-of-Context Sentences Through Controlled Toxicity Decoding
Maria Mihaela Trusca | Liesbeth Allein

Interpretations of a single sentence can vary, particularly when its context is lost. This paper aims to simulate how readers perceive content with varying toxicity levels by generating diverse interpretations of out-of-context sentences. By modeling toxicity we can anticipate misunderstandings and reveal hidden toxic meanings. Our proposed decoding strategy explicitly controls toxicity in the set of generated interpretations by (i) aligning interpretation toxicity with the input, (ii) relaxing toxicity constraints for more toxic input sentences, and (iii) promoting diversity in toxicity levels within the set of generated interpretations. Experimental results show that our method improves alignment with human-written interpretations in both syntax and semantics while reducing model prediction uncertainty.

pdf bib
On the Robustness of Agentic Function Calling
Ella Rabinovich | Ateret Anaby Tavor

Large Language Models (LLMs) are increasingly acting as autonomous agents, with function calling (FC) capabilities enabling them to invoke specific tools for tasks. While prior research has primarily focused on improving FC accuracy, little attention has been given to the robustness of these agents to perturbations in their input. We introduce a benchmark assessing FC robustness in two key areas: resilience to naturalistic query variations, and stability in function calling when the toolkit expands with semantically related tools. Evaluating best-performing FC models on a carefully expanded subset of the Berkeley function calling leaderboard (BFCL), we identify critical weaknesses in existing evaluation methodologies, and highlight areas for improvement in real-world agentic deployments.

pdf bib
Monte Carlo Temperature: a robust sampling strategy for LLM’s uncertainty quantification methods
Nicola Cecere | Andrea Bacciu | Ignacio Fernández-Tobías | Amin Mantrach

Uncertainty quantification (UQ) in Large Language Models (LLMs) is essential for their safe and reliable deployment, particularly in critical applications where incorrect outputs can have serious consequences. Current UQ methods typically rely on querying the model multiple times using non-zero temperature sampling to generate diverse outputs for uncertainty estimation. However, the impact of selecting a given temperature parameter is understudied, and our analysis reveals that temperature plays a fundamental role in the quality of uncertainty estimates. The conventional approach of identifying optimal temperature values requires expensive hyperparameter optimization (HPO) that must be repeated for each new model-dataset combination. We propose Monte Carlo Temperature (MCT), a robust sampling strategy that eliminates the need for temperature calibration. Our analysis reveals that: 1) MCT provides more robust uncertainty estimates across a wide range of temperatures, 2) MCT improves the performance of UQ methods by replacing fixed-temperature strategies that do not rely on HPO, and 3) MCT achieves statistical parity with oracle temperatures, which represent the ideal outcome of a well-tuned but computationally expensive HPO process. These findings demonstrate that effective UQ can be achieved without the computational burden of temperature parameter calibration.

pdf bib
Know Thyself: Validating Knowledge Awareness of LLM-based Persona Agents
Savita Bhat | Ishaan Shukla | Shirish Karande

Large Language Models (LLMs) have demonstrated remarkable capability in simulating human behaviors, personality, and language. Such synthetic agents with personalities are considered as cost-effective proxies for real users to facilitate crowd-sourcing efforts like annotations, surveys, and A/B testing. Accordingly, it is imperative to validate knowledge awareness of these LLM persona agents when they are customized for further usage. Currently, there is no established way for such evaluation and appropriate mitigation. In this work, we propose a generic evaluation approach to validate LLM based persona agents for correctness, relevance, and diversity in the context of self-awareness and domain knowledge.We evaluate the efficacy of this framework using three LLMs ( Llama, GPT-4o, and Gemma) for domains such as air travel, gaming, and fitness. We also experiment with advanced prompting strategies such as ReAct and Reflexion. We find that though GPT-4o and Llama demonstrate comparable performance, they fail some of basic consistency checks under certain perturbations.

pdf bib
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura | Sahil Wadhwa | Jesse Zymet | Akshay Gupta | Andy Luo | Melissa Kazemi Rad | Swapnil Shinde | Mohammad Shahed Sorower

The rapid growth of Large Language Models (LLMs) presents significant privacy, security, and ethical concerns. While much research has proposed methods for defending LLM systems against misuse by malicious actors, researchers have recently complemented these efforts with an offensive approach that involves red teaming, i.e., proactively attacking LLMs with the purpose of identifying their vulnerabilities. This paper provides a concise and practical overview of the LLM red teaming literature, structured so as to describe a multi-component system end-to-end. To motivate red teaming we survey the initial safety needs of some high-profile LLMs, and then dive into the different components of a red teaming system as well as software packages for implementing them. We cover various attack methods, strategies for attack-success evaluation, metrics for assessing experiment outcomes, as well as a host of other considerations. Our survey will be useful for any reader who wants to rapidly obtain a grasp of the major red teaming concepts for their own use in practical applications.

pdf bib
Difficulty Estimation in Natural Language Tasks with Action Scores
Aleksandar Angelov | Tsegaye Misikir Tashu | Matias Valdenegro-Toro

This study investigates the effectiveness of the action score, a metric originally developed for computer vision tasks, in estimating sample difficulty across various natural language processing (NLP) tasks. Using transformer-based models, the action score is applied to sentiment analysis, natural language inference, and abstractive text summarization. The results demonstrate that the action score can effectively identify challenging samples in sentiment analysis and natural language inference, often capturing difficult instances that are missed by more established metrics like entropy. However, the effectiveness of the action score appears to be task-dependent, as evidenced by its performance in the abstractive text summarization task, where it exhibits a nearly linear relationship with entropy. The findings suggest that the action score can provide valuable insights into the characteristics of challenging samples in NLP tasks, particularly in classification settings. However, its application should be carefully considered in the context of each specific task and in light of emerging research on the potential value of hard samples in machine learning.

pdf bib
Are Small Language Models Ready to Compete with Large Language Models for Practical Applications?
Neelabh Sinha | Vinija Jain | Aman Chadha

The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging as smaller LMs don’t perform well universally. This work tries to bridge this gap by proposing a framework to experimentally evaluate small, open LMs in practical settings through measuring semantic correctness of outputs across three practical aspects: task types, application domains and reasoning types, using diverse prompt styles. It also conducts an in-depth comparison of 10 small, open LMs to identify best LM and prompt style depending on specific application requirement using the proposed framework. We also show that if selected appropriately, they can outperform SOTA LLMs like DeepSeek-v2, GPT-4o-mini, Gemini-1.5-Pro, and even compete with GPT-4o.

pdf bib
A Calibrated Reflection Approach for Enhancing Confidence Estimation in LLMs
Umesh Bodhwani | Yuan Ling | Shujing Dong | Yarong Feng | Hongfei Li | Ayush Goyal

A critical challenge in deploying Large Language Models (LLMs) is developing reliable mechanisms to estimate their confidence, enabling systems to determine when to trust model outputs and when to seek human intervention. In this paper, we present a Calibrated Reflection Approach for Enhancing Confidence Estimation in LLMs, a framework that combines structured reasoning with distance-aware calibration techniques. Our approach introduces three key innovations: (1) a Maximum Confidence Selection (MCS) method that comprehensively evaluates confidence across all possible labels, (2) a reflection-based prompting mechanism that enhances reasoning reliability, and (3) a distance-aware calibration technique that accounts for ordinal relationships between labels. We evaluate our framework across diverse datasets, including HelpSteer2, Llama T-REx, and an internal conversational dataset, demonstrating its effectiveness across both conversational and fact-based classification tasks. This work contributes to the broader goal of developing reliable and well-calibrated confidence estimation methods for LLMs, enabling informed decisions about when to trust model outputs and when to defer to human judgement.

pdf bib
Evaluating Design Choices in Verifiable Generation with Open-source Models
Shuyang Cao | Lu Wang

Verifiable generation is introduced to improve the transparency and trustworthiness of outputs produced by large language models (LLMs). Recent studies observe that open-source models struggle to include accurate citations to supporting documents in their generation with in-context learning, in contrast to the strong performance demonstrated by proprietary models. Our work aims to reveal the critical design choices that can benefit open-source models, including generation pipelines, fine-tuning methods, and inference-time compute techniques.We consider three generation pipelines, producing the outputs directly or decomposing the generation into subtasks.These generation pipelines are fine-tuned using supervised fine-tuning and preference-based optimization including further fine-tuning with rejection sampling data and direct preference optimization (DPO).The construction of preference data with varying content and citation diversity is also investigated.Additionally, we examine the benefit of an additional reranking step. With four open-source models, our experiments show that directly generating the outputs achieves the best performance. Compared to other fine-tuning methods, DPO that computes training signals from contrastive pairs consistently yields better performance, and it reaches the peak performance when the contrastive pairs are constructed with sufficient content diversity.We also find that reranking can further boost the performance of verifiable generation systems, but the marginal improvement might not justify the additional cost.

pdf bib
Battling Misinformation: An Empirical Study on Adversarial Factuality in Open-Source Large Language Models
Shahnewaz Karim Sakib | Anindya Bijoy Das | Shibbir Ahmed

Adversarial factuality refers to the deliberate insertion of misinformation into input prompts by an adversary, characterized by varying levels of expressed confidence. In this study, we systematically evaluate the performance of several open-source large language models (LLMs) when exposed to such adversarial inputs. Three tiers of adversarial confidence are considered: strongly confident, moderately confident, and limited confidence. Our analysis encompasses eight LLMs: LLaMA 3.1 (8B), Phi 3 (3.8B), Qwen 2.5 (7B), Deepseek-v2 (16B), Gemma2 (9B), Falcon (7B), Mistrallite (7B), and LLaVA (7B). Empirical results indicate that LLaMA 3.1 (8B) exhibits a robust capability in detecting adversarial inputs, whereas Falcon (7B) shows comparatively lower performance. Notably, for the majority of the models, detection success improves as the adversary’s confidence decreases; however, this trend is reversed for LLaMA 3.1 (8B) and Phi 3 (3.8B), where a reduction in adversarial confidence corresponds with diminished detection performance. Further analysis of the queries that elicited the highest and lowest rates of successful attacks reveals that adversarial attacks are more effective when targeting less commonly referenced or obscure information.

pdf bib
Will the Prince Get True Love’s Kiss? On the Model Sensitivity to Gender Perturbation over Fairytale Texts
Christina A Chance | Da Yin | Dakuo Wang | Kai-Wei Chang

In this paper, we study whether language models are affected by learned gender stereotypes during the comprehension of stories. Specifically, we investigate how models respond to gender stereotype perturbations through counterfactual data augmentation. Focusing on Question Answering (QA) tasks in fairytales, we modify the FairytaleQA dataset by swapping gendered character information and introducing counterfactual gender stereotypes during training. This allows us to assess model robustness and examine whether learned biases influence story comprehension. Our results show that models exhibit slight performance drops when faced with gender perturbations in the test set, indicating sensitivity to learned stereotypes. However, when fine-tuned on counterfactual training data, models become more robust to anti-stereotypical narratives. Additionally, we conduct a case study demonstrating how incorporating counterfactual anti-stereotype examples can improve inclusivity in downstream applications.

pdf bib
Disentangling Linguistic Features with Dimension-Wise Analysis of Vector Embeddings
Saniya Karwa | Navpreet Singh

Understanding the inner workings of neural embeddings, particularly in models such as BERT, remains a challenge because of their high-dimensional and opaque nature. This paper proposes a framework for uncovering the specific dimensions of vector embeddings that encode distinct linguistic properties (LPs). We introduce the Linguistically Distinct Sentence Pairs (LDSP-10) dataset, which isolates ten key linguistic features such as synonymy, negation, tense, and quantity. Using this dataset, we analyze BERT embeddings with various methods, including the Wilcoxon signed-rank test, mutual information, and recursive feature elimination, to identify the most influential dimensions for each LP. We introduce a new metric, the Embedding Dimension Impact (EDI) score, which quantifies the relevance of each embedding dimension to a LP. Our findings show that certain properties, such as negation and polarity, are robustly encoded in specific dimensions, while others, like synonymy, exhibit more complex patterns. This study provides insights into the interpretability of embeddings, which can guide the development of more transparent and optimized language models, with implications for model bias mitigation and the responsible deployment of AI systems.

pdf bib
Gender Encoding Patterns in Pretrained Language Model Representations
Mahdi Zakizadeh | Mohammad Taher Pilehvar

Gender bias in pretrained language models (PLMs) poses significant social and ethical challenges. Despite growing awareness, there is a lack of comprehensive investigation into how different models internally represent and propagate such biases. This study adopts an information-theoretic approach to analyze how gender biases are encoded within various encoder-based architectures.We focus on three key aspects: identifying how models encode gender information and biases, examining the impact of bias mitigation techniques and fine-tuning on the encoded biases and their effectiveness, and exploring how model design differences influence the encoding of biases.Through rigorous and systematic investigation, our findings reveal a consistent pattern of gender encoding across diverse models. Surprisingly, debiasing techniques often exhibit limited efficacy, sometimes inadvertently increasing the encoded bias in internal representations while reducing bias in model output distributions. This highlights a disconnect between mitigating bias in output distributions and addressing its internal representations. This work provides valuable guidance for advancing bias mitigation strategies and fostering the development of more equitable language models.

pdf bib
Defining and Quantifying Visual Hallucinations in Vision-Language Models
Vipula Rawte | Aryan Mishra | Amit Sheth | Amitava Das

The troubling rise of hallucination presents perhaps the most significant impediment to the advancement of responsible AI. In recent times, considerable research has focused on detecting and mitigating hallucination in Large Language Models (LLMs). However, it’s worth noting that hallucination is also quite prevalent in Vision-Language models (VLMs). In this paper, we offer a fine-grained discourse on profiling VLM hallucination based on the image captioning task. We delineate eight fine-grained orientations of visual hallucination: i) Contextual Guessing, ii) Identity Incongruity, iii) Geographical Erratum, iv) Visual Illusion, v) Gender Anomaly, vi) VLM as Classifier, vii) Wrong Reading, and viii) Numeric Discrepancy. We curate Visual HallucInation eLiciTation, a publicly available dataset comprising 2,000 samples generated using eight VLMs across the image captioning task, along with human annotations for the categories as mentioned earlier. To establish a method for quantification and to offer a comparative framework enabling the evaluation and ranking of VLMs according to their vulnerability to producing hallucinations, we propose the Visual Hallucination Vulnerability Index (VHVI). In summary, we introduce the VHILT dataset for image-to-text hallucinations and propose the VHVI metric to quantify hallucinations in VLMs, targeting specific visual hallucination types. A subset sample is available at: https://huggingface.co/datasets/vr25/vhil. The full dataset will be publicly released upon acceptance.

pdf bib
Revitalizing Saturated Benchmarks: A Weighted Metric Approach for Differentiating Large Language Model Performance
Bryan Etzine | Masoud Hashemi | Nishanth Madhusudhan | Sagar Davasam | Roshnee Sharma | Sathwik Tejaswi Madhusudhan | Vikas Yadav

Existing benchmarks are becoming saturated and less effective in evaluating model performance due to factors such as data contamination and the advancing capabilities of the Large Language Models (LLMs). This paper introduces EMDM (Enhanced Model Differentiation Metric), a novel weighted metric designed to revitalize existing benchmarks. EMDM implements a weighting schema for samples based on their complexity and requisite knowledge, utilizing the performance of a baseline LLM in two experimental setups: “Unguided”, where the model has no prior exposure to test samples, and “Guided”, where the model has prior knowledge about the desired answer. This schema is leveraged in an optimization objective to assign weights to test samples, distinguishing instances of varying complexity. EMDM accounts for both answer correctness and the depth and accuracy of reasoning, offering a more nuanced evaluation of model performance. By weighting test examples based on their required reasoning and knowledge, EMDM achieves a distinguishing range of evaluation scores of 46% among various LLMs, compared to just 17% with traditional exact match (EM) metrics, thereby highlighting the saturation of current evaluation methods.

pdf bib
Synthetic Lyrics Detection Across Languages and Genres
Yanis Labrak | Markus Frohmann | Gabriel Meseguer-Brocal | Elena V. Epure

In recent years, the use of large language models (LLMs) to generate music content, particularly lyrics, has gained in popularity. These advances provide valuable tools for artists and enhance their creative processes, but they also raise concerns about copyright violations, consumer satisfaction, and content spamming. Previous research has explored content detection in various domains. However, no work has focused on the text modality, lyrics, in music. To address this gap, we curated a diverse dataset of real and synthetic lyrics from multiple languages, music genres, and artists. The generation pipeline was validated using both humans and automated methods. We performed a thorough evaluation of existing synthetic text detection approaches on lyrics, a previously unexplored data type. We also investigated methods to adapt the best-performing features to lyrics through unsupervised domain adaptation. Following both music and industrial constraints, we examined how well these approaches generalize across languages, scale with data availability, handle multilingual language content, and perform on novel genres in few-shot settings. Our findings show promising results that could inform policy decisions around AI-generated music and enhance transparency for users.

pdf bib
A Lightweight Multi Aspect Controlled Text Generation Solution For Large Language Models
Chenyang Zhang | Jiayi Lin | Haibo Tong | Bingxuan Hou | Dongyu Zhang | Jialin Li | Junli Wang

Multi-Aspect Controllable Text Generation (MCTG) introduces fine-grained multiple constraints in natural language generation, i.e. control attributes in topics, sentiments, and detoxification.MCTG demonstrates application prospects for trustworthy generation of Large Language Models (LLMs) but is limited by generalization issues.Existing work exploits additional structures and strategies for solutions, requiring LLMs’ modifications.To activate LLMs’ MCTG ability, we propose a lightweight MCTG pipeline based on data augmentation and instruction tuning.We analyze aspect bias and correlations in traditional datasets and address these concerns with augmented control attributes and sentences.Augmented datasets are feasible for instruction tuning.We conduct experiments for various LLMs backbone and parameter sizes, demonstrating general effectiveness on MCTG performance.

pdf bib
Gender Bias in Large Language Models across Multiple Languages: A Case Study of ChatGPT
YiTian Ding | Jinman Zhao | Chen Jia | Yining Wang | Zifan Qian | Weizhe Chen | Xingyu Yue

With the growing deployment of large language models (LLMs) across various applications, assessing the influence of gender biases embedded in LLMs becomes crucial. The topic of gender bias within the realm of natural language processing (NLP) has gained considerable focus, particularly in the context of English. Nonetheless, the investigation of gender bias in languages other than English is still relatively under-explored and insufficiently analyzed. In this work, We examine gender bias in LLMs-generated outputs for different languages. We use three measurements: 1) gender bias in selecting descriptive words given the gender-related context. 2) gender bias in selecting gender-related pronouns (she/he) given the descriptive words. 3) gender bias in the topics of LLM-generated dialogues. We investigate the outputs of the GPT series of LLMs in various languages using our three measurement methods. Our findings revealed significant gender biases across all the languages we examined.

pdf bib
Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation
Neeraj Varshney | Satyam Raj | Venkatesh Mishra | Agneet Chatterjee | Amir Saeidi | Ritika Sarkar | Chitta Baral

Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks. However, they have been shown to suffer from a critical limitation pertinent to ‘hallucination’ in their output. Recent research has focused on investigating and addressing this problem for a variety of tasks such as biography generation, question answering, abstractive summarization, and dialogue generation. However, the crucial aspect pertaining to ‘negation’ has remained considerably underexplored. Negation is important because it adds depth and nuance to the understanding of language and is also crucial for logical reasoning and inference. In this work, we address the above limitation and particularly focus on studying the impact of negation in LLM hallucinations. Specifically, we study four tasks with negation: ‘false premise completion’, ‘constrained fact generation’, ‘multiple choice question answering’, and ‘fact generation’. We show that open-source state-of-the-art LLMs such as LLaMA-2-chat, Vicuna, and Orca-2 hallucinate considerably on all these tasks involving negation which underlines a critical shortcoming of these models. Addressing this problem, we further study numerous strategies to mitigate these hallucinations and demonstrate their impact.

pdf bib
FACTOID: FACtual enTailment fOr hallucInation Detection
Vipula Rawte | S.m Towhidul Islam Tonmoy | Shravani Nag | Aman Chadha | Amit Sheth | Amitava Das


up

pdf (full)
bib (full)
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
Proceedings of the 12th Workshop on NLP for Similar Languages, Varieties and Dialects
Yves Scherrer | Tommi Jauhiainen | Nikola Ljubešić | Preslav Nakov | Jorg Tiedemann | Marcos Zampieri

pdf bib
Findings of the VarDial Evaluation Campaign 2025: The NorSID Shared Task on Norwegian Slot, Intent and Dialect Identification
Yves Scherrer | Rob van der Goot | Petter Mæhlum

The VarDial Evaluation Campaign 2025 was organized as part of the twelfth workshop on Natural Language Processing for Similar Languages, Varieties and Dialects (VarDial), colocated with COLING 2025. It consisted of one shared task with three subtasks: intent detection, slot filling and dialect identification for Norwegian dialects. This report presents the results of this shared task. Four participating teams have submitted systems with very high performance (> 97% accuracy) for intent detection, whereas slot detection and dialect identification showed to be much more challenging, with respectively span-F1 scores up to 89%, and weighted dialect F1 scores of 84%.

pdf bib
Information Theory and Linguistic Variation: A Study of Brazilian and European Portuguese
Diego Alves

We present a general analysis of the lexical and grammatical differences between Brazilian and European Portuguese by applying entropy measures, including Kullback-Leibler divergence and word order entropy, across various linguistic levels. Using a parallel corpus of BP and EP sentences translated from English, we quantified these differences and identified characteristic phenomena underlying the divergences between the two varieties. The highest divergence was observed at the lexical level due to word pairs unique to each variety but also related to grammatical distinctions. Furthermore, the analysis of parts-of-speech (POS), dependency relations, and POS tri-grams provided information concerning distinctive grammatical constructions. Finally, the word order entropy analysis revealed that while most of the syntactic features analysed showed similar patterns across BP and EP, specific word order preferences were still apparent.

pdf bib
Leveraging Open-Source Large Language Models for Native Language Identification
Yee Man Ng | Ilia Markov

Native Language Identification (NLI) – the task of identifying the native language (L1) of a person based on their writing in the second language (L2) – has applications in forensics, marketing, and second language acquisition. Historically, conventional machine learning approaches that heavily rely on extensive feature engineering have outperformed transformer-based language models on this task. Recently, closed-source generative large language models (LLMs), e.g., GPT-4, have demonstrated remarkable performance on NLI in a zero-shot setting, including promising results in open-set classification. However, closed-source LLMs have many disadvantages, such as high costs and undisclosed nature of training data. This study explores the potential of using open-source LLMs for NLI. Our results indicate that open-source LLMs do not reach the accuracy levels of closed-source LLMs when used out-of-the-box. However, when fine-tuned on labeled training data, open-source LLMs can achieve performance comparable to that of commercial LLMs.

pdf bib
Adapting Whisper for Regional Dialects: Enhancing Public Services for Vulnerable Populations in the United Kingdom
Melissa Torgbi | Andrew Clayman | Jordan J. Speight | Harish Tayyar Madabushi

We collect novel data in the public service domain to evaluate the capability of the state-of-the-art automatic speech recognition (ASR) models in capturing regional differences in accents in the United Kingdom (UK), specifically focusing on two accents from Scotland with distinct dialects. This study addresses real-world problems where biased ASR models can lead to miscommunication in public services, disadvantaging individuals with regional accents particularly those in vulnerable populations. We first examine the out-of-the-box performance of the Whisper large-v3 model on a baseline dataset and our data. We then explore the impact of fine-tuning Whisper on the performance in the two UK regions and investigate the effectiveness of existing model evaluation techniques for our real-world application through manual inspection of model errors. We observe that the Whisper model has a higher word error rate (WER) on our test datasets compared to the baseline data and fine-tuning on a given data improves performance on the test dataset with the same domain and accent. The fine-tuned models also appear to show improved performance when applied to the test data outside of the region it was trained on suggesting that fine-tuned models may be transferable within parts of the UK. Our manual analysis of model outputs reveals the benefits and drawbacks of using WER as an evaluation metric and fine-tuning to adapt to regional dialects.

pdf bib
Large Language Models as a Normalizer for Transliteration and Dialectal Translation
Md Mahfuz Ibn Alam | Antonios Anastasopoulos

NLP models trained on standardized language data often struggle with variations. We assess various Large Language Models (LLMs) for transliteration and dialectal normalization. Tuning open-source LLMs with as little as 10,000 parallel examples using LoRA can achieve results comparable to or better than closed-source LLMs. We perform dialectal normalization experiments for twelve South Asian languages and dialectal translation experiments for six language continua worldwide. The dialectal normalization task can also be a preliminary step for the downstream dialectal translation task. Among the six languages used in dialectal translation, our approach enables Italian and Swiss German to surpass the baseline model by 21.5 and 25.8 BLEU points, respectively.

pdf bib
Testing the Boundaries of LLMs: Dialectal and Language-Variety Tasks
Fahim Faisal | Antonios Anastasopoulos

This study evaluates the performance of large language models (LLMs) on benchmark datasets designed for dialect-specific NLP tasks. Dialectal NLP is a low-resource field, yet it is crucial for evaluating the robustness of language models against linguistic diversity. This work is the first to systematically compare state-of-the-art instruction-tuned LLMs—both open-weight multilingual and closed-weight generative models—with encoder-based models that rely on supervised task-specific fine-tuning for dialectal tasks. We conduct extensive empirical analyses to provide insights into the current LLM landscape for dialect-focused tasks. Our findings indicate that certain tasks, such as dialect identification, are challenging for LLMs to replicate effectively due to the complexity of multi-class setups and the suitability of these tasks for supervised fine-tuning. Additionally, the structure of task labels—whether categorical or continuous scoring—significantly affects model performance. While LLMs excel in tasks like machine reading comprehension, their instruction-following ability declines in simpler tasks like POS tagging when task instructions are inherently complex. Overall, subtle variations in prompt design can greatly impact performance, underscoring the need for careful prompt engineering in dialectal evaluations.

pdf bib
Text Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy
Alistair Plum | Tharindu Ranasinghe | Christoph Purschke

This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish. Despite its active development, Luxembourgish faces a digital data scarcity, exacerbated by Luxembourg’s multilingual context. We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts, in terms of size and type, of German and French data. We hypothesise that a model trained on Luxembourgish, German, and French will improve the model’s cross-lingual transfer learning capabilities and outperform monolingual and large multilingual models. To verify this, the study at hand explores whether multilingual or monolingual training is more beneficial for Luxembourgish language generation. For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.

pdf bib
Retrieval of Parallelizable Texts Across Church Slavic Variants
Piroska Lendvai | Uwe Reichel | Anna Jouravel | Achim Rabus | Elena Renje

The goal of our study is to identify parallelizable texts for Church Slavic, across chronological and regional variants. Next to using a benchmark text, we utilize a recently digitized, large text collection and compile new resources for the retrieval of similar texts: a ground truth dataset holding a small amount of manually aligned sentences in Old Church Slavic and in Old East Slavic, and a large unaligned dataset that has a subset of ground truth (GT) quality texts but contains noise from handwritten text recognition (HTR) for the majority of the collection. We discuss preprocessing challenges in the data and the impact of sentence segmentation on retrieval performance. We evaluate sentence snippets mapped across these two diachronic variants of Church Slavic, expressed by mean reciprocal rank, using embedding representations from large language models (LLMs) as well as classical string similarity based approaches combined with k-nearest neighbor (kNN) search. Experimental results indicate that in the current setup (short text snippets, off-the-shelf multilingual embeddings), classical string similarity based retrieval can still outperform embedding based retrieval.

pdf bib
Neural Text Normalization for Luxembourgish Using Real-Life Variation Data
Anne-Marie Lutgen | Alistair Plum | Christoph Purschke | Barbara Plank

Orthographic variation is very common in Luxembourgish texts due to the absence of a fully-fledged standard variety. Additionally, developing NLP tools for Luxembourgish is a difficult task given the lack of annotated and parallel data, which is exacerbated by ongoing standardization. In this paper, we propose the first sequence-to-sequence normalization models using the ByT5 and mT5 architectures with training data obtained from word-level real-life variation data. We perform a fine-grained, linguistically-motivated evaluation to test byte-based, word-based and pipeline-based models for their strengths and weaknesses in text normalization. We show that our sequence model using real-life variation data is an effective approach for tailor-made normalization in Luxembourgish.

pdf bib
Improving Dialectal Slot and Intent Detection with Auxiliary Tasks: A Multi-Dialectal Bavarian Case Study
Xaver Maria Krückl | Verena Blaschke | Barbara Plank

Reliable slot and intent detection (SID) is crucial in natural language understanding for applications like digital assistants. Encoder-only transformer models fine-tuned on high-resource languages generally perform well on SID. However, they struggle with dialectal data, where no standardized form exists and training data is scarce and costly to produce. We explore zero-shot transfer learning for SID, focusing on multiple Bavarian dialects, for which we release a new dataset for the Munich dialect. We evaluate models trained on auxiliary tasks in Bavarian, and compare joint multi-task learning with intermediate-task training. We also compare three types of auxiliary tasks: token-level syntactic tasks, named entity recognition (NER), and language modelling. We find that the included auxiliary tasks have a more positive effect on slot filling than intent classification (with NER having the most positive effect), and that intermediate-task training yields more consistent performance gains. Our best-performing approach improves intent classification performance on Bavarian dialects by 5.1 and slot filling F1 by 8.4 percentage points.

pdf bib
Regional Distribution of the /el/-/æl/ Merger in Australian English
Steven Coats | Chloé Diskin-Holdaway | Debbie Loakes

Prelateral merger of /e/ and /æ/ is a salient acoustic feature of speech from Melbourne and the state of Victoria in Australia, but little is known about its presence in other parts of the country. In this study, automated methods of data collection, forced alignment, and formant extraction are used to analyze the regional distribution of the vowel merger within all of Australia, in 4.3 million vowel tokens from naturalistic speech in 252 locations. The extent of the merger is quantified using the difference in Bhattacharyya’s distance scores based on phonetic context, and the regional distribution is assessed using spatial autocorrelation. The principal findings are that the merger is most prominent in Victoria and least prominent in Sydney and New South Wales. We also find preliminary indications that it may be present in other parts of the country.

pdf bib
Learning Cross-Dialectal Morphophonology with Syllable Structure Constraints
Salam Khalifa | Abdelrahim Qaddoumi | Jordan Kodner | Owen Rambow

We investigate learning surface forms from underlying morphological forms for low-resource language varieties. We concentrate on learning explicit rules with the aid of learned syllable structure constraints, which outperforms neural methods on this small data task and provides interpretable output. Evaluating across one relatively high-resource and two related low-resource Arabic dialects, we find that a model trained only on the high-resource dialect achieves decent performance on the low-resource dialects, useful when no low-resource training data is available. The best results are obtained when our system is trained only on the low-resource dialect data without augmentation from the related higher-resource dialect. We discuss the impact of syllable structure constraints and the strengths and weaknesses of data augmentation and transfer learning from a related dialect.

pdf bib
Common Ground, Diverse Roots: The Difficulty of Classifying Common Examples in Spanish Varieties
Javier A. Lopetegui | Arij Riabi | Djamé Seddah

Variations in languages across geographic regions or cultures are crucial to address to avoid biases in NLP systems designed for culturally sensitive tasks, such as hate speech detection or dialog with conversational agents. In languages such as Spanish, where varieties can significantly overlap, many examples can be valid across them, which we refer to as common examples. Ignoring these examples may cause misclassifications, reducing model accuracy and fairness. Therefore, accounting for these common examples is essential to improve the robustness and representativeness of NLP systems trained on such data. In this work, we address this problem in the context of Spanish varieties. We use training dynamics to automatically detect common examples or errors in existing Spanish datasets. We demonstrate the efficacy of using predicted label confidence for our Datamaps (CITATION) implementation for the identification of hard-to-classify examples, especially common examples, enhancing model performance in variety identification tasks. Additionally, we introduce a Cuban Spanish Variety Identification dataset with common examples annotations developed to facilitate more accurate detection of Cuban and Caribbean Spanish varieties. To our knowledge, this is the first dataset focused on identifying the Cuban, or any other Caribbean, Spanish variety.

pdf bib
Add Noise, Tasks, or Layers? MaiNLP at the VarDial 2025 Shared Task on Norwegian Dialectal Slot and Intent Detection
Verena Blaschke | Felicia Körner | Barbara Plank

Slot and intent detection (SID) is a classic natural language understanding task. Despite this, research has only more recently begun focusing on SID for dialectal and colloquial varieties. Many approaches for low-resource scenarios have not yet been applied to dialectal SID data, or compared to each other on the same datasets. We participate in the VarDial 2025 shared task on slot and intent detection in Norwegian varieties, and compare multiple set-ups: varying the training data (English, Norwegian, or dialectal Norwegian), injecting character-level noise, training on auxiliary tasks, and applying Layer Swapping, a technique in which layers of models fine-tuned on different datasets are assembled into a model. We find noise injection to be beneficial while the effects of auxiliary tasks are mixed. Though some experimentation was required to successfully assemble a model from layers, it worked surprisingly well; a combination of models trained on English and small amounts of dialectal data produced the most robust slot predictions. Our best models achieve 97.6% intent accuracy and 85.6% slot F1 in the shared task.

pdf bib
LTG at VarDial 2025 NorSID: More and Better Training Data for Slot and Intent Detection
Marthe Midtgaard | Petter Mæhlum | Yves Scherrer

This paper describes the LTG submission to the VarDial 2025 shared task, where we participate in the Norwegian slot and intent detection subtasks. The shared task focuses on Norwegian dialects, which present challenges due to their low-resource nature and variation. We test a variety of neural models and training data configurations, with the focus on improving and extending the available Norwegian training data. This includes automatically re-aligning slot spans in Norwegian Bokmål, as well as re-translating the original English training data into both Bokmål and Nynorsk. % to address dialectal diversity. We also re-annotate an external Norwegian dataset to augment the training data. Our best models achieve first place in both subtasks, achieving an span F1 score of 0.893 for slot filling and an accuracy of 0.980 for intent detection. Our results indicate that while translation quality is less critical, improving the slot labels has a notable impact on slot performance. Moreover, adding more standard Norwegian data improves performance, but incorporating even small amounts of dialectal data leads to greater gains.

pdf bib
HiTZ at VarDial 2025 NorSID: Overcoming Data Scarcity with Language Transfer and Automatic Data Annotation
Jaione Bengoetxea | Mikel Zubillaga | Ekhi Azurmendi | Maite Heredia | Julen Etxaniz | Markel Ferro | Jeremy Barnes

In this paper we present our submission for the NorSID Shared Task as part of the 2025 VarDial Workshop, consisting of three tasks: Intent Detection, Slot Filling and Dialect Identification, evaluated using data in different dialects of the Norwegian language. For Intent Detection and Slot Filling, we have fine-tuned a multitask model in a cross-lingual setting, to leverage the xSID dataset available in 17 languages. In the case of Dialect Identification, our final submission consists of a model fine-tuned on the provided development set, which has obtained the highest scores within our experiments. Our final results on the test set show that our models do not drop in performance compared to the development set, likely due to the domain-specificity of the dataset and the similar distribution of both subsets. Finally, we also report an in-depth analysis of the provided datasets and their artifacts, as well as other sets of experiments that have been carried out but did not yield the best results. Additionally, we present an analysis on the reasons why some methods have been more successful than others; mainly the impact of the combination of languages and domain-specificity of the training data on the results.

pdf bib
CUFE@VarDial 2025 NorSID: Multilingual BERT for Norwegian Dialect Identification and Intent Detection
Michael Ibrahim

Dialect identification is crucial in enhancing various tasks, including sentiment analysis, as a speaker’s geographical origin can significantly affect their perspective on a topic, also, intent detection has gained significant traction in natural language processing due to its applications in various domains, including virtual assistants, customer service automation, and information retrieval systems. This work describes a system developed for VarDial 2025: Norwegian slot and intent detection and dialect identification shared task (Scherrer et al., 2025), a challenge designed to address the dialect recognition and intent detection problems for a low-resource language like Norwegian. More specifically, this work investigates the performance of different BERT models in solving this problem. Finally, the output of the multilingual version of the BERT model was submitted to this shared task, the developed system achieved a weighted F1 score of 79.64 for dialect identification and an accuracy of 94.38 for intent detection.

up

pdf (full)
bib (full)
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)

pdf bib
Proceedings of the 4th Workshop on Arabic Corpus Linguistics (WACL-4)
Saad Ezzini | Hamza Alami | Ismail Berrada | Abdessamad Benlahbib | Abdelkader El Mahdaouy | Salima Lamsiyah | Hatim Derrouz | Amal Haddad Haddad | Mustafa Jarrar | Mo El-Haj | Ruslan Mitkov | Paul Rayson

pdf bib
ArabicSense: A Benchmark for Evaluating Commonsense Reasoning in Arabic with Large Language Models
Salima Lamsiyah | Kamyar Zeinalipour | Samir El amrany | Matthias Brust | Marco Maggini | Pascal Bouvry | Christoph Schommer

Recent efforts in natural language processing (NLP) commonsense reasoning research have led to the development of numerous new datasets and benchmarks. However, these resources have predominantly been limited to English, leaving a gap in evaluating commonsense reasoning in other languages. In this paper, we introduce the ArabicSense Benchmark, which is designed to thoroughly evaluate the world-knowledge commonsense reasoning abilities of large language models (LLMs) in Arabic. This benchmark includes three main tasks: first, it tests whether a system can distinguish between natural language statements that make sense and those that do not; second, it requires a system to identify the most crucial reason why a nonsensical statement fails to make sense; and third, it involves generating explanations for why statements do not make sense. We evaluate several Arabic BERT-based models and causal LLMs on these tasks. Experimental results demonstrate improvements after fine-tuning on our dataset. For instance, AraBERT v2 achieved an 87% F1 score on the second task, while Gemma and Mistral-7b achieved F1 scores of 95.5% and 94.8%, respectively. For the generation task, LLaMA-3 achieved the best performance with a BERTScore F1 of 77.3%, closely followed by Mistral-7b at 77.1%. All codes and the benchmark will be made publicly available at https://github.com/.

pdf bib
Lahjawi: Arabic Cross-Dialect Translator
Mohamed Motasim Hamed | Muhammad Hreden | Khalil Hennara | Zeina Aldallal | Sara Chrouf | Safwan AlModhayan

In this paper, we explore the rich diversity of Arabic dialects by introducing a suite of pioneering models called Lahjawi. The primary model, Lahjawi-D2D, is the first designed for cross-dialect translation among 15 Arabic dialects. Furthermore, we introduce Lahjawi-D2MSA, a model designed to convert any Arabic dialect into Modern Standard Arabic (MSA). Both models are fine-tuned versions of Kuwain-1.5B an in-house built small language model, tailored for Arabic linguistic characteristics. We provide a detailed overview of Lahjawi’s architecture and training methods, along with a comprehensive evaluation of its performance. The results demonstrate Lahjawi’s success in preserving meaning and style, with BLEU scores of 9.62 for dialect-to-MSA and 9.88 for dialect-to- dialect tasks. Additionally, human evaluation reveals an accuracy score of 58% and a fluency score of 78%, underscoring Lahjawi’s robust handling of diverse dialectal nuances. This research sets a foundation for future advancements in Arabic NLP and cross-dialect communication technologies.

pdf bib
Lost in Variation: An Unsupervised Methodology for Mining Lexico-syntactic Patterns in Middle Arabic Texts
Julien Bezançon | Rimane Karam | Gaël Lejeune

While MSA and some dialects of Arabic have been extensively studied in NLP, Middle Arabic is still very much unknown to the field. However, Middle Arabic holds issues that are still not covered: it is characterized by variation since it mixes standard features, colloquial ones, as well as features that belong to neither of the two. Here, we introduce a methodology to identify, extract and rank variations of 13 manually retrieved formulas. Those formulas come from the nine first booklets of S ̄IRAT AL-MALIK AL-Z. ̄AHIR BAYBAR S., a corpus of Damascene popular literature written in Middle Arabic and composed of 53,843 sentences. In total, we ranked 20, sequences according to their similarity with the original formulas on multiple linguistic layers. We noticed that the variations in these formulas occur in a lexical, morphological and graphical level, but in opposition, the semantic and syntactic levels remain strictly invariable.

pdf bib
SADSLyC: A Corpus for Saudi Arabian Multi-dialect Identification through Song Lyrics
Salwa Saad Alahmari

This paper presents the Saudi Arabian Dialects Song Lyrics Corpus (SADSLyC), the first dataset featuring song lyrics from the five major Saudi dialects: Najdi (Central Region), Hijazi (Western Region), Shamali (Northern Region), Janoubi (Southern Region), and Shargawi (Eastern Region). The dataset consists of 31,358 sentences, with each sentence representing a self-contained verse in a song, totaling 151,841 words. Additionally, we present a baseline experiment using the SaudiBERT model to classify the fine-grained dialects in the SADSLyC Corpus. The model achieved an overall accuracy of 73% on the test dataset.

pdf bib
Enhancing Dialectal Arabic Intent Detection through Cross-Dialect Multilingual Input Augmentation
Shehenaz Hossain | Fouad Shammary | Bahaulddin Shammary | Haithem Afli

Addressing the challenges of Arabic intent detection amid extensive dialectal variation, this study presents a crossdialtectal, multilingual approach for classifying intents in banking and migration contexts. By augmenting dialectal inputs with Modern Standard Arabic (MSA) and English translations, our method leverages cross-lingual context to improve classification accuracy. We evaluate single-input (dialect-only), dual-input (dialect + MSA), and triple-input (dialect + MSA + English) models, applying language-specific tokenization for each. Results demonstrate that, in the migration dataset, our model achieved an accuracy gain of over 50% on Tunisian dialect, increasing from 43.3% with dialect-only input to 94% with the full multilingual setup. Similarly, in the PAL (Palestinian dialect) dataset, accuracy improved from 87.7% to 93.5% with translation augmentation, reflecting a gain of 5.8 percentage points. These findings underscore the effectiveness of our approach for intent detection across various Arabic dialects.

pdf bib
Dial2MSA-Verified: A Multi-Dialect Arabic Social Media Dataset for Neural Machine Translation to Modern Standard Arabic
Abdullah Khered | Youcef Benkhedda | Riza Batista-Navarro

Social media has become an essential focus for Natural Language Processing (NLP) research due to its widespread use and unique linguistic characteristics. Normalising social media content, especially for morphologically rich languages like Arabic, remains a complex task due to limited parallel corpora. Arabic encompasses Modern Standard Arabic (MSA) and various regional dialects, collectively termed Dialectal Arabic (DA), which complicates NLP efforts due to their informal nature and variability. This paper presents Dial2MSA-Verified, an extension of the Dial2MSA dataset that includes verified translations for Gulf, Egyptian, Levantine, and Maghrebi dialects. We evaluate the performance of Seq2Seq models on this dataset, highlighting the effectiveness of state-of-the-art models in translating local Arabic dialects. We also provide insights through error analysis and outline future directions for enhancing Seq2Seq models and dataset development. The Dial2MSA-Verified dataset is publicly available to support further research.

pdf bib
Web-Based Corpus Compilation of the Emirati Arabic Dialect
Yousra A. El-Ghawi

This paper displays some initial efforts conducted in the compilation pursuits of Arabic dialectal corpora in the form of raw text, the end purpose of which is to fine-tune existing Arabic large language models (LLM) to better understand and generate text in the Emirati dialect as instructed. The focus of the paper is on the process of compiling corpora from the web, which includes the exploration of possible methods, tools and techniques specific to web search, as well as examples of genres and domains to explore. The results of these efforts and the importance of native speaker contributions to corpus compilation for low-resource languages are also touched upon.

pdf bib
Evaluating Calibration of Arabic Pre-trained Language Models on Dialectal Text
Ali Al-Laith | Rachida Kebdani

While pre-trained language models have made significant progress in different classification tasks, little attention has been given to the reliability of their confidence scores. Calibration, how well model confidence aligns with actual accuracy, is essential for real-world applications where decisions rely on probabilistic outputs. This study addresses this gap in Arabic dialect identification by assessing the calibration of eight pre-trained language models, ensuring their predictions are not only accurate but also reliable for practical applications. We analyze two datasets: one with over 1 million text samples and the Nuanced Arabic Dialect Identification dataset(NADI-2023). Using Expected Calibration Error (ECE) as a metric, we reveal substantial variation in model calibration across dialects in both datasets, showing that prediction confidence can vary significantly depending on regional data. This research has implications for improving the reliability of Arabic dialect models in applications like sentiment analysis and social media monitoring.

pdf bib
Empirical Evaluation of Pre-trained Language Models for Summarizing Moroccan Darija News Articles
Azzedine Aftiss | Salima Lamsiyah | Christoph Schommer | Said Ouatik El Alaoui

Moroccan Dialect (MD), or “Darija,” is a primary spoken variant of Arabic in Morocco, yet remains underrepresented in Natural Language Processing (NLP) research, particularly in tasks like summarization. Despite a growing volume of MD textual data online, there is a lack of robust resources and NLP models tailored to handle the unique linguistic challenges posed by MD. In response, we introduce .MA_v2, an expanded version of the GOUD.MA dataset, containing over 50k articles with their titles across 11 categories. This dataset provides a more comprehensive resource for developing summarization models. We evaluate the application of large language models (LLMs) for MD summarization, utilizing both fine-tuning and zero-shot prompting with encoder-decoder and causal LLMs, respectively. Our findings demonstrate that an expanded dataset improves summarization performance and highlights the capabilities of recent LLMs in handling MD text. We open-source our dataset, fine-tuned models, and all experimental code, establishing a foundation for future advancements in MD NLP. We release the code at https://github.com/AzzedineAftiss/Moroccan-Dialect-Summarization.

pdf bib
Dialect2SQL: A Novel Text-to-SQL Dataset for Arabic Dialects with a Focus on Moroccan Darija
Salmane Chafik | Saad Ezzini | Ismail Berrada

The task of converting natural language questions into executable SQL queries, known as text-to-SQL, has gained significant interest in recent years, as it enables non-technical users to interact with relational databases. Many benchmarks, such as SPIDER and WikiSQL, have contributed to the development of new models and the evaluation of their performance. In addition, other datasets, like SEDE and BIRD, have introduced more challenges and complexities to better map real-world scenarios. However, these datasets primarily focus on high-resource languages such as English and Chinese. In this work, we introduce Dialect2SQL, the first large-scale, cross-domain text-to-SQL dataset in an Arabic dialect. It consists of 9,428 NLQ-SQL pairs across 69 databases in various domains. Along with SQL-related challenges such as long schemas, dirty values, and complex queries, our dataset also incorporates the complexities of the Moroccan dialect, which is known for its diverse source lan-guages, numerous borrowed words, and unique expressions. This demonstrates that our dataset will be a valuable contribution to both the text-to-SQL community and the development of resources for low-resource languages.

pdf bib
AraSim: Optimizing Arabic Dialect Translation in Children’s Literature with LLMs and Similarity Scores
Alaa Hassan Bouomar | Noorhan Abbas

The goal of the paper is to address the linguistic gap faced by young Egyptian Arabic speakers through translating children stories from Modern Standard Arabic to the Egyptian Cairo dialect. Claude is used for initial translation, and a fine-tuned AraT5 model is used for backtranslation. The translation quality is assessed using semantic similarity and BLUE scores to compare the original texts and the translations. The resulting corpus contains 130 stories which were revised by native Egyptian speakers who are professional translators. The strengths of this paper are multiple: working on a less-resourced variety, addressing an important social issue, creating a dataset with potential real-life applications, and ensuring the quality of the produced dataset through human validation.

pdf bib
Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection
Ahmed Haj Ahmed | Rui-Jie Yew | Xerxes Minocher | Suresh Venkatasubramanian

Social media platforms have become central to global communication, yet they also facilitate the spread of hate speech. For underrepresented dialects like Levantine Arabic, detecting hate speech presents unique cultural, ethical, and linguistic challenges. This paper explores the complex sociopolitical and linguistic landscape of Levantine Arabic and critically examines the limitations of current datasets used in hate speech detection. We highlight the scarcity of publicly available, diverse datasets and analyze the consequences of dialectal bias within existing resources. By emphasizing the need for culturally and contextually informed natural language processing (NLP) tools, we advocate for a more nuanced and inclusive approach to hate speech detection in the Arab world.

up

pdf (full)
bib (full)
Proceedings of the The 7th Workshop on Narrative Understanding

pdf bib
Proceedings of the The 7th Workshop on Narrative Understanding
Elizabeth Clark | Yash Kumar Lal | Snigdha Chaturvedi | Mohit Iyyer | Anneliese Brei | Ashutosh Modi | Khyathi Raghavi Chandu

pdf bib
NarraDetect: An annotated dataset for the task of narrative detection
Andrew Piper | Sunyam Bagga

Narrative detection is an important task across diverse research domains where storytelling serves as a key mechanism for explaining human beliefs and behavior. However, the task faces three significant challenges: (1) inter-narrative heterogeneity, or the variation in narrative communication across social contexts; (2) intra-narrative heterogeneity, or the dynamic variation of narrative features within a single text over time; and (3) the lack of theoretical consensus regarding the concept of narrative. This paper introduces the NarraDetect dataset, a comprehensive resource comprising over 13,000 passages from 18 distinct narrative and non-narrative genres. Through a manually annotated subset of ~400 passages, we also introduce a novel theoretical framework for annotating for a scalar concept of “narrativity.” Our findings indicate that while supervised models outperform large language models (LLMs) on this dataset, LLMs exhibit stronger generalization and alignment with the scalar concept of narrativity.

pdf bib
On the Transferability of Causal Knowledge for Language Models
Gourab Dey | Yash Kumar Lal

Language understanding includes identifying logical connections between events in a discourse, such as news and instructional text. We study the transferability of causal knowledge across these two domains by analyzing the extent to which understanding preconditions in narratives such as news articles can help models reason about cooking recipes, and vice-versa. Our experiments show that using instructions to pretrain small models on one domain before similarly finetuning it on the other shows a slight improvement over just finetuning it. We also find that finetuning the models on a mix of both types of data is better (~3-7%) for understanding causal relations in instructional text. While we find that the improvements do not translate to larger or already instruction tuned models, our analysis highlights the aspects of a plan that are better captured through the interoperability of causal knowledge.

pdf bib
Finding Common Patterns in Domestic Violence Stories Posted on Reddit
Mohammad Shokri | Emily Klapper | Jason Shan | Sarah Ita Levitan

Domestic violence survivors often share their experiences in online spaces, offering valuable insights into common abuse patterns. This study analyzes a dataset of personal narratives about domestic violence from Reddit, focusing on event extraction and topic modeling to uncover recurring themes. We evaluate GPT-4 and LLaMA-3.1 for extracting key sentences, finding that GPT-4 exhibits higher precision, while LLaMA-3.1 achieves better recall. Using LLM-based topic assignment, we identify dominant themes such as psychological aggression, financial abuse, and physical assault which align with previously published psychology findings. A co-occurrence and PMI analysis further reveals the interdependencies among different abuse types, emphasizing the multifaceted nature of domestic violence. Our findings provide a structured approach to analyzing survivor narratives, with implications for social support systems and policy interventions.

pdf bib
A Theoretical Framework for Evaluating Narrative Surprise in Large Language Models
Annaliese Bissell | Ella Paulin | Andrew Piper

Narrative surprise is a core element of storytelling for engaging audiences, and yet it remains underexplored in the context of large language models (LLMs) and narrative generation. While surprise arises from events that deviate from expectations while maintaining retrospective coherence, current computational approaches lack comprehensive frameworks to evaluate this phenomenon. This paper presents a novel framework for assessing narrative surprise, drawing on psychological theories of narrative comprehension and surprise intensity. We operationalize six criteria—initiatoriness, immutability violation, predictability, post-dictability, importance, and valence—to measure narrative surprise in story endings. Our study evaluates 120 story endings, generated by both human authors and LLMs, across 30 mystery narratives. Through a ranked-choice voting methodology, we identify significant correlations between reader preferences and four of the six criteria. Results underscore the continuing advantage of human-authored endings in achieving compelling narrative surprise, while also revealing significant progress in LLM-generated narratives.

pdf bib
Beyond LLMs A Linguistic Approach to Causal Graph Generation from Narrative Texts
Zehan Li | Ruhua Pan | Xinyu Pi

pdf bib
CHATTER: A character-attribution dataset for narrative understanding
Sabyasachee Baruah | Shrikanth Narayanan

Computational narrative understanding studies the identification, description, and interaction of the elements of a narrative: characters, attributes, events, and relations.Narrative research has given considerable attention to defining and classifying character types.However, these character-type taxonomies do not generalize well because they are small, too simple, or specific to a domain.We require robust and reliable benchmarks to test whether narrative models truly understand the nuances of the character’s development in the story.Our work addresses this by curating the CHATTER dataset that labels whether a character portrays some attribute for 88124 character-attribute pairs, encompassing 2998 characters, 12967 attributes and 660 movies.We validate a subset of CHATTER, called CHATTEREVAL, using human annotations to serve as an evaluation benchmark for the character attribution task in movie scripts.CHATTEREVAL also assesses narrative understanding and the long-context modeling capacity of language models.

pdf bib
Tracking Evolving Relationship Between Characters in Books in the Era of Large Language Models
Abhilasha Sancheti | Rachel Rudinger

This work aims to assess the zero-shot social reasoning capabilities of LLMs by proposing various strategies based on the granularity of information used to track the fine-grained evolution in the relationship between characters in a book. Without gold annotations, we thoroughly analyze the agreements between predictions from multiple LLMs and manually examine their consensus at a local and global level via the task of trope prediction. Our findings reveal low-to-moderate agreement among LLMs and humans, reflecting the complexity of the task. Analysis shows that LLMs are sensitive to subtle contextual changes and often rely on surface-level cues. Humans, too, may interpret relationships differently, leading to disagreements in annotations.

pdf bib
Narrative Studio: Visual narrative exploration using LLMs and Monte Carlo Tree Search
Parsa Ghaffari | Chris Hokamp

Interactive storytelling benefits from planning and exploring multiple “what if” scenarios. Modern LLMs are useful tools for ideation and exploration, but current chat-based user interfaces restrict users to a single linear flow. To address this limitation, we propose Narrative Studio – a novel in-browser narrative exploration environment featuring a tree-like interface that allows branching exploration from user-defined points in a story. Each branch is extended via iterative LLM inference guided by system and user-defined prompts. Additionally, we employ Monte Carlo Tree Search (MCTS) to automatically expand promising narrative paths based on user-specified criteria, enabling more diverse and robust story development. We also allow users to enhance narrative coherence by grounding the generated text in a graph that represents the actors and environment of the story.

pdf bib
Speaker Identification and Dataset Construction Using LLMs: A Case Study on Japanese Narratives
Seiji Gobara | Hidetaka Kamigaito | Taro Watanabe

Speaker identification in narrative analysis is a challenging task due to complex dialogues, diverse utterance patterns, and ambiguous character references. Cosly and time-intensive manual annotation limits the scalability of high-quality dataset creation.This study demonstrates a cost-efficient approach of constructing speaker identification datasets by combining small-scale manual annotation with LLM-based labeling. A subset of data is manually annotated and is used to guide LLM predictions with a few-shot approach followed by refinement through minimal human corrections. Our results show that LLMs achieve approximately 90% accuracy on challenging narratives, such as the “Three Kingdoms” dataset, underscoring the importance of targeted human corrections. This approach proves effective for constructing scalable and cost-efficient datasets for Japanese and complex narratives.

up

pdf (full)
bib (full)
Proceedings of the Tenth Workshop on Noisy and User-generated Text

pdf bib
Proceedings of the Tenth Workshop on Noisy and User-generated Text
JinYeong Bak | Rob van der Goot | Hyeju Jang | Weerayut Buaphet | Alan Ramponi | Wei Xu | Alan Ritter

pdf bib
Towards a Social Media-based Disease Surveillance System for Early Detection of Influenza-like Illnesses: A Twitter Case Study in Wales
Mark Drakesmith | Dimosthenis Antypas | Clare Brown | Jose Camacho-Collados | Jiao Song

Social media offers the potential to provide detection of outbreaks or public health incidents faster than traditional reporting mechanisms. In this paper, we developed and tested a pipeline to produce alerts of influenza-like illness (ILI) using Twitter data. Data was collected from the Twitter API, querying keywords referring to ILI symptoms and geolocated to Wales. Tweets that described first-hand descriptions of symptoms (as opposed to non-personal descriptions) were classified using transformer-based language models specialised on social media (BERTweet and TimeLMs), which were trained on a manually labelled dataset matching the above criteria. After gathering this data, weekly tweet counts were applied to the regression-based Noufaily algorithm to identify exceedances throughout 2022. The algorithm was also applied to counts of ILI-related GP consultations for comparison. Exceedance detection applied to the classified tweet counts produced alerts starting four weeks earlier than by using GP consultation data. These results demonstrate the potential to facilitate advanced preparedness for unexpected increases in healthcare burdens.

pdf bib
Sentiment Analysis on Video Transcripts: Comparing the Value of Textual and Multimodal Annotations
Quanqi Du | Loic De Langhe | Els Lefever | Veronique Hoste

This study explores the differences between textual and multimodal sentiment annotations on videos and their impact on transcript-based sentiment modelling. Using the UniC and CH-SIMS datasets which are annotated at both the unimodal and multimodal level, we conducted a statistical analysis and sentiment modelling experiments. Results reveal significant differences between the two annotation types, with textual annotations yielding better performance in sentiment modelling and demonstrating superior generalization ability. These findings highlight the challenges of cross-modality generalization and provide insights for advancing sentiment analysis.

pdf bib
Restoring Missing Spaces in Scraped Hebrew Social Media
Avi Shmidman | Shaltiel Shmidman

A formidable challenge regarding scraped corpora of social media is the omission of whitespaces, causing pairs of words to be conflated together as one. In order for the text to be properly parsed and analyzed, these missing spaces must be detected and restored. However, it is particularly hard to restore whitespace in languages such as Hebrew which are written without vowels, because a conflated form can often be split into multiple different pairs of valid words. Thus, a simple dictionary lookup is not feasible. In this paper, we present and evaluate a series of neural approaches to restore missing spaces in scraped Hebrew social media. Our best all-around method involved pretraining a new character-based BERT model for Hebrew, and then fine-tuning a space restoration model on top of this new BERT model. This method is blazing fast, high-performing, and open for unrestricted use, providing a practical solution to process huge Hebrew social media corpora with a consumer-grade GPU. We release the new BERT model and the fine-tuned space-restoration model to the NLP community.

pdf bib
Identifying and analyzing ‘noisy’ spelling errors in a second language corpus
Alan Juffs | Ben Naismith

This paper addresses the problem of identifying and analyzing ‘noisy’ spelling errors in texts written by second language (L2) learners’ texts in a written corpus. Using Python, spelling errors were identified in 5774 texts greater than or equal to 66 words (total=1,814,209 words), selected from a corpus of 4.2 million words (Authors-1). The statistical analysis used hurdle() models in R, which are appropriate for non-normal, count data, with many zeros.

pdf bib
Automatic normalization of noisy technical reports with an LLM: What effects on a downstream task?
Mariame Maarouf | Ludovic Tanguy

This study explores the automatic normalization of noisy and highly technical anomaly reports by an LLM. Different prompts are tested to instruct the LLM to clean the text without changing the structure, vocabulary or specialized lexicon. The evaluation of this task is made in two steps. First, the Character Error Rate (CER) is calculated to assess the changes made compared to a gold standard on a small sample. Second, an automatic sequence labeling task is performed on the original and on the corrected datasets with a transformer-based classifier. If some configurations of LLM and prompts can reach satisfying CER scores, the sequence labeling task shows that the normalization has a small negative impact on performance.

pdf bib
We’re Calling an Intervention: Exploring Fundamental Hurdles in Adapting Language Models to Nonstandard Text
Aarohi Srivastava | David Chiang

We present a suite of experiments that allow us to understand the underlying challenges of language model adaptation to nonstandard text. We do so by designing interventions that approximate core features of user-generated text and their interactions with existing biases of language models. Applying our interventions during language model adaptation to nonstandard text variations, we gain important insights into when such adaptation is successful, as well as the aspects of text variation and noise that are particularly difficult for language models to handle. For instance, on text with character-level variation, out-of-the-box performance improves even with a few additional training examples but approaches a plateau, suggesting that more data is not the solution. In contrast, on text with variation involving new words or meanings, far more data is needed, but it leads to a massive breakthrough in performance. Our findings reveal that existing models lack the necessary infrastructure to handle diverse forms of nonstandard text, guiding the development of more resilient language modeling techniques. We make the code for our interventions, which can be applied to any English text data, publicly available.

pdf bib
On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation
Rune Birkmose | Nathan Mørkeberg Reece | Esben Hofstedt Norvin | Johannes Bjerva | Mike Zhang

This paper investigates whether Large Language Models (LLMs), fine-tuned on synthetic but domain-representative data, can perform the twofold task of (i) slot and intent detection and (ii) natural language response generation for a smart home assistant, while running solely on resource-limited, CPU-only edge hardware. We fine-tune LLMs to produce both JSON action calls and text responses. Our experiments show that 16-bit and 8-bit quantized variants preserve high accuracy on slot and intent detection and maintain strong semantic coherence in generated text, while the 4-bit model, while retaining generative fluency, suffers a noticeable drop in device-service classification accuracy. Further evaluations on noisy human (non-synthetic) prompts and out-of-domain intents confirm the models’ generalization ability, obtaining around 80–86% accuracy. While the average inference time is 5–6 seconds per query—acceptable for one-shot commands but suboptimal for multi-turn dialogue—our results affirm that an on-device LLM can effectively unify command interpretation and flexible response generation for home automation without relying on specialized hardware.

pdf bib
Applying Transformer Architectures to Detect Cynical Comments in Spanish Social Media
Samuel Gonzalez-Lopez | Steven Bethard | Rogelio Platt-Molina | Francisca Orozco

Detecting cynical comments in online communication poses a significant challenge in human-computer interaction, especially given the massive proliferation of discussions on platforms like YouTube. These comments often include offensive or disruptive patterns, such as sarcasm, negative feelings, specific reasons, and an attitude of being right. To address this problem, we present a web platform for the Spanish language that has been developed and leverages natural language processing and machine learning techniques. The platform detects comments and provides valuable information to users by focusing on analyzing comments. The core models are based on pre-trained architectures, including BETO, SpanBERTa, Multilingual BERT, RoBERTuito, and BERT, enabling robust detection of cynical comments. Our platform was trained and tested with Spanish comments from car analysis channels on YouTube. The results show that models achieve performance above 0.8 F1 for all types of cynical comments in the text classification task but achieve lower performance (around 0.6-0.7 F1) for the more arduous token classification task.

pdf bib
Prompt Guided Diffusion for Controllable Text Generation
Mohaddeseh Mirbeygi | Hamid Beigy

Controlled text generation, originally a task to generate coherent, contextually relevant text with specified attributes such as sentiment, topic, or style, has seen a lot of development with methods that use PPLM, FUDGE, and diffusion-based models. However, most state-of-the-art methods balance control precision with fluency. Classifier-guided approaches, like PPLM, are well-known for unstable updates of gradients, yielding incoherent outputs, while autoregressive models, like FUDGE, depend on rigid templates that limit creativity. While recent diffusion models show promises in iterative refinement and diversity, they often lack mechanisms to explicitly incorporate task-specific knowledge and hence require various complicated auxiliary classifiers for training and inference.We now propose a prompt-guided diffusion framework that integrates structured prompts seamlessly into the process of diffusion for precise and flexible control of generated texts.Each prompt combines a target condition (e.g., sentiment label), an in-class example (e.g., a positive movie review), and a placeholder for the generated sentence. Explicit, human-readable guidance is thereby given, spanning high-level intent to low-level text generation.Our approach encodes prompts using large pre-trained language models, e.g., BART, fusing these in a cross-attention manner with the diffusion dynamics, achieves new state-of-the-art results for all benchmarks, including IMDB for sentiment, AG News for topic, and E2E for structured data-to-text generation.

pdf bib
FaBERT: Pre-training BERT on Persian Blogs
Mostafa Masumi | Seyed Soroush Majd | Mehrnoush Shamsfard | Hamid Beigy

We introduce FaBERT, a Persian BERT-base model pre-trained on the HmBlogs corpus, encompassing both informal and formal Persian texts. FaBERT is designed to excel in traditional Natural Language Understanding (NLU) tasks, addressing the intricacies of diverse sentence structures and linguistic styles prevalent in the Persian language. In our comprehensive evaluation of FaBERT on 12 datasets in various downstream tasks, encompassing Sentiment Analysis (SA), Named Entity Recognition (NER), Natural Language Inference (NLI), Question Answering (QA), and Question Paraphrasing (QP), it consistently demonstrated improved performance, all achieved within a compact model size. The findings highlight the importance of utilizing diverse corpora, such as HmBlogs, to enhance the performance of language models like BERT in Persian Natural Language Processing (NLP) applications.

pdf bib
Automatically Generating Chinese Homophone Words to Probe Machine Translation Estimation Systems
Shenbin Qian | Constantin Orasan | Diptesh Kanojia | Félix Do Carmo

Evaluating machine translation (MT) of user-generated content (UGC) involves unique challenges such as checking whether the nuance of emotions from the source are preserved in the target text. Recent studies have proposed emotion-related datasets, frameworks and models to automatically evaluate MT quality of Chinese UGC, without relying on reference translations. However, whether these models are robust to the challenge of preserving emotional nuances has been left largely unexplored. To this end, we introduce a novel method inspired by information theory which generates challenging Chinese homophone words related to emotions, by leveraging the concept of *self-information*. Our approach generates homophones that were observed to cause translation errors in emotion preservation, and exposes vulnerabilities in MT models struggling to preserve relevant emotions. We evaluate the efficacy of our method using human evaluation and compare it with an existing one, showing that our method achieves higher correlation with human judgments. The generated Chinese homophones, along with their manual translations, are utilized to generate perturbations and to probe the robustness of existing quality evaluation models, including models trained using multi-task learning, fine-tuned variants of multilingual language models, as well as large language models (LLMs). Our results indicate that LLMs with larger size exhibit higher stability and robustness to such perturbations. We release our data and code for reproducibility and further research.

pdf bib
Multi-BERT: Leveraging Adapters for Low-Resource Multi-Domain Adaptation
Parham Abed Azad | Hamid Beigy

Multi-domain text analysis presents significant challenges, particularly in Persian name entity recognition (NER). Using a single model for multiple domains often fails to capture the specific features of different domains. That is why many scientists have focused on prompting chatbots for this issue. However, studies show that these models do not achieve remarkable results in NER tasks without proper fine-tuning while training and storing a chatbot is extremely costly. This paper presents a new approach using one core model with various sets of domain-specific parameters. By using techniques like LoRAs and pre-fix tuning, along with extra layers, we train each set of trainable parameters for a specific domain. This allows the model to perform as well as individual models for each domain. Tests on various formal and informal datasets show that by using these added parameters, the proposed model performs much better than existing practical models. The model needs only one instance for storage but achieves excellent results across all domains. This paper also examines each adaptation strategy, outlining its strengths, weaknesses, and the best settings and hyperparameters for Persian NER. Lastly, this study introduces a new document-based domain detection system for situations where text domains are unknown. This novel pipeline enhances the adaptability and practicality of the proposed approach for real-world applications.

pdf bib
Enhancing NER Performance in Low-Resource Pakistani Languages using Cross-Lingual Data Augmentation
Toqeer Ehsan | Thamar Solorio

Named Entity Recognition (NER), a fundamental task in Natural Language Processing (NLP), has shown significant advancements for high-resource languages. However, due to a lack of annotated datasets and limited representation in Pre-trained Language Models (PLMs), it remains understudied and challenging for low-resource languages. To address these challenges, in this paper, we propose a data augmentation technique that generates culturally plausible sentences and experiments on four low-resource Pakistani languages; Urdu, Shahmukhi, Sindhi, and Pashto. By fine-tuning multilingual masked Large Language Models (LLMs), our approach demonstrates significant improvements in NER performance for Shahmukhi and Pashto. We further explore the capability of generative LLMs for NER and data augmentation using few-shot learning.

pdf bib
Wikipedia is Not a Dictionary, Delete! Text Classification as a Proxy for Analysing Wiki Deletion Discussions
Hsuvas Borkakoty | Luis Espinosa-Anke

Automated content moderation for collaborative knowledge hubs like Wikipedia or Wikidata is an important yet challenging task due to multiple factors. In this paper, we construct a database of discussions happening around articles marked for deletion in several Wikis and in three languages, which we then use to evaluate a range of LMs on different tasks (from predicting the outcome of the discussion to identifying the implicit policy an individual comment might be pointing to). Our results reveal, among others, that discussions leading to deletion are easier to predict, and that, surprisingly, self-produced tags (keep, delete or redirect) don’t always help guiding the classifiers, presumably because of users’ hesitation or deliberation within comments

pdf bib
From Conversational Speech to Readable Text: Post-Processing Noisy Transcripts in a Low-Resource Setting
Arturs Znotins | Normunds Gruzitis | Roberts Dargis

We present ongoing research on automatic post-processing approaches to enhance the readability of noisy speech transcripts in low-resource languages, with a focus on conversational speech in Latvian. We compare transformer-based sequence-labeling models and large language models (LLMs) for the standard punctuation and capitalization restoration task, while also considering automatic correction of mispronounced words and disfluency, and partial inverse text normalization. Our results show that very small LLMs (approx. 2B parameters), fine-tuned on a modest text corpus, can achieve near state-of-the-art performance, rivaling orders of magnitude larger LLMs. Additionally, we demonstrate that a fine-tuned Whisper model, leveraging acoustic cues, outperforms text-only systems on challenging conversational data, even for a low-resource language. Error analysis reveals recurring pitfalls in sentence boundary determination and disfluency handling, emphasizing the importance of consistent annotation and domain adaptation for robust post-processing. Our findings highlight the feasibility of developing efficient post-processing solutions that significantly refine ASR output in low-resource settings, while opening new possibilities for editing and formatting speech transcripts beyond mere restoration of punctuation and capitalization.

pdf bib
Text Normalization for Japanese Sentiment Analysis
Risa Kondo | Ayu Teramen | Reon Kajikawa | Koki Horiguchi | Tomoyuki Kajiwara | Takashi Ninomiya | Hideaki Hayashi | Yuta Nakashima | Hajime Nagahara

We manually normalize noisy Japanese expressions on social networking services (SNS) to improve the performance of sentiment polarity classification.Despite advances in pre-trained language models, informal expressions found in social media still plague natural language processing.In this study, we analyzed 6,000 posts from a sentiment analysis corpus for Japanese SNS text, and constructed a text normalization taxonomy consisting of 33 types of editing operations.Text normalization according to our taxonomy significantly improved the performance of BERT-based sentiment analysis in Japanese.Detailed analysis reveals that most types of editing operations each contribute to improve the performance of sentiment analysis.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025)

pdf bib
Proceedings of the First Workshop on Writing Aids at the Crossroads of AI, Cognitive Science and NLP (WRAICOGS 2025)
Michael Zock | Kentaro Inui | Zheng Yuan

pdf bib
Chain-of-MetaWriting: Linguistic and Textual Analysis of How Small Language Models Write Young Students Texts
Ioana Buhnila | Georgeta Cislaru | Amalia Todirascu

Large Language Models (LLMs) have been used to generate texts in response to different writing tasks: reports, essays, story telling. However, language models do not have a metarepresentation of the text writing process, nor inherent communication learning needs, comparable to those of young human students. This paper introduces a fine-grained linguistic and textual analysis of multilingual Small Language Models’ (SLMs) writing. With our method, Chain-of-MetaWriting, SLMs can imitate some steps of the human writing process, such as planning and evaluation. We mainly focused on short story and essay writing tasks in French for schoolchildren and undergraduate students respectively. Our results show that SLMs encounter difficulties in assisting young students on sensitive topics such as violence in the schoolyard, and they sometimes use words too complex for the target audience. In particular, the output is quite different from the human produced texts in term of text cohesion and coherence regarding temporal connectors, topic progression, reference.

pdf bib
Semantic Masking in a Needle-in-a-haystack Test for Evaluating Large Language Model Long-Text Capabilities
Ken Shi | Gerald Penn

In this paper, we introduce the concept of Semantic Masking, where semantically coherent surrounding text (the haystack) interferes with the retrieval and comprehension of specific information (the needle) embedded within it. We propose the Needle-in-a-Haystack-QA Test, an evaluation pipeline that assesses LLMs’ long-text capabilities through question answering, explicitly accounting for the Semantic Masking effect. We conduct experiments to demonstrate that Semantic Masking significantly impacts LLM performance more than text length does. By accounting for Semantic Masking, we provide a more accurate assessment of LLMs’ true proficiency in utilizing extended contexts, paving the way for future research to develop models that are not only capable of handling longer inputs but are also adept at navigating complex semantic landscapes.

pdf bib
Reading Between the Lines: A dataset and a study on why some texts are tougher than others
Nouran Khallaf | Carlo Eugeni | Serge Sharoff

Our research aims at better understanding what makes a text difficult to read for specific audiences with intellectual disabilities, more specifically, people who have limitations in cognitive functioning, such as reading and understanding skills, an IQ below 70, and challenges in conceptual domains. We introduce a scheme for the annotation of difficulties which is based on empirical research in psychology as well as on research in translation studies. The paper describes the annotated dataset, primarily derived from the parallel texts (standard English and Easy to Read English translations) made available online. we fine-tuned four different pre-trained transformer models to perform the task of multiclass classification to predict the strategies required for simplification. We also investigate the possibility to interpret the decisions of this language model when it is aimed at predicting the difficulty of sentences in this dataset.

pdf bib
ParaRev : Building a dataset for Scientific Paragraph Revision annotated with revision instruction
Léane Jourdan | Florian Boudin | Richard Dufour | Nicolas Hernandez | Akiko Aizawa

Revision is a crucial step in scientific writing, where authors refine their work to improve clarity, structure, and academic quality. Existing approaches to automated writing assistance often focus on sentence-level revisions, which fail to capture the broader context needed for effective modification. In this paper, we explore the impact of shifting from sentence-level to paragraph-level scope for the task of scientific text revision. The paragraph level definition of the task allows for more meaningful changes, and is guided by detailed revision instructions rather than general ones. To support this task, we introduce ParaRev, the first dataset of revised scientific paragraphs with an evaluation subset manually annotated with revision instructions. Our experiments demonstrate that using detailed instructions significantly improves the quality of automated revisions compared to general approaches, no matter the model or the metric considered.

pdf bib
Towards an operative definition of creative writing: a preliminary assessment of creativeness in AI and human texts
Chiara Maggi | Andrea Vitaletti

Nowadays, AI is present in all our activities. This pervasive presence is perceived as a threat by many category of users that might be substituted by their AI counterpart. While the potential of AI in handling repetitive tasks is clear, the potentials of its creativeness is still misunderstood. We believe that understanding this aspects of AI can transform a threat into an opportunity. This paper is a first attempt to provide a measurable definition of creativeness. We applied our definition to AI and human generated texts, proving the viability of the proposed approach. Our preliminary experiments show that human texts are more creative.

pdf bib
Decoding Semantic Representations in the Brain Under Language Stimuli with Large Language Models
Anna Sato | Ichiro Kobayashi

Brain decoding technology is paving the way for breakthroughs in the interpretation of neural activity to recreate thoughts, emotions, and movements. Tang et al. (2023) introduced a novel approach that uses language models as generative models for brain decoding based on functional magnetic resonance imaging (fMRI) data. Building on their work, this study explored the use of three additional language models along with the GPT model used in previous research to improve decoding accuracy. Furthermore, we added an evaluation metric using an embedding model, providing higher-level semantic similarity than the BERTScore. By comparing the decoding performance and identifying the factors contributing to good performance, we found that high decoding accuracy does not solely depend on the ability to accurately predict brain activity. Instead, the type of text (e.g., web text, blogs, news articles, and books) that the model tends to generate plays a more significant role in achieving more precise sentence reconstruction.