Other Workshops and Events (2026)
Volumes
- Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script 71 papers
- Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026) 31 papers
- Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR) 24 papers
- Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP) 28 papers
- Proceedings of the 13th Workshop on Argument Mining and Reasoning 18 papers
- Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026) 93 papers
- Proceedings of Beyond Alignment: Transdisciplinary Conversations on Human-AI Futures 2 papers
- Proceedings of The Big Picture v2: Crafting a Research Narrative 13 papers
- BioNLP 2026 90 papers
- Proceedings of the BioNLP 2026 (Shared Tasks) 36 papers
- Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026) 17 papers
- Proceedings of the 1st Workshop on Computational Developmental Linguistics (CDL) 11 papers
- Proceedings of the 2nd Workshop on Computational Humor (CHum 2026) 9 papers
- Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026) 46 papers
- Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9) 20 papers
- Proceedings of the 30th Conference on Computational Natural Language Learning 48 papers
- Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U) 18 papers
- Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages 72 papers
- Proceedings of the 9th Workshop on Event Extraction and Understanding: Challenges and Applications (EEUCA 2026) 26 papers
- Proceedings of the Workshop on Evaluating Evaluations (EvalEval) 22 papers
- Proceedings of the Ninth Fact Extraction and VERification Workshop (FEVER) 12 papers
- Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics 8 papers
- Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM) 77 papers
- Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026) 29 papers
- Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026) 15 papers
- Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026 33 papers
- Proceedings of the 20th Linguistic Annotation Workshop (LAW XX) 21 papers
- The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26) 14 papers
- Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026) 25 papers
- Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion 31 papers
- Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026) 11 papers
- Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026) 31 papers
- Proceedings of the First Workshop on Multilingual Multicultural Evaluation 16 papers
- Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026) 35 papers
- Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities 37 papers
- Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026) 10 papers
- Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science 20 papers
- Proceedings of the Seventh Workshop on Privacy in Natural Language Processing 11 papers
- Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026) 15 papers
- Proceedings of the Society for Computation in Linguistics 2026 52 papers
- Proceedings of the 20th International Workshop on Semantic Evaluation (2026) 456 papers
- Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026) 21 papers
- Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP 7 papers
- The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family 15 papers
- Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks 54 papers
- Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026) 37 papers
- Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026) 13 papers
- Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026) 25 papers
- Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026) 18 papers
- Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026) 42 papers
- Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects 33 papers
- The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026) 24 papers
- Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026) 19 papers
up
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Proceedings of the 2nd Workshop on NLP for Languages Using Arabic Script
Mo El-Haj | Paul Rayson | Mustafa Jarrar | Ignatius Ezeani | Saad Ezzini | Sina Ahmadi | Amal Haddad Haddad | Cynthia Amol | Ahmad Abdelali | Shadi Abudalfa
Mo El-Haj | Paul Rayson | Mustafa Jarrar | Ignatius Ezeani | Saad Ezzini | Sina Ahmadi | Amal Haddad Haddad | Cynthia Amol | Ahmad Abdelali | Shadi Abudalfa
We present ArabicDialectHub, a cross-dialectal Arabic learning resource comprising 552 phrases across six varieties (Moroccan Darija, Lebanese, Syrian, Emirati, Saudi, and MSA) and an interactive web platform. Phrases were generated using LLMs and validated by five native speakers, stratified by difficulty, and organized thematically. The open-source platform provides translation exploration, adaptive quizzing with algorithmic distractor generation, cloud-synchronized progress tracking, and cultural context. Both the dataset and complete platform source code are released under MIT license. Platform: https://arabic-dialect-hub.netlify.app.
Multilingual evaluation often relies on language coverage or translated benchmarks, implicitly assuming that subword tokenization behaves comparably across scripts. In mixed-script settings, this assumption breaks down. We examine this effect using polarity detection as a case study, comparing Orthographic Syllable Pair Encoding (OSPE) and Byte Pair Encoding (BPE) under identical architectures, data, and training conditions on SemEval Task 9, which spans Devanagari, Perso-Arabic, and Latin scripts. OSPE is applied to Hindi, Nepali, Urdu, and Arabic, while BPE is retained for English. We find that BPE systematically underestimates performance in abugida and abjad scripts, producing fragmented representations, unstable optimization, and drops of up to 27 macro-F1 points for Nepali, while English remains largely unaffected. Script-aware segmentation preserves orthographic structure, stabilizes training, and improves cross-language comparability without additional data or model scaling, highlighting tokenization as a latent but consequential evaluation decision in multilingual benchmarks. While the analysis spans multiple scripts, we place particular emphasis on Arabic and Perso-Arabic languages, where frequency-driven tokenization most severely disrupts orthographic and morphological structure.
Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction
Rabab Alkhalifa
Rabab Alkhalifa
Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline—two framers, a critic, and a discriminator—treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.
Optimizer Choice and Calibration for QARiB on Arabic-Script Social Media Offensive Language Detection
Auda Elshokry | Mohammed Alhanjouri
Auda Elshokry | Mohammed Alhanjouri
Optimizer choice is a central hyperparameter in fine-tuning transformer models, yet its impact remains under-studied for Arabic-script social media classification un der class imbalance. We compare Adam, AdamW, and SGD for fine-tuning QARiB on two Arabic offensive-language bench marks, OffensEval20 and MPOLD, using a controlled grid over learning rate, weight decay, and warmup, and report test-set performance as mean (std) over three random seeds. Minority-class discrimination is evaluated using macro-F1 and AUC-PROFF, while calibration is assessed via expected calibration error (ECE), reliability diagrams, and proper scoring rules (Brier score and negative log-likelihood, NLL). Across both datasets, AdamW and Adam are consistently strong and closely matched when properly tuned, whereas SGD substantially underperforms under the same tuning bud get and exhibits higher seed sensitivity. We observe non-trivial miscalibration across optimizers; post-hoc temperature scaling offers a low-cost adjustment, yielding modest, dataset-dependent changes in calibration while preserving ranking-based discrimination. We further evaluate a practical decision-rule step by optimizing the classification threshold on the validation set and applying it to test predictions, and provide qualitative examples il lustrating typical optimizer-dependent confidence behaviors. In practice, for Arabic offensive-language detection under imbalance, we recommend starting from a tuned AdamW or Adam baseline; when calibrated probabilities are required for thresholding or triage, temperature scaling can be applied. We will release a reproducible pipeline to support further evaluation of optimizer–calibration trade-offs in Arabic-script safety tasks.
We introduce the Tarab Corpus, a large-scale cultural and linguistic resource that brings together Arabic song lyrics and poetry within a unified analytical framework. The corpus comprises 2.56 million verses and more than 13.5 million tokens, making it, to our knowledge, the largest open Arabic corpus of creative text spanning both classical and contemporary production. Tarab is broadly balanced between songs and poems and covers Classical Arabic, Modern Standard Arabic (MSA), and six major regional varieties: Egyptian, Gulf, Levantine, Iraqi, Sudanese, and Maghrebi Arabic. The artists and poets represented in the corpus are associated with 28 modern nation states and multiple historical eras, covering over fourteen centuries of Arabic creative expression from the Pre-Islamic period to the twenty-first century. Each verse is accompanied by structured metadata describing linguistic variety, geographic origin, and historical or cultural context, enabling comparative linguistic, stylistic, and diachronic analysis across genres and time. We describe the data collection, normalisation, and validation pipeline and present baseline analyses for variety identification and genre differentiation. The dataset is publicly available on HuggingFace at https://huggingface.co/datasets/drelhaj/Tarab.
LLM-to-Speech: A Synthetic Data Pipeline for Training Dialectal Text-to-Speech Models
Ahmed Khamis | Hesham Ali Ahmed
Ahmed Khamis | Hesham Ali Ahmed
Despite the advances in neural text to speech (TTS), many Arabic dialectal varieties remain marginally addressed, with most resources con- centrated on Modern Spoken Arabic (MSA) and Gulf dialects, leaving Egyptian Arabic— the most widely understood Arabic dialect— severely under-resourced. We address this gap by introducing NileTTS: 38 hours of tran- scribed speech from two speakers across di- verse domains including medical, sales, and general conversations. We construct this dataset using a novel synthetic pipeline: large language models (LLM) generate Egyptian Arabic content, which is then converted to natu- ral speech using audio synthesis tools, followed by automatic transcription and speaker diariza- tion with manual quality verification. We fine- tune XTTS v2, a state-of-the-art multilingual TTS model, on our dataset and evaluate against the baseline model trained on other Arabic dialects. Our contributions include: (1) the first publicly available Egyptian Arabic TTS dataset, (2) a reproducible synthetic data gen- eration pipeline for dialectal TTS, and (3) an open-source fine-tuned model. All resources are released to advance Egyptian Arabic speech synthesis research.
HCMUS_PrompterXPrompter at AbjadMed: When Classification Meets Retrieval: Taming the Long Tail in Arabic Medical Text Classification
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Dinh Ha Duong | Nguyen Chi Tran | Phu Quy Nguyen Lam | Hoa Pham Phu
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Dinh Ha Duong | Nguyen Chi Tran | Phu Quy Nguyen Lam | Hoa Pham Phu
Medical text classification is high-stakes work, yet models often falter precisely where they are needed most: on rare, critical conditions buried in the long tail of the data distribution. In the EACL 2026 ABJAD-NLP Shared Task, we confronted this challenge with a dataset of Arabic medical questions heavily skewed towards a few common topics, leaving dozens of categories with fewer than ten examples. We present HybridMed, a system that effectively tames this long tail by marrying the semantic generalization of a fine-tuned Arabic BERT model with the precise, instance-based memory of k-nearest neighbor retrieval. This complementary union allowed our system to achieve a macro-F1 score of 0.4902, demonstrating that for diverse and imbalanced medical data, the whole is indeed greater than the sum of its parts.
KazakhOCR: A Synthetic Benchmark for Evaluating Multimodal Models in Low-Resource Kazakh Script OCR
Henry Gagnier | Sophie Gagnier | Ashwin Kirubakaran
Henry Gagnier | Sophie Gagnier | Ashwin Kirubakaran
Kazakh is a Turkic language using the Arabic, Cyrillic, and Latin scripts, making it unique in terms of optical character recognition (OCR). Work on OCR for low-resource Kazakh scripts is very scarce, and no OCR benchmarks or images exist for the Arabic and Latin scripts. We construct a synthetic OCR dataset of 7,219 images for all three scripts with font, color, and noise variations to imitate real OCR tasks. We evaluated three multimodal large language models (MLLMs) on a subset of the benchmark for OCR and language identification: Gemma-3-12B-it, Qwen2.5-VL-7B-Instruct, and Llama-3.2-11B-Vision-Instruct. All models are unsuccessful with Latin and Arabic script OCR, and fail to recognize the Arabic script as Kazakh text, misclassifying it as Arabic, Farsi, and Kurdish. We further compare MLLMs with a classical OCR baseline and find that while traditional OCR has lower character error rates, MLLMs fail to match this performance. These findings show significant gaps in current MLLM capabilities to process low-resource Abjad-based scripts and demonstrate the need for inclusive models and benchmarks supporting low-resource scripts and languages.
Seeing Words Differently: Visual Embeddings for Robust English-Arabic Machine Translation
Mahdi Alshaikh Saleh | Irfan Ahmad
Mahdi Alshaikh Saleh | Irfan Ahmad
Context: Natural Language Processing (NLP) has become an essential field with widespread applications across domains such as Large Language Models (LLMs). One of the core applications of NLP is machine translation (MT). A major challenge in MT is handling out-of-vocabulary (OOV) words and spelling mistakes, which can lead to poor translation quality. Objective: This study compares traditional text-based embeddings with visual embeddings for English-to-Arabic translation. It investigates the effectiveness of each approach, especially in handling noisy inputs or OOV terms. Method: Using the IWSLT 2017 English-Arabic dataset, we trained a baseline transformer encoder-decoder model using standard text embeddings and compared it with models using several visual embeddings strategies, including vowel-removal preprocessing and trigram-based image rendering. The translated outputs were evaluated using BLEU scores. Results: show that although traditional BPE-based models achieve higher BLEU on clean data, visual embedding models are substantially more robust to spelling noise, retaining up to 2.4× higher BLEU scores at 50% character corruption.
Character-Level Transformer for Tajik–Persian Transliteration with a Parallel Lexical Corpus
Arabov Mullosharaf Kurbonovich
Arabov Mullosharaf Kurbonovich
This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik–Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik–Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The best Transformer configuration with beam search (k=3) achieves a CER of 0.3182 and an exact-match accuracy of 0.3215, achieving lower error rates than dictionary-based rule-based and recurrent neural baselines. We describe the data collection and preprocessing pipeline, model architecture, and experimental protocol, and report a part-of-speech analysis showing performance differences across lexical categories. All resources (dataset, preprocessing scripts, splits, and training configurations) will be released publicly to ensure reproducibility and facilitate future work on Tajik–Persian transliteration, cross-script NLP, and lexicographic applications.
Arabic Dialect Translation with Small LLMs: Enhancing through Reasoning-Oriented Reinforcement Learning
Sohaila Abdulsattar | Keith Ross
Sohaila Abdulsattar | Keith Ross
Arabic dialect↔English machine translation remains difficult due to extreme dialect variation, inconsistent orthography, and limited parallel data. Moreover, dialect translation is often needed in remote regions or by economically-disadvantaged communities, which often operate in compute-constrained or offline settings. Motivated by these concerns, in this paper we explore optimizing Arabic dialect↔English translators that run over small LLMs, which could be implemented on small offline devices. We show that reasoning-oriented reinforcement learning can substantially improve small multilingual LLMs for Arabic dialect translation. Using the MADAR corpus, small Qwen-2.5 models trained with a think-then-translate template and optimized with Group-Relative Policy Optimization using a SacreBLEU reward outperform a much larger 7B baseline trained with supervised fine-tuning. The dialect-to-English BLEU score more than doubles from 17.4 to 34.9, while the English-to-dialect COMET score improves from 0.57 to 0.73.
MedArabs at AbjadMed: Arabic Medical Text Classification via Data- and Algorithm-Level Fusion
Amrita Singh
Amrita Singh
In this work, we address the challenges of Arabic medical text classification, focusing on class imbalance and the complexity of the language’s morphology. We propose a multiclass classification pipeline based on Data- and Algorithm-Level fusion, which integrates the optimal Back Translation technique for data augmentation with the Class Balanced (CB) loss function to enhance performance. The domain-specific AraBERT model is fine-tuned using this approach, achieving competitive results. On the official test set of the AbjadMed task, our pipeline achieves a Macro-F1 score of 0.4219, and it achieves 0.4068 on the development set.
GATech at AbjadMed: Bidirectional Encoders vs. Causal Decoders: Insights from 82-Class Arabic Medical Classification
Ahmed Khamis
Ahmed Khamis
This paper presents system description for Arabic medical text classification across 82 distinct categories. Our primary architecture utilizes a fine-tuned AraBERTv2 encoder enhanced with a hybrid pooling strategies, combining attention and mean representations, and multi-sample dropout for robust regularization. We systematically benchmark this approach against a suite of multilingual and Arabic-specific encoders, as well as several large-scale causal decoders, including zero-shot re-ranking via Llama 3.3 70B and feature extraction from Qwen 3B hidden states. Our findings demonstrate that specialized bidirectional encoders significantly outperform causal decoders in capturing the precise semantic boundaries required for fine-grained medical text classification. We show that causal decoders, optimized for next-token prediction, produce sequence-biased embeddings that are less effective for categorization compared to the global context captured by bidirectional attention. Despite significant class imbalance and label noise identified within the training data, our results highlight the superior semantic compression of fine-tuned encoders for specialized Arabic NLP tasks. Final performance metrics on the test set, including Accuracy and Macro-F1, are reported and discussed.
Named Entity Recognition (NER) models trained on clean text often fail on real-world data containing orthographic noise. Work on NER for Persian is emerging, but it has not yet explored the orthographic robustness of models to perturbations often exhibited in user-generated content. We evaluate ParsBERT, ParsBERT v2.0, BertNER, and two XLM-r-based models on a subset of Persian-NER-Dataset-500k after applying eleven different perturbations, including simulated typos, code-switching, and segmentation errors. All models were competitive with each other, but XLM-r-large consistently displayed the best robustness to perturbations. Code-switching, typos, similar character swaps, segmentation errors, and noisy text all decreased F1 scores, while Latinized numbers increased F1 scores in ParsBERT. Removing diacritics, zero-width non-joiners, and normalizing Yeh/Kaf all did not have an effect on F1. These findings suggest that Persian NER models require improvement for performance on noisy text, and that the Perso-Arabic script introduces unique factors into NER not present in many high-resource languages, such as code-switching and Eastern Arabic numerals. This work creates a foundation for the development of robust Persian NER models and highlights the necessity of evaluating low-resource NER models under challenging and realistic conditions.
ArabicMedicalBERT-QA-82 at AbjadMed: Fighting Class Imbalance in Arabic Medical Text Classification
Gleb Shanshin
Gleb Shanshin
We present a supervised system for Arabic medical question-answer classification developed for the AbjadMed shared task. The task involves assigning one of 82 highly imbalanced medical categories and is evaluated using macro-averaged F1. Our approach builds on an AraBERT model further pretrained on a related Arabic medical classification dataset. Under a unified fine-tuning setup, this domain-adapted model consistently outperforms general-purpose Arabic backbones, with the best results obtained using a low backbone learning rate, indicating that only limited adaptation is required. The final system achieves a macro F1 score of 0.51 on the private test split. For comparison, we evaluate several cost-efficient large language models under constrained prompting and observe substantially lower performance.
KvochurHegel at AbjadMed: Combining LDAM Loss and Adversarial Training for Arabic Medical Question-Answer Classification
Minh-Hoang Le
Minh-Hoang Le
This paper describes our team’s submission to AbjadMed at AbjadNLP 2026. The task involves classifying Arabic medical question-answer pairs into 82 categories, characterized by a long-tail distribution and significant semantic overlap. While domain-specific Arabic models exist, they are primarily optimized for Named Entity Recognition or span-extraction tasks rather than high-cardinality sequence classification. Consequently, our system adopts a robust optimization approach using a general-purpose encoder. We utilize ARBERTv2 as the backbone, employing Label-Distribution-Aware Margin (LDAM) loss to mitigate class imbalance and Fast Gradient Method (FGM) adversarial training to enhance generalization boundaries. Our approach achieves a Macro-F1 score of 0.4028 on the private test set, demonstrating that advanced optimization techniques can yield competitive performance on specialized taxonomies without requiring domain-specific pre-training.
baellouf at AbjadMed: Efficient Fine-tuning with All-Linear LoRA for Arabic Medical QA Classification
Abdallah Khallouf
Abdallah Khallouf
We describe our system for the AbjadMed shared task on Arabic medical text classification at AbjadNLP 2026. Our approach combines efficient fine-tuning of Qwen3-8B using QLoRA with a Dice+CrossEntropy hybrid loss designed for Macro F1 optimization. Taking inspiration from recent research on optimal LoRA configurations, we apply low-rank adapters to all linear layers of the model rather than attention layers only, which we validate improves performance by 4.0 points. We also explore data augmentation through machine translation of external medical QA data, though this did not improve generalization. Our best submission achieves a Macro F1 score of 0.4441 on the test set.
Supachoke at AbjadMed: Enhancing Arabic Medical Text Classification Using Fine-Tuned AraBERT
Thanh Phu Nguyen | Tuan Thai Huy Nguyen Cu | Son Thai Pham | Tri Duy Ho Nguyen
Thanh Phu Nguyen | Tuan Thai Huy Nguyen Cu | Son Thai Pham | Tri Duy Ho Nguyen
Medical text classification is an important task in healthcare NLP, yet Arabic medical texts remain underexplored due to linguistic complexity and limited annotated data. In this paper, we study the effectiveness of AraBERT, a pre-trained Arabic transformer model, for Arabic medical text classification. We fine-tune AraBERT on a labeled medical dataset and evaluate its performance using standard classification metrics. Experimental results show that our fine-tuned AraBERT model achieves a private leaderboard score of 0.4076 and ranks 13th among participating teams, outperforming classical machine learning baselines and other transformer variants. These findings highlight the potential of transformer-based approaches for Arabic medical NLP and motivate further research.
REIGNITE at AbjadMed: Imbalance-Aware Fine-Tuning of Pretrained Arabic Transformers for Arabic Medical Text Classification Task
Nahid Montasir Rifat | Foyez Ahmed Dewan
Nahid Montasir Rifat | Foyez Ahmed Dewan
This paper presents our system developed for the AbjadNLP Shared Task 4 on Medical Text Classification in Arabic, which aims to assign Arabic medical question-answer pairs to a predefined set of medical categories. The task poses significant challenges due to severe class imbalance across 82 categories and the linguistic complexity of domain-specific Arabic medical text. To address these challenges, we propose an imbalance-aware training framework that combines targeted data augmentation for minority classes with class-weighted focal loss during fine-tuning. We evaluate multiple Arabic pretrained transformer models under a unified training configuration and further improve robustness through a majority-voting ensemble of the best-performing models. Our approach achieves competitive performance, ranking 15th on the private leaderboard with a macro F1 score of 0.4052, demonstrating the effectiveness of combining different data augmentation techniques, imbalance-aware training objectives, and ensemble learning for large-scale, highly imbalanced Arabic medical text classification. The code is available on GitHub.
Tashkees-AI at AbjadMed 2026: Flat vs. Hierarchical Classification for Fine-Grained Arabic Medical QA
Fatimah Mohamed Emad Eldin
Fatimah Mohamed Emad Eldin
This paper describes Tashkees-AI, a system developed for the AbjadMed 2026 Shared Task on Arabic Medical Question Classification. A comprehensive empirical study was conducted across 82 fine-grained categories, investigating three paradigms: fine-tuned encoder models, hierarchical classification, and ensemble methods. Leveraging a dataset of 27k Arabic medical question-answer pairs, an extensive ablation studies was conducted, comparing MARBERTv2, CAMeLBERT, two-stage hierarchical classifiers, and RAG-based approaches. The findings reveal that fine-tuned MARBERTv2 with data cleaning yields the best performance, achieving a macro F1-score of 0.3659 on the blind test set. In contrast, hierarchical methods surprisingly underperformed (0.332 F1) due to error propagation. The system ranked 26th on the official leaderboard.
MetaSwarm at AbjadMed: Forensic Optimization and Class-Balanced Discovery for Medical Diglossia in Abjad Scripts
Rahul Jaisy
Rahul Jaisy
The classification of diglossic medical text presents a high-dimensional challenge defined by extreme class imbalance (N = 82) and the orthographic ambiguity of unvocalized Abjad scripts. While standard supervised learning often collapses into majority-class prediction due to the "Long Tail" distribution, we intro- duce a Human-in-the-Loop Forensic Opti- mization framework. Unlike static end-to-end pipelines, our approach decouples strategic hy- perparameter tuning from high-throughput tac- tical execution (Elastic Compute). We lever- age a rigorous Class-Balanced Focal Loss (CBFL) derived from the "Effective Number of Samples" theory (En) to stabilize the de- cision manifold against stochastic class domi- nance. Using a CAMELBERT-DA backbone optimized via a custom weighted trainer on Dual H200 GPUs, our system achieved a ro- bust Public Leaderboard score of 0.3588. We further perform a "Linguistic Error Topology" analysis, utilizing UMAP projections and atten- tion saliency, to demonstrate that generalization gaps are driven by dialectal "Constraint Drift" rather than stochastic model failure.
QurSci-Onto: A Hierarchical Ontology and Dataset for Scientific Exegesis in the Quran
Ibad-ur-Rehman Rashid | Junaid Hussain | Sadam Al-Azani
Ibad-ur-Rehman Rashid | Junaid Hussain | Sadam Al-Azani
This paper introduces resources for the computational study of scientific exegesis (Tafsir Ilmi): a structured ontology, a curated dataset of 194 scientifically relevant Quranic verses linked to 260 exegetical records from two authoritative Tafsir books, and an annotation framework that organizes scientific references by topic and sequential context. Existing Quranic resources treat verses as unstructured text, losing the logical order and causal relationships of scientific concepts documented in exegesis. To address this, we present QurSci-Onto, a three-layer ontology that categorizes verses by scientific domain, links them to authoritative Tafsir, and provides a framework for representing sequential processes through stage-based annotations. Our dataset includes page-level citations and covers 8 major scientific topics across 73 nodes. While the full corpus is tagged with broad categories and scientific topics, a specialized subset features granular node-level mappings to capture complex scientific narratives. We release QurSci-Onto as a foundational resource for Arabic semantic NLP and demonstrate that it enables significant improvements in semantic retrieval and enables multi-hop sequential reasoning capabilities over unstructured baselines.
AjamiMorph: Zero-Annotation Morphological Discovery for Hausa Ajami via Multi-Method Consensus
Soumedhik Bharati | Shibam Mandal | Prithwish Ghosh | Swarup Kr Ghosh | Sayani Mondal
Soumedhik Bharati | Shibam Mandal | Prithwish Ghosh | Swarup Kr Ghosh | Sayani Mondal
Hausa Ajami (Hausa written in Arabic script) remains severely under-resourced for computational morphology. We present AjamiMorph, a zero-annotation framework that discovers morphemes through consensus among three unsupervised methods, namely, Byte Pair Encoding (BPE), transition-based boundary detection using Pointwise Mutual Information (PMI), and computational linguistics based Distributional Affix Mining (DAM). Using a Hausa Ajami Bible corpus consisting of 637,414 tokens, AjamiMorph identifies 1,611 high-confidence morphemes, achieving 99.9% coverage. The inventory exhibits a linguistically realistic distribution (66.0% stems, 22.6% suffixes, 11.4% prefixes) and recovers 77.8% of known Hausa affixes. A permutation test that shuffles method assignments (preserving per-method selection sizes) confirms that the observed agreement is above-chance; chi-square remains as a secondary check. A lightweight 5-gram LM comparison (characters vs. consensus morphemes) provides an extrinsic signal. We also report negative results for script-driven Arabic assumptions and LLM-first annotation. This work provides the first unsupervised morpheme inventory for Hausa Ajami and demonstrates consensus as a robust strategy for zero-resource morphology.
Morphological Feature Extraction for Fine-Grained Sorani Kurdish Dialect Identification: A Hybrid Transformer-Linguistic Approach
Soumedhik Bharati | Shibam Mandal | Subham Majumdar | Swarup Kr Ghosh | Sayani Mondal
Soumedhik Bharati | Shibam Mandal | Subham Majumdar | Swarup Kr Ghosh | Sayani Mondal
As reported, approximately 6 million people in Iraq and Iran speak in Sorani Kurdish, which exhibits substantial regional variation but lacks computational resources for dialect identification. We present the first fine-grained sub-dialect classification system for six Sorani varieties namely, Sulaymaniyah, Erbil, Iranian Sorani, Ardalani, Babani, and Mukriani. This investigation combines cross-lingual contextual embeddings (XLM-RoBERTa) with morphological features derived from explicit linguistic rules, including 24 patterns capturing verb prefixes, pronominal clitics, and definite markers. The suggested morphology-augmented XLM-R model has been trained on a unified dataset of 16,409 sentences without manual annotation, and achieves 91.91% accuracy, outperforming pure transformers (91.79%) and traditional machine learning baselines (SVM 86.41%). Key ablation studies reveal that morphological features serve as effective regularizers for geographically proximate dialects.
Olga Snissarenko at AbjadMed: Arabic Clinical Text Classification with AraBERT: Results from the AbjadMed Shared Task
Olga Snissarenko
Olga Snissarenko
We present a solution for the Arabic medical text classification task, formulated as a multi-class classification problem with 82 medical categories. The task is challenging due to severe class imbalance, long and heterogeneous input texts, and the presence of domain-specific medical terminology in Modern Standard Arabic. Our approach is based on fine-tuning pretrained AraBERT models with a focus on loss-level imbalance handling rather than architectural complexity. Through a systematic comparison of multiple AraBERT-based configurations, we show that class-weighted loss combined with simple mean pooling yields the strongest performance. Our best model achieves a macro-F1 score of 0.387 on the public evaluation set and 0.411 on the private test set.
From Classical to Contemporary: Evolutionary Analysis & Classification of Urdu Poetry
Noor Fatima | Hasan Faraz Khan | Irfan Ahmad
Noor Fatima | Hasan Faraz Khan | Irfan Ahmad
Automatic classification of literary text by historical era can support literary analysis and reveal stylistic evolution. We study this problem for Urdu poetry across three eras, classical, modern, and contemporary. We introduce a new dataset of 10,026 four-line Urdu poetry segments collected from online archives (Rekhta and UrduPoint) and labeled by era. To handle Urdu’s script and orthographic variability, we apply standard preprocessing, including Unicode normalization and removal of diacritics and non-Urdu characters. We benchmark a range of approaches, from traditional machine learning classifiers to deep learning models, including fine-tuned Urdu BERT-style transformers. To assess generalization, we evaluate under two regimes: (i) a standard stratified random split and (ii) a stricter author-disjoint split that ensures poets do not overlap between training and test sets. On the random split, the best traditional models achieve about 70-73% accuracy, suggesting era-related stylistic cues are learnable. However, performance drops to roughly 58-60% under the author-disjoint split, highlighting the difficulty in generalizing across unseen poets and the possibility of overestimating performance via author-specific leakage. Notably, fine-tuned transformers do not surpass simpler TF-IDF-based baselines, indicating that era cues may be subtle and that data limitations constrain more complex models.
Alkhalil Corpus: An Open-Source Thematic and Lemmatized Corpus for Modern Standard Arabic
Samir Belayachi | Azzeddine Mazroui
Samir Belayachi | Azzeddine Mazroui
The availability of large annotated corpora remains a major challenge for the development of natural language processing systems for under-resourced languages such as Arabic. In this paper, we present two annotated corpora dedicated to Modern Standard Arabic. These corpora are open-source and freely available on the Hugging Face platform. The first corpus, annotated by theme and designed to provide a balanced representation of contemporary Arabic usage, comprises approximately 76 million words collected from diverse sources covering multiple domains and geographical regions. The second corpus, containing approximately one million words, is a sub-corpus extracted from the first. It was annotated with lemma tags using a semi-automatic approach that combines automatic annotation with the Alkhalil lemmatizer and MADAMIRA, followed by manual validation.
Enhancing Urdu Sentiment Classification through Instruction-Tuned LLMs and Cross-Lingual Transfer
Hasan Faraz Khan | Noor Fatima | Irfan Ahmad
Hasan Faraz Khan | Noor Fatima | Irfan Ahmad
Sentiment analysis in low-resource languages such as Urdu poses unique challenges due to limited annotated data, morphological complexity, and significant class imbalance in most publicly available datasets. This study addresses these issues through two experimental strategies. First, we explore class imbalance mitigation by using instruction-tuned large language models (LLMs) to generate synthetic negative sentiment samples in Urdu. This augmentation strategy results in a more balanced dataset, which significantly improves the recall and F1-score for minority class predictions when fine-tuned using a multilingual BERT model. Second, we investigate the effectiveness of translating Urdu text into English and applying sentiment classification through a pre-trained English language model. Comparative evaluation reveals that the translation-based pipeline, using a RoBERTa model fine-tuned for English sentiment classification, achieves superior performance across major metrics. Our results suggest that LLM-based augmentation and cross-lingual transfer via translation both serve as viable approaches to overcome data scarcity and performance limitations in sentiment analysis for low-resource languages. The findings highlight the potential applicability of these approaches to other under-resourced linguistic domains.
Back-of-the-book indexes (BoBIs) are crucial for book readability. However, their manual creation is laborious and error prone. In this paper, we introduce ArBoBIM to automate BoBI extraction and review processes for Arabic books. Given a book with a corresponding BoBI, ArBoBIM extracts BoBI terms and identifies their occurrences and aligns those across several versions of the book. ArBoBIM first defines a pool of candidates for each term by leveraging noun phrases and named entities. ArBoBIM leverages several metrics, including exact matches, morpho-lexical similarity, and semantic similarity, to determine the best candidates. We empirically fine-tuned thresholds for ArBoBIM and achieve an F1-score of 0.94 (precision= 0.97, recall=0.91). These results are significantly better than baseline results, and top LLM based results with lower computational cost and no publishing house IP risks. Additionally, with ArBoBIM, over 500 books have been processed, resulting in the ArBoBIMap dataset, containing books, their terms, occurrences, and various metadata related to them, to be made available for the public. This dataset is used to train a model to identify if a term, given its features, should be added to the back-of-the-book index of a specific book. The model achieves an F1-score of 0.91 (precision = 0.97, recall = 0.85).
Improving on State-of-the-Art Models for Sentiment Analysis on Saudi-English Code-Switching Text
Samaher Alghamdi | Paul Rayson | Reem Alotibi
Samaher Alghamdi | Paul Rayson | Reem Alotibi
Inserting English words, phrases, or sentences while writing or speaking in the Saudi Arabic dialect has become a widespread phenomenon in Saudi society. This phenomenon is linguistically called code-switching. It remains unclear how current sentiment analysis methods perform on Saudi-English code-switching text. In this paper, we address this gap by conducting the first sentiment analysis study on Saudi-English code-switching text. We present the first Saudi-English Sentiment Analysis Code Switching Dataset (SESA-CSD) and establish baseline results on this dataset. By evaluating multiple state-of-the-art small language models, we achieve improvements over the baseline of 3% to 11% in both accuracy and macro-F1. Among all small language models, XLM-RoBERTa achieved the highest performance,with an accuracy of 95.50% and a macro-F1 of 95.53%. Our findings indicate that multilingual and Arabic small language models, such as XLM-RoBERTa, GigaBERT, and SaudiBERT, consistently outperform bilingual Arabic-English large language models, such as Fanar and ALLaM, across zero-shot and multiple few-shot settings.
OMAN-SPEECH: A Multi-Layer Annotated Speech Corpus for Omani Arabic Dialects
Rayyan S. Al Khadhuri | Firas Al Mahrouqi | Salim Al Mandhari | Amir Azad Al-Kathiri | Omar Said Alshahri | Ghassab Mansoor Alsaqr | Badri Abdulhakim Mudhsh | Tarek Fatnassi
Rayyan S. Al Khadhuri | Firas Al Mahrouqi | Salim Al Mandhari | Amir Azad Al-Kathiri | Omar Said Alshahri | Ghassab Mansoor Alsaqr | Badri Abdulhakim Mudhsh | Tarek Fatnassi
Automatic Speech Recognition (ASR) has achieved strong performance in high-resource languages; however, Dialectal Arabic remains significantly under-resourced. This gap is particularly evident in Oman, where Arabic exhibits substantial sociolinguistic variation shaped by settlement patterns between sedentary (Hadari) and nomadic (Badu) communities, which are often overlooked by urban-centric or generalized Gulf Arabic datasets. We introduce OMAN-SPEECH, a sociolinguistically stratified spoken corpus for Omani Arabic comprising approximately 40 hours of spontaneous and semi-spontaneous speech from 32 speakers across 11 Wilayats (provinces). The corpus is balanced to capture regional and lifestyle variation and is annotated at the sentence level with Arabic transcription, English translation, and phonetic transcription using the International Phonetic Alphabet (IPA) through a human-in-the-loop annotation pipeline. OMAN-SPEECH provides a foundational resource for evaluating ASR and related speech technologies on Omani and Gulf Arabic varieties and supports more granular modeling of regional dialectal variation.
Hala Technical Report Building Arabic-Centric Instruction & Translation Models at Scale
Hasan Abed Al Kader Hammoud | Mohamad Bilal Zbib | Bernard Ghanem
Hasan Abed Al Kader Hammoud | Mohamad Bilal Zbib | Bernard Ghanem
We present HALA, a family of Arabic-centric instruction and translation models built with our translate-and-tune pipeline. We first compress a strong AR↔EN teacher to FP8 (yielding ~2× higher throughput with no quality loss) and use it to create high-fidelity bilingual supervision. A lightweight language model LFM2–1.2B is then fine-tuned on this data and used to translate high-quality English instruction sets into Arabic, producing a million-scale corpus tailored to instruction following. We train HALA models at 350M, 700M, 1.2B, and 9B parameters, and apply slerp merging to balance Arabic specialization with base-model strengths. On Arabic-centric benchmarks, HALA achieves state-of-the-art results within both the "nano" (≤2B) and "small" (7–9B) categories, outperforming their bases. We are committed to release models, data, evaluation, and recipes to accelerate research in Arabic NLP.
Arabic Citation Parsing using Part of Speech and Named Entity Recognition
Youssef Karout | Hadi Hamoud | Fadi A. Zaraket
Youssef Karout | Hadi Hamoud | Fadi A. Zaraket
This paper introduces an industry level citation element extractor from Arabic text. Citation element extraction enables editorial task automation for publishing houses, creation of citation networks, and automatic citation analytics for impact analysis firms. Citation library tools help users manage their citations. However, for Arabic, these tools lack basic support to identify and extract citation elements. Consequently, researchers, editors and reviewers manually manage Arabic citations tasks. We present a novel Arabic citation element dataset, use it to train a citation element extraction model, and use named entity recognition, morphological analysis, and keyword detection to improve the results for practical use. The paper reports industry ready performance with F1 scores ranging between .80 and .95 for interesting citation elements.
DeformAR: A Visual Analytics Framework for Evaluation of Arabic Named Entity Recognition
Ahmed Mustafa Younes
Ahmed Mustafa Younes
Arabic Named Entity Recognition (ANER) presents challenges due to its linguistic characteristics (Qu et al., 2023). While Transformer models have advanced ANER, evaluation still relies heavily on aggregate metrics like F1 score that obscure the interplay between data characteristics, model behaviour, and error patterns. We present DeformAR, a diagnostic visual analytics framework for evaluating and diagnosing Arabic NER systems through structured, component-level analysis and interpretability. DeformAR integrates quantitative metrics with interactive visualizations to support systematic error analysis, dataset and model debugging. In a case study on ANERCorp, DeformAR identifies annotation mistakes, model calibration issues, and subcomponent interaction effects. To our knowledge, this is the first open-source framework for component-level diagnostic evaluation and interpretability in Arabic NER, available at https://github.com/ay94/DeformAR.
The spoken Arabic exhibits substantial dialectal variation in the Arabic-speaking world. This paper presents a corpus-based analysis of Arabic dialectal variation using the SADA corpus, examining lexical, morphosyntactic, and discourse-pragmatic patterns across dialects. We combine quantitative frequency-based measures with qualitative linguistic analysis, including keyword comparison, distributional profiling, collocational and trigram analyses, and similarity-based clustering. Our results show that Arabic dialects share a substantial common core, while differing systematically in frequent discourse markers, evaluative expressions, and recurrent phraseological patterns. These findings provide empirical evidence for regional clustering among contemporary dialects and for variation relative to the standard register. The study contributes linguistic insights that support both Arabic dialectology and the development of dialect-aware NLP systems.
HACS-TL: Cross-Script Transfer Learning for Hausa Ajami Hate Speech Detection Using Transformer-Based Architecture
Abdulkadir Shehu Bichi | Muqaddar Ali | Prashant Sharma | Ismail Dauda Abubakar
Abdulkadir Shehu Bichi | Muqaddar Ali | Prashant Sharma | Ismail Dauda Abubakar
The Arabic-derived scripts contain several languages that face challenges with the limited resources of speech detection, these challenges are worsened by the scarcity of resources and highly complex linguistic challenges. We proposed ( HACS-TL Hausa Ajami Cross-Script Transfer Learning) a brand new transformer-based architecture that focuses on the detection of hate speech within Ajami script. Hausa is a Chadic language which contains over 77 million speakers located in West Africa; it uses two types of scripts: the Latin (Boko) and the Arabic-derived Ajami which creates new computational difficulties. Our method combines scripts of artistically converted linguistics, augmented cross script multi-head attention, and dialect feature extraction to trellis the morphophonological depth of the Hausa. After a thorough examination using stratified cross-validation along with systemically augmented data, HACS-TL obtained a Macro F1 score of 76.09% which is a significant improvement from the other multilingual baselines (mBERT (69.17 % ) XLM-RoBERTa (73.20 % ) AraBERT (58.63% ) ) HACS-TL outperformed all of the previously stated models. Strong multilingual baselines refer to the other stated models; AraBERT (58.63) XLM-RoBERTa (73.20) mBERT (69.17) HACS-TL 70.73 + 10 % Cross-Script+ (mBERT) 46.73 + 0.9 % Cross-Script + AraBERT. The importance of cross-script attention and learning from transfer sources of resources to languages with limited scripts has proven effective. Our systematic method has aided the advancement of Arabic script homage Hausa and African language resources for the NLP of the Nubians in learning African languages and the intricate Nubian and cross-learning systems from different scripts.
Code-Switching as a Safety Failure Mode in Large Language Models: An Empirical Study of Roman Urdu across English, Mixed, and Transliteration-Only Inputs
Waleed Jamil | Saima Rafi
Waleed Jamil | Saima Rafi
Large Language Models exhibit robust safety alignment when harmful intent is expressed in English, yet their resilience to code-switching and transliteration remains underexplored. This paper presents the first targeted investigation of code-switching as a safety failure mode, focusing on Roman Urdu—a widely used transliterated form common in informal and emotionally expressive communication. We introduce the Roman Urdu Adversarial Benchmark (RUAB), a semantically controlled evaluation benchmark designed to isolate linguistic variation from intent across four safety-critical categories: passive suicidal ideation, psychological distress, threat or intimidation, and coercion or emotional manipulation. Evaluating seven state-of-the-art models, we find that safety detection degrades consistently in code-switched and transliterated inputs, with the most pronounced failures occurring for passive suicidal ideation. Instruction-tuned and reasoning-capable models demonstrate greater robustness, suggesting these failures reflect alignment gaps rather than inherent model limitations. Our findings highlight transliteration and code-switching as under-recognized safety risks and motivate the development of linguistically inclusive, transliteration-aware safety methods.
QAMAR: A New Fully Verified and Accurate Quranic Arabic Morphological Analysis Resource.
Sara Faqihi | Karim Bouzoubaa | Rachida Tajmout | Driss Namly
Sara Faqihi | Karim Bouzoubaa | Rachida Tajmout | Driss Namly
Several Quranic morphological corpora have been developed to support Arabic linguistic analysis and NLP applications, yet they often lack full coverage, consistency, or manual verification. We present QAMAR, a morphologically oriented, multi-task corpus derived from the Qur’an. This comprehensive, manually verified resource provides a detailed linguistic layer for every Quranic word, including the Modern Standard Arabic (MSA) equivalent, the stem, the lemma, the root, and the part of speech (POS). QAMAR supports multiple NLP tasks, such as normalization, lemmatization, root extraction, and POS tagging, and serves as a gold-standard reference for Quranic and Arabic NLP research, including corpus-to-corpus evaluation and morphological analyzer benchmarking. The paper details QAMAR’s annotation framework, verification process, and resource structure, and reports comparative analyses with existing Quranic morphological resources and outputs produced by current large language models (LLMs).
AraModernBERT: Transtokenized Initialization and Long-Context Encoder Modeling for Arabic
Omar Elshehy | Omer Nacar | Abdelbasset Djamai | Muhammed Ragab | Khloud Al Jallad | Mona Abdelazim
Omar Elshehy | Omer Nacar | Abdelbasset Djamai | Muhammed Ragab | Khloud Al Jallad | Mona Abdelazim
Encoder-only transformer models remain widely used for discriminative NLP tasks, yet recent architectural advances have largely focused on English. In this work, we present AraModernBERT, an adaptation of the ModernBERT encoder architecture to Arabic, and study the impact of transtokenized embedding initialization and native long-context modeling up to 8,192 tokens. We show that transtokenization is essential for Arabic language modeling, yielding dramatic improvements in masked language modeling performance compared to non-transtokenized initialization. We further demonstrate that AraModernBERT supports stable and effective long-context modeling, achieving improved intrinsic language modeling performance at extended sequence lengths. Downstream evaluations on Arabic natural language understanding tasks, including inference, offensive language detection, question-question similarity, and named entity recognition, confirm strong transfer to discriminative and sequence labeling settings. Our results highlight practical considerations for adapting modern encoder architectures to Arabic and other languages written in Arabic-derived scripts.
Parameter-Efficient Adaptation of Self-Supervised Models for Arabic Speech Recognition
Wafa Mohammed Alshehri | Wasfi G. Al-khatib | Mohammad Ismail Amro
Wafa Mohammed Alshehri | Wasfi G. Al-khatib | Mohammad Ismail Amro
Arabic speech recognition systems face distinct challenges due to the language’s complex morphology and dialectal variations. Self-supervised models (SSL) like XLS-R have shown promising results, but their size with over than 300 million of parameters, makes fine-tuning computationally expensive. In this work, we present the first comparative study of parameter-efficient fine-tuning (PEFT), specifically LoRA and DoRA, applied to XLS-R for Arabic ASR. We evaluate on the newly released Common Voice Arabic V24.0 dataset, establishing new benchmarks. Our full fine-tuning achieves state-of-the-art results among XLS-R-based models with 23.03% Word Error Rate (WER). In our experiments, LoRA achieved a 36.10% word error rate (WER) while training just 2% of the model’s parameters. DoRA reached 45.20% WER in initial experiments. We analyze the trade-offs between accuracy and efficiency, offering practical guidance for developing Arabic ASR systems when computational resources are limited. The models and code are publicly available.
Current state of LLMs for Arabic dialectal machine translation
Josef Jon | Rawan Bondok | Ondřej Bojar
Josef Jon | Rawan Bondok | Ondřej Bojar
This work presents an evaluation of large language models (LLMs) for English to dialectal Arabic machine translation on the MADAR dataset. We evaluate both translation directions (English to Arabic and vice-versa) on 16 Arabic dialects. Our experiments cover a diverse set of models, including specialized Arabic models (Jais, Nile), multilingual models (Gemma, Command-R, Mistral, Aya), and commercial APIs (GPT-4.1). We employ multiple evaluation metrics: BLEU, CHRF, COMET (both reference-based and reference-less variants) and GEMBA (LLM-as-a-judge), as well as a small-scale manual evaluation, to assess translation quality. We discuss the challenges of automatic MT evaluation, especially in the context of Arabic dialects. We also evaluate the ability of LLMs to classify the dialect used in a text. The study offers insights into the capabilities and limitations of current LLMs for dialectal Arabic machine translation, particularly highlighting the difficulty of handling dialectal diversity, although the results may be influenced by possible training data contamination, which is always a concern with LLMs.
A Hybrid Confidence-Aware Framework for Arabic Toxicity Detection in Social Media
Fawzia Zaal Alanazi | Asma Mohammed Alamri | Arwa Bin Saleh | Abdullah I. Alharbi
Fawzia Zaal Alanazi | Asma Mohammed Alamri | Arwa Bin Saleh | Abdullah I. Alharbi
Automatic detection of toxic and offensive content in Arabic social media is a challenging task due to rich morphology, dialectal variation, and noisy writing styles. While transformer-based language models have achieved strong performance, they often produce uncertain predictions in borderline cases. This paper presents a hybrid framework for Arabic toxicity detection that combines a pretrained Arabic-specific transformer model with a confidence-aware rule-based mechanism. The proposed approach activates automatically induced lexical rules only when the model prediction falls within a predefined gray zone of uncertainty, preserving neural dominance while improving robustness and interpretability. Experiments conducted on a manually annotated dataset of 35,000 Arabic posts demonstrate that the hybrid approach achieves consistent improvements over the baseline model, particularly in reducing false negatives for toxic content. The results indicate that selective rule activation is an effective strategy for enhancing reliability in real-world Arabic social media moderation systems.
Arabic-Adapted One-Step Speech-to-Diacritized ASR: Evaluation and Error Analysis
Osamah A. I. Abduljalil | Dalal Ali | Razan A. Bajaman | Abdullah I. Alharbi
Osamah A. I. Abduljalil | Dalal Ali | Razan A. Bajaman | Abdullah I. Alharbi
Arabic diacritics encode phonetic information essential for pronunciation, disambiguation, and downstream applications, yet most Arabic ASR systems generate undiacritized output. In this work, we study direct speech-to-diacritized-text recognition using a single-stage ASR pipeline that predicts diacritics jointly with Arabic letters, without text-based post-processing. We evaluate two Arabic-adapted ASR architectures—wav2vec 2.0 XLSR-53 and Whisper-base—under a unified experimental setup on the ClArTTS Classical Arabic dataset. Performance is assessed using surface and lexical WER/CER alongside diacritic error rate (DER) to disentangle base transcription accuracy from diacritic realization. Our results show that Arabic-adapted wav2vec 2.0 achieves substantially lower diacritic error rates than Whisper, indicating stronger exploitation of acoustic cues relevant to vowelization. We further analyze the effect of decoding strategy and provide a detailed breakdown of diacritic errors, highlighting challenges associated with short vowels and morphosyntactic markers. These findings underscore the importance of model architecture and Arabic-specific adaptation for accurate diacritized Arabic ASR.
GATech at AbjadGenEval Shared Task: Multilingual Embeddings for Arabic Machine-Generated Text Classification
Ahmed Khamis
Ahmed Khamis
We present our approach to the AbjadGenEval shared task on detecting AI-generated Arabic text. We fine-tuned the multilingual E5-large encoder for binary classification, and we explored several pooling strategies to pool token representations, including weighted layer pooling, multi-head attention pooling, and gated fusion. Interestingly, none of these outperformed simple mean pooling, which achieved an F1 of 0.75 on the test set. We believe this is because complex pooling methods introduce additional parameters that need more data to train properly, whereas mean pooling offers a stable baseline that generalizes well even with limited examples. We also observe a clear pattern in the data: human-written texts tend to be significantly longer than machine-generated ones.
AraLingBench: A Human-Annotated Benchmark for Evaluating Arabic Linguistic Capabilities of Large Language Models
Mohamad Bilal Zbib | Hasan Abed Al Kader Hammoud | Ammar Mohanna | Nadine Rizk | Fatima Karnib | Sina Moukaled | Bernard Ghanem
Mohamad Bilal Zbib | Hasan Abed Al Kader Hammoud | Ammar Mohanna | Nadine Rizk | Fatima Karnib | Sina Moukaled | Bernard Ghanem
We present AraLingBench, a fully human annotated benchmark for evaluating the Arabic linguistic competence of large language mod- els (LLMs). The benchmark spans five core categories: grammar, morphology, spelling, reading comprehension, and syntax, through 150 expert designed multiple choice questions that directly assess structural language understanding. Evaluating 35 Arabic and bilingual LLMs reveals that current models demonstrate strong surface level proficiency but struggle with deeper grammatical and syntactic reasoning. AraLingBench highlights a persistent gap between high scores on knowledge-based benchmarks and true linguistic mastery, showing that many models succeed through memorization or pattern recognition rather than au- thentic comprehension. By isolating and measuring fundamental linguistic skills, AraLingBench provides a diagnostic framework for developing Arabic LLMs. The benchmark and evaluation code are available on Hugging Face and GitHub.
REGLAT at AbjadMed: Handling Imbalanced Arabic Medical Text Classification via Hierarchical KNN-MLP Architecture
Ahmed M. Fetouh | Mohammed Rahmath | Omer Dawood | Mariam Labib | Nsrin Ashraf | Hamada Nayel
Ahmed M. Fetouh | Mohammed Rahmath | Omer Dawood | Mariam Labib | Nsrin Ashraf | Hamada Nayel
In this paper, we demonstrate the system submitted to the shared task of medical text classification in Arabic. We proposed a single-model approach based on fine-tuned LLM-based embedding combined with hierarchical classical classifiers, achieving a competitive macro F1-score of 0.46 on the blind test set. We explored various modeling strategies, including tree-based ensembles, LLM, and hierarchical correction for rare classes, highlighting the effectiveness of domain-specific fine-tuning in low-resource settings. The results demonstrate that a single fine-tuned Arabic BERT variant can serve as a strong baseline in extreme imbalance scenarios, outperforming more complex ensembles in simplicity and reproducibility.
Murabaa: A comprehensive Resource Platform for Arabic Morphology
Karim Bouzoubaa | Driss Namly | Hamid Jihad | Rachida Tajmout | Jamal Ezzouaine | Hakima Khamar
Karim Bouzoubaa | Driss Namly | Hamid Jihad | Rachida Tajmout | Jamal Ezzouaine | Hakima Khamar
Arabic language faces technical and cultural challenges, including a lack of high-quality resources and the prevalence of regional dialects, which hinders the development of effective language processing systems. Therefore, the "Murabaa" platform was developed to transform Arabic linguistic knowledge into integrated digital resources. The platform aims to provide accurate digital content and promote the use of Arabic in various fields to bridge the gap between tradition and modernity by offering integrated linguistic resources for developing advanced research tools. The platform provides eight accurate dictionaries in the form of a website and a web application, contributing to the digitization of knowledge and its representation within the framework of standard lexical markup. In this study, we also conduct a quantitative comparison of the resources against similar ones to assess the quality of the linguistic knowledge they provide.
Sujith Kanakkassery at AbjadMed: Imbalance-Aware Transformer Fine-tuning for Arabic Medical Text Classification
Sujith Kanakkassery
Sujith Kanakkassery
This paper describes our system submitted to the AbjadMed 2026 shared task at AbjadNLP. The task focuses on the multi-class classification of Arabic medical texts under severe class imbalance. Our approach fine-tunes a pre-trained Arabic Transformer model and incorporates several imbalance-aware strategies, including data cleaning, class-weighted loss, and label smoothing. Through ablation experiments, we observe consistent improvements over a baseline system, demonstrating the effectiveness of these techniques in improving performance on underrepresented medical categories. Finally, our error analysis highlights persistent challenges related to label sparsity and semantic overlap among medical classes.
A Knowledge Graph Based Diagnostic Framework for Analyzing Hallucinations in Arabic Machine Reading Comprehension
Najwa Abdullah AlGhamdi | Sadam Al-Azani | Kwabena Nuamah | Alan Bundy
Najwa Abdullah AlGhamdi | Sadam Al-Azani | Kwabena Nuamah | Alan Bundy
Large Language Models (LLMs) frequently generate answers that are fluent but not fully grounded in the provided context, a phenomenon commonly referred to as hallucination. While recent work has explored hallucination detection primarily in English and open domain settings, comparatively little attention has been given to Arabic machine reading comprehension (MRC), particularly in culturally sensitive domains such as Qur’anic texts. In this paper, we present a knowledge graph based diagnostic framework for analyzing hallucinations and question misalignment in Arabic MRC. Rather than proposing a new detection model or metric, the framework provides an interpretable, triple level analysis of model generated answers by comparing subject-relation-object representations derived from the passage, the question, and the answer. The approach incorporates question-aware filtering and operates under weak supervision, combining automatic analysis with targeted human adjudication to handle annotation gaps and semantic ambiguity. We apply the framework to the Qur’anic Reading Comprehension Dataset (QRCD) and demonstrate how it exposes systematic hallucination patterns that are difficult to capture using surface level similarity metrics alone, particularly for questions requiring justification or abstract interpretation. The results highlight the value of structured, transparent diagnostic evaluation for understanding LLM behavior in low resource and high stakes Arabic NLP settings.
From Posts to Pressure: An Arabic Dataset about Stress and Mental-Health Monitoring
Wajdi Zaghouani | Eman Sedqy Shlkamy | Mabrouka Bessghaier
Wajdi Zaghouani | Eman Sedqy Shlkamy | Mabrouka Bessghaier
How do Arabic-speaking communities express and engage with psychological stress on social media? We introduce AraStress, the first large-scale Arabic corpus dedicated to psychological stress research, comprising 175,862 public social media posts from 2020 to 2024, covering pandemic and post-pandemic periods.It fills a significant gap in Arabic mental-health NLP resources focused on stress, enabling large-scale analysis of related expressions.Unlike prior work focusing primarily on Twitter and depression or suicidality, AraStress addresses the critical gap in stress-focused resources. Our lexicon-based analysis reveals that stress-related posts elicit predominantly affective engagement and exhibit a hybrid lexical framing that integrates religious and therapeutic language. AraStress provides a foundational resource for culturally grounded computational models of stress detection and digital wellbeing in Arabic-speaking communities.
HCMUS_TheFangs at AbjadGenEval Shared Task: Weighted Layer Pooling with Attention Fusion for Arabic AI-Generated Text Detection
Duy Minh Dao Sy | Nguyen Chi Tran | Trung Kiet Huynh | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
Duy Minh Dao Sy | Nguyen Chi Tran | Trung Kiet Huynh | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
The rapid advancement of large language mod-els poses significant challenges for content au-thenticity, particularly in under-resourced lan-guages where detection tools remain scarce.We present our winning system for the Abjad-GenEval shared task on Arabic AI-generatedtext detection. Our key insight is that AI-generated text exhibits distinctive patternsacross multiple linguistic levels-from local syn-tax to global semantics-that can be captured bylearning to fuse representations from differenttransformer layers. We introduce aWeightedLayer Poolingmechanism that learns optimallayer combinations, combined withAttentionPoolingfor sequence-level context aggregation.Through systematic experimentation with 15+ approaches, we make a surprising discovery:model architecture selection dominates over so-phisticated training techniques, with DeBERTa-v3 providing +27% relative improvement overAraBERT regardless of training strategy. Oursystem achieves 0.93 F1-score, securing 1st placeamong all participants and outperform-ing the runner-up by 3 absolute points
HCMUS_The Fangs at AbjadStyleTransfer Shared Task: Learning to Query Style, Contrastive Representations for Zero-Shot Arabic Authorship Style Transfer
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Chi Tran | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
Duy Minh Dao Sy | Trung Kiet Huynh | Nguyen Chi Tran | Nguyen Lam Phu Quy | Pham Phu Hoa | Nguyen Dinh Ha Duong
This paper describes the system developed by team HCMUS_The Fangs for the AbjadStyleTransfer shared task (ArabicNLP 2026), where we achieved 1st place. We present a contrastive style learning approach for zero-shot Arabic authorship style transfer. Our key discovery is that the 21 test authors-including Nobel laureate Naguib Mahfouz and literary pioneer Taha Hussein-have zero overlap with the 32,784 training authors, transforming this into a pure zero-shot challenge. This insight led us to develop a dual-encoder architecture that learns transferable style representations through contrastive objectives, rather than memorizing author-specific patterns. Our system achieves 19.77 BLEU and 55.74 chrF, outperforming retrieval-augmented generation (+18%) and multi-task learning (+31%). Counter-intuitively, we find that sophisticated architectural modifications like style injection consistently degrade performance, while simpler approaches that preserve pre-trained knowledge excel. Our analysis reveals that for famous authors, pre-trained Arabic language models already encode substantial stylistic knowledge-the key is surfacing it, not learning from scratch.
Large Language Models (LLMs) have rapidly proliferated, presenting challenges in distinguishing human-written text from AI-generated content, especially in low-resource languages like Urdu. This paper introduces U-RoCX, a novel hybrid architecture for the AbjadGenEval Shared Task on AI-Generated Urdu Text Detection. U-RoCX combines the multilingual semantic capabilities of a frozen XLM-RoBERTa backbone with local feature extraction from Convolutional Neural Networks (CNNs) and the advanced sequential modeling of the recently proposed Extended LSTM (xLSTM). By utilizing xLSTM’s matrix memory and covariance update rules, the model addresses traditional Recurrent Neural Network bottlenecks. Experimental results demonstrate the robustness of U-RoCX, achieving a balanced accuracy and F1-score of 88% on the test set.
HCMUS_PrisonDilemma at AbjadAuthorID Shared Task: Less is More with Base Models
Trung Kiet Huynh | Duy Minh Dao Sy | Nguyen Chi Tran | Pham Phu Hoa | Nguyen Lam Phu Quy | Truong Bao Tran
Trung Kiet Huynh | Duy Minh Dao Sy | Nguyen Chi Tran | Pham Phu Hoa | Nguyen Lam Phu Quy | Truong Bao Tran
We present our approach to the AbjadNLP 2026 Arabic Authorship Identification shared task, achieving 4th place. Our key finding is that AraBERT-base (110M) outperforms AraBERT-large (340M) on the test set with macro F1 of 0.8449 versus 0.8096, despite lower validation scores. We handle long passages via sliding window chunking with mean pooling, and use a two-stage classification head with dual dropout for regularization. Per-class analysis reveals that translated works achieve perfect F1 while classical poets remain challenging due to shared formal structures. Our results challenge the "scale is all you need" assumption for stylometric tasks.
U-MIRAGE: Benchmarking Chain-of-Thought Reasoning for Urdu Medical QA
Ali Faheem | Faizad Ullah | Muhammad Hammad | Ahmed Hassan | Muhammad Sohaib Ayub | Asim Karim
Ali Faheem | Faizad Ullah | Muhammad Hammad | Ahmed Hassan | Muhammad Sohaib Ayub | Asim Karim
Medical AI systems increasingly rely on large language models (LLMs), yet their deployment in linguistically diverse regions remains unexplored. We address this gap by introducing U-MIRAGE, the first medical question-answering benchmark for Urdu and Roman Urdu. Urdu is the 11th most spoken language (with over 246 million speakers) worldwide. Our systematic evaluation of six state-of-the-art LLMs reveals three main findings. (1) 6% to 10% drop in performance when moving from English to Urdu variants, even though medical knowledge should theoretically transfer across languages. (2) Chain-of-Thought (CoT) prompting improves small models by 8% to 20%, while surprisingly the larger models’ performance degraded by up to 3%. (3) Quantized small models fail catastrophically in low-resource languages, achieving near-random accuracy regardless of various prompting strategies. These findings challenge core assumptions about multilingual medical AI systems. Roman Urdu consistently outperforms standard Urdu script, suggesting orthographic alignment with pre-training data matters more than linguistic proximity. CoT prompting effectiveness depends critically on model architecture rather than task complexity alone. CoT prompting effectiveness depends critically on model architecture rather than task complexity alone. Our contributions are threefold: (1) U-MIRAGE, (2) systematic benchmarking of LLMs for Urdu and Roman Urdu medical reasoning, and (3) empirical analysis of CoT prompting in low-resource contexts. Our code and datasets are publicly available.
XLMR-Urdu at AbjadGenEval Shared Task: A Data-Centric Transformer-Based Approach for AI-Generated Urdu Text Detection
Mohannad Mohammad Hendi
Mohannad Mohammad Hendi
The rapid advancement of large language models (LLMs) has led to a substantial increase in automatically generated textual content, raising concerns regarding misinformation, plagiarism, and authorship verification. These challenges are particularly pronounced for low-resource languages such as Urdu, where limited annotated data and complex linguistic properties hinder robust detection. In this paper, we present a transformer-based approach for binary classification of human-written versus AI-generated Urdu text, developed for the AbjadGenEval Task 2 shared task. Beyond model fine-tuning, we adopt a data-centric perspective, emphasizing dataset diagnostics, document-level inference, and calibration strategies. Our system achieves strong performance on the official test set, with an F1-score of 88.68% and balanced accuracy of 88.71%. Through empirical analysis, we demonstrate that dataset characteristics and generator-specific artifacts play a dominant role in model generalization, highlighting critical directions for future research in low-resource AI-generated text detection.
This paper describes our system submitted to the AbjadGenEval Shared Task at ArabicNLP 2026, which focuses on binary classification of human-written versus machine-generated text in low-resource languages. We participated in two independent subtasks targeting Arabic and Urdu news and literary texts. Our approach relies exclusively on fine-tuning XLM-RoBERTa, a multilingual Transformer-based model, under carefully controlled training and preprocessing settings. While the same model architecture was used for both subtasks, language-specific data handling strategies were applied based on empirical observations. The proposed system achieved first place in the Urdu subtask and third place in the Arabic subtask according to the official evaluation. These results demonstrate that multilingual pretrained models can serve as strong and reliable systems for AI-generated text detection across diverse languages.
The proliferation of Large Language Models (LLMs) has introduced significant challenges regarding algorithmic bias, privacy, and the authenticity of digital content. While detection mechanisms for English are maturing, low-resource languages like Urdu—spoken by over 100 million people—require dedicated research. In this paper, we present a technical framework for Urdu AI-generated text detection developed for the *ACL shared task. We propose a hybrid pipeline that combines TF-IDF Character N-grams with a custom stylometric feature extractor designed to capture unique Urdu linguistic markers, including repeated word ratios, punctuation density, and formal function markers. Using a Linear Support Vector Machine (SVM) optimized via Stochastic Gradient Descent (SGD), our system achieves a balanced accuracy and F1-score of 87.80% on a dataset of 6,800 records. Our results demonstrate that a computationally efficient, classical machine learning approach—prioritizing stylistic signals over heavy preprocessing—remains highly effective for distinguishing between human-written and AI-generated Urdu text.
QalamID at AbjadAuthorID Shared Task: Morphology Matters, A Hybrid Ensemble for Arabic Authorship Attribution
Youssef Zaghloul
Youssef Zaghloul
Arabic authorship attribution presents unique challenges due to the language’s rich derivational morphology, which often fragments word-level frequencies. In this paper, we describe our winning submission to the AbjadAuthorID Shared Task. We propose a hybrid ensemble system that fuses the morphological precision of character n-gram LinearSVCs with the semantic understanding of fine-tuned Transformers (AraBERT and XLM-RoBERTa). Contrary to current trends in NLP, we demonstrate that traditional character n-grams (0.92 F1) significantly outperform deep learning baselines (AraBERT 0.87 F1) for this task, suggesting that authorial signature in Arabic is encoded more densely in morphological patterns than in semantic content. Our final system employs a novel Precision Scalpel post-hoc calibration technique and selective pseudo-labeling to address class imbalance and genre confounds. The system achieved the 1st place ranking with a macro F1-score of 0.932 and accuracy of 0.963 on the test set.
Kashif-AI at AbjadGenEval Shared Task: A Transformer-based Approach for Arabic AI-Generated Text Detection
Fatimah Mohamed Emad Eldin
Fatimah Mohamed Emad Eldin
As Large Language Models (LLMs) become increasingly proficient at generating human-like text, distinguishing between human-written and machine-generated content has become a critical challenge for information integrity. This paper presents Kashif-AI, a system developed for the AbjadGenEval Task 1: AI-Generated Arabic Text Detection. The approach leverages fine-tuned Arabic Pre-trained Language Models (PLMs), specifically MARBERT and CAMeLBERT, to classify news articles. A rigorous ablation study was conducted to evaluate the impact of data augmentation, comparing models trained on the official shared task data against those trained on a combined corpus of over 47,000 samples. While near-perfect performance was observed during validation, the blind test set evaluation revealed a significant generalization gap. Contrary to expectations, data augmentation resulted in performance degradation due to domain shifts. The best-performing configuration, which utilized CAMeLBERT-Mix trained on the original dataset, achieved an F1-score of 66.29% and an Accuracy of 70.5% on the blind test set.
NileUn at AbjadGenEval Shared Task: Contrastive Learning with Stacking Ensemble for Efficient Arabic AI-Generated Text Detection
Mohamed Hussein Mohamed | Shrouk Shalaby | Nesreen Mohamed
Mohamed Hussein Mohamed | Shrouk Shalaby | Nesreen Mohamed
We present a computationally efficient ap- proach for detecting AI-generated Arabic text as part of the AbjadGenEval shared task. Our method combines Supervised Con- trastive Learning with a Stacking Ensemble of AraBERT and XLM-RoBERTa models. Our training pipeline progresses through three stages: (1) standard fine-tuning without con- trastive loss, (2) adding supervised contrastive loss for better embeddings, and (3) further fine-tuning on diverse generation styles. On our held-out test split, the stacking ensemble achieves F1=0.983 before fine-tuning. On the official workshop test data, our system achieved 4th place with F1=0.782, demonstrating strong generalization using only encoder-based trans- formers without requiring large language mod- els. Our implementation is publicly available
REGLAT at AbjadGenEval: Multi-Model Ensemble Approach for Arabic AI-Generated Text Detection
Mariam Labib | Nsrin Ashraf | Ahmed M. Fetouh | Hamada Nayel
Mariam Labib | Nsrin Ashraf | Ahmed M. Fetouh | Hamada Nayel
The rapid advancement of large language models necessitates robust methods for detecting AI-generated Arabic text. This paper presents our system for distinguishing human-written from machine-generated Arabic content. We propose a weighted ensemble combining AraBERTv2 and BERT-base-arabic, trained via 5-fold stratified cross-validation with class-balanced loss functions. Our methodology incorporates Arabic text normalization, strategic data augmentation using 16,678 samples from external scientific abstracts, and threshold optimization prioritizing recall. On the official test set, our system achieved an F1-score of 0.763, an accuracy of 0.695, a precision of 0.624, and a recall of 0.980, demonstrating strong detection of machine-generated texts with minimal false negatives at the cost of elevated false positives. Analysis reveals critical insights into precision-recall trade-offs and challenges in cross-domain generalization for Arabic AI text detection.
AyahVerse at AbjadGenEval Shared Task: Monolingual Precision and Cross-Lingual Analysis in Perso-Arabic AI Detection
Fizza Nawaz | Ibad-ur-Rehman Rashid | Uswa Abid | Junaid Hussain
Fizza Nawaz | Ibad-ur-Rehman Rashid | Uswa Abid | Junaid Hussain
This paper presents our submission to the AbjadGenEval shared task on AI-generated text detection in Arabic and Urdu. To address the challenges of morphologically rich and low-resource environments, we developed a composite framework leveraging monolingual specialists (AraBERTv2, CAMeLBERT-DA) and multilingual transformers. Our system achieved robust in-domain performance with Test F1-scores of 0.75 for Arabic and 0.86 for Urdu. Methodologically, we tested both raw and normalized text to distinguish whether models detect based on semantic content or on surface artifacts such as punctuation and formatting patterns. Furthermore, our cross-lingual investigations reveal directional performance differences, where Urdu-trained models achieve 0.75 F1 on Arabic, while Arabic-trained models achieve only 0.61 F1 on Urdu. Despite this difference, both directions maintained notably high recall for the machine class, indicating that the model learns cross-lingual machine detection patterns across the Perso-Arabic script. Finally, transfer performance collapsed when internal layers were frozen, demonstrating that full fine-tuning is essential for cross-lingual detection. However, the observed performance differences may partly reflect data imbalance rather than purely linguistic factors.
AbjadMed: Arabic Medical Text Classification at AbjadNLP 2026
Pranav Gupta | Niranjan Kumar M | Balaji Nagarajan | Imed Zitouni | Mo El-Haj
Pranav Gupta | Niranjan Kumar M | Balaji Nagarajan | Imed Zitouni | Mo El-Haj
We present AbjadMed, a shared task on Arabic medical text classification organised as part of the 2nd AbjadNLP workshop at EACL 2026. The task targets supervised multi-class classification under realistic conditions of severe class imbalance, fine-grained category structure, and naturally occurring label noise. Participants assign each Arabic medical question–answer instance to one of 82 predefined categories derived from real healthcare consultations. The dataset is based on the Arabic Healthcare Dataset (AHD) and is released as curated training and test splits containing 27,951 and 18,634 instances respectively, while preserving the original label distribution. Systems are evaluated using macro-averaged F1 to emphasise performance on minority medical topics. Results show that Arabic medical text classification remains challenging even with modern pretrained models, particularly for low-frequency and semantically overlapping categories. AbjadMed provides a reproducible benchmark for studying robustness and generalisation in Arabic healthcare NLP.
Uslub at AbjadAuthorID Shared Task: A Comparative Analysis of Traditional Machine Learning and Transformer-Based Models for Authorship Attribution in Arabic and Urdu
Shahad Alsuhaibani | Mohamed Alkaoud
Shahad Alsuhaibani | Mohamed Alkaoud
Authorship attribution is a critical task in natural language processing with applications ranging from forensic linguistics to plagiarism detection. While well-studied in high-resource languages, it remains challenging for low-resource languages like Arabic and Urdu. In this paper, we present our participation in the AbjadNLP shared task, where we systematically evaluate three distinct approaches: traditional machine learning using SVM with TF-IDF features, fine-tuned transformer-based models (AraBERT), and LLMs. We demonstrate that while fine-tuned AraBERT excels in Arabic, traditional lexical models (SVM) prove more robust for Urdu, outperforming both BERT-based and LLM approaches. We also show that few-shot prompting with LLMs, when operated as a reranker over top candidates, significantly outperforms zero-shot baselines. Our final systems achieved competitive performance, ranking 6th and 1st in the Arabic and Urdu tasks respectively.
Arabic Author Attribution Using Transformer-Based Models: Insights from the AbjadAuthorID Shared Task
Ghader Kurdi
Ghader Kurdi
This paper describes the author’s participation in the Arabic track of the AbjadAuthorID shared task which focuses on multiclass authorship attribution using transformer-based models. The task involves identifying the author of a given text excerpt drawn from diverse genres and historical periods, posing significant challenges due to stylistic variation and linguistic richness. Experimental results demonstrate strong performance, with an ensemble of MAR BERTv2 and ARBERTv2 achieving achieving an accuracy of 92% and a macro-averaged F1 score of 89%, ranking second on the leader board, and highlighting the effectiveness of the proposed approach for Arabic authorship identification.
R-R at AbjadAuthorID Shared Task: A Fine-Tuned Approach for Kurdish Authorship Identification
Rania Azad M. San Ahmed | Rebwar M. Nabi
Rania Azad M. San Ahmed | Rebwar M. Nabi
Authorship identification is a fundamental task in natural language processing and computational stylistics. Despite significant advancements in high-resource languages, lowresource languagesparticularly those utilizing non-Latin scriptsremain largely underexplored, leaving a critical gap in resources and benchmarks for this linguistically distinct, lowresource language. Addressing this oversight, this paper presents Task 3 of AbjadNLP 2026, the first shared task dedicated to authorship identification for Kurdish. The task introduces a newly constructed dataset designed to capture the unique phonological and orthographic features of Sorani Kurdish and formulate the task as a closed-set multiclass classification problem. To establish a robust baseline, we fine-tune the pretrained XLM-RoBERTa model to capture authorial, stylistic patterns. Experimental results on the test set demonstrate the efficacy of transformer-based representations for this domain, achieving an accuracy of approximately 75%.
AbjadGenEval: Abjad AI Generated Text Detection Shared Task for Languages Using Arabic Script at AbjadNLP 2026
Saad Ezzini | Irfan Ahmad | Salmane Chafik | Shadi Abudalfa | Mo El-Haj | Ahmed Abdelali | Mustafa Jarrar | Nadir Durrani | Hassan Sajjad | Farah Adeeba
Saad Ezzini | Irfan Ahmad | Salmane Chafik | Shadi Abudalfa | Mo El-Haj | Ahmed Abdelali | Mustafa Jarrar | Nadir Durrani | Hassan Sajjad | Farah Adeeba
We present the findings of the AbjadGenEval shared task, organized as part of the AbjadNLP workshop at EACL 2026, which benchmarks AI-generated text detection for Arabic-script languages. Extending beyond Arabic to include Urdu, the task serves as a binary classification platform distinguishing human-written from AI-generated news articles produced by varied LLMs (e.g., GPT, Gemini). Twenty teams par- ticipated, with top systems achieving F1 scores of 0.93 for Arabic and 0.89 for Urdu. The re- sults highlight the dominance of multilingual transformers-specifically XLM-RoBERTa and DeBERTa-v3-and reveal significant challenges in cross-domain generalization, where naive data augmentation often yielded diminishing returns. This shared task establishes a robust baseline for authenticating content in the Abjad ecosystem.
AbjadAuthorID: Authorship Identification for Arabic-Script Languages at AbjadNLP 2026
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba | Sina Ahmadi
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba | Sina Ahmadi
Authorship identification is a core problem in Natural Language Processing and computational linguistics, with applications spanning digital humanities, literary analysis, and forensic linguistics. While substantial progress has been made for English and other high-resource languages, authorship attribution for languages written in the Arabic (Abjad) script remains underexplored. In this paper, we present an overview of AbjadAuthorID, a shared task organised as part of the AbjadNLP workshop at EACL 2026, which focuses on multiclass authorship identification across Arabic-script languages. The shared task covers Modern Standard Arabic, Urdu, and Kurdish, and is formulated as a closed-set multiclass classification problem over literary text spanning multiple authors and historical periods. We describe the task motivation, dataset construction, evaluation protocol, and participation statistics, and report official results for the Arabic track. The findings highlight both the effectiveness of current approaches in controlled settings and the challenges posed by lower participation and resource availability in some language tracks. AbjadAuthorID establishes a new benchmark for multilingual authorship attribution in morphologically rich, underrepresented languages.
AbjadStyleTransfer: Authorship Style Transfer for Arabic-Script Languages at AbjadNLP 2026
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba
Shadi Abudalfa | Saad Ezzini | Ahmed Abdelali | Mustafa Jarrar | Mo El-Haj | Nadir Durrani | Hassan Sajjad | Farah Adeeba
Authorship style transfer aims to rewrite a given text so that it reflects the distinctive style of a target author while preserving the original meaning. Despite growing interest in text style transfer, most existing work has focused on English and other high-resource languages, with limited attention to languages written in the Arabic script. In this paper, we present an overview of AbjadStyleTransfer, a shared task organised as part of the AbjadNLP workshop at EACL 2026, which targets authorship style transfer for Arabic-script languages with a strong focus on literary text. The shared task covers Modern Standard Arabic and Urdu, and is designed to encourage research on controllable text generation in morphologically rich and stylistically diverse languages. Participants are required to generate text that conforms to the writing style of a specified author, given a semantically equivalent formal input. We describe the task motivation, dataset construction, evaluation protocol, and participation statistics, and provide an initial discussion of the challenges associated with authorship style transfer in Arabic-script languages. AbjadStyleTransfer establishes a new benchmark for literary style transfer beyond Latin-script settings and supports future research on culturally grounded and linguistically informed text generation.
up
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Proceedings of the 7th Workshop on African Natural Language Processing (AfricaNLP 2026)
Everlyn Asiko Chimoto | Constantine Lignos | Shamsuddeen Muhammad | Idris Abdulmumin | Clemencia Siro | David Ifeoluwa Adelani
Everlyn Asiko Chimoto | Constantine Lignos | Shamsuddeen Muhammad | Idris Abdulmumin | Clemencia Siro | David Ifeoluwa Adelani
Dealing with the Hard Facts of Low-Resource African NLP
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté | Aymane Dembélé | Madani Amadou Tall | Emmanuel Elise Kone
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté | Aymane Dembélé | Madani Amadou Tall | Emmanuel Elise Kone
Creating speech datasets, models, and evaluation frameworks for low-resource languages remains challenging given the lack of a broad base of pertinent experience to draw from. This paper reports on the field collection of 612 hours of spontaneous speech in Bambara, a low-resource West African language; the semi-automated annotation of that dataset with transcriptions; the creation of several monolingual ultra-compact and small models using the dataset; and the automatic and human evaluation of their output. We offer practical suggestions for data collection protocols, annotation, and model design, as well as evidence for the importance of performing human evaluation. In addition to the main dataset, multiple evaluation datasets, models, and code are made publicly available.
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
Seung Hun Eddie Han | Youssef Mohamed | Mohamed Elhoseiny
Seung Hun Eddie Han | Youssef Mohamed | Mohamed Elhoseiny
This paper presents a Multilingual Vision Large Language Model, named M-MiniGPT4. Our model exhibits strong vision-language understanding (VLU) capabilities across 11 languages. We utilize a mixture of native multilingual and translated data to push the multilingual VLU performance of the MiniGPT4 architecture. In addition, we propose a multilingual alignment training stage that uses parallel text corpora to further enhance the multilingual capabilities of our model. M-MiniGPT4 achieves 36% accuracy on the multilingual MMMU benchmark, outperforming state-of-the-art models in the same weight class, including foundation models released after the majority of this work was completed. We open-source our models, code, and translated datasets to facilitate future research in low-resource and multilingual settings.
InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
Mamadou K. Keita | Sebastien Diarra | Christopher M Homan | Seydou Diallo
Mamadou K. Keita | Sebastien Diarra | Christopher M Homan | Seydou Diallo
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: **ZarmaInstruct-50k**, **BambaraInstruct-50k**, and **FulfuldeInstruct-50k**.
Leveraging CoHere Multilingual Embeddings and Inverted Softmax Retrieval for Automatic Parallel Sentence Alignment in Low-Resource Languages
Abubakar Auwal Khalid | Salisu Musa Borodo | Amina Abubakar Imam
Abubakar Auwal Khalid | Salisu Musa Borodo | Amina Abubakar Imam
We present an improved method for automaticparallel sentence alignment in low- resourcelanguages. We used CoHere multilingualembeddings and inverted softmax retrieval.Our technique achieved a higher F1-score of78.30% on the MAFAND-MT test set, comparedto the existing technique’s 54.75%. Precisionand recall have shown similar performance.We assessed the quality of the extracted data bydemonstrating that it outperforms the existingtechnique in terms of low-resource translationperformance.
AfriCaption: Establishing a New Paradigm for Image Captioning in African Languages
Mardiyyah Oduwole | Prince Mireku | Fatimo Adebanjo | Oluwatosin Olajide | Mahi Aminu Aliyu | Jekaterina Novikova
Mardiyyah Oduwole | Prince Mireku | Fatimo Adebanjo | Oluwatosin Olajide | Mahi Aminu Aliyu | Jekaterina Novikova
Multimodal AI research has overwhelmingly focused on high-resource languages, hindering the democratization of advancements in the field. To address this, we present AfriCaption, a comprehensive framework for multilingual image captioning in 20 African languages and our contributions are threefold: (i) a curated dataset built on Flickr8k, featuring semantically aligned captions generated via a context-aware selection and translation process; (ii) a dynamic, context-preserving pipeline that ensures ongoing quality through model ensembling and adaptive substitution; and (iii) the AfriCaption model, a 0.5B parametervision-to-text architecture that integrates SigLIP and NLLB200 for caption generation across underrepresented languages. This unified framework ensures ongoing data quality and establishes the first scalable image-captioning resource for underrepresented African languages, laying the groundwork for truly inclusive multimodal AI.
Developing an English–Efik Corpus and Machine Translation System for Digitization Inclusion
Offiong Bassey Edet | Mbuotidem Awak | Emmanuel Ubene Oyo-Ita | Benjamin Okon Nyong | Ita Etim Bassey
Offiong Bassey Edet | Mbuotidem Awak | Emmanuel Ubene Oyo-Ita | Benjamin Okon Nyong | Ita Etim Bassey
Low-resource languages serve as invaluable repositories of human history, preserving cultural and intellectual diversity. Despite their significance, they remain largely absent from modern natural language processing systems. While progress has been made for widely spoken African languages such as Swahili, Yoruba, and Amharic, smaller indigenous languages like Efik continue to be underrepresented in machine translation research. This study evaluates the effectiveness of state-of-the-art multilingual neural machine translation models for English–Efik translation, leveraging a small-scale, community-curated parallel corpus of N = 13,865 sentence pairs. We fine-tuned both the mT5 multilingual model and the NLLB-200 model on this dataset. NLLB-200 outperformed mT5, achieving BLEU scores of 26.64 for English–Efik and 31.21 for Efik–English, with corresponding chrF scores of 51.04 and 47.92, indicating improved fluency and semantic fidelity. Our findings demonstrate the feasibility of developing practical machine translation tools for low-resource languages and highlight the importance of inclusive data practices and culturally grounded evaluation in advancing equitable NLP.
Reasoning Beyond Labels: Measuring LLM Sentiment in Low-Resource, Culturally Nuanced Contexts
Millicent Ochieng | Anja Thieme | Ignatius Ezeani | Risa Ueno | Samuel Chege Maina | Keshet Ronen | Javier Gonzalez | Jacki O'Neill
Millicent Ochieng | Anja Thieme | Ignatius Ezeani | Risa Ueno | Samuel Chege Maina | Keshet Ronen | Javier Gonzalez | Jacki O'Neill
Sentiment analysis in low-resource, culturally nuanced contexts challenges conventional NLP approaches that assume fixed labels and universal affective expressions. We present a diagnostic framework that treats sentiment as a context-dependent, culturally embedded construct, and evaluate how large language models (LLMs) reason about sentiment in informal, code-mixed WhatsApp messages from Nairobi youth health groups. Using human-annotated data, sentiment-flipped counterfactuals, and rubric-based explanation evaluation, we probe LLM interpretability, robustness, and alignment with human reasoning. Framing our evaluation through a social science measurement lens, we operationalize LLM outputs as an instrument for measuring the abstract concept of sentiment. Our findings reveal significant variation in model reasoning quality, with top-tier LLMs demonstrating greater interpretive stability, while smaller open-weight models in our study show reduced stability under ambiguity or sentiment shifts. This work highlights the need for culturally sensitive, reasoning-aware AI evaluation in complex, real-world communication.
ÒWE-Voice: An Evaluation of Monolingual and Multilingual ASR Model Using Yoruba Proverb Speech Dataset
Daud Abolade
Daud Abolade
Given the advancement of various Artificial Intelligence (AI) technologies in the 21st century, Automatic Speech Recognition (ASR) plays a vital role in human and machine interaction and serves as an interface for a wide range of applications. The development of these high-performing, robust and useful technologies continue to gain more attention on high-resource languages due to high availability of language data, market profitability dominance and access to funding and research initiatives compared to the marginalised low-resource languages. Despite efforts to develop ASR systems for African languages, there are still numerous challenges due to limited speech datasets, tonal complexity and dialectal variation. In this study, we curated a domain-specific speech dataset for one of the oral Yoruba literatures, proverbs, which are highly culturally inclined. We used the Yoruba recording app that was developed for Iroyin-speech project to record 6 hours of Yoruba proverb sentences. The NCAIR1/Yoruba-ASR model which was finetuned on Open AI Whisper Small and Massively Multilingual Speech, a multilingual speech model featuring low-resource languages including Yoruba language was evaluated with the recorded Yoruba proverbs. Evaluation was conducted based on Word Error Rate (WER) and Tone Error Rate (TER). Our result shows that current ASR systems that support Yoruba does not capture cultural nuances. These findings highlight an urgent need to curate more robust speech datasets that are culturally embedded for low resource languages and in this case particularly, Yoruba language in order to build technological tools that preserve African culture, language and identity.
Language choice in multilingual societies is rarely arbitrary. In Nigerian, English, Nigerian Pidgin (NP) and indigenous languages are strategically deployed in online discourse, yet little is known about how they function in hostile contexts. Here we conduct the first systematic analysis of NP in online hate speech on two platforms, Twitter and Instagram. Using a linguistically enriched annotation scheme, we label each post for class, targeted group, language variety, and hate type. Our results show that NP is disproportionately used in offensive and hateful discourse, particularly against Hausa, women, and LGBTQ+ groups, and that insults are the dominant hate strategy. Cross-domain evaluation further reveals that classifiers trained on Twitter systematically over-predict hate on Instagram, highlighting challenges of domain transfer. These findings underscore NP’s role as a linguistic resource for hostility and its sociolinguistic salience in amplifying stereotypes and affect. For NLP, the work demonstrates the need for NP-specific resources, sensitivity to figurative strategies, and domain adaptation across platforms. By bridging sociolinguistics and computational modeling, this study contributes new evidence on how language choice shapes online hate speech in a multilingual African context.
The Token Tax: Systematic Bias in Multilingual Tokenization
Jessica M. Lundin | Ada Zhang | Nihal Karim | Hamza Louzan | Guohao Wei | David Ifeoluwa Adelani | Cody Carroll
Jessica M. Lundin | Ada Zhang | Nihal Karim | Hamza Louzan | Guohao Wei | David Ifeoluwa Adelani | Cody Carroll
Tokenization inefficiency is associated with structural disadvantages on morphologically complex, low-resource languages, inflating compute resources and reducing accuracy. We evaluate 10 Large Language Models (LLMs) on AfriMMLU (5 subjects; 16 African languages) and show that token fertility reliably predicts accuracy. Higher fertility consistently predicts lower accuracy across all models and subjects. We further find that reasoning models (e.g., DeepSeek, o1) consistently outperform non-reasoning peers across high- and low-resource languages in the AfriMMLU dataset, narrowing accuracy gaps observed in prior generations. In terms of economics, a doubling in tokens results in quadrupled training cost and time, underscoring the “token tax” faced by many languages. These results motivate morphologically aware tokenization, fair pricing, and multilingual benchmarks for equitable natural language processing (NLP).
EduNaija AI Tutor: A Multi-Agent Retrieval-Augmented Generation System for Nigerian Curriculum Education
Israel Olanrewaju Odeajo | Edifon Emmanuel Jimmy
Israel Olanrewaju Odeajo | Edifon Emmanuel Jimmy
Equitable access to quality education remains a critical challenge in Nigeria, where millions of students prepare annually for standardized examinations (WAEC, NECO, JAMB) with limited access to personalized tutoring (Badei et.al, 2024). This research presents EduNaija AI Tutor, a multi-agent Retrieval-Augmented Generation (RAG) system designed to democratize educational support through AI-powered tutoring aligned with Nigerian curricula. The system integrates conversational AI with document-based question answering, automated assessment generation, and multilingual support for English, Yoruba, Hausa, and Igbo. Using LangChain for agent orchestration, OpenAI GPT models for natural language processing, and FAISS for vector retrieval, the system enables students to interact with educational content through natural language queries while maintaining cultural relevance through Nigerian-contextualized examples and conventions (Chukwuma et.al, 2024). The multi-agent architecture comprises five specialized components: a main orchestrator, explanation agent, quiz generation agent, web search agent, and RAG agent for processing uploaded educational materials. Preliminary evaluation demonstrates the system’s capability to provide curriculum-aligned explanations, generate practice assessments, and answer questions from uploaded textbooks and study materials. This work contributes a culturally-aware educational AI framework addressing linguistic diversity and curriculum alignment challenges in African educational contexts, while leveraging open-source tools for reproducibility and accessibility (Shoukat et.al, 2025).
Synthetic Data Generation Pipeline for Low-Resource Swahili Sentiment Analysis: Multi-LLM Judging with Human Validation
Samuel Gyamfi | Alfred Malengo Kondoro | Yankı Öztürk | Richard Hans Schreiber | Vadim Borisov
Samuel Gyamfi | Alfred Malengo Kondoro | Yankı Öztürk | Richard Hans Schreiber | Vadim Borisov
Despite serving over 100 million speakers as a vital African lingua franca, Swahili remains critically under-resourced for Natural Language Processing, hindering technological progress across East Africa. We present a scalable solution: a controllable synthetic data generation pipeline that produces culturally grounded Swahili text for sentiment analysis, validated through automated LLM judges. To ensure reliability, we conduct targeted human evaluation with a native Swahili speaker on a stratified sample, achieving 80.95% agreementbetween generated sentiment labels and human ground truth, with strong agreement on judge quality assessments. This demonstrates that LLM-based generation and quality assessment can transfer effectively to low-resource languages. We release a dataset and provide a reproducible pipeline in tandem, providing ample knowledge and working material for NLP researchers in low-resource contexts. We release the resulting Swahili sentiment dataset and the full reproducible generation pipeline publicly at https://huggingface.co/datasets/tabularisai/swahili-sentiment-dataset and https://github.com/tabularis-ai/Synthetic-Data-Generation-Pipeline-for-Low-Resource-Swahili-Sentiment-Analysis.
In this paper, we present some of our recent efforts to provide base NLP pipelines for African languages. These include an infrastructure called UDMorph to make UD-compatible training data available for resources that do not have dependency relations, and a Python package called flexiPipe to easily run an NLP pipeline in various NLP tools using a uniform front-end, including the models provided by UDMorph. flexiPipe also provides Unicode normalization, an often overlooked feature that has a significant impact on African NLP. flexiPipe currently provides an NLP pipeline for 33 African languages, a significant increase from the handful of models that are currently easily accessible. And UDMorph is designed to make it easy to provide training data for more languages.
Linguistically Informed Evaluation of Multilingual ASR for African Languages
Fei-Yueh Chen | Lateef Adeleke | C. M. Downey
Fei-Yueh Chen | Lateef Adeleke | C. M. Downey
Word Error Rate (WER) mischaracterizes ASR models’ performance for African languages by combining phonological, tone, and other linguistic errors into a single lexical error. By contrast, Feature Error Rate (FER) has recently attracted attention as a viable metric that reveals linguistically meaningful errors in models’ performance. In this paper, we evaluate three speech encoders on two African languages by complementing WER with CER, and FER, and add a tone-aware extension (TER). We show that by computing errors on phonological features, FER and TER reveal linguistically-salient error patterns even when word-level accuracy remains low. Our results reveal that models perform better on segmental features, while tones (especially mid and downstep) remain the most challenging features. Results on Yoruba show a striking differential in metrics, with WER=0.788, CER=0.305, and FER=0.151. Similarly for Uneme (an endangered language absent from pretraining data) a model with near-total WER and 0.461 CER achieves the relatively low FER of 0.267. This indicates model error is often attributable to individual phonetic feature errors, which is obscured by all-or-nothing metrics like WER.
Evaluating Native-Speaker Preferences on Machine Translation and Post-Edits for Five African Languages
Hiba El Oirghi | Tajuddeen Gwadabe | Marine Carpuat
Hiba El Oirghi | Tajuddeen Gwadabe | Marine Carpuat
Wikipedia editors undertake the task of editing machine translation (MT) outputs in various languages to disseminate multilingual knowledge from English. But are editors doing more than just translating or fixing MT output? To answer this broad question, we constructed a dataset of 4,335 fine-grained annotated parallel pairs of MT translations and human post-edit (HE) translations for five low-resource African languages: Hausa, Igbo, Swahili, Yoruba, and Zulu. We report on our data selection and annotation methodologies as well as findings from the annotated dataset, the most surprising of which is that annotators mostly preferred the MT translations over their HE counterparts for three out of five languages. We analyze the nature of these "fluency breaking" edits and provide recommendations for the MT post-editing workflows in the Wikipedia domain and beyond.
Building a Conversational AI Assistant for African Travel Services with LLMs and RAG
Grace Kevine Ngoufo | Shamsuddeen Hassan Muhammad | Kevin Jeff Fogang Fokoa
Grace Kevine Ngoufo | Shamsuddeen Hassan Muhammad | Kevin Jeff Fogang Fokoa
Travel agencies in many African countries face increasing pressure to handle large volumes of customer inquiries with limited staff or, either non-existent or outdated rule-based chat-bots. To address this challenge, we develop a conversational virtual assistant powered by a Large Language Model (LLM) and enhanced with a Retrieval-Augmented Generation (RAG) pipeline. The system combines LLM reasoning, company-specific knowledge retrieval, and real-time API (Application Programming Interface) integration to deliver accurate, context-aware responses through WhatsApp, the region’s most widely used communication platform. A dedicated web interface enables staff to upload and update internal documents, ensuring that the assistant remains aligned with changing service information. Demonstrations show that the proposed solution improves response speed, enhances user experience, and reduces operational burden.
Morphologically-informed Somali Lemmatization Corpus built with a Web-based Crowdsourcing Platform
Abdifatah Ahmed Gedi | Shafie Abdi Mohamed | Yusuf A. Yusuf | Muhidin A. Mohamed | Fuad Mire Hassan | Houssein A Assowe
Abdifatah Ahmed Gedi | Shafie Abdi Mohamed | Yusuf A. Yusuf | Muhidin A. Mohamed | Fuad Mire Hassan | Houssein A Assowe
Lemmatization, which reduces words to their root forms, plays a key role in tasks such as information retrieval, text indexing, and machinelearning-based language models. However, a key research challenge for low-resourced languages such as the Somali is the lack of human-annotated lemmatization datasets and reliable ground truth to underpin accurate morphological analysis and training relevant NLP models. To address this problem, we developed the first large-scale, purpose-built Somali lemmatization lexicon, coupled with a crowdsourcing platform for ongoing expansion. The system leverages Somali’s agglutinative and derivational morphology, encompassing over5,584 root words and 78,629 derivative forms, each annotated with part-of-speech tags. For data validation purpose, we have devised a pilot lexicon-based lemmatizer integrated with rule-based logic to handle out-of-vocabulary terms. Evaluation on a 294-document corpuscovering news articles, social media posts, and short messages shows lemmatization accuracies of 51.27% for full articles, 44.14% forexcerpts, and 59.51% for short texts such as tweets. These results demonstrate that combining lexical resources, POS tagging, and rulebased strategies provides a robust and scalable framework for addressing morphological complexity in Somali and other low-resource languages
Kunnafonidilaw ka Cadeau: an ASR dataset of present-day Bambara
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté
Michael Leventhal | Yacouba Diarra | Nouhoum Coulibaly | Panga Azazia Kamaté
We present Kunkado, a 160-hour Bambara ASR dataset compiled from Malian radio archives to capture present-day spontaneous speech across a wide range of topics. It includes code-switching, disfluencies, background noise, and overlapping speakers that practical ASR systems encounter in real-world use. We finetuned Parakeet-based models on a 33.47-hour human-reviewed subset and apply pragmatic transcript normalization to reduce variability in number formatting, tags, and code-switching annotations. Evaluated on two real-world test sets, finetuning with Kunkado reduces WER from 44.47% to 37.12% on one and from 36.07% to 32.33% on the other. In human evaluation, the resulting model also outperforms a comparable system with the same architecture trained on 98 hours of cleaner, less realistic speech. We release the data and models to support robust ASR for predominantly oral languages.
Full Fine-Tuning vs. Parameter-Efficient Adaptation for Low-Resource African ASR: A Controlled Study with Whisper-Small
Sukairaj Hafiz Imam | Muhammad Yahuza Bello | Hadiza Ali Umar | Tadesse Destaw Belay | Idris Abdulmumin | Seid Muhie Yimam | Shamsuddeen Hassan Muhammad
Sukairaj Hafiz Imam | Muhammad Yahuza Bello | Hadiza Ali Umar | Tadesse Destaw Belay | Idris Abdulmumin | Seid Muhie Yimam | Shamsuddeen Hassan Muhammad
Automatic speech recognition (ASR) for African low-resource languages (LRLs) is often limited by scarce labelled data and the high cost of adapting large foundation models. This study evaluates whether parameter-efficient fine-tuning (PEFT) can serve as a practical alternative to full fine-tuning (FFT) for adapting Whisper-Small with limited labelled speech and constrained compute. We used a 10-hour subset of NaijaVoices covering Hausa, Yorùbá, and Igbo, and we compared FFT with several PEFT strategies under a fixed evaluation protocol. DoRA attains a 22.0% macro-average WER, closely aligning with the 22.1% achieved by FFT while updating only 4M parameters rather than 240M, and this difference remains within run-to-run variation across random seeds. Yorùbá consistently yields the lowest word error rates, whereas Igbo remains the most challenging, indicating that PEFT can deliver near FFT accuracy with substantially lower training and storage requirements for low-resource African ASR.
Real-Time Spoken Instruction Following and Translation in Ugandan Languages
Benjamin Akera | Tim Wenjie Hu | Patrick Walukagga | Evelyn Nafula Ouma | Yiga Gilbert | Ernest Tonny Mwebaze | John Quinn
Benjamin Akera | Tim Wenjie Hu | Patrick Walukagga | Evelyn Nafula Ouma | Yiga Gilbert | Ernest Tonny Mwebaze | John Quinn
Many languages are predominantly spoken rather than written, and to bring the benefits of LLMs to speakers of these languages, it is essential that models cater to the voice modality. The typical approach is to cascade ASR, LLM and TTS models together, though this results in systems with high latency, making them unsuitable for natural, real-time interaction. We describe results on taking the encoder part of a Whisper-based model trained to recognise ten languages common in Uganda, and using the Ultravox architecture to project its output directly to the input embedding space of a text model based on Qwen 3 32B, also trained to have comprehension of those languages. The result is a speech LLM with high accuracy and very low latency. For most spoken prompts, we can begin streaming a text response within as low as 50 ms, and a speech audio response within around one second, making real-time spoken interaction with an LLM possible for the first time in these languages. The model is available open source onHugging Face.
SALT-31: A Machine Translation Benchmark Dataset for 31 Ugandan Languages
Solomon Nsumba | Benjamin Akera | Evelyn Nafula Ouma | Medadi E. Ssentanda | Deo Kawalya | Engineer Bainomugisha | Ernest Tonny Mwebaze | John Quinn
Solomon Nsumba | Benjamin Akera | Evelyn Nafula Ouma | Medadi E. Ssentanda | Deo Kawalya | Engineer Bainomugisha | Ernest Tonny Mwebaze | John Quinn
We present the SALT-31 benchmark dataset for evaluation of machine translation models covering 31 Ugandan languages. Unlike sentence-level evaluation sets, SALT-31 is constructed from short, scenario-driven mini-dialogues designed to preserve discourse context, pragmatics, and culturally grounded communication patterns common in everyday Ugandan settings. The dataset contains 100 English sentences organized into 20 typical communication scenarios, each represented as a five-sentence mini-sequence. It can therefore be used to evaluate both sentence-level and paragraph level machine translation, and includes nearly every language spoken in a country with high linguistic diversity. It is available at https://huggingface.co/datasets/Sunbird/salt-31
Sample-Size Scaling of the African Languages NLI Evaluation
Anuj Tiwari | Oluwapelumi Ogunremu | Terry Oko-odion | Jesujuwon Egbewale | Hannah Sopuruchi Nwokocha
Anuj Tiwari | Oluwapelumi Ogunremu | Terry Oko-odion | Jesujuwon Egbewale | Hannah Sopuruchi Nwokocha
African languages have very little labelled data, and it is unclear if augmenting the quantity of annotation data reliably enhances downstream performance. The study is a systematic sample-size scaling study of natural language inference (NLI) on 16 African languages based on the AfriXNLI benchmark. Under controlled conditions, two multilingual transformer models with roughly 0.6B parameters XLM-R Large fine-tuned on XNLI and AfroXLM-R Large are tested on sample sizes of between 50 and 500 labeled examples and average their results across random subsampling runs. As opposed to the usual belief of monotonic increase with increased data, we find a strongly language-sensitive and often non-monotonic scaling behavior. Some languages show early saturation or decrease in performance with sample size as well as high variance in low resource regimes. These results indicate that the volume of data is not enough to guarantee stable profits to African NLI, creating the necessity of language-sensitive datasets creation and stronger multi-lingual modelling strategies.
Evaluating Yoruba Text-to-Speech Systems for Accessible Computer-Based Testing in Visually Impaired Learners
Kausar Yetunde Moshood | Victor Tolulope Olufemi | Oreoluwa Boluwatife Babatunde | Emmanuel Bolarinwa | Williams Oluwademilade
Kausar Yetunde Moshood | Victor Tolulope Olufemi | Oreoluwa Boluwatife Babatunde | Emmanuel Bolarinwa | Williams Oluwademilade
Text-to-Speech (TTS) technology offers potential to improve exam accessibility for visually impaired learners, but existing systems often underperform in underrepresented languages like Yoruba. This study evaluates current Yoruba TTS models in delivering standardized exam content to five visually impaired students through a web-based interface. Before testing, four Yoruba TTS systems were compared; only Facebook’s mms-tts-yor and YarnGPT produced intelligible Yoruba speech. Students experienced exam questions delivered by human voice, Braille, and TTS. All preferred Braille for clarity and independence, some valued human narration, while TTS was least favored due to robotic and unclear output. These results reveal a significant gap between TTS capabilities and the needs of users in low-resource languages. The paper highlights the urgency of developing tone-aware, user-centered TTS solutions to ensure equitable access to digital education for visually impaired speakers of underrepresented languages.
Power Asymmetries, Bias, and AI, a Reflection of Society on Low-Resourced Languages - African Languages as Case Study
Simbiat Ajao
Simbiat Ajao
In recent times, artificial intelligence (AI) systems have become the primary intermediary to information access, services, and opportunities. Currently, there are growing concerns as to how existing social inequalities are reproduced and amplified through AI. This is significantly evident in language technologies, where a small number of dominant languages or what we’ll refer to as big languages and cultural contexts shape the training, design, and evaluation of models. This paper examines the intersections of power asymmetries, linguistic bias, and cultural representation in AI, with a major focus on African languages and communities. We argue that current Natural Language Processing (NLP) systems reflect a high level of global imbalances in the availability of data, infrastructure, and decision making power, often marginalizing low-resourced languages and cultural peculiarities. It is important we know that how these data are structured is a great determinant in what their outcome will be. With reference to examples from speech recognition, machine translation, and large language models, we highlight the social and cultural consequences of linguistic exclusion, including reduced accessibility, misinterpretation, and digital invisibility. Finally, we identify and discuss pathways toward more equitable language technologies, emphasizing community-led data practices, interdisciplinary collaboration, and context-aware evaluation frameworks. By foregrounding language as both a technical and political concern, this work advocates for African-centered approaches to NLP that promote fairness, accountability, and linguistic justice in AI development.
Sudanese-Flores: Extending FLORES+ to Sudanese Arabic Dialect
Hadia Mohmmedosman Ahmed Samil | David Ifeoluwa Adelani
Hadia Mohmmedosman Ahmed Samil | David Ifeoluwa Adelani
In this work, we introduce Sudanese-Flores, an extension of the popular Flores+ machine translation (MT) benchmark to the Sudanese Arabic dialect. We translate both the DEV and DEVTEST splits of the Modern Standard Arabic dataset into the corresponding Sudanese dialect, resulting in a total of 2,009 sentences. While the dialect was recently introduced in Google Translate, there are no available benchmark in this dialect despite spoken by over 40 million people. Our evaluation on two leading LLMs such as GPT-4.1 and Gemini 2.5 Flash showed that while the performance English to Arabic is impressive (more than 23 BLEU), they struggle on Sudanese dialect (less than 11 BLEU) in zero-shot settings. In few-shot scenario, we achieved only a slight boost in performance.
Where Are We at with Automatic Speech Recognition for the Bambara Language?
Seydou Diallo | Yacouba Diarra | Panga Azazia Kamaté | Aboubacar Ouattara | Mamadou K. Keita | Adam Bouno Kampo
Seydou Diallo | Yacouba Diarra | Panga Azazia Kamaté | Aboubacar Ouattara | Mamadou K. Keita | Adam Bouno Kampo
This paper introduces the first standardized benchmark for evaluating Automatic Speech Recognition (ASR) in the Bambara language, utilizing one hour of professionally recorded Malian constitutional text. Designed as a controlled reference set under near-optimal acoustic and linguistic conditions, the benchmark was used to evaluate 37 models, ranging from Bambara-trained systems to large-scale commercial models. Our findings reveal that current ASR performance remains significantly below deployment standards; the top-performing system in terms of Word Error Rate (WER) achieved 46.76% and the best Character Error Rate (CER) of 13.00% was set by another model, while several prominent multilingual models exceeded 100% WER due to severe hallucinations. These results suggest that multilingual pre-training and model scaling alone are insufficient for underrepresented languages. Furthermore, because this dataset represents a best-case scenario of the most simplified and formal form of spoken Bambara, these figures likely establish an upper bound for performance in practical, real-world settings. We provide the benchmark and an accompanying public leaderboard to facilitate transparent evaluation and future research in Bambara speech technology.
Enhancing Automatic Speech Recognition Models for Maternal and Reproductive Health: Fine-Tuning and Real-World Evaluation in Wolof
Ertony Basilwango | Yann Le Beux | Oche David Ankeli | Pierre Herve Berdys
Ertony Basilwango | Yann Le Beux | Oche David Ankeli | Pierre Herve Berdys
Automatic Speech Recognition (ASR) systems perform well for high-resource languages, but most African languages, including Wolof, remain underrepresented, particularly in maternal and reproductive healthcare. This work proposes a domain-specific approach to improving Wolof ASR under low-resource conditions, addressing limited annotated data, orthographic variability, and code-switching. We curated a dataset of 750 validated Wolof utterances covering 250 maternal health keywords and applied data augmentation to increase acoustic diversity. Pretrained models, including wav2vec 2.0 and Whisper, were benchmarked to select candidates for fine-tuning. Using parameter-efficient Low-Rank Adaptation (LoRA), a Whisper model was adapted to the maternal health domain. Evaluation using Word Error Rate (WER), Character Error Rate (CER), and Keyword Error Rate (KER), which measures medically critical term transcription accuracy, shows substantial gains, reducing WER from 46.5% to 23.2% and KER from 17% to 11%. Community-based evaluation on 1,340 real-world utterances reveals a moderate degradation, with WER increasing by 35%. These results demonstrate that lightweight domain adaptation with small, high-quality data can significantly improve ASR for low-resource healthcare applications.This work introduces one of the first Wolof ASR datasets for healthcare and presents a practical framework for developing reliable speech recognition tools in underrepresented languages, improving access to healthcare information and services.
Eyaa-Tom 26, Yodi-Mantissa and Lom Bench: A Community Benchmark for TTS in Local Languages
Bakoubolo Essowe Justin | Catherine Nana Nyaah Essuman | Messan Agbobli | Ahoefa Kansiwer | Eli Jean Doumeyan | Julie Pato | Notou Your Timibe | Emile KOGBEDJI Agossou | Guedela Bakouya
Bakoubolo Essowe Justin | Catherine Nana Nyaah Essuman | Messan Agbobli | Ahoefa Kansiwer | Eli Jean Doumeyan | Julie Pato | Notou Your Timibe | Emile KOGBEDJI Agossou | Guedela Bakouya
We present an extension of our previous work on multilingual NLP for Togolese languages by introducing new datasets, improved models, and a community-driven evaluation benchmark for Text-To-Speech (TTS). We expand the Eyaa-Tom multilingual corpus with additional speech data of about 26.9k recordings (30.9 hours) across 10 local languages, and incorporated 64.6k clips (46.6 hours) of Mozilla Common Voice contributions for Adja, Nawdm, Mina, and Tem to strengthen Automatic Speech Recognition (ASR) and speech synthesis. We detail how community contributors – including collaboration with a national TV journalist – helped collect and validate the Kabyè and French text, with an ethical compensation model in place. We fine-tune state-of-the-art models: OpenAI Whisper and faster-whisper, and Meta’s NLLB-200 model for machine translation across 11 languages (achieving 19.4 BLEU score for French→Ewe and 26.1 BLEU score for Kabyè→French). We also introduce the Lom Bench, a community-based benchmark where native speakers rate TTS output, indicating promising preliminary results in Mina and Togolese lingua franca french although further data is needed. We provide a comparative analysis of our results with recent multilingual systems, including Simba, Meta’s Omnilingual ASR, and UBC Toucan. Our work emphasizes practical pathways and how FAIR data sourcing and community participation can drive sustainable NLP development for underserved languages.
Using Subword-Embeddings for Bilingual Lexicon Induction in Bantu Languages
Adrian Breiding | Alan Akbik
Adrian Breiding | Alan Akbik
Bilingual Lexicon Induction (BLI) is a valuable tool in machine translation and cross-lingual transfer learning, but it remains challenging for agglutinative and low-resource languages. In this work, we investigate the use of weighted sub-word embeddings in BLI for agglutinative languages. We further evaluate a graph-matching and Procrustes-based BLI approach on two Bantu languages, assessing its effectiveness in a previously underexplored language family. Our results for Swahili with an average P@1 score of 51.84% for a 3000 word dictionary demonstrate the success of the approach for Bantu languages. Weighted sub-word embeddings perform competitively on Swahili and outperform word embeddings in our experiments with Zulu.
AfriNLLB: Efficient Translation Models for African Languages
Yasmin Moslem | Aman Kassahun Wassie | Amanuel Gizachew Abebe
Yasmin Moslem | Aman Kassahun Wassie | Amanuel Gizachew Abebe
In this work, we present AfriNLLB, a series of lightweight models for efficient translation from and into African languages. AfriNLLB supports 15 language pairs (30 translation directions), including Swahili, Hausa, Yoruba, Amharic, Somali, Zulu, Lingala, Afrikaans, Wolof, and Egyptian Arabic, as well as other African Union official languages such as Arabic (MSA), French, Portuguese, and Spanish. Our training data covers bidirectional translation between English and 13 languages, and between French and two languages (Lingala and Wolof). AfriNLLB models are based on NLLB-200 600M, which we compress using iterative layer pruning and quantization. We fine-tune the pruned models on parallel corpora we curated for African languages, employing knowledge distillation from a larger teacher model. Our work aims at enabling efficient deployment of translation models for African languages in resource-constrained settings. Our evaluation results demonstrate that AfriNLLB models achieve performance comparable to the baseline while being significantly faster. We release two versions of the AfriNLLB models, a Transformers version that allows further fine-tuning and a CTranslate2 version for efficient inference. Moreover, we release all the training data that we used for fine-tuning the baseline and pruned models to facilitate further research.
up
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Proceedings of the 4th Workshop on Advances in Language and Vision Research (ALVR)
Qianqi Yan | Syrielle Montariol | Yue Fan | Jing Gu | Jiayi Pan | Manling Li | Parisa Kordjamshidi | Alane Suhr | Xin Eric Wang
Qianqi Yan | Syrielle Montariol | Yue Fan | Jing Gu | Jiayi Pan | Manling Li | Parisa Kordjamshidi | Alane Suhr | Xin Eric Wang
Thinking in Pictures: A Diagnostic Study of Visual vs. Textual Chain-of-Thought Reasoning in Vision-Language Models
Ben Jenkins
Ben Jenkins
Chain-of-thought (CoT) reasoning has become a standard technique for eliciting complex reasoning in large language models, and recent work has extended it to vision-language models (VLMs). However, virtually all multimodal CoT methods generate intermediate reasoning steps in natural language, even for inherently visual problems such as spatial reasoning, geometric manipulation, and object tracking. We ask a fundamental question: when should a VLM reason in words, and when should it reason in pictures? We present VisCoT-Diag, a diagnostic benchmark of 1,200 instances across five visual reasoning categories, and compare four CoT paradigms across four VLMs. Our results reveal a striking modality gap: textual CoT degrades performance by up to 17.5% on spatial transformation and 13.2% on multi-object tracking, while visual CoT yields gains of up to 23.1%. We identify three failure modes (spatial state collapse, transformation hallucination, tracking loss) and show that adaptive modality routing achieves 73.1% accuracy versus 68.9% for V-CoT-everywhere. We recommend practitioners use visual CoT for spatial tasks and textual CoT for compositional counting.
The rapid evolution of text-to-image generation has blurred the perceptual boundary between natural and synthetic imagery. However, it remains questionable whether the statistical structure of generated visual content mirrors the information density of the physical visual world. Drawing upon principles from statistical linguistics, this study investigates the visual language of generative models through the lens of Zipfian dynamics. By analyzing a large-scale corpus of real and synthetic images, we uncover a fundamental divergence between visual syntax and semantics. We find that while generative models have successfully replicated the low-level physics of light, their high-level texture vocabulary exhibits distinct statistical signatures. Our analysis reveals a spectrum of entropy, identifying architectural fingerprints unique to each model. Furthermore, we investigate the relation ship between generated images and prompt complexity, and find that increasing the semantic specificity of text prompts systematically degrades the statistical realism of the generated output.
Semantically Aware Optimal Transport for Dense Label Transfer
Preeti | Kiran Ravish | Ankita Kushwaha | Pawan Kumar
Preeti | Kiran Ravish | Ankita Kushwaha | Pawan Kumar
Vision foundation models produce features that generalize across visual domains without fine-tuning, yet naively transferring labels through these feature spaces fails under large distribution shifts.We propose SAOT (**S**emantically **A**ware **O**ptimal **T**ransport), which learns a transport cost within a fused unbalanced optimal transport formulation for dense label transfer from frozen vision transformer features to new domains.SAOT combines a learnable appearance metric with semantic class-prototype priors, unbalanced transport for partial matching under distribution shift, and a block-sparse solver for tractable inference.We pair this with a two-stage decoder: an MLP trained on SAOT pseudo-labels, then refined via EMA-teacher self-training with class-balanced sampling.On GTA5→Cityscapes with frozen DINOv2 ViT-L/14 features, SAOT+Decoder reaches 25.7% mIoU, a **3.8×** improvement over nearest-neighbor transfer (6.7%), without any backbone adaptation.Per-class results show large gains on spatially coherent classes (road 90.3%, car 76.2%, building 71.5%), demonstrating that learned semantic transport costs capture domain-invariant structure even under severe synthetic-to-real shifts. On VOC train→val with frozen ViT-B/16 features, the full pipeline reaches 47.5% mIoU, indicating that the approach extends beyond synthetic-to-real adaptation.
CoSMoEs: Compact Sparse Mixture of Experts
Patrick Huber | Akshat Shrivastava | Ernie Chang | Chinnadhurai Sankar | Ahmed A Aly | Adithya Sagar
Patrick Huber | Akshat Shrivastava | Ernie Chang | Chinnadhurai Sankar | Ahmed A Aly | Adithya Sagar
Sparse Mixture of Expert (MoE) models are widely used foundation architectures at large scale, yet remain under-explored at smaller sizes. In this work, we introduce Compact Sparse Mixture of Experts (CoSMoEs) for on-device inference, addressing three key challenges: Quality, Memory, and Latency. On the quality front, we conduct a fair evaluation (removing confounding factors) and show that MoE architectures outperform dense models at on-device scale. We further propose weight-decomposed experts, which improve MoE performance beyond the standard formulation. On the memory and latency front, we address the prohibitively large parameter count of MoE models by improving expert offloading efficiency through a novel training-time loss, reducing inference latency for on-device deployment
GraphicWeaver: Benchmarking Agentic Planning for Graphic Design Generation
Dayeon Ki | Tianyi Zhou | Marine Carpuat | Gang Wu | Puneet Mathur | Viswanathan Swaminathan
Dayeon Ki | Tianyi Zhou | Marine Carpuat | Gang Wu | Puneet Mathur | Viswanathan Swaminathan
Vision-language model (VLM)-powered agents are increasingly enabling new forms of automation across various human tasks. While prior work has primarily focused on well-defined problems with explicit goals, the capabilities of agents in creative graphic design, where goals are inherently open-ended and subjective, remain largely underexplored.To bridge this gap, we introduce GraphicWeaver, a planning benchmark for graphic design comprising 1,079 diverse user queries and associated images spanning four design categories.Comprehensive experiments with six models reveal that current VLM-based agents struggle to handle such complex planning tasks, which require taking into account both explicit design constraints specified in queries and implicit commonsense design principles. We attribute these failures to challenges in (1) retrieving appropriate parameters for tool usage, (2) understanding spatial relationships across design components, and (3) coordinating dependencies across agents. We envision GraphicWeaver as a challenging yet valuable testbed for advancing VLM agent planning in creative design contexts.
Scaling Vision–Language Models for Pharmaceutical Long-Form Video Reasoning on Industrial GenAI Platform
Suyash Mishra | Qiang Li | Srikanth Patil | Satyanarayan Pati | Baddu Narendra
Suyash Mishra | Qiang Li | Srikanth Patil | Satyanarayan Pati | Baddu Narendra
Vision Language Models (VLMs) have shown strong performance on multimodal reasoning tasks, yet most evaluations focus on short videos and assume unconstrained computational resources. In industrial settings such as pharmaceutical content understanding, practitioners must process long-form videos under strict GPU, latency, and cost constraints, where many existing approaches fail to scale. In this work, we present an industrial GenAI framework that processes over 200,000 PDFs, 25,326 videos across eight formats (e.g., MP4, M4V, etc.), and 888 multilingual audio files in more than 20 languages. Our study makes three contributions: (i) an industrial large-scale architecture for multimodal reasoning in pharmaceutical domains; (ii) empirical analysis of over 40 VLMs on two leading benchmarks (Video-MME and MMBench) and proprietary dataset of 25,326 videos across 14 disease areas; and (iii) four findings relevant to long-form video reasoning: the role of multimodality, attention mechanism trade-offs, temporal reasoning limits, and challenges of video splitting under GPU constraints. Results show 3–8X efficiency gains with SDPA attention on commodity GPUs, multimodality improving up to 8/12 task domains (especially length-dependent tasks), and clear bottlenecks in temporal alignment and keyframe detection across open- and closed-source VLMs. Rather than proposing a new "A+B" model, this paper characterizes practical limits, trade-offs, and failure patterns of current VLMs under realistic deployment constraints, and provide actionable guidance for both researchers and practitioners designing scalable multimodal systems for long-form video understanding in industrial domains.
PGGA: A Plan-Grounded GUI Agent for Automated Device Support
Lei Hsiung | Zhiyu Chen | Seonhoon Kim | Qun Liu
Lei Hsiung | Zhiyu Chen | Seonhoon Kim | Qun Liu
Current GUI agents struggle with multi-step digital device support. We investigate whether this failure is partly caused by a procedural knowledge deficit: agents often rely on zero-shot visual exploration instead of executing verified instructions. To address this, we introduce the Plan-Grounded GUI Agent (PGGA), framing interface navigation as a knowledge-execution problem by conditioning low-level actions on step-by-step text plans. Evaluated on our focused Device-Support Interaction Benchmark (DSIB), results reveal a sharp gap between knowing which operation to perform and grounding that operation on the screen: GTA1-7B reaches 99.59% Operation Accuracy with expert plans, but only 82.99% Element Accuracy and 45.61% Task Success Rate; without plans, its Task Success Rate is 0.00%. Our fine-tuned 2B-parameter PGGA achieves 54.39% Task Success Rate and 91.28% Element Accuracy when guided by expert plans, suggesting that explicit procedural grounding can substantially improve GUI execution when high-quality plans are available. Project Page: https://hsiung.cc/PGGA/
CAFES: A Collaborative Multi-Agent Framework for Multi-Granular Multimodal Essay Scoring
Jiamin Su | Yibo Yan | Zhuoran Gao | Han Zhang | Xiang Liu | Huiyu Zhou | Xuming Hu
Jiamin Su | Yibo Yan | Zhuoran Gao | Han Zhang | Xiang Liu | Huiyu Zhou | Xuming Hu
Automated Essay Scoring (AES) is crucial for modern education, particularly with the increasing prevalence of multimodal assessments. However, traditional AES methods struggle with evaluation generalizability and multimodal perception, while even recent Multimodal Large Language Model (MLLM)-based approaches can produce hallucinated justifications and scores misaligned with human judgment. To address the limitations, we introduce CAFES, the first collaborative multi-agent framework specifically designed for AES. It orchestrates three specialized agents: an Initial Scorer for rapid, trait-specific evaluations; a Feedback Pool Manager to aggregate detailed and evidence-grounded feedback; and a Reflective Scorer that iteratively refines scores based on this feedback to enhance human alignment. Extensive experiments, using widely adopted MLLMs, achieve an average relative improvement of 21% in Quadratic Weighted Kappa (QWK) against ground truth, with particularly strong gains in grammatical and lexical diversity. Our proposed CAFES paves the way for an intelligent multimodal AES system. The code and dataset are available at https://anonymous.4open.science/r/CAFES-C87F/.
GM-PRM: A Generative Multimodal Process Reward Model for Multimodal Mathematical Reasoning
Jianghangfan Zhang | Yibo Yan | Kening Zheng | Xin Zou | Song Dai | Xuming Hu
Jianghangfan Zhang | Yibo Yan | Kening Zheng | Xin Zou | Song Dai | Xuming Hu
Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities but often struggle with complex, multi-step mathematical reasoning, where minor errors in visual perception or logical deduction can lead to complete failure. While Process Reward Models (PRMs) offer step-by-step supervision, existing multimodal PRMs are limited to being binary verifiers that can identify but not correct errors, offering little explanatory power. To address these deficiencies, we introduce the **Generative Multimodal Process Reward Model (GM-PRM), a novel paradigm that transforms the PRM from a passive judge into an active reasoning collaborator**. Instead of a simple scalar score, GM-PRM provides a fine-grained, interpretable analysis of each reasoning step, evaluating its step intent, visual alignment, and logical soundness. More critically, GM-PRM is trained to generate a corrected version of the first erroneous step it identifies. This unique corrective capability enables our new test-time inference strategy, Refined Best-of-N (Refined-BoN). This framework actively enhances solution quality by using the PRM’s generated correction to guide the policy model toward a more promising reasoning trajectory, thereby improving the diversity and correctness of the solution pool. We demonstrate that GM-PRM achieves state-of-the-art results on multiple multimodal math benchmarks, significantly boosting policy model performance with remarkable data efficiency, requiring only a 20K-sample training dataset.
Look Where You’re Told: Instruction-Consistent Attention for GUI Grounding
Seonhoon Kim | Zhiyu Chen | Xin Li | Qun Liu
Seonhoon Kim | Zhiyu Chen | Xin Li | Qun Liu
Visual grounding in graphical user interface (GUI) requires accurate localization of UI elements from natural language instructions. Conventional coordinate generation approaches face inherent limitations, including sensitivity to resolution variations and lack of interpretability. Recently, coordinate-free attention-based methods have emerged as a promising alternative, but these methods supervise attention using only spatial location signals from ground-truth bounding boxes, without ensuring that the learned attention distributions reflect genuine semantic correspondence between the instruction and the attended visual regions. We propose Attention Cycle-Consistency (ACC), a self-supervised regularization framework that enforces bidirectional alignment between visual attention and instruction semantics. ACC introduces two complementary constraints: semantic consistency, which ensures attended visual regions contain sufficient information to reconstruct the original instruction, and spatial consistency, which requires attention distributions to remain invariant when cycled through instruction reconstruction. We further incorporate entropy regularization to encourage spatially concentrated attention. ACC is applicable as a lightweight, model-agnostic regularizer for attention-based coordinate-free grounding methods, adding zero computational overhead at inference as all auxiliary components are discarded after training.
From Pixels to BFS: High Maze Accuracy Does Not Imply Visual Planning
Alberto Gonzalo Rodriguez Salgado
Alberto Gonzalo Rodriguez Salgado
How do multimodal models solve visual spatial tasks—through genuine planning, or by brute-forcing solutions in token space? We introduce MazeBench, a benchmark of 110 procedurally generated maze images organized into nine controlled groups (diagnostic, grid scale, wall density, trap ablation, unreachable detection, and more), and evaluate 16 model configurations across four providers (OpenAI, Anthropic, Google, Alibaba) at multiple reasoning effort levels. GPT-5.4 solves 91% and Gemini 3.1 Pro 79%, but our analysis reveals these scores are misleading: models translate images into text grids and brute-force paths via serial enumeration, consuming 1,710–22,818 tokens per solve for a task humans do in seconds. Without added reasoning budgets, all configurations score only 2–12%; on 20x20 ultra-hard mazes, they hit token limits and give up. Qualitative analysis of model outputs confirms a universal two-stage strategy: image-to-grid translation followed by step-by-step path search in natural language—essentially BFS implemented in prose. A text-grid ablation shows Claude’s poor image performance (6%) jumps to 80% when given the correct grid directly, confirming vision quality, not reasoning ability, as the bottleneck for weaker models. Perhaps most striking, when we explicitly instruct models not to build a text grid and not to perform graph search—asking them to "reason visually, like a human"—they silently ignore the instruction and immediately fall back to the same grid-enumeration strategy. This suggests that brute-force token-level search is the dominant mechanism these models rely on for spatial planning in our setting.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise
Philip Wootaek Shin | Ajay Narayanan Sridhar | Lakshmi Sivani Devarapalli | Rui Zhang | Jack Sampson | Vijaykrishnan Narayanan
Philip Wootaek Shin | Ajay Narayanan Sridhar | Lakshmi Sivani Devarapalli | Rui Zhang | Jack Sampson | Vijaykrishnan Narayanan
Vision–language models (VLMs) achieve strong multimodal performance but remain prone to relation hallucination, which requires accurate reasoning over inter-object interactions. We study the impact of visual perturbations, specifically rotation and noise, and show that even mild distortions significantly degrade relational reasoning across models and datasets. We further evaluate prompt-based augmentation and preprocessing strategies (orientation correction and denoising), finding that while they offer partial improvements, they do not fully resolve hallucinations. Our results reveal a gap between perceptual robustness and relational understanding, highlighting the need for more robust, geometry-aware VLMs.
VLCE: A Knowledge-Enhanced Framework for Image Description in Disaster Assessment
Md. Mahfuzur Rahman | Marufa Kamal | Fahad Rahman | Sunzida Siddique | Ahmed Rafi Hasan | Mohd Ariful Haque | Kishor Datta Gupta | Roy George
Md. Mahfuzur Rahman | Marufa Kamal | Fahad Rahman | Sunzida Siddique | Ahmed Rafi Hasan | Mohd Ariful Haque | Kishor Datta Gupta | Roy George
General-purpose vision-language models (VLMs) such as LLaVA and QwenVL produce descriptions of disaster imagery that lack domain-specific vocabulary and actionable detail. We propose the Vision-Language Caption Enhancer (), a framework that integrates external semantic knowledge from ConceptNet and WordNet into the caption generation process for post-disaster satellite and UAV imagery. operates in two stages: first, a baseline VLM generates an initial caption conditioned on YOLOv8 object detections; second, a knowledge-enriched sequential model, a CNN-LSTM or a hierarchical cross-modal Transformer, refines the caption using a vocabulary augmented with 1,566 domain-relevant terms extracted from knowledge graphs. We evaluate on two disaster benchmarks: xBD (satellite, 6,369 images, 3 damage classes) and RescueNet (UAV, 4,494 images, 12 damage classes), using CLIPScore for semantic alignment and InfoMetIC for informativeness. On RescueNet with the Transformer decoder, with knowledge graph enrichment produces captions preferred over QwenVL baselines in 95.33% of image pairs on InfoMetIC and 73.64% on CLIPScore. Qualitative analysis shows that without knowledge graph integration, generated captions exhibit hallucinations, word repetition, and semantic incoherence, whereas knowledge-enriched captions maintain factual consistency and domain-appropriate vocabulary. intended as a continuous, extensible monitor of differential framing under changing real-world inputs.
Beyond Visual Similarity: Rule-Guided Multimodal Clustering with explicit domain rules
Kishor Datta Gupta | Mohd Ariful Haque | Marufa Kamal | Ahmed Rafi Hasan | Md. Mahfuzur Rahman | Roy George
Kishor Datta Gupta | Mohd Ariful Haque | Marufa Kamal | Ahmed Rafi Hasan | Md. Mahfuzur Rahman | Roy George
Traditional clustering techniques often rely solely on similarity in the input data, limiting their ability to capture structural or semantic constraints that are critical in many domains. We introduce the Domain-Aware Rule-Triggered Variational Autoencoder (DART-VAE), a rule-guided multimodal clustering framework that incorporates domain-specific constraints directly into the representation learning process. DART-VAE extends the VAE architecture by embedding explicit rules, semantic representations, and data-driven features into a unified latent space, while enforcing constraint compliance through rule-consistency and violation penalties in the loss function. Unlike conventional clustering methods that rely only on visual similarity or apply rules as post-hoc filters, DART-VAE treats rules as first-class learning signals. The rules are generated by LLMs, structured into knowledge graphs, and enforced through a loss function combining reconstruction, KL divergence, consistency, and violation penalties. Experiments on aircraft and automotive datasets demonstrate that rule-guided clustering produces more operationally meaningful and interpretable clusters—for example, isolating UAVs, unifying stealth aircraft, or separating SUVs from sedans—while improving traditional clustering metrics. However, the framework faces challenges: LLM-generated rules may hallucinate or conflict, excessive rules risk overfitting, and scaling to complex domains increases computational and consistency difficulties. By combining rule encodings with learned representations, DART-VAE achieves more meaningful and consistent clustering outcomes than purely data-driven models, highlighting the utility of constraint-guided multimodal clustering for complex, knowledge-intensive settings.
Charts are central to analytical reasoning, yet existing benchmarks for chart understanding focus almost exclusively on single-chart interpretation rather than comparative reasoning across multiple charts. To address this gap, we introduce ChartDiff, the first large-scale benchmark for cross-chart comparative summarization. ChartDiff consists of 8,541 chart pairs spanning diverse data sources, chart types, and visual styles, each annotated with LLM-generated and human-verified summaries describing differences in trends, fluctuations, and anomalies. Using ChartDiff, we evaluate general-purpose, chart-specialized, and pipeline-based models. Our results show that frontier general-purpose models achieve the highest GPT-based quality, while specialized and pipeline-based methods obtain higher ROUGE scores but lower human-aligned evaluation, revealing a clear mismatch between lexical overlap and actual summary quality. We further find that multi-series charts remain challenging across model families, whereas strong end-to-end models are relatively robust to differences in plotting libraries. Overall, our findings demonstrate that comparative chart reasoning remains a significant challenge for current vision-language models and position ChartDiff as a new benchmark for advancing research on multi-chart understanding.
Formal Machine Interpretation for the Semasiographic Mixtec Codices of Precolonial and Early Colonial Mesoamerica
Christopher Driggers-Ellis | Gabriel Ayoubi | Girish.Salunke811@Gmail.Com Girish.Salunke811@Gmail.Com | Christan Grant
Christopher Driggers-Ellis | Gabriel Ayoubi | Girish.Salunke811@Gmail.Com Girish.Salunke811@Gmail.Com | Christan Grant
The precolonial and early colonial Mixtec codices describe the history and stories of the region in a semasiographic medium that is full of symbolic representations and meant to be narrated.Recently, the community has introduced datasets of XML representations of related media, including Aztec codices and Mayan hieroglyphic script, in a step towards symbolic machine interpretation of these historic Mesoamerican artifacts.In this work, we propose formal symbolic machine interpretation of XML encodings representing facsimile images from the Mixtec Codex Zouche-Nuttal.We demonstrate the efficacy of symbolic machine interpretation from XML step-by-step, showing how our parser and interpreter process text capturing a scene from the Mixtec Codex Zouche-Nuttall.We hope our contribution and the example we provide motivate collaboration among the archaeological, historical, linguistic, and natural language processing research communities to apply machine interpretation to Mixtec codices and similar manuscripts.
Temporal-Linguistic Adaptive Streaming for Continuous Sign Language Translation
Arshia Kermani | Habib Irani | Deautaun Ross | Vangelis Metsis
Arshia Kermani | Habib Irani | Deautaun Ross | Vangelis Metsis
Real-time sign language translation must generate text incrementally as signs arrive, yet existing streaming policies treat glosses as a flat token sequence and discard the temporal rhythm of signing. Inter-gloss pauses reliably mark sentence boundaries in continuous discourse, but policies such as Wait-k cause arbitrary cross-boundary fragmentation. We propose Temporal-Linguistic Adaptive Streaming (TLAS), which fuses a Temporal Pause Detector (TPD, tracking inter-gloss interval statistics via an exponential moving average) and a Linguistic Readiness Estimator (LRE, a trained neural head on a frozen T5 encoder) through an Adaptive Fusion Gate (AFG). A proactive timeout fires before the next gloss arrives when the inter-gloss gap exceeds a threshold, producing clean sentence segmentation without oracle boundary information. We also contribute a synthetic discourse dataset of 1,400 ASL discourse groups with LLM-generated per-gloss timestamps and introduce a continuous-stream evaluation paradigm requiring autonomous boundary detection from an unbroken gloss stream. Under such conditions, TLAS significantly outperforms current heuristic baselines, such as Wait-k, and methods relying solely on linguistic content.
Multimodal Large Language Models (MLLMs) have achieved remarkable success in semantic visual reasoning, yet their capacity for fine-grained, low-level perception remains critically under-evaluated. This perceptual fragility limits their reliability in noisy, real-world environments where visual signals are degraded. Furthermore, existing benchmarks often entangle visual perception with language priors, masking these underlying deficits. To address this, we introduce the **FAint numeric Detection Evaluation (FADE)** dataset, a novel evaluation suite designed to probe the limits of zero-shot Optical Character Recognition (OCR) in frontier MLLMs. By embedding synthetic, strictly numerical sequences over cluttered natural backgrounds at varying levels of transparency (𝛼), FADE explicitly disentangles pure visual perception from semantic predictability. We evaluate state-of-the-art models including Gemini 3.0, Claude 4.5 Sonnet, and Gemma 3 against a specialized UNet segmentation baseline. Our results reveal a striking limitation in frontier architectures: while they achieve near-perfect transcription at high visibility, their performance collapses under high transparency. Conversely, the UNet pipeline maintains robust spatial grounding, significantly outperforming generalist models at the lowest visibility thresholds. FADE provides a reproducible dataset to expose and diagnose the perceptual breakage points of modern multimodal systems.
Visual Question Answering (VQA) models process all image patches uniformlydespite questions typically requiring only a small subset of visual information.This inefficiency leads to unnecessary computation and can result in attentiondilution across irrelevant image regions. We propose Question-GuidedSparse Attention (QGSA), a plug-and-play mechanism that dynamically selectsrelevant image patches conditioned on question semantics. Our approach introducesthree components: (1)a differentiable patch selector based on Gumbel-Softmaxreparameterisation that enables end-to-end training with hard patch selection atinference; (2)a self-supervised grounding loss that encourages spatialselectivity without bounding-box annotations, combining contrastive patchselection with patch–word alignment via a frozen CLIP encoder; and (3)anadaptive sparsity mechanism that adjusts the number of selected patches accordingto estimated question complexity. Experiments on SmolVLM-256M-Instruct andSmolVLM-500M-Instruct across three VQA benchmarks (VQA-RAD, A-OKVQA, RefCOCO)demonstrate that QGSA reduces cross-attention FLOPs by 91–99% across inputresolutions, achieving up to 76× theoretical speedup at 576px resolution, whilemaintaining exact accuracy parity with the dense baseline (𝛥=0.0 ppon all datasets).Wall-clock parity with the dense baseline is reached at 336px; realisedend-to-end speedup requires larger models where cross-attention dominates totalcompute. QGSA consistently selects an average of k≈17 patches out of576 (256M model), up to k≈18 (500M model), yielding up to a 34×reduction in the visual token sequence. These small-scale results validate thefeasibility of question-conditioned sparse attention and provide a foundation forscaling to larger VLMs.
Systematic Performance Degradation in Indic Vision-Language Models: Evidence from Hindi and Telugu
Rishikant Chigrupaatii | Ponnada Sai Tulasi Kanishka | Lalit Chandra Routhu | Martin Patel | Sama Supratheek Reddy | Divyam Gupta | Rajiv Misra | Rohun Tripathi
Rishikant Chigrupaatii | Ponnada Sai Tulasi Kanishka | Lalit Chandra Routhu | Martin Patel | Sama Supratheek Reddy | Divyam Gupta | Rajiv Misra | Rohun Tripathi
With 1.5 billion people speaking over 120 major languages, India exemplifies the challenges of multilingual AI evaluation. Current multilingual VLM benchmarks suffer from unverified auto-translations, narrow task coverage, small sample sizes, and lack of culturally grounded content. We present HinTel-AlignBench, a comprehensive evaluation framework and benchmark for Hindi and Telugu vision-language models with English-aligned samples. Our framework combines semi-automated translation with human verification to generate 4k QA pairs per language across five domains: adapted English datasets (VQAv2, RealWorldQA, CLEVR-Math) and native Indic sets (JEE for STEM, VAANI for cultural grounding). Evaluation of state-of-the-art open and closed-source VLMs reveals consistent performance regression from English to Indic languages, with average drops of 8.3 points for Hindi and 5.5 points for Telugu across four of five tasks. We identify key failure modes and establish reproducible baselines for multilingual multimodal evaluation.
How Fragile Is Vision-Language Alignment? Mapping Concept Disruption Under Text-to-Image Personalization
Mujtaba Hasan
Mujtaba Hasan
Text-to-image diffusion models learn a mapping from natural language to visual structure, but how robust is this mapping to perturbation? We use personalization—fine-tuning a model to learn a new face, object, or style—as a controlled stress test to probe the fragility of learned vision-language alignment. We find that fine-tuning for one concept systematically shifts the model’s ability to faithfully render unrelated concepts, and that this disruption follows structured, predictable patterns. To measure this fragility, we construct Concept Entanglement Maps: per-prompt, per-model disruption matrices that reveal which concepts are most affected and why. Using Stable Diffusion v1.5 as a controlled testbed, we evaluate 15 subjects across three personalization methods on 200 prompts and report three findings about the organization of vision-language alignment: (1) aggregate disruption is larger for vision-backbone and cross-attention perturbations than for text-embedding perturbations, despite the latter directly modifying the language representation; (2) abstract and compositional language is significantly more fragile than concrete, object-specific language; and (3) disruption does not follow semantic proximity—personalizing for a face does not preferentially disrupt other face-related prompts (p = 1.0), suggesting that alignment vulnerability is organized globally rather than purely by semantic category. These findings expose a structural vulnerability in current text-to-image personalization: the same cross-attention mechanism that enables compositional generalization also creates pathways through which local fine-tuning can propagate as global alignment shift.
The Compositional Grounding Gap: Why Vision-Language Models Fail at Relational Reasoning and How to Fix It
Kaustubh S. Bukkapatnam
Kaustubh S. Bukkapatnam
Large vision-language models (LVLMs) achieve strong performance on many multimodal tasks, yet consistently fail at compositional relational reasoning—distinguishing "the cat on the mat" from "the mat on the cat." We provide a formal explanation for this failure. We prove that any vision-language alignment operating on pooled (order-invariant) visual features contains compositional blind spots: semantically distinct scenes that map to identical representations. We show that the number of blind spots grows factorially with scene complexity, establishing a fundamental limit on pooled-feature architectures. Motivated by this analysis, we propose REGROUND, a training-free, test-time method that re-introduces spatial structure into alignment by performing relation-guided cross-attention over spatial visual tokens, directed by a lightweight parse of the text query. Without any fine-tuning, REGROUND improves compositional accuracy by +8.6 points on Winoground, +8.4 on ARO-Relation, +6.4 on SugarCrepe, and +8.4 on VSR when applied to LLaVA-1.5, and provides consistent gains across other LVLMs. Ablation studies confirm that each component—parse guidance, token-level attention, and relation masking—contributes significantly.
HalluTrace: Causal Attribution and Source-Targeted Decoding for Hallucination in Large Vision-Language Models
Kaustubh S. Bukkapatnam
Kaustubh S. Bukkapatnam
Object hallucination in large vision-language models (LVLMs) is well-documented, but the mechanisms that produce it remain poorly understood. We introduce HALLUTRACE, a causal attribution framework that decomposes hallucination into three distinct sources: (VGF) visual grounding failure, where the visual encoder produces a representation insufficient to identify the target object; (LPD) language prior dominance, where the language model overrides a correct visual signal with a statistically-driven prediction; and (CMC) cross-modal conflict, where visual and linguistic signals are irreconcilably inconsistent and the model resolves the conflict incorrectly. We operationalise these sources via causal component ablations: intervening on fvis, fproj, and fLM independently and measuring the change in CHAIR score. Experiments on five LVLMs show that attribution patterns are object-category-specific and model-consistent: person/vehicle hallucinations are predominantly LPD (≥52%), food/furniture hallucinations are predominantly VGF (≥44%), and animal hallucinations split between VGF and CMC. Guided by these attributions, we design HAD (Hallucination-Aware Decoding), a unified decoding strategy that applies source-targeted interventions: visual signal amplification for VGF, language prior suppression for LPD, and contrastive re-weighting for CMC. HAD reduces CHAIRI by 3.7–5.6 points and improves POPE F1 by 1.9–3.1 points over LLaVA-1.5, outperforming VCD and ICD on all three benchmarks (CHAIR, POPE, MME) without any additional training. We further prove that the attribution-decoding correspondence is tight: the CHAIR improvement from HAD is linearly predictable from the VGF attribution share (r = 0.86, p < 10−6), validating the causal framework.
up
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)
Manuel Mager | Abteen Ebrahimi | Minh Duc Bui | Robert Pugh | Arturo Oncevay | Luis Chiruzzo | Rolando Coto Solano | Shruti Rijhwani | Katharina Von Der Wense
Manuel Mager | Abteen Ebrahimi | Minh Duc Bui | Robert Pugh | Arturo Oncevay | Luis Chiruzzo | Rolando Coto Solano | Shruti Rijhwani | Katharina Von Der Wense
Neural Text-to-Speech for Myaamia: Speech Synthesis for an Indigenous Algonquian Language
Anita Baral | John Femiani | Hunter Lockwood | Daniela Inclezan | Balaram Bhandari
Anita Baral | John Femiani | Hunter Lockwood | Daniela Inclezan | Balaram Bhandari
We present the first neural text-to-speech (TTS) implementation for Myaamia (Miami-Illinois), an Indigenous Algonquian language of North America. Developed in collaboration with the Myaamia Center at Miami University, our approach upholds principles of data sovereignty. Using 14,358 utterances (10.4 hours total, 8.18 hours for training) from seven speakers, we train and evaluate FastSpeech, Glow-TTS, and VITS, assessing synthesis quality through objective (MCD, F0 RMSE, duration RMSE) and subjective (expert evaluation) metrics. VITS outperforms other models in spectral and prosodic accuracy, but challenges remain in phonetic precision and prosody modeling. Our results confirm the feasibility of neural TTS for Myaamia, with direct implications for language learning and revitalization. This work offers a replicable framework for other low-resource Indigenous languages while ensuring ethical, linguistic data governance.
We evaluate seven large language models—four proprietary and three open-weight—on bidirectional Lakota–English translation using 200 sentence pairs from the New Lakota Dictionary. Each model is evaluated with and without extended reasoning, where the provider’s API permits. The best model (Gemini 3.1 Pro) achieves a mean chrF++ of 59.4 on Lakota→English and 42.6 on English→Lakota; the strongest open-weight model trails the proprietary leaders, and no model produces reliable translation in either direction. Two independent LLM judges from different model families agree substantially (Cohen’s κ=0.75) that semantic equivalence ranges from 6% (GPT-5.2) to 60% (Gemini), diverging substantially from chrF++ scores. For the open-weight models, enabling reasoning changes refusal behavior far more than translation quality: it surfaces the limitation rather than overcoming it. Diacritic-normalization analysis shows models produce roughly correct base characters but place diacritical marks inconsistently. All results and evaluation code are publicly available at https://github.com/robotson/lakota-translation-benchmark.
Bridging Digital Tools for Linguistic Documentation and Revitalization
Christopher Haberland | Carly Crowther | Jingnong Qu | Anuk Centellas
Christopher Haberland | Carly Crowther | Jingnong Qu | Anuk Centellas
Digital tools serving language revitalization tend to fall into two categories: 1) linguist-oriented documentation tools that prioritize annotation, morphological analysis, and archival preservation, and 2) community-facing applications that emphasize accessibility and language learning. Few systems integrate the former with the latter, and practical barriers — including the cost of computational expertise, single-user workflows, and limited data governance — further constrain their utility. These disconnects incur additional development and communication costs for revitalization teams consisting of linguists and community members. We introduce "langlit", a collaborative web-based platform that attempts to tailor documentation workflows for the language revitalization context within a single system. The platform integrates a finite-state morphological analyzer with a three-tier human-in-the-loop annotation workflow, searchable corpus interfaces with multiple query modalities, interactive word construction guided by the morphological grammar, corpus-linked hypothesis tracking with provenance, and a grammar-derived editable dictionary. All components share a single underlying FST grammar, and the system supports configurable access controls, collaborative editing, and optional LLM integration with transparent data handling. Designed for redeployment across languages through a modular architecture, "langlit" is published as an open-source repository on GitHub. We situate our system within the existing landscape of revitalization tools through a comparative analysis and discuss how integrated, community-informed design can better serve the specific goals of language revitalization.
A Systematic Comparison of Parameter-Efficient Fine-Tuning Techniques for Low-Resource Neural Machine Translation: Evidence from Indigenous Languages of the Americas
Drew Stackhouse | Justin Debenedetto
Drew Stackhouse | Justin Debenedetto
We present the first systematic benchmark of parameter-efficient fine-tuning (PEFT) for low-resource neural machine translation (NMT) of indigenous languages of the Americas. We evaluate eight PEFT methods alongside full fine-tuning on NLLB-200-distilled-600M across 13 indigenous-to-Spanish language pairs spanning four resource tiers (357-125,008 training sentences). OFT (Orthogonal Finetuning) achieves the highest development-set chrF++ among PEFT methods (26.63) while training only 0.28% of parameters. LoRA (Low-Rank Adaptation) offers a strong efficiency-quality tradeoff (25.27 chrF++, 0.19%). On held-out test data, full fine-tuning ranks first (25.12) with OFT a close second (25.06; p = 0.43). VeRA (Vector-based Random Matrix Adaptation) and Prefix Tuning consistently underperform. These results demonstrate that PEFT is a viable alternative to full fine-tuning for indigenous-language NMT.
Linguistic Feature Tagging for Automatic Classification of 27 Closely-Related Quechua Varieties
Claire Post | Alexis Palmer
Claire Post | Alexis Palmer
This paper presents a multi-dialect text classifier for Quechua that augments neural models with rule-based linguistic information to address challenges in low-resource, morphologically complex settings. The approach is built on a carefully curated dataset spanning multiple genres, including annotated parallel bible corpora, and encodes manually annotated lexical variation and polypersonal verbal agreement as explicit features within a transformer-based classifier. Results show that neural models substantially outperform statistical baselines, enabling highly accurate multi-class classification across 27 Quechua dialects. The impact of linguistic augmentation is context-dependent: gains are minimal in high-resource settings but more pronounced in low-resource and cross-domain conditions. Overall, this work aims to contribute to the development of dialect-sensitive NLP methods for Quechua and other low-resource, morphologically rich languages.
What Resources Matter for Interlinear Glossing? Using LLMs and RAG for the Low-Resource Mapudungun Language
Anaís Almendra | Arianna Bisazza | Claudio Gutierrez | Felipe Hasler
Anaís Almendra | Arianna Bisazza | Claudio Gutierrez | Felipe Hasler
Interlinear glossing is essential for the study and revitalization of endangered languages. However, it remains a time-consuming process that requires extensive linguistic expertise. Recent advances in Large Language Models (LLMs) offer a potential solution. In this research, we study the case of Mapudungun, an endangered language spoken in Chile and Argentina, to generate automatic interlinear glosses using the Gemini 2.5 Pro model. Our study investigates which information configuration through Retrieval-Augmented Generation (RAG) yields the best results. We compare the integration of a formal grammar, a dictionary, a small annotated corpus, and a combination of all these resources. Our evaluation shows that while dictionary integration causes a significant degradation in performance, grounding the model with a structured corpus maximizes accuracy relative to the resources employed. Notably, we find that a remarkably small dataset of 589 meaning units provides enough normative guidance to significantly improve the morphological tagging task. This work highlights the viability of utilizing minimally annotated corpora to assist in the documentation of morphologically complex languages.
Deer, Deities, and Dancing: Culturally Biased LLM Hallucination in Low-Resource Wixárika Translation
Henry Gagnier | Ashwin Kirubakaran
Henry Gagnier | Ashwin Kirubakaran
Large language models (LLMs) struggle with low-resource polysynthetic languages, yet the nature of their failures remains underexplored. We evaluate GPT-4o-mini, Gemma~3~27B, Llama~3.3~70B, and NLLB-200 on Spanish$\leftrightarrow$Wixárika translation using zero-shot and 5-shot prompting. All systems are unusable, scoring below 3 BLEU and 21 chrF. Qualitative analysis reveals that LLMs largely ignore source content and instead generate fluent hallucinations. Spanish outputs frequently include indigenous cultural stereotypes such as deer, deities, rain dance, and shamans, regardless of the input, while Wixárika outputs are repetitive across different inputs and morphologically implausible. Few-shot prompting yields model-dependent improvements, with Gemma and Llama improving substantially at higher shot counts while GPT-4o-mini remains flat. These results demonstrate that current LLMs are unable to represent polysynthetic morphology and instead default to exoticizing Indigenous culture and identity. We call for the development of inclusive morphological-aware modeling strategies and increased resource creation to ensure that Indigenous languages of the Americas are represented safely and accurately.
IndigiEval: Evaluating LLMs in North American Indigenous Languages
Julia Mainzinger | Jacqueline Brixey
Julia Mainzinger | Jacqueline Brixey
This paper presents IndigiEval, a framework for evaluating the language and cultural proficiency of several commercially available large language models (LLMs) across five North American Indigenous languages (Mvskoke, Choctaw, Cherokee, Cheyenne, and Hawaiian). This framework is a qualitative evaluation method intended for communities with small speaker populations to be able to critically evaluate LLM performance with minimal data and human effort. IndigiEval includes tasks such as answering cultural questions, translation, text generation, and speech recognition. The results of our experiments indicate that no currently available LLM performs well across all evaluation categories, and that LLMs frequently hallucinate orthographies, grammatical structures, cultural knowledge, and vocabulary for all languages and cultures considered. Our proposed evaluation framework is not intended as a comprehensive score, but rather a qualitative and flexible framework to inform language communities about a given LLM’s potential as a resource, since each language has unique environments, strengths, and availability of resources.
A data-centric approach to performance improvement in under-resourced ASR: The case of Dënë Sųłıné
Olga Kriukova | Olga Lovick | Antti Arppe
Olga Kriukova | Olga Lovick | Antti Arppe
This paper presents a study focused on advancing Automatic Speech Recognition (ASR) for the under-resourced language Dënë Sųłıné through data-centric approaches. We explore multiple strategies to enhance the quality of training data—both audio recordings and transcriptions—to address the challenges posed by mixed-quality datasets. Our experiments investigate which data preparation techniques most effectively improve ASR performance in this context. Our findings show that reducing non-phonemic spelling variation in the corpus significantly improves model generalization, resulting in a substantial increase in recognition accuracy. Additionally, we demonstrate that increasing manually reviewed transcriptions consistently improves word and character error rates, while audio enhancement slightly reduces performance, highlighting the complex trade-offs in low-resource ASR development.
Towards a Community-accessible Cahuilla corpus: Developing HTR for J.P. Harrington’s handwritten fieldnotes on Mountain Cahuilla
Ray Huaute | Jacqueline Brixey
Ray Huaute | Jacqueline Brixey
This paper describes ongoing work to develop a corpus of Cahuilla language from the John Peabody Harrington collection, which contains linguistic and ethnographic fieldnotes documenting Indigenous languages of California and other regions across the Americas. Handwritten notes present numerous processing challenges, including scratch-outs, multilingual entries in Spanish and other Indigenous languages, unique abbreviations, and varying script orientations. We compare the efficacy of deep learning text recognition models to convert images of the notes into a machine-readable format, with a focus on respecting tribal data sovereignty in our methods. We find that Pylaia is the most accurate model for our data. Finally, we present the preliminary findings and indicate future directions for developing a Cahuilla corpus.
Corpora duplication for NLP in low-resource languages: A case study of Nahuatl
Juan Jose Guzman Landa | Juan-Manuel Torres-Moreno | Luis Moreno Jimenez | Elvys Linhares Pontes | Miguel Figueroa-Saavedra | Graham Ranger | Martha Lorena Avendaño Garrido
Juan Jose Guzman Landa | Juan-Manuel Torres-Moreno | Luis Moreno Jimenez | Elvys Linhares Pontes | Miguel Figueroa-Saavedra | Graham Ranger | Martha Lorena Avendaño Garrido
In this paper, we aim to answer the following question: could corpus duplication be useful in Natural Language Processing (NLP) for low-resource languages? In these languages (or pi-languages), corpora available for training Large Language Models are virtually non-existent. Specifically, we study the impact of corpus expansion in Nahuatl, an agglutinative and polysynthetic Amerindian pi-language characterised by extensive dialectal variation. Our goal is to increase the size of Nahuatl corpora, which currently consist of a limited number of tokens, through controlled duplication techniques. Our experimental setup employs incremental duplication alongside appropriate corpus balancing, with the objective of training embeddings optimised for downstream NLP tasks. Consequently, static embeddings were trained and evaluated on a sentence-level semantic similarity task. Our results show a significant improvement in performance when incremental duplication is applied, compared to results obtained without corpus expansion. To our knowledge, this technique has not yet been explored in this field.
On the Robustness of Morphosyntactic Transformation with Large Language Models: The Case of Quechua Collao
Pool Pocco | Arturo Oncevay
Pool Pocco | Arturo Oncevay
We present a morphosyntactically controlled transformation dataset for Quechua Collao and evaluate large language models on a sentence-level transformation task under varying prompting conditions. Results show that performance depends on the interaction between model behavior, context size, and linguistic complexity, with smaller models benefiting more from additional examples and morphological hints providing selective gains.
Building Community-Centred NLP Resources for Puno Quechua
Elwin Huaman | Adrian Gamarra Lafuente | Johanna Cordova | Anna Korhonen
Elwin Huaman | Adrian Gamarra Lafuente | Johanna Cordova | Anna Korhonen
The preservation of under-resourced languages requires digital tools and resources shaped by and for their speakers. We present the first dedicated ASR resources for Puno Quechua (ISO 639-3: qxp): (1) the largest speech corpus for any single Quechua variety, consisting in 66 hours of recordings for scripted and spontaneous speech (including 36 hours of manually transcribed and validated data), collected via a participatory design campaign; (2) the first systematic ASR benchmark for Puno Quechua, evaluating state-of-the-art models and fine-tuning Whisper-base, wav2vec2-base, and XLS-R-300M, with and without continued pre-training (CPT); (3) an open release of all datasets and fine-tuned models.
The Power of Simplicity: N-Grams and Transformers in Nahuatl Language Identification
Luis Mercado Campos | Robert Pugh | Alexis Palmer
Luis Mercado Campos | Robert Pugh | Alexis Palmer
In the context of real-world language technology applications, the language or variety in which a given text is written is often unknown or uncertain. Yet, this information is crucial in order to adequately select and apply appropriate models or resources. Language identification (LID), or the process of determining the language or variety of a text sample, is thus often an important fundamental task in natural language processing. LID can be particularly challenging when: (1) there are not many labeled texts for training; and (2) similar or related languages are involved, since these may share a number of surface-level features. In this paper, we present an LID system for Nahuatl, a group of closely-related language varieties spoken in Mexico and Central America. Nahuatl LID involves both of the aforementioned challenges: Nahuatl varieties can be quite similar, sharing morphemes and even many lexical items, and there is a relative paucity of representative, variant-labeled Nahuatl text. We describe LID experiments for a total of 11 Nahuatl varieties, achieving generally good results (90.59% ±0.09% in 5-fold cross-validation experiments). Many of the outstanding errors are the result of confusion between three highly similar Huasteca variants.
RAN: Resource Abundance Notation for Languages in NLP
Jared Coleman | Tainã Coleman | Bhaskar Krishnmachari
Jared Coleman | Tainã Coleman | Bhaskar Krishnmachari
The term "low-resource" is used pervasively in NLP but communicates almost nothing precise. We propose RAN (Resource Abundance Notation), a compact, multi-dimensional notation for quantifying a language’s NLP resource profile. A RAN score is written as S/M/L_1-B_1/L_2-B_2/..., where S = floor(log10(speakers)), M = floor(log10(monolingual sentences)), and each L_i-B_i pair records a bilingual partner and floor(log10(parallel sentences)). Values derive from canonical sources: Wikidata for speakers, OSCAR 23.01 for monolingual corpora, and (where available) OPUS for parallel corpora. We score 20 typologically diverse languages and correlate each profile against published benchmarks for three tasks: machine translation (MT, via NLLB-200 chrF++), named entity recognition (NER, via XTREME XLM-R WikiANN F1), and part-of-speech tagging (POS, via XTREME XLM-R UD accuracy). The RAN components carry complementary information: a linear model using all three explains 52% of MT variance, 76% of NER variance, and 72% of POS variance. Among single predictors, B_max (the largest bilingual corpus, regardless of partner) is strongest for the cross-lingual transfer tasks (NER, POS), while M and B_en are strongest for MT. RAN is designed first as a communication tool, not a predictive model.
Bringing Mapudungun into the Modern MT Ecosystem: Morphology-Aware Tokenization for NLLB-200 Fine-Tuning
Isaac Thompson | Brandon Rogers | Eric Ringger
Isaac Thompson | Brandon Rogers | Eric Ringger
For Mapudungun arn→es translation, morphology-aware tokenization can substitute for a 5× increase in model parameters. We fine-tune three sizes of Meta’s NLLB-200 on Mapudungun–Spanish translation across eight tokenization strategies, including our novel Morfessor-VC method, whichconstrains Morfessor morpheme segmentation to tokens already present in NLLB’s pretrainedvocabulary. Our 600M Morfessor-VC model is competitive with our own fine-tuned 3.3B Standard BPE model on arn→es (43.2 vs. 42.9 chrF++, ∆ = +0.3, p = 0.039, 95% CI [0.02, 0.60]) while using five times fewer parameters, and all fine-tuned conditions surpass frontier LLMs by over 27 chrF++. Mapudungun is an indigenous polysynthetic language spoken by 200,000+ Mapuche people in Chile and Argentina, absent from NLLB-200 and not supported by major commercial MT providers; prior work predates large-scale multilingual models and does not address the tokenization challenges posed by its agglutinativemorphology. These results establish new state-of-the-art baselines for Mapudungun MT and provide a practical foundation for community language tools in pedagogy, social media, and language revitalization.
QomL’aqtaqa: A Qom–Spanish Parallel Corpus for Natural Language Processing with Machine Translation Evaluation
Viviana Cotik | Aleksei Korablev | Paola Cúneo | Pablo Laciana
Viviana Cotik | Aleksei Korablev | Paola Cúneo | Pablo Laciana
Qom, a language of the Guaycuruan family, is a low-resource language for NLP and speech processing. We present the first parallel Qom–Spanish corpus in a computationally usable format, comprising 33,392 parallel segments, totaling 1,469,905 Qom tokens and 891,344 Spanish tokens. A subset of 2,943 segments excludes Bible-derived content. It includes alignments at different levels: sentences, sentence fragments, and paragraphs, and is compiled from multiple sources, both previously available and newly collected. We also present bidirectional neural machine translation baselines based on NLLB-200, achieving competitive performance in both translation directions on the full dataset, and lower performance on the non-Bible subset. An ablation study shows that training exclusively on biblical data reduces performance on non-biblical text, highlighting the importance of domain diversity in low-resource machine translation.
Toward a Coarse-Labeled Spoken Language Identification Dataset for Central Alaskan Yup’ik and Samoan from US Broadcast Archives
Yangyang Chen | Kyeongmin Rim | James Pustejovsky
Yangyang Chen | Kyeongmin Rim | James Pustejovsky
Publicly available spoken language identification (LID) systems provide sparse and inconsistent coverage of indigenous languages of the Americas and languages of the Pacific Islands. No system on HuggingFace covers Central Alaskan Yup’ik except the largest variant of Meta’s MMS-LID family, and only three MMS-LID variants cover Samoan, while Whisper and VoxLingua107-based models lack both despite including other Polynesian languages. We describe an ongoing effort to build a coarse-labeled LID dataset for Yup’ik and Samoan from US public broadcast archives, benchmark publicly available LID systems on it, and train a simple MLP classifier on frozen wav2vec~2.0 representations as a prototype. We report preliminary corpus statistics, off-the-shelf model performance, and prototype results. Guided by the distinctive phonological typology of the target languages, we outline a phonologically-informed fine-tuning direction as future work.
Retrieval-Augmented Long-Context Translation for Cultural Image Captioning: Gators submission for AmericasNLP 2026 shared task
Aashish Dhawan | Christopher Driggers-Ellis | Dzmitry Kasinets | Christan Grant | Zhe Wang
Aashish Dhawan | Christopher Driggers-Ellis | Dzmitry Kasinets | Christan Grant | Zhe Wang
This paper presents the University of Florida Gators submission to the AmericasNLP 2026 shared task on cultural image captioning for Indigenous languages. The system uses a two-stage pipeline: first generating Spanish captions from images with a vision-language model, then translating them into target languages using retrieval-augmented many-shot prompting with Gemini 2.5 Flash. The paper reports strong improvements over the shared task baseline across multiple languages, analyzes the role of retrieval, synthetic exemplars, and morphology-aware prompting, and discusses limitations related to dev-set exemplars, cascade errors, and chrF++ based evaluation.
From Machine Translation to Image Captioning: Training Vision-Language Models for Indigenous Languages of the Americas
Luis Lara | Param Raval
Luis Lara | Param Raval
We describe our system for the AmericasNLP 2026 Shared Task on Cultural Image Captioning for Indigenous Languages of the Americas. Our post-training pipeline starts from Aya Vision 32B: the vision-language model is first fine-tuned on machine translation data from prior AmericasNLP shared tasks and then further fine-tuned on the cultural Image Captioning data. This approach uses translation as an intermediate training task, while the final system produces captions directly in the requested Indigenous language rather than translating a Spanish caption afterward. Our experiments show that machine translation fine-tuning is an important initialization step. The resulting fine-tuned vision-language model also shows translation capabilities for the languages considered in this work. In addition, our zero-shot GPT-5.5 submission ranks first in the Maya language track under the official human-evaluation stage.
Culturally-Aware Image Captioning for Guaraní with Multimodal Prompting: IUHoosiers at AmericasNLP 2026
Wenchen Shi | Phakphum Artkaew | Luke Gessler
Wenchen Shi | Phakphum Artkaew | Luke Gessler
The AmericasNLP 2026 shared task challenges systems to generate culturally grounded image captions in indigenous languages of the Americas, a setting that demands both cultural awareness and linguistic accuracy for severely underresourced languages. We present IUHoosiers, Indiana University’s system for the Guaraní track. Rather than fine-tuning, our approach centers on inference-time knowledge injection: for each test image, we retrieve relevant Guaraní grammatical and cultural resources using BM25 and inject them into a large vision language model’s prompt alongside the image, enabling language-specific cultural and linguistic grounding without any parameter updates. IUHoosiers placed first for Guaraní in both automatic evaluation (24.67 chrF++) and human evaluation (3.45/5), outperforming all other participating systems.
6fanle Submission to the AmericasNLP 2026 Shared Task on Wixarika Image Captioning
Ji Wang | Hanqi Yang
Ji Wang | Hanqi Yang
This system description presents a Wixarika image captioning system for the AmericasNLP 2026 Shared Task on Cultural Image Captioning for Indigenous Languages. The system uses Spanish as a pivot language, combining CLIP-based image retrieval, Qwen3-VL Spanish caption generation, the official Sheffield-compatible Spanish-to-Wixarika MT model, and character n-gram language-model reranking. We report local 5-fold development results, official test results, error analysis, and implementation details for reproducibility.
Culturally Grounded Image Captioning in Indigenous Languages with Vision-Language Models: Cascaded and Single-Stage Approaches
Mirelle Bueno | Sushil Garg
Mirelle Bueno | Sushil Garg
Culturally grounded image captioning for under-resourced Indigenous languages is challenging due to severe data scarcity and the need to describe culturally specific visual content. This paper describes our submission to the AmericasNLP 2026 shared task, where we evaluate two architectural paradigms for caption generation across Bribri, Guaraní, Yucatec Maya, Wixárika, and Orizaba Nahuatl. First, we implement a cascaded system that combines a large vision-language model with a machine translation pipeline, showing that culturally contextualized, persona-based prompting improves over the official baseline in most comparable settings. Second, we develop a direct, end-to-end Single-stage approach by adapting PaliGemma 2 using LoRA fine-tuning, continued pre-training, and multilingual joint training. Our single-stage experiments show that, despite severe domain mismatch and reliance on synthetic training data, multilingual training and continued pre-training improve automatic chrF++ relative to single-language LoRA fine-tuning in some settings. Overall, cascaded pipelines remain the strongest among the evaluated approaches under current data constraints, while single-stage models remain a promising but currently data-limited path toward direct Indigenous-language image captioning.
Schema-Constrained Image Captioning for Five Low-Resource Indigenous Languages
Diego Cuadros | Nicholas Leeds | Amanda Avalos | Azul Alpizar-Velazquez | Jared Coleman | Faezeh Dehghan Tarzjani | Bhaskar Krishnamachari
Diego Cuadros | Nicholas Leeds | Amanda Avalos | Azul Alpizar-Velazquez | Jared Coleman | Faezeh Dehghan Tarzjani | Bhaskar Krishnamachari
We describe our submission to all five tracks of the AmericasNLP 2026 Shared Task on Cultural Image Captioning: Bribri, Guaraní, Yucatec Maya, Orizaba Nahuatl, and Wixárika. Our system is an LLM-assisted rule-based machine translation (LLM-RBMT) captioner. For each language, a coding agent reads the small development split and open-web linguistic references and writes a complete Pydantic grammar package with a closed vocabulary. At inference time, a vision–language model sees the image and the schema, emits a structured SentenceList under constrained decoding, and a deterministic Python renderer produces the surface string. The model never generates target-language tokens. The same architecture handles all five languages with no fine-tuning, no parallel corpora, and no human edits to the generated packages. On the official test set, the system placed first on human evaluation in Bribri and Orizaba Nahuatl, third on Yucatec Maya, and first on ChrF++ in Yucatec Maya. We suggest that a strength of the approach is that outputs are restricted to simple sentences that are grammatically correct by construction, modulo the correctness of the generated grammar itself.
USP at AmericasNLP 2026 Shared Task: Culturally-Aware Image Captioning for Indigenous Languages via Vision-Language Models and Fine-Tuned Neural Machine Translation
Rafael Fernandes
Rafael Fernandes
We describe the USP system for the AmericasNLP 2026 Shared Task on Culturally Relevant Image Captioning for Indigenous Languages, covering Guaraní (grn), Maya Yucateco (yua), Nahuatl (nah), Wixárika (hch), and Bribri (bzd). We propose a two-stage cascade: Qwen3-VL-8B-Instruct (Bai et al., 2025) generates Spanish captions via language-specific cultural prompts; language-specific fine-tuned NLLB-200-distilled-600M (NLLB Team et al., 2022) models then translate them into each target language. We train on AmericasNLP 2023 data (Ebrahimi et al., 2023) augmented with public parallel corpora. Our system achieves competitive results, including 3rd place in Guaraní human evaluation (2.41/5.0) and 5th in Bribri (1.09/5.0) among 8 teams. We also report that NLLB-200-distilled-600M silently lacks vocabulary entries for Bribri and Maya Yucateco, producing English output without error.
Nearest-Neighbor Retrieval for Indigenous Image Captioning
Justin Vasselli | Arturo Martínez Peguero | Shintaro Ozaki | Frederikus Hudi | Haruki Sakajo | Taro Watanabe
Justin Vasselli | Arturo Martínez Peguero | Shintaro Ozaki | Frederikus Hudi | Haruki Sakajo | Taro Watanabe
This paper describes the NAIST submission to the AmericasNLP 2026 Shared Task on Indigenous Language Image Captioning. We investigate two approaches for generating captions in Bribri, Guaraní, Nahuatl, Wixárika, and Yucatec Maya. The first is a nearest-neighbor retrieval system that uses CLIP image embeddings to retrieve the most similar image from the development set and directly reuse its caption. The second is a generation pipeline that combines scene analysis, dictionary-grounded lexical planning, retrieved gloss templates, and interlinear gloss representations to constrain generation in low-resource settings.The retrieval-based approach substantially outperformed the gloss-based pipeline under chrF++ evaluation and was competitive across all submitted systems, achieving first-place automated system rankings for Bribri and Wixárika and third place for Nahuatl. The gloss-based pipeline produced weaker automatic evaluation results and exposed problems with dictionary coverage, orthographic mismatches between resources, and unstable grammatical generation. Our results suggest that retrieval-based methods provide a strong baseline for low-resource captioning tasks when high-quality examples are available.
Findings of the AmericasNLP 2026 Shared Task on Cultural Image Captioning for Indigenous Languages
Minh Duc Bui | David Guzmán | Abteen Ebrahimi | Franklin Morales | Marvin Agüero-Torales | Raquel Insfrán | Cecilia González | Ramón Araujo | Luca Cernuzzi | Carlos Raul Noh Chi | Carlos Eduardo Tec Cahun | Sindi Estrella Poot Cohuo | Daniel Ricardo Benítez Chi | Santos Natanael Palomo Arévalo | Jessica Elizabeth Canul Canche | Deysi Aracely Poot Poot | Wendy Marleny Dzib Dzib | Eduardo José Ake Pool | Reynaldo Alexander Couoh Martin | Silvia Fernandez Sabido | Luis Samuel Santiago Melchor | Sotero Silverio | Robert Pugh | Raúl Vázquez | John E. Ortega | Arturo Oncevay | Rubén Manrique | Luis Chiruzzo | Rolando Coto-Solano | Elisabeth Mager | Shruti Rijhwani | David Ifeoluwa Adelani | Manuel Mager | Katharina von der Wense
Minh Duc Bui | David Guzmán | Abteen Ebrahimi | Franklin Morales | Marvin Agüero-Torales | Raquel Insfrán | Cecilia González | Ramón Araujo | Luca Cernuzzi | Carlos Raul Noh Chi | Carlos Eduardo Tec Cahun | Sindi Estrella Poot Cohuo | Daniel Ricardo Benítez Chi | Santos Natanael Palomo Arévalo | Jessica Elizabeth Canul Canche | Deysi Aracely Poot Poot | Wendy Marleny Dzib Dzib | Eduardo José Ake Pool | Reynaldo Alexander Couoh Martin | Silvia Fernandez Sabido | Luis Samuel Santiago Melchor | Sotero Silverio | Robert Pugh | Raúl Vázquez | John E. Ortega | Arturo Oncevay | Rubén Manrique | Luis Chiruzzo | Rolando Coto-Solano | Elisabeth Mager | Shruti Rijhwani | David Ifeoluwa Adelani | Manuel Mager | Katharina von der Wense
Indigenous languages of the Americas face severe endangerment, and the scarcity of culturally grounded resources remains a critical barrier to revitalization efforts. We present the AmericasNLP 2026 Shared Task on Cultural Image Captioning for Indigenous Languages, the first shared task dedicated to generating captions for images depicting Indigenous cultures of the Americas, written in the Indigenous languages themselves. To support this, we introduce and publicly release a newly constructed dataset spanning five cultures and their dominant languages: Bribri, Guaraní, Yucatec Maya, Central Veracruz Nahuatl, and Wixárika. Evaluation follows a two-stage process, combining automatic evaluation using ChrF++ with human evaluation of the top-performing systems for each language. Eight teams participate, submitting 27 systems in total. Results indicate that the task remains largely unsolved: while the strongest systems produce understandable captions, they fall short on descriptive detail and, critically, cultural grounding.
up
Proceedings of the 13th Workshop on Argument Mining and Reasoning
Proceedings of the 13th Workshop on Argument Mining and Reasoning
Mohamed Elaraby | Annette Hautli-Janisz | Julia Romberg | Elena Musi | Federico Ruggeri | John Lawrence
Mohamed Elaraby | Annette Hautli-Janisz | Julia Romberg | Elena Musi | Federico Ruggeri | John Lawrence
STCOR: A Trilevel Syllogism-Driven Reasoning Framework
Keying Yang | Hao Wang | Chengtao Jian | Kai Yang
Keying Yang | Hao Wang | Chengtao Jian | Kai Yang
Inspired by the human expert thinking paradigm in operations research, this work introduces a new concept of reasoning tasks: Textual Constrained Optimization (TCO) problems. A TCO problem is characterized by a natural language description that implicitly specifies an underlying structured model with variables, constraints, and objectives. We propose a novel Syllogism-driven Textual Constrained Optimization Reasoning (STCOR) paradigm, driven by classical syllogistic logic. Unlike contemporary stepwise methods, our framework structures reasoning into three phases: meta-modeling, which acts as the major premise by retrieving a relevant class-driven prototype template; formalization, which serves as the minor premise by instantiating the template into an explicit logical model from textual queries; and solving, which derives the final answer as conclusion. To support the end-to-end implementation, we further develop a tri-level optimization algorithm TriRL.
Beyond Logical Forms: LLM-Extracted Patterns for Fallacy Classification
Eleni Papadopulos | Firoj Alam | Giovanni Da San Martino
Eleni Papadopulos | Firoj Alam | Giovanni Da San Martino
In today’s fast-paced information era, logical fallacies, defined as defective patterns of reasoning, inevitably contribute to the growth of information disorder. However, often fallacies appear in nuanced forms that complicate automated classification. In this study, we investigate whether merging abstract logical structures with context-level linguistic cues proves beneficial for fallacy classification, developing a framework that inductively extracts such patterns from fallacious examples and their explanations using Large Language Models (LLMs). We evaluate the impact of these patterns across different LLMs and experimental zero- and one-shot configurations, showing statistically significant improvements over zero-shot baselines and outperforming competing approaches. Cross-dataset experiments validate generalization, establishing data-driven pattern extraction as an effective method for generating logical representations.
A Three-Level Audit of LLM Alignment for Argument Quality Assessment
Wei-Fan Chen | Jinming Yu | Lucie Flek
Wei-Fan Chen | Jinming Yu | Lucie Flek
Large Language Models (LLMs) are increasingly used as automated evaluators of argument quality. However, existing studies typically assess models only through their agreement with human scores, leaving the reasoning process behind these judgments unexplored. In this paper, we propose a three-level audit framework for evaluating the reliability of LLM-based argument quality assessment. The framework distinguishes between (1) surface alignment, measuring agreement between LLM-predicted scores and human annotations; (2) instructional alignment, assessing whether generated rationales adhere to the intended evaluation criteria; and (3) faithfulness alignment, examining whether predicted scores are supported by the generated rationales. To operationalize this audit, we introduce structural rationale prompting, which guides LLMs to generate structured justifications before assigning scores across 11 dimensions of the Dagstuhl-15512 argument quality corpus. We evaluate several LLMs under this framework and find that structural rationale prompting substantially improves agreement with human annotations compared to definition-based prompting. Furthermore, the generated rationales generally follow the evaluation instructions and remain highly consistent with the predicted scores. Overall, our results suggest that auditing LLM evaluators beyond surface score agreement provides deeper insight into the reliability and transparency of LLM-based evaluation.
Stance classification is a core task in argument mining and subjectivity analysis, crucial for understanding public discourse and opinion dynamics on social media. Despite their impressive few-shot capabilities, Large Language Models (LLMs) remain sensitive to prompt construction, including the selection and ordering of in-context examples. In this paper, we propose a Topic-Guided prompting method for argument stance classification that dynamically integrates topic-specific information into the few-shot context. We evaluate our method on five LLMs across three datasets spanning formal debates and user-generated online comments. Our extensive evaluation shows that our proposed Topic-Guided prompting outperforms standard few-shot prompting and state-of-the-art example selection strategies. Further analysis indicates that our method reduces the bias towards the ’support’ class observed in several models, resulting in more balanced predictions across stances and thus a more robust approach to stance classification.
AMResources: Cataloging Argument Mining Datasets
Dexter Williams | Shiwei Liu | Manfred Stede | Henning Wachsmuth | Jodi Schneider
Dexter Williams | Shiwei Liu | Manfred Stede | Henning Wachsmuth | Jodi Schneider
Annotated datasets are essential for developing and evaluating argument mining systems, yet information about argument mining datasets remains scattered across papers, repositories, and task-specific surveys. To address this, we introduce AMResources (http://purl.archive.org/amresources), an online catalog that organizes argument mining datasets by task, and captures relationships among datasets, releases, and papers. We draw particular attention to relationships such as re-annotation and dataset extension. To curate dataset information into a consistent and provenance-aware structure, AMResources links datasets to canonical papers. For each dataset release, AMResources records standardized metadata such as language, genre, unit type and unit count, annotator characteristics, agreement reporting, and accessibility. We argue that such structured dataset documentation remains critical in the era of large language models, where annotated datasets increasingly serve as high-quality evaluation benchmarks and where tracing dataset provenance and annotation layers is necessary for systematic comparisons across tasks.
Argument-Based Comparative Question Answering Evaluation Benchmark
Irina Nikishina | Saba Anwar | Nikolay Dolgov | Maria Manina | Daria Ignatenko | Viktor Moskvoretskii | Artem Shelmanov | Tim Baldwin | Chris Biemann
Irina Nikishina | Saba Anwar | Nikolay Dolgov | Maria Manina | Daria Ignatenko | Viktor Moskvoretskii | Artem Shelmanov | Tim Baldwin | Chris Biemann
Despite the ability of large language models (LLMs) to generate coherent comparative answers, automatic comparative question answering (CQA) remains challenging due to the absence of standardized evaluation criteria and the high resource demands of manual assessment. To address these problems, this paper proposes a comprehensive evaluation framework designed to assess the quality of CQA summaries using LLMs-as-a-Judge. We formulate 15 evaluation criteria for assessing comparative answers generated by various sources, including LLMs, human experts, and prior work. To capture a diverse range of comparative answers, LLM summaries were generated under various prompting scenarios. We evaluate the effectiveness of our framework using both human assessment and LLMs, demonstrating the consistency between automated and manual evaluations. Finally, we fine-tune Llama-3-8B-Instruct on a dataset generated from the best-performing CQA models in our evaluation.
Illustrating Arguments with Images Using Aspect-Aware Prompting
Maximilian Heinrich | Sharat Anand | Johannes Kiesel | Benno Stein
Maximilian Heinrich | Sharat Anand | Johannes Kiesel | Benno Stein
Images can powerfully strengthen arguments, conveying ideas more immediately and compellingly than text alone. With the rise of text-to-image models, a broad audience can now generate custom visuals to illustrate their arguments. Yet a fundamental mismatch undermines this potential: these models are trained on concrete scene descriptions, while arguments operate at the level of general, abstract principles. Naively prompting such a model with an argumentative text therefore rarely produces images that genuinely illustrate the argument. To address this challenge, we propose an aspect-aware image generation approach. Given an argument, our method first identifies the key aspects that an illustrative image should convey, then constructs a detailed scene description grounded in both the argument and those aspects, and finally generates an image using that scene description as the prompt. A human-assessment evaluation demonstrates that this approach yields images that illustrate arguments significantly better than those produced by naive prompting.
Do We Need Large Models for Argument Classification? Revisiting the Role of Model Compression
Filip Gampel | Rafał Olszowski | Marcin Pietroń
Filip Gampel | Rafał Olszowski | Marcin Pietroń
Large language models have improved argument mining substantially, but the associated computational cost complicates deployment, replication, and systematic comparison. We examine how much compression an open-source large language model can tolerate before argument classification quality degrades. Using gpt-oss-20b as the base model, we study pruning with Wanda and post-training quantization under a zero-shot prompting setup. We evaluate compressed variants on three argument-mining resources, namely UKP, Args.me, and ARIES, and contrast their behavior with general language-model benchmarks. The results show a consistent pattern: moderate pruning preserves most of the original performance on argument classification, whereas activation quantization causes larger and more systematic drops. The findings suggest that argument classification is more compression-tolerant than general-purpose evaluation suites, but only up to a point, and they should not be interpreted as evidence that aggressive compression is universally safe. We therefore position compression as a practical way to reduce model cost for argument analysis, while emphasizing that claims about efficiency gains must distinguish between preserved predictive quality and realized runtime speedups.
A Neural Approach to Fine-Grained Argumentation Strategy Classification with Emotion and Moral Value Lexicons across Multiple Domains
Mohammad Yeghaneh Abkenar | Weixing Wang | Manfred Stede | Julia Romberg
Mohammad Yeghaneh Abkenar | Weixing Wang | Manfred Stede | Julia Romberg
Fine-grained argumentation mining goes beyond coarse-grained distinctions such as claim and premise, by delving deeper into the underlying strategies employed (e.g., the use of facts or values to persuade the audience). Despite the advancements brought about by pre-trained language models, the task remains challenging. We investigate whether auxiliary knowledge such as emotion and moral value lexicon features can improve the classification of fine-grained argumentation strategies. Our Neural Flair Transformer Classifier (NFTC), in its base form, fine-tunes a transformer-based document encoder (RoBERTa) for end-to-end argument component classification. Evaluated across four corpora from diverse domains spanning public participation, persuasive forums, product reviews, and student essays, NFTC consistently outperforms majority-voting and Qwen2.5-7B baselines, achieving competitive performance on all datasets. Moreover, gains are observed against a fine-tuned LLaMA-3-8B-Instruct model, regarded in prior work as a leading approach. Injecting additional knowledge into NFTC yields mixed effects: emotion and moral value features provide consistent gains in product reviews and persuasive forums, but not in the other two domains. Our findings suggest that the utility of subjective knowledge is domain and schema dependent.
Overview of the UZH Shared Task 2026 on Reconstructing the Reasoning in United Nations Resolutions
Anastassia Shaitarova | Yingqiang Gao | Fatma-Zohra Rezkellah | Reto Gubelmann | Patrick Montjouridès
Anastassia Shaitarova | Yingqiang Gao | Fatma-Zohra Rezkellah | Reto Gubelmann | Patrick Montjouridès
This paper presents the UZH Shared Task at the 13th Workshop on Argument Mining and Reasoning, co-located with ACL 2026, which focuses on reconstructing argumentative structure in highly formal legal-political texts, namely United Nations resolutions and recommendations. The shared task addresses the challenge of recovering paragraph-level reasoning patterns from the fairly formulaic structure of international decision-making records. It comprises two subtasks: (1) paragraph classification, where systems identify paragraph type (preambular or operative) and assign one or more thematic tags, and (2) argumentative relation prediction, where systems infer links between paragraphs and label them with relation types.
LLM-INSTRUCT at UZH Shared Task 2026: Constraint-Aware Retrieval and Selective Debate for Paragraph-Level Argument Mining
Phuong Huu Vu Tran | Long Minh Vo | Son Nguyen Minh Le | Hoang Van
Phuong Huu Vu Tran | Long Minh Vo | Son Nguyen Minh Le | Hoang Van
We present LLM-INSTRUCT, the winning system for the UZH Shared Task at ArgMining 2026 on paragraph-level argument mining in UN and UNESCO resolutions. The task requires paragraph-type classification, prediction of a subset of 141 official tags, and directed relation prediction under a strict JSON schema setting using only open-weight models up to 8B parameters. We frame the task as constrained structured prediction. The system first narrows the candidate tag space with metadata-aware dense retrieval, then applies constrained decoding with per-dimension caps, and escalates only uncertain cases to a three-agent debate branch.
RESOLVENOW at UZH Shared Task 2026: Rule-Based Type Classification with LLM-Driven Multi-Label Tagging for UN Resolutions
Vedant Gupta | Rahul Bhatia | Vaibhav Varshney | Manjunatha Naik
Vedant Gupta | Rahul Bhatia | Vaibhav Varshney | Manjunatha Naik
Subtask 1 of the UZH Shared Task 2026 asks for paragraph-level classification of UN resolutions as preambular or operative and multi-label tagging from a 141-code, 15-dimension taxonomy, scored by tag F1 and an open-weight LLM-as-Judge on reasoning quality. Two earlier pipelines we built failed in opposite ways. An embedding-retrieval system dropped relevant tags before the LLM saw them; a per-dimension prompting system was accurate but too slow to iterate. The submitted system fixes both. A deterministic French-English lexical classifier assigns paragraph types at type macro-F1 of 0.910 on the official silver standard with no LLM calls, and DeepSeek-R1-0528-Qwen3-8B predicts tags through a single merged prompt that exposes the full taxonomy.
Argchestrators at UZH Shared Task 2026: Efficient Argument Mining in UN Resolutions: A Sub-8B Pipeline using Agentic Debate and Heuristic Retrieval
Bogdan Octavian Grecu | Gerrit Quaremba | Elizabeth Black | Denny Vrandečić | Elena Simperl | Oana Cocarascu
Bogdan Octavian Grecu | Gerrit Quaremba | Elizabeth Black | Denny Vrandečić | Elena Simperl | Oana Cocarascu
The highly formal and negotiated language of United Nations (UN) resolutions presents unique challenges for argument mining. This paper describes our system submitted to the ArgMining 2026 Shared Task: Reconstructing the Reasoning in United Nations Resolutions. Adhering to the strict constraint of utilising open-weight models with at most 8 billion parameters, we propose a hybrid, compute-efficient architecture powered by Qwen3-8B. For the preambular-operative classification, we implement a set of deterministic rules related to the specificity of UN documents, supplemented by an LLM-based multi-label classifier for thematic dimensions and a directed-graph extraction approach for argumentative relation prediction.
Prompteam at UZH Shared Task 2026: RAG-Augmented Classification and Cosine-Filtered Relation Prediction for UN Resolutions
Siddhartha Khandelwal | Jyotsana Bhardwaj
Siddhartha Khandelwal | Jyotsana Bhardwaj
We describe our system for the UZH ArgMining 2026 Shared Task on reconstructing argumentative structure in UN/UNESCO resolutions. The task requires (1) classifying paragraph types and assigning thematic tags from a 141-label taxonomy, and (2) predicting directed argumentative relations between paragraphs. Our pipeline combines a quantised Qwen2.5-7B-Instruct model with retrieval-augmented generation (RAG) backed by FAISS-indexed dense embeddings for few-shot prompting and tag candidate pre-filtering. For relation prediction, we apply a sliding-window cosine pre-filter that reduces the quadratic pair space to near-linear cost. A parallelisable, fault-tolerant pipeline with atomic checkpointing enabled complete processing of 2,959 paragraphs across three concurrent Kaggle T4 sessions despite 12-hour GPU limits. Our system achieved 2nd place overall on the shared task leaderboard.
TypeCoT at UZH Shared Task 2026: Reconstructing Argumentative Structure in UN Resolutions using Type-Informed Chain-of-Thought
Chandan Kumar R S | Vinay Babu Ulli | Jyoti Kumari | Vaibhav Singh
Chandan Kumar R S | Vinay Babu Ulli | Jyoti Kumari | Vaibhav Singh
United Nations and UNESCO resolutions encode complex collective reasoning through highly structured preambles and operative clauses. Reconstructing this implicit argumentative structure is a challenging natural language processing task. This paper describes our submission to the UZH Shared Task at the ArgMining Workshop 2026. Adhering to the strict constraint of using open-weight models with at most 8B parameters, we propose a highly efficient, modular pipeline built entirely upon the Qwen-2.5-7B-Instruct architecture. To address Subtask 1, we decouple the problem, employing a 4-bit quantized LoRA adapter via the Unsloth framework for paragraph type classification and a type-informed chain-of-thought approach for thematic tagging and relation prediction.
POINTERS at UZH Shared Task 2026: Reasoning Probes for Argumentation Mining in UN Resolutions
Sohom Sen | Avina Nakarmi | Xun Song | Aritra Dasgupta
Sohom Sen | Avina Nakarmi | Xun Song | Aritra Dasgupta
This paper describes the submission of team POINTERS to the UZH ArgMining 2026 Shared Task, which aims to recover the argumentation structure of UN and UNESCO resolutions by labeling paragraph types, assigning specific tags, and predicting relations between paragraphs. We take a generative approach, treating each resolution as a sequence of claim-evidence pairs connected by explicit reasoning strategies. First, each paragraph is classified as preambular or operative and assigned tags, with the model required to quote specific phrases to justify every decision. Second, for each paragraph, we first retrieve semantically related candidates using sentence transformers, then use reasoning strategies as a diagnostic scaffold to label the relation—supporting, complemental, contradictive, or modifying—along with a quoted, strategy-grounded rationale.
HybridArguer at UZH Shared Task 2026: Argument Structure Modeling in Bilingual UN Resolutions with Retrieval-Augmented and Iterative LLM Reasoning
Siddharth Bhargava
Siddharth Bhargava
Extracting argument structures from legal-political discourse reveals how policies and actions are proposed, debated, and formalized, but remains challenging due to the complexity of long-form, structured text. This work proposes a modular, retrieval-augmented system for traceable and structured argument mining in long, bilingual United Nations resolutions. This paper describes our system submission to the UZH Shared Task 2026, focusing on practical design choices for argument structure modeling under task and model constraints. Our system employs a parameter-efficient (at most 8B) open-source model, Qwen3:8B in thinking mode, to perform paragraph classification, multi-label tag assignment, and multi-label relation prediction through a modular, retrieval-augmented pipeline.
up
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)
Ekaterina Kochmar | Bashar Alhafni | Stefano Bannò | Marie Bexte | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Anais Tack | Victoria Yaneva | Zheng Yuan
Ekaterina Kochmar | Bashar Alhafni | Stefano Bannò | Marie Bexte | Jill Burstein | Andrea Horbach | Ronja Laarmann-Quante | Anais Tack | Victoria Yaneva | Zheng Yuan
Theory of Mind and Application in Educational Context
Effat Farhana | Maha Zainab | Qiaosi Wang | Niloofar Mireshghallah | Ramira van der Meulen | Max van Duijn
Effat Farhana | Maha Zainab | Qiaosi Wang | Niloofar Mireshghallah | Ramira van der Meulen | Max van Duijn
This tutorial examines the integration of Theory of Mind (ToM) into AI-driven tutoring systems, with a focus on how large language models (LLMs) can represent learners’ cognitive and emotional states to enable adaptive, personalized feedback. Participants will learn foundational ToM concepts from cognitive science and psychology and how these ideas can be operationalized in AI systems. We discuss mutual ToM, in which both tutors and learners model each other’s mental states, and address challenges including misconception detection, metacognitive modeling, and privacy in data-driven tutoring. The tutorial also includes hands-on demonstrations of machine ToM in programming education using benchmark datasets such as CS1QA and CodeQA. By combining theoretical foundations, empirical insights, and practical exercises, this tutorial will provide an overview of designing human-centered, ethically aware, and cognitively informed AI tutoring systems.
We introduce a thermal–visual fusion approach to improve non-invasive Voice Activity Detection (VAD) for classroom engagement monitoring. In noisy multi-speaker classrooms using a single microphone, acoustic-only methods fail to reliably isolate individual speakers. Our method integrates facial thermal signatures—capturing respiratory and speech-related heat patterns—with visual lip-motion cues to provide an acoustic-independent speech signal. This provides a localized, privacy-preserving, and acoustic-independent indicator of speech activity.This system acts as a visual-diarization frontend, informing Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) systems not only when speech occurs, but precisely which student is speaking. Using up to 19 engineered features, our Thermal-Only Random Forest classifier achieved a Recall of 0.9234 and an F1-score of 0.8105 in subject-independent evaluations, outperforming visual-only baselines. The system was validated as a proof-of-concept on a Raspberry Pi 5 in a controlled laboratory setting, demonstrating real-time feasibility. These results demonstrate that thermal–visual fusion enables more reliable linguistic analysis of collaborative learning and provide critical input for AI agents to facilitate group participation in real-world educational settings that lead to more successful learning outcomes.
Investigating Context-aware CTC for Pronunciation Assessment: Mitigating Peaky Behavior and Context Independency Assumption
Jiun-Ting Li | Tien-Hong Lo | Bi-Cheng Yan | Shih-Hsuan Chiu | Fu-An Chao | Berlin Chen
Jiun-Ting Li | Tien-Hong Lo | Bi-Cheng Yan | Shih-Hsuan Chiu | Fu-An Chao | Berlin Chen
Automatic pronunciation assessment (APA) provides L2 learners with scalable and timely feedback on pronunciation proficiency in a target language, typically through goodness of pronunciation (GOP) features. GOP quantifies how well a pronounced phoneme matches the expected target sound by comparing acoustic features against the model’s posterior probabilities. Traditional GOP relies on forced alignment to obtain these posteriors, but it suffers from acoustic-induced misalignments that degrade assessment reliability. Although the standard CTC-GOP approach bypasses forced alignment, it is limited by the inherent peaky behavior of CTC-based ASR models, which produces sparse posteriors and lacks stable temporal information. To address these issues in standard CTC, we propose a context-aware CTC framework incorporating output context dependency (OCD) in the CTC topology, along with label prior (LP) and maximum conditional entropy (EnCTC) regularization, to mitigate peakiness and produce more stable ASR logits suitable for GOP computation. Experiments on the speechocean762 corpus demonstrate that our best context-aware configurations achieve superior phoneme-level performance, outperforming the TDNN-F baseline and standard CTC in unified GOPT (phoneme PCC 0.641 vs. 0.612; word total PCC 0.582 vs. 0.549) while narrowing the gap in hierarchical HierCB scoring. These improvements widen the scoring margin between correct and mispronounced phonemes from 0.708 to 0.816 in GOPT. They also reveal that mitigating CTC peakiness and incorporating context dependency significantly enhance CTC-GOP stability and robustness, especially for alignment-free APA models.
A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges
Wen Liang | Li Siyan | Zackary Rackauckas | Julia Hirschberg
Wen Liang | Li Siyan | Zackary Rackauckas | Julia Hirschberg
Automated coaching for oral presentations sits at the intersection of computer-assisted pronunciation training (CAPT), prosody modeling, and speech synthesis, yet no prior work has systematically surveyed and compared existing systems along these dimensions. This survey reviews and categorizes automated presentation coaching systems, spanning pronunciation tutors, fluency and prosody coaches, multimodal trainers, and conference Q A practice tools. We introduce a five-dimensional task taxonomy - covering segmental pronunciation, lexical stress, suprasegmental prosody, pacing, and content faithfulness - and explicitly map surveyed systems onto it to reveal coverage gaps. We further review the core technical methods these systems employ: TTS-based exemplar generation and diagnostic methods for pronunciation, prosody, and fluency assessment. Key open challenges include the scarcity of annotated presentation corpora, achieving accent-fair feedback across diverse L1 backgrounds, and delivering low-latency diagnostics for real-time rehearsal.
The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors
Li Lucy | Albert Zhang | Nathan Anderson | Ryan Knight | Kyle Lo
Li Lucy | Albert Zhang | Nathan Anderson | Ryan Knight | Kyle Lo
Effective mathematics education requires identifying and responding to students’ mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students’ handwritten, hand-drawn responses to math problems. We find that models’ weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who may require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
Criterial Features in German: Towards Interpretable NLP in Readability Assessment
Denise Loefflad | Sofia Kathmann | Heiko Holz | Detmar Meurers
Denise Loefflad | Sofia Kathmann | Heiko Holz | Detmar Meurers
This paper presents an empirical evaluation of the German Grammar Profile (GGP), a CEFR-aligned resource of criterial features, and its corresponding extraction system PALME. We design a systematic test suite in which each feature extractor is evaluated on controlled positive and negative examples. The results show that PALME achieves high precision and recall across all CEFR levels, with over 90% of features achieving scores above 0.8. Qualitative analysis shows that lower performance primarily results from morphological ambiguity in noun and adjective case marking. To evaluate the usefulness of the criterial features of the GGP for CEFR-aligned readability assessment, we assess their predictive power using Explainable Boosting Machines on graded readers. The model achieves strong performance (precision: 0.75, recall: 0.73). Our qualitative analysis shows that features related to specific verb constructions follow patterns consistent with developmental stages predicted by Processability Theory. These findings underline the value and relevance of criterial features for modeling language development in readability assessment.
Letting Tutor Personas Speak Up for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization
Jaewook Lee | Alexander Scarlatos | Simon Woodhead | Andrew Lan
Jaewook Lee | Alexander Scarlatos | Simon Woodhead | Andrew Lan
With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor–student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners’ needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We train a steering vector using preference optimization: an activation-space direction that guides model responses toward specific tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned scaling coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.
Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM
Younghun Lee | Amir Bralin | Nobel Sanjay Rebello | Dan Goldwasser
Younghun Lee | Amir Bralin | Nobel Sanjay Rebello | Dan Goldwasser
Educational interventions are effective tools for enhancing student learning. While Large Language Models (LLMs) allow for generating adaptive feedback at scale, current studies lack clear methodologies for providing Just-in-Time (JiT) feedback in authentic instructional settings. In this paper, we present a framework that provides adaptive feedback by grounding LLMs with domain-specific expert knowledge. Our approach collects written reasoning logic (strategy essays) from students, analyzes potential error types based on the content of that reasoning, and delivers non-intrusive feedback designed to clarify missing or incorrect concepts. We deploy this framework in a large-scale college course (N > 1,000), where it improved student performance by over 80% compared to previous semesters. Lastly, we validate the framework’s pedagogical utility by analyzing the learning trajectories; we demonstrate how iterative conversations with LLM facilitate shifting one’s misconception to correct understanding.
RABIT: Rationale-Based Distillation Towards Interpretable Automatic Speaking Assessment via a Small Language Model
Bi-Cheng Yan | Hong-Yun Lin | Fu-An Chao | Jiun-Ting Li | Berlin Chen
Bi-Cheng Yan | Hong-Yun Lin | Fu-An Chao | Jiun-Ting Li | Berlin Chen
Automatic speaking assessment (ASA) manages to quantify the language competence of foreign language learners by providing a proficiency score based on their spoken response. Existing efforts in ASA typically employ a neural grader integrated with a set of handcrafted features to assess learners’ oral proficiency from multiple facets. Despite decent performance, the black-box nature of these neural graders remains a significant barrier to providing interpretable explanations for the grading results. In light of this, we propose RABIT for ASA, a novel Rationale-based knowledge distillation framework for interpretable grading decisions via a small language model. Specifically, RABIT first extracts multi-faceted grading rationales from a large language model (LLM) pertaining to the learner’s response and the scoring guidelines. Subsequently, a compact yet efficient language model, equipped with distinct output heads, is jointly optimized to estimate a proficiency score while generating a sequence of grading rationales in an autoregressive manner. A series of experiments conducted on General English Proficiency Test (GEPT) dataset validates the feasibility and superiority of our method over several cutting-edge baselines.
Towards Pedagogically Aligned LLM Tutors for Math Mistake Remediation
Kseniia Petukhova | Tien Dat Nguyen | Ekaterina Kochmar
Kseniia Petukhova | Tien Dat Nguyen | Ekaterina Kochmar
Large language models have strong potential for use in intelligent tutoring systems, but they often fail to follow effective pedagogical strategies, such as guiding students without revealing final answers. We study the application of a two-stage alignment pipeline for math mistake remediation, combining supervised fine-tuning on tutoring dialogs with Direct Preference Optimization on synthetic preference pairs. We construct a dataset that integrates existing tutoring corpora with synthetic data generated along pedagogical dimensions, such as scaffolding and factuality, and study different input configurations that incorporate solution correctness and gold answers. Experiments show that this approach improves both factual accuracy and pedagogical quality over base models and existing tutoring models. Human evaluation further indicates that our best model is competitive with a strong proprietary baseline, while providing additional benefits in terms of openness, transparency, and reproducibility. Our results highlight the effectiveness of preference-based pedagogical alignment, while also revealing challenges in reliably evaluating tutoring quality.
Challenges in Machine Translation of Interactive Multimodal Exercises
Lucie Polakova | Miroslav Hrabal | Věra Kloudová | Michal Novák | Mariia Anisimova | Martin Popel
Lucie Polakova | Miroslav Hrabal | Věra Kloudová | Michal Novák | Mariia Anisimova | Martin Popel
This paper describes linguistic and technological challenges encountered within an applied project aimed at expanding a large e-learning portal from its original Czech to three other languages: Ukrainian, English and German. Although there seems to be a general belief that machine translation is a solved task in 2026, we show that translating educational content, which in our case is highly terminological, multimodal, interactive and encoded in XML, brings along many challenges of different types, some easily solvable and some not. We also compare our results from the early phase of the project (Transformer-based machine translation) with those after the switch to the LLM-based translation methods. We show that both MT methods are prone to different types of errors, some of which are quite new (such as the undesired correction of counterfactual statements) and require new ways of handling them. The resulting four-language edition of the educational web portal will be freely available to educators, students and researchers by the end of 2026.
Evaluating LLM Workflows for Generating Clinical Communication Assessment Items: A Comparative Study with Subject-Matter Experts
Christopher Runyon | Peter Baldwin | Ian Micir | Kevin Frome | Stephanie Mann | Saed Rezayi | Keelan Evanini | Victoria Yaneva
Christopher Runyon | Peter Baldwin | Ian Micir | Kevin Frome | Stephanie Mann | Saed Rezayi | Keelan Evanini | Victoria Yaneva
Generative AI is increasingly used to accelerate assessment content development, yet its effectiveness for generating content used in complex assessment tasks for knowledge-rich domains such as medical education is unclear. This study evaluates automated LLM-supported workflows for generating patient-centered communication assessment items that allow students to practice their communication skills. We compared two content generation approaches—constrained linear and exploratory branching—each implemented with and without anchoring in vetted multiple-choice questions (MCQs). Ten subject-matter experts (SMEs) evaluated 80 communication items across six quality dimensions using structured rubrics. The constrained linear approach yielded better ratings than exploratory branching approaches, particularly for medical accuracy and alignment with learning objectives and patient-centered behaviors. MCQ anchoring did not improve medical accuracy. Only a minority of items met all criteria without requiring revision, and no items were unanimously approved by all SMEs. These findings underscore the importance of workflow design in LLM-supported assessment content generation, the continued need for human oversight, and the current limitations of automated content generation in medical education.
Towards Self-Referential Analytic Assessment: A Profile-Based Approach to L2 Writing Evaluation with LLMs
Stefano Banno | Kate Knill | Mark Gales
Stefano Banno | Kate Knill | Mark Gales
Automated essay scoring (AES) research often relies on rank-based correlation metrics to validate analytic assessment. However, such metrics obscure both intrinsic intercorrelations among analytic dimensions that arise from the structure of writing proficiency itself and halo effects, whereby holistic impressions bleed into fine-grained component scores. As a result, high correlations may mask a system’s true diagnostic behaviour. In this study, we propose a novel self-referential assessment evaluation framework that focuses on identifying intra-learner strengths and weaknesses rather than assessing inter-learner rankings. We conduct experiments on the publicly available ICNALE GRA, a uniquely dense second-language writing dataset annotated holistically and analytically by up to 80 trained raters. To obtain reliable reference scores, we apply two-facet Rasch modelling to calibrate rater severity and derive fair average scores across ten analytic aspects and holistic proficiency. We compare the analytic scoring performance of human operational raters and three large language models (LLMs) in a zero-shot setting. Our results show that LLMs tend to outperform single human raters in identifying relative weaknesses (negative feedback) across several proficiency aspects, while human raters remain stronger at identifying relative strengths (positive feedback). Overall, our findings highlight the limitations of rank-based evaluation for analytic assessment and demonstrate the value of intra-learner, profile-based methods for assessing and deploying LLMs in AES.
Using Interaction Log Data to Evaluate and Improve Feedback Accuracy in an Intelligent Language Tutoring System
Mariia Soliar | Leona Colling | Stephen Bodnar | Detmar Meurers
Mariia Soliar | Leona Colling | Stephen Bodnar | Detmar Meurers
Intelligent Tutoring Systems (ITS) can record learner interactions in fine-grained detail at scale. This opens the door to data-driven methods for investigating system performance and identifying points for improvement. In this paper, we draw on authentic log data from an English language ITS (N_logs = 5646, N_students = 368) to investigate the performance of its feedback algorithm. In step 1 of our analysis, we profiled feedback accuracy by exploring how well the system provided error-specific feedback to malformed student answers in gap-filling grammar exercises using an expert-created set of feedback generation rules. We then identified frequently occurring student errors that triggered incorrect or unspecific feedback and refined the rule set used to detect and respond to these errors with correct specific feedback. In step 2, we validated the rule modifications on an unseen dataset. Comparing the performance of the initial and updated rule sets, we find significant improvement that generalizes to unseen data. Our study thus illustrates how an empirical evaluation of authentic data can complement feedback creators’ expertise by informing rule refinement decisions that yield significant and generalizable performance improvements to feedback in ITS systems.
A Bigger Catch: Fine-Grained Curriculum Standards Alignment on the MathFish Benchmark
Xinman Liu | Mayank Sharma | Xinyu Shi
Xinman Liu | Mayank Sharma | Xinyu Shi
Most existing math benchmarks for LLMs focus on evaluating whether models produce correct solutions. In educational settings, however, it is equally important to understand whether LLMs grasp the pedagogical intent behind math problems, beyond simply arriving at the right answer. Tagging curriculum standards is challenging for the same reason: distinguishing fine-grained standards requires understanding subtle pedagogical distinctions. In this paper, we use the MathFish benchmark, which frames curriculum alignment as a multi-label prediction task over 385 Common Core State Standards, to evaluate a three-stage pipeline inspired by observed failure modes in retrieval and structural reasoning: curriculum-informed hard negatives (M1), a cross-encoder reranker (M2), and a ReAct agent paired with an LLM-as-a-judge critic (M3). We additionally evaluate a training-free alternative (A1) that combines hybrid sparse-dense retrieval with curriculum-graph reranking. M3 achieves 31.3% exact-match accuracy, approximately 6.5× higher than the three-shot GPT-4-Turbo baseline. Error analysis shows that, despite these improvements, the pipeline still struggles with missing predictions, grade-level misalignment, and sibling-standard confusion, reinforcing that precise curriculum alignment remains a fundamentally difficult problem in educational NLP.
Through the Sentence Lens: Explainable Essay Scoring through Fine-Grained Predictions
Daniel Mora Melanchthon | Stefan Keller | Andrea Horbach
Daniel Mora Melanchthon | Stefan Keller | Andrea Horbach
Beyond performance, model transparency is a crucial factor in Automated Essay Scoring, yet current systems often lack explainability, limiting their pedagogical value and users’ trust. Existing explainability methods, such as gradient-based attribution or feature-importance approaches, either produce counterintuitive explanations or are too complex for classroom use. To address this limitation, we make use of fine-grained prediction at the sentence level as a way to enhance explainability. We propose ablation strategies to derive sentence-level pseudo scores from essay-level gold scores and use them to train sentence-level models. We evaluate their performance against essay-level baselines on two datasets (ASAP and MEWS), and compare their sentence-level output to a human baseline. Results indicate a trade-off between essay-level performance and sentence-level granularity. For the language quality trait, most sentence-level models achieve performance comparable to the essay-level baseline, whereas for content, the approach yields more positive results on prompts with shorter
Instruction-Following LLMs for Grammatical Error Correction: Analyzing Neutral-Anchored Instructional Sensitivity Across Editing Modes
Tolgahan Türker | Gülşen Eryiğit
Tolgahan Türker | Gülşen Eryiğit
Grammatical Error Correction (GEC) requires models to make edit decisions under competing objectives: correcting errors while either minimizing changes or maximizing fluency.However, we lack a principled characterization of how instruction-following Large Language Models (LLMs) shift their edit decisions across such editing modes, and whether standard evaluation setups faithfully reflect these shifts.We address this gap by defining three modes—Neutral, Minimal-Edit, and Fluency-Edit—and measuring neutral-anchored performance shifts to quantify instructional sensitivity.We benchmark seven LLMs, including proprietary and open-weight models, in a unified zero-shot prompting schema on CoNLL-2014, BEA-2019, and JFLEG datasets.The Minimal-Edit instruction mitigates over-editing and typically boosts precision; in some settings, strong models also improve recall, suggesting more selective and effective corrections.In contrast, the Fluency-Edit instruction often encourages broader paraphrastic rewriting that may improve perceived fluency while lowering GLEU, suggesting both a metric-objective mismatch and a shift away from targeted local correction.Notably, Claude-Sonnet-4.5 demonstrates superior zero-shot capabilities, outperforming previously reported scores and matching or even exceeding few-shot results across CoNLL-2014 (F_0.5: 67.05), BEA-2019 (F_0.5: 64.91), and JFLEG (GLEU: 66.09).
Assessing the Quality and Consistency of Automated Knowledge Component Generation using Instructor-generated Questions and LLMs
Jordan Esiason | Priyanka Khare | Wookhee Min | Seung Lee | Gamze Ozogul | Xiaoying Zheng | Yeil Jeong
Jordan Esiason | Priyanka Khare | Wookhee Min | Seung Lee | Gamze Ozogul | Xiaoying Zheng | Yeil Jeong
Lecture-style instruction is one of the most prevalent forms of learning in postsecondary education in the United States. Despite the factors that make lectures a convenient format, they tend to present few opportunities for meaningful engagement between students and the course materials being presented due to factors such as the overhead associated with interacting with large numbers of students. By utilizing large language models, we have created a pipeline built upon the ExplainIt classroom response system for processing student self-explanations produced during lectures using automatically generated knowledge components. This pipeline can facilitate deeper engagement with course materials, offer traceability in assessment results, and allows instructors to respond to student errors or misconceptions in real-time during lecture. While previous work using a proprietary large language model has examined the basic functionality of this pipeline, this work more closely examines the consistency and quality of this pipeline using both a large closed-weight model and a smaller open-weight model, with or without retrieval-augmented generation (RAG). The use of open-source models could allow institutions deploying ExplainIt to maintain control of their student data without substantially sacrificing performance. We find that while there are small statistically significant differences in performance between the RAG conditions of each LLM, they are nearly comparable at this task. Additionally, the LLM-generated knowledge components are of higher quality when relevant course material is provided for RAG, although consistency is not improved. These results indicate that both large closed-weight and smaller open-weight models show promise in this task, but fine-tuning may be necessary to improve performance further.
Estimating LLM Grading Ability and Response Difficulty in Automatic Short Answer Grading via Item Response Theory
Longwei Cong | Sonja Hahn | Sebastian Gombert | Leon Camus | Hendrik Drachsler | Ulf Kroehne
Longwei Cong | Sonja Hahn | Sebastian Gombert | Leon Camus | Hendrik Drachsler | Ulf Kroehne
Automated short answer grading (ASAG) with large language models (LLMs) is commonly evaluated with aggregate metrics such as macro-F1 and Cohen’s kappa. However, these metrics provide limited insight into how grading performance varies across student responses of differing grading difficulty. We introduce an evaluation framework for LLM-based ASAG based on item response theory (IRT), which models grading correctness as a function of latent grader ability and response grading difficulty. This formulation enables response-level analysis of where LLM graders succeed or fail and reveals robustness differences that are not visible from aggregate scores alone. We apply the framework to 17 open-weight LLMs on the SciEntsBank and Beetle benchmarks. The results show that even models with similar overall performance differ substantially in how sharply their grading accuracy declines as response difficulty increases. In addition, confusion patterns show that errors on difficult responses concentrate disproportionately on the partially_correct_incomplete label, indicating a tendency toward intermediate-label collapse under ambiguity. To characterize difficult responses, we further analyze semantic and linguistic correlates of estimated difficulty. Across both datasets, higher difficulty is associated with weaker semantic alignment to the reference answer, stronger contradiction signals, and greater semantic isolation in embedding space. Overall, these results show that item response theory offers a useful framework for evaluating LLM-based ASAG beyond aggregate performance measures.
Using k-Shot Prompting with Large k for the Automated Scoring of a German Written Elicited Imitation Test
Malte Sternik | Ronja Laarmann-Quante | Anastasia Drackert
Malte Sternik | Ronja Laarmann-Quante | Anastasia Drackert
This paper explores the application of a Large Language Model (LLM) using k-shot prompting with large k for automatically scoring a German Written Elicited Imitation Test (WEIT), a test for assessing literacy-dependent procedural knowledge in German as a foreign language. In this test, test-takers are briefly presented with written sentences which they then have to reproduce in writing as accurately as possible. The responses are scored on an ordinal scale which differentiates between different types of errors (e.g. lexical vs. grammatical). We find that with increasing k (in a range from 1 to 700) accuracy increases significantly but it also depends on the drawn sample and varies across different runs of the same prompt. Overall, the k-shot setting which relies on in-context learning without being provided with the scoring rubric outperforms a baseline where only the scoring rubric is provided to the model. However, the LLM does not outperform previous results based on rule-based or BERT-based models.
Kelvi: A Morphological Parser to Support Tamil Literacy
Shankhalika Srikanth | Sabrina Yu | Sophia Chan | Madeline Solis de Ovando
Shankhalika Srikanth | Sabrina Yu | Sophia Chan | Madeline Solis de Ovando
We discuss the development of kelvi.ca, an open source web-based dictionary and morphological parser designed to aid Tamil learners in developing their literacy skills. Tamil is an agglutinative language and heavily suffixal. Existing Tamil dictionaries only carry stems, not conjugated or inflected forms, and for a beginner learner of the language, isolating the stem in an unfamiliar word can be very challenging. Kelvi provides 1) the stem of any input word alongside its definition, and 2) non-technical descriptions of any suffixes that are part of this input, so that learners will gradually start to recognize these suffixes and be able to understand and produce new Tamil words themselves. In detailing our process of collaborative research, user interviews, suffix database creation, and error analysis, we also hope to show that Kelvi can be adapted for other languages and has the potential to be a useful pedagogical aid for learner literacy development, especially for agglutinative and/or polysynthetic languages which tend to be otherwise underserved in the mainstream.
From Questions to Assessment Tuples: A Multi-Agent Framework with Bloom-Specialized Agents and Automated Verification
Gee-Lyle Wong | Runcong Zhao | Yulan He | Jiazheng Li
Gee-Lyle Wong | Runcong Zhao | Yulan He | Jiazheng Li
Automatic question generation with large language models has advanced rapidly, yet producing assessment-ready items, complete with mark schemes and expected answers, remains challenging, especially when generation must reliably target higher-order cognitive levels in Bloom’s Taxonomy. We propose a multi-agent, multi-stage framework that generates structured assessment tuples for both short-answer questions (SAQs) and scenario-based questions (SBQs), combining Bloom-specialized generation agents with staged decomposition and automated verification. We further introduce a rubric-guided LLM-as-a-judge evaluation framework with Bloom-specific alignment metrics. Experiments on university-level AI course material across five generation pipelines show that prompt-level Bloom conditioning alone is insufficient to reliably achieve cognitive control. In contrast, our structured approach yields consistent and notable improvements in alignment, mark scheme quality, and output yield, particularly for higher-order Bloom levels over baseline pipelines.
Intent vs. Surface: Recovering Acoustic Realization from Modern ASR for Pronunciation Training
Seongjin Park
Seongjin Park
Pronunciation feedback in language learning depends on accurate detection of learner errors, but it is unclear whether modern ASR systems are suitable for this purpose. Their language models recover intended words rather than what was actually pronounced, systematically masking mispronunciations. This is a tendency we call intent bias. By evaluating eight ASR systems spanning three architectures on two L2 English corpora, we find that overcorrection rate correlates inversely with word error rate. In other words, ASR systems with lower WER tend to mask more pronunciation errors. We propose surface-faithful reranking, an inference-time method that uses phoneme-level acoustic similarity to select N-best hypotheses closer to what the learner actually said. Without retraining or access to model internals, the method reduces the false acceptance rate of mispronunciations by 6.0 percentage points on L2-ARCTIC and 5.6 on speechocean762. The improvement is consistent across age groups and first-language backgrounds, though substantial overcorrection remains, pointing to the need for pronunciation-aware ASR objectives.
KEYSCORE — Keystroke-enhanced Automated Essay Scoring
Nils-Jonathan Schaller | Daniel Mora Melanchthon | Thorben Jansen | Olaf Köller | Andrea Horbach
Nils-Jonathan Schaller | Daniel Mora Melanchthon | Thorben Jansen | Olaf Köller | Andrea Horbach
We investigate the predictive power of keystroke logging data for automated essay scoring using the newly collected PISA FLA writing process dataset. Based on 3,882 writing sessions, we extract a comprehensive set of keystroke-based process features, including temporal measures, pause and burst patterns, deletion behavior, production efficiency, and navigation activity and evaluate their ability to predict holistic essay scores on a 0–5 scale. We specifically compare process-feature-based models with content-based scoring approaches trained on data written with and without the help of an AI chatbot, and investigate how predictive power evolves over the course of a writing session by training models at multiple time thresholds.Our analysis reveals that keystroke features provide genuine early predictive signal, capturing aspects of writing fluency and revision behavior that distinguish writers before their texts are long enough to score conventionally. Additionally, our results suggest that process-based scoring is a viable complement to product-based approaches, with promise for formative, real-time feedback during writing.
EduMUSE: A Multimodal Educational Dataset with Automatically Extracted Instructional Context
Andreea Dutulescu | Stefan Ruseti | Mihai Dascalu | Danielle McNamara
Andreea Dutulescu | Stefan Ruseti | Mihai Dascalu | Danielle McNamara
Research in AI applied to education increasingly relies on large-scale, high-quality datasets to support the development and evaluation of learning analytics and intelligent educational systems. Open educational resources provide a promising foundation, yet few datasets integrate structured instructional content with assessment materials in a multimodal form. In this study, we introduce a large-scale multimodal educational dataset (EduMUSE - Educational Multimodal Understanding & Solution Dataset) constructed from OpenStax undergraduate textbooks across multiple domains. The dataset integrates hierarchically structured instructional text, figures, exercises, and, when available, official solutions. For exercises with solutions, we introduce an automatic method that associates each exercise with a focused instructional subsection rather than entire textbook chapters, estimating subsection relevance via solution likelihood under candidate contexts using a vision–language model. We analyze the impact of contextualization on the behavior of vision–language models across different contexts. Results indicate that subsection-level instructional context has a measurable impact on model performance, with variation across model scales and task formulations. The dataset and code are released as open source at https://github.com/upb-nlp/BEA-EduMUSE/ to support reproducible research in multimodal educational modeling and to facilitate generating similar datasets using our approach.
Opportunities and Challenges of LLMs in Education: An NLP Perspective
Sowmya Vajjala | Bashar Alhafni | Stefano Banno | Kaushal Maurya | Ekaterina Kochmar
Sowmya Vajjala | Bashar Alhafni | Stefano Banno | Kaushal Maurya | Ekaterina Kochmar
Fine-Grained Content Zone Prediction in German Argumentative Essays Using LLMs
Xiaoyu Bai | Manfred Stede
Xiaoyu Bai | Manfred Stede
We introduce FDE-Arg, a newly compiled dataset of argumentative student essays in German. We use two Llama models of different sizes to label sentence-level content zones both in FDE-Arg and in an existing dataset of source-dependent argumentative essays. We investigate three approaches for improving model performance: a) Incorporating targeted task information into the prompt text; b) few-shot prompting with up to 10 examples selected on the basis of similarity with the target instance; and c) parameter-efficient fine-tuning. We observe that both incorporating additional information in the prompts and similarity-based few-shot prompting have produced highly promising performance gains over the baseline.
Multi-step Large Language Model for Fine-Grained Feedback in Stepwise Linear Equation Solutions
Imran Chamieh | Torsten Zesch | Klaus Giebermann
Imran Chamieh | Torsten Zesch | Klaus Giebermann
This paper addresses the problem of fine-grained error classification in stepwise algebraic problem solving, with the objective of enabling accurate and timely feedback in large-scale educational environments. Using authentic student response data, we compare a carefully engineered rule-based baseline with large language models (LLMs) in zero-shot and few-shot configurations, as well as multistep LLM-based approaches. We further consider hybrid architectures that combine symbolic computation with LLM inferential processes, with particular emphasis on enhancing the robustness and faithfulness of intermediate representations and mitigating error propagation across successive stages of the computational pipeline. Our empirical results indicate that, although the baseline model delivers strong and reliable performance for narrowly defined error categories, structured multi-step approaches improve performance relative to single-step methods by achieving superior precision, F1 scores, and overall accuracy.
Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation
Abigail Gurin Schleifer | Moriah Ariely | Beata Beigman Klebanov | Asaf Salman | Giora Alexandron
Abigail Gurin Schleifer | Moriah Ariely | Beata Beigman Klebanov | Asaf Salman | Giora Alexandron
Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs’ broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored.We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert.The results show that human–human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best.This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.
Using LLMs for item creation: Validating the potential of automatically generated sentence repetition test items for language assessment
Sarah Löber | Björn Rudzewitz | Yuan Chu | Mengyuan He | Shiqin Liu | Yushan Ye | Xiaobin Chen
Sarah Löber | Björn Rudzewitz | Yuan Chu | Mengyuan He | Shiqin Liu | Yushan Ye | Xiaobin Chen
Various aspects of the Elicited Imitation Test (EIT), a sentence repetition task for language assessment, can be automated, for example in terms of test administration or automatic scoring. It is potentially also possible to generate test items with Large Language Models (LLMs). This study investigates the potential of GPT-4o for item creation in the context of EIT, creating a parallel form to two popular and validated tests. We analysed the tests in terms of their linguistic and psychometric properties. While the items created by the LLM show some difference in grammatical structures when compared to human-written items, linguistic complexity results did not differ significantly between tests. Psychometric properties showed only minor differences. These findings lend support to the potential of Automatic Item Generation with LLMs in the context of sentence repetition tasks and might support the process of standardisation in SLA research and testing by enabling parallel test creation.
Comparative Evaluation of AI-Generated vs. Expert-written Answer Explanations for a Medical Education Self-Assessment
Yiyun Zhou | Francis O’Donnell | Victoria Yaneva
Yiyun Zhou | Francis O’Donnell | Victoria Yaneva
Answer explanations for medical multiple-choice questions (MCQs) are a valuable learning tool, but producing them is resource intensive. Writing high quality explanations requires specialized medical expertise and careful alignment with the keyed answer, distractors, and the clinical vignette. This paper evaluates whether a template-aware, retrieval-guided large language model (LLM) workflow can support this production task in a real formative assessment setting. Using a 50-item medical education self-assessment, we compared AI-generated and expert-written MCQ explanations in a blinded study involving eight medical faculty and sixteen medical students. Each participant rated 25 of 50 paired explanations on clarity, amount of information, and structure. The clearest empirical difference was in amount of information: AI-generated explanations were rated significantly higher than expert-written explanations in a cumulative link mixed model analysis (OR = 1.99, 95% CI [1.33, 2.99], p = 0.001). Ratings of clarity and structure did not differ significantly between conditions. Based on faculty ratings, a smaller proportion of AI-generated explanations were judged to require correction (20%) compared with expert-written explanations (38%). These findings suggest that AI can reduce first-draft authoring effort in explanation writing while still requiring expert review to ensure content accuracy.
What Aggregate Scores Hide: Per-Rule Evaluation of Russian Grammatical Error Correction
Anna Smirnova | Artyom Kopan | Vladislav Makeev | George Chernishev
Anna Smirnova | Artyom Kopan | Vladislav Makeev | George Chernishev
Russian grammar correction models can improveon aggregate benchmarkswhile getting worse at specific grammar rules.We show this through per-rule evaluationon a diagnostic benchmark of 48 prescriptive rules:finetuning on synthetic data improves overall F0.5while driving subordinate-clause comma accuracyfrom 14% to 1%.The suppression is invisible under corpus-level metricsand undetectable with existing coarse, corpus-specific tagsets;it is recoverable only when diagnosed at rule granularity.To enable this analysis,we develop a 98-category error taxonomygrounded in Rozental’s reference grammarand SyntErr, an open-source synthetic data generatorwhose per-rule distribution is an explicit parameter,designed to support arbitrary rule sets and languages.Finetuning eight open models (0.8B–12B)on 39K synthetic examplesyields up to 75.3 F0.5,approaching frontier API modelswith models small enough to run on device.We release the taxonomy, generator,per-rule evaluation data, and all training artifacts.
FinnGEC: Benchmarking Grammatical Error Correction for Finnish
Anh-Duc Vu | Mikhail Zolotilin | Jue Hou | Anisia Katinskaia | Yiheng Wu | Roman Yangarber
Anh-Duc Vu | Mikhail Zolotilin | Jue Hou | Anisia Katinskaia | Yiheng Wu | Roman Yangarber
Grammatical error correction (GEC) is a natural language processing task critical for improving language quality, supporting communication efficacy, and for language learning and teaching. To date, most research in GEC has focused on major, resource-rich languages such as English, while lower-resource languages remain underexplored. In this paper, we focus on GEC for Finnish. We build a dataset based on data from real-world language learners. We explore various approaches to GEC, including fine-tuning transformer models and zero-shot LLM prompting. We also adapt ERRANT, a popular GEC evaluation tool, for the Finnish language, to evaluate the performance of the models. Our results indicate that the performance of GEC for Finnish is promising, but requires further research. To the best of our knowledge, this is the first in-depth exploration of GEC for Finnish; we provide benchmarks, datasets, and code for GEC for Finnish—by releasing our training and test data and the code for Finnish ERRANT—to support further research on this important task.
From Metrics to Meaning: Rule-Grounded LLM Explanations for Data Literacy in the Case of Youth Football
Tomasz Piłka | Tomasz Kuczyński | Mateusz Czajka
Tomasz Piłka | Tomasz Kuczyński | Mateusz Czajka
Young athletes, parents, and coaches are increasingly exposed to training metrics from wearable technology, yet such metrics are difficult to interpret without contextual explanation. We present a rule-grounded data-to-text framework for supporting data literacy in youth football through concise, stakeholder-specific summaries of training sessions. A rule layer maps duration-normalised indicators to structured facts about session profile, internal intensity, speed exposure, and movement dynamics, which are then verbalised by a large language model for coaches, parents, or players. We compare direct generation from raw metrics, generation from rule-derived facts, and an augmented rule-grounded configuration, ENRICHED, that supplements validated facts with raw metrics and explicit threshold definitions. In this setting, selected open-weight models are additionally adapted using LoRA. The framework is developed using 122 anonymised player-session records from a U15 environment and evaluated on a held-out subset of ten sessions with stakeholder-oriented reference summaries. The results indicate that rule grounding improves reliability and audience adaptation compared with direct generation from raw metrics, particularly by reducing unsupported or overly strong interpretations. A school-based expert evaluation with physical education teachers further suggests that player-facing explanations in the evaluated ENRICHED setting can remain accurate, comprehensible, and practically useful. We position the framework as an interpretable data-literacy support interface for youth sport analytics.
Sharing is Caring: Advantages of Sharing a Language Background with Learners as an Annotator of Learner Data in UD
Caroline Grand-Clement | Arianna Masciolini
Caroline Grand-Clement | Arianna Masciolini
This paper looks at the impact of annotators sharing a language background with learners when annotating learner data using the Universal Dependencies (UD) framework. We perform a study comparing annotations by two different annotators working on sets of L2 Swedish sentences (learner sentences and target corrections) from the Swedish Learner Language corpus (SweLL) written by learners for whom French is a main writing language. The annotators are both L2 speakers of Swedish but have different knowledge of French: one is a native French speaker and the other has no knowledge of French. We find high annotator agreement, which may indicate an non-significant impact, though we qualitatively observe an advantage in sharing language background.
Productive struggle is a critical component of mathematics education, requiring students to actively work through ideas rather than just making errors. However, identifying this struggle from text transcripts is challenging because students often mask confusion with epistemic hedging rather than direct statements. Zero-shot large language models exhibit a conservative bias, systematically under-detecting struggle in classroom discourse. We introduce a two-stage NLP pipeline comprising a lexical heuristic gate and an LLM subtype classifier. Our model achieves 90.0% binary accuracy and 84.0% 4-category accuracy. We demonstrate the pedagogical value of this tool by showing that struggle is uniquely concentrated during explicit mathematical reasoning, offering educators a scalable method for root-cause analysis.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
Ravi Kumar | Utkarsh Grover | Xiaomin Lin | Agoritsa Polyzou
Ravi Kumar | Utkarsh Grover | Xiaomin Lin | Agoritsa Polyzou
Large language models (LLMs) can provide automated feedback in educational settings, but aligning an LLM’s style with a specific instructor’s tone while maintaining diagnostic correctness remains challenging. We ask: how can we update an LLM for automated feedback generation to align with a target instructor’s style without sacrificing core knowledge? We study how Reinforcement Learning from Human Feedback (RLHF) can adapt a transformer-based LLM to generate programming feedback that matches a professor’s grading voice. We introduce PERSA, an RLHF pipeline that combines supervised fine-tuning on professor demonstrations, reward modeling from pairwise preferences, and Proximal-based policy optimization, while deliberately constraining learning to style-bearing components.Motivated by analyses of transformer internals, PERSA applies parameter efficient fine-tuning. It updates only the top transformer blocks and their feed-forward projections, minimizing global parameter drift while increasing stylistic controllability. We evaluate our proposed approach on three code-feedback benchmarks (APPS, PyFiXV, and CodeReviewQA) using complementary metrics for style alignment and fidelity. Across both Llama-3 and Gemma-2 backbones, PERSA delivers the strongest professor-style transfer while preserving perfect correctness; for example on APPS, it boosts Style Alignment Score (SAC) to 96.2% (from 34.8% for Base) with Correctness Accuracy (CA) up to 100% on Llama-3, and Gemma-2. Overall, PERSA offers a practical route to personalized educational feedback by aligning both what it says (content correctness) and, crucially, how it says it (instructor-like tone, structure, and guidance).
Data-lean fine-tuning of models for evaluating teacher performance in a GenAI-led elicitation simulation
Beata Beigman Klebanov | Andrew Hoang | Jamie Mikeska | Benny Longwill | Sanjna Kashyap | Shreyashi Halder | Aakanksha Bhatia
Beata Beigman Klebanov | Andrew Hoang | Jamie Mikeska | Benny Longwill | Sanjna Kashyap | Shreyashi Halder | Aakanksha Bhatia
Recent advances in the capabilities of conversational agents based on large language models make them a very promising tool for role playing K-12 students in order to train educators in conversational teaching practices, such as eliciting student thinking, explaining disciplinary content, and facilitating a classroom discussion. In fact, such simulations can and have been developed relatively quickly and without data to machine-learn from – neither classroom data nor human-simulated data. To enhance the usefulness and effectiveness of such teaching simulations, it is necessary to provide pedagogically sound, timely, and personalized feedback to the educator about their simulation performance. In this study, we present experiments on fine-tuning models to evaluate educator performance in an elicitation teaching simulation. The models are developed with data collected during usability testing of the simulation and evaluated on real user data. We show that even with relatively little fine-tuning data, robust performance can be obtained
Multi-component student writing profiles for expert-aligned automated evaluation of English learner essays.
Russell Moore | Andrew Caines | Paula Buttery
Russell Moore | Andrew Caines | Paula Buttery
Automated Writing Evaluation (AWE) platforms have become common, but a significant gap remains between automated assessment and expert human feedback. We address this gap by introducing a supervised learning method that uses a multi-component student writing profile (comprising estimated CEFR levels, grammatical error rates, and vocabulary distribution) to align AI scoring with expert human judgements. In the context of an online essay-writing platform for second language learners of English, our model achieves a 36% reduction in RMSE for holistic essay scoring and an 84% improvement in similarity to human-expert annotation of grammatical errors compared to automarker scores (26% and 57% improvement from the best-performing comparable earlier work, by Zaidi et al. (2019) . Furthermore, we demonstrate that the model can predict a student’s final submission profile (CEFR level and grammatical error rate) from earlier drafts and that predictions generalise to a subsequent task, offering new possibilities for automated curriculum planning. Finally, we introduce a visualisation tool that provides educators with clear expert-aligned longitudinal views of student development.
Policy-Sensitive Fairness Evaluation in Automated Scoring of Clinical Communication
Saed Rezayi | Le An Ha | Victoria Yaneva | Polina Harik | Janet Mee | Jason Snyder
Saed Rezayi | Le An Ha | Victoria Yaneva | Polina Harik | Janet Mee | Jason Snyder
This study examines automated scoring fairness in a formative assessment context: the automated evaluation of medical students’ communication skills. Building on the premise that definitions of fairness are value-dependent, we investigate how conclusions about group differences may vary under different weighting schemes for false positives (FPs) and false negatives (FNs). Results show that when errors are treated symmetrically, no statistically significant differences are observed across demographic groups based on race or gender. This pattern remains stable when error weights are varied, with no consistent or robust disparities emerging. A small number of isolated differences appear under moderate FN weighting. Overall, the findings suggest that fairness conclusions in this setting are relatively robust to variations in error weighting. At the same time, the study highlights the importance of making value assumptions explicit when evaluating automated scoring systems, particularly in formative contexts where error trade-offs carry pedagogical implications for feedback, learner engagement, and educational equity.
Noise Steering for Controlled Text Generation: Improving Diversity and Reading-Level Fidelity in Arabic Educational Story Generation
Haziq Khalid | Salsabeel Shapsough | Imran Zualkernan
Haziq Khalid | Salsabeel Shapsough | Imran Zualkernan
Generating diverse, pedagogically valid stories for Arabic early-grade reading assessments requires balancing tight constraints on vocabulary, reading level, and narrative structure against the need to avoid repetitive plots that undermine assessment validity. We investigate noise steering, injecting calibrated Gaussian perturbations into the internal representations of transformer models at inference time, as a training-free diversity method evaluated across five small Arabic-centric language models (7–9B parameters). We compare four injection strategies against high-temperature sampling baselines, measuring diversity, quality, constraint adherence, and reading grade level. Residual stream noise consistently improves narrative diversity with minimal quality or constraint cost and preserves early-grade reading level across all Arabic-centric models. Attention entropy noise injection (AENI) stabilizes the otherwise unreliable attention-logit noise while recovering quality. High-temperature sampling inflates reading grade level and causes catastrophic collapse on several models. We find internal representation-level perturbation to be a more suitable diversity strategy than output-level stochasticity for constrained educational content generation.
The Effects of Structured LLM-Generated Feedback on Programming Assignment Performance
Tsvetomila Mihaylova | Evanfiya Logacheva | Arto Hellas | Jing Fan | Francisco Castro | Bita Akram | Narges Norouzi | Peter Brusilovsky | Juho Leinonen
Tsvetomila Mihaylova | Evanfiya Logacheva | Arto Hellas | Jing Fan | Francisco Castro | Bita Akram | Narges Norouzi | Peter Brusilovsky | Juho Leinonen
When programming students encounter errors in their code, compiler messages or static analysis output often provide limited guidance, particularly for novice programmers. Personalized feedback from instructors can be effective but does not scale well. Recent advances in large language models (LLMs) enable automated feedback generation at scale.This study examines whether LLM-generated feedback with different levels of guidance is associated with differences in students’ problem-solving behavior. We analyze effects on time to solution and number of attempts, and examine whether these effects differ by programming experience. We design three feedback types and compare them to a baseline in which students receive only compiler error messages. Results from an online programming course show that LLM-generated feedback is associated with faster time to solution compared to the no-feedback baseline, with less guided feedback showing slightly stronger effects. Overall, the findings suggest that feedback structure plays an important role in how students progress toward correct solutions and motivate further work on adaptive feedback designs and longer-term learning outcomes.
Interpretable Difficulty-Aware Knowledge Tracing in Tutor-Student Dialogues
Shuyan Huang | Alexander Scarlatos | Jaewook Lee | Andrew Lan
Shuyan Huang | Alexander Scarlatos | Jaewook Lee | Andrew Lan
Recent advances in large language models (LLMs) have led to the development of AI-powered tutoring systems that provide interactive support via dialogue. To enable these tutoring systems to provide personalized support, it is essential to assess student performance at each turn, motivating knowledge tracing (KT) in dialogue settings. However, existing dialogue-based KT approaches often ignore question difficulty and rely on opaque LLM latent representations, hindering accurate and interpretable prediction. In this work, we propose an interpretable difficulty-aware conversational KT framework that leverages LLMs to explicitly model student knowledge state and the difficulty of tutor-posed tasks at each dialogue turn. The framework incorporates the original question and the next tutor-posed task to estimate the student’s knowledge state and the difficulty of the upcoming turn. It further integrates Item Response Theory to map LLM outputs into student ability and question difficulty parameters, enabling interpretable prediction of student performance grounded in cognitive theories of learning. We evaluate the framework on two tutor-student dialogue datasets. Quantitative and qualitative results show that our framework outperforms existing KT baselines, meanwhile generating interpretable outputs consistent with cognitive theory. Our code and data are available at https://github.com/umass-ml4ed/Difficulty-Aware-DialogKT.
Rubrics as Semantic Subspaces: A Unified Approach to Rubric-based Constructed Response Scoring across Short Answers and Essays
Sebastian Gombert | Sonja Hahn | Nico Andersen | Leon Camus | Zhifan Sun | Ngoc Nhu Hao Nguyen | Fabian Zehner | Longwei Cong | Alexander Mehler | Hendrik Drachsler
Sebastian Gombert | Sonja Hahn | Nico Andersen | Leon Camus | Zhifan Sun | Ngoc Nhu Hao Nguyen | Fabian Zehner | Longwei Cong | Alexander Mehler | Hendrik Drachsler
Rubrics are the primary reference for manual scoring of constructed responses, and there is growing interest in their use in automated scoring methodologies. In this work, we propose Aspect-Grounded Rubric–Answer Alignment (AGRAA), a rubric-based end-to-end scoring framework that models rubric descriptors as latent aspect spaces. Concretely, rubric descriptors are represented as low-dimensional subspaces derived from contextualised transformer embeddings, and student responses are scored according to how strongly their representations align with these rubric-induced spaces relative to the residual space outside them. This formulation provides a geometrically grounded interpretation of rubric-based scoring while enabling end-to-end training with standard transformer encoders. We introduce three distinct architectural variants and evaluate them on multiple short-answer and essay scoring datasets. Across these tasks, AGRAA achieves predictive performance highly competitive with strong neural and feature-based baselines. In addition, the framework yields interpretable intermediate representations that expose which rubric-defined aspects contribute to scoring decisions, enabling decision-aligned explanations grounded in rubric descriptors.
Domain-Adaptive Pre-training for Automated Short Answer Grading in Conceptual Physics: Reliability, Question-Level Analysis, and Error Reduction
Shirin Lade | Alistair Willis | Jonathan Nylk | Oli Howson
Shirin Lade | Alistair Willis | Jonathan Nylk | Oli Howson
This paper investigates whether automated short answer grading can reliably support teachers when marking conceptual physics responses in settings with limited labelled data. Using free-text responses derived from Force Concept Inventory-style questions, the study shows that incorporating subject-specific knowledge improves grading consistency, particularly in early deployment scenarios. The system reduces grading errors and provides more reliable agreement with reference judgments, especially for more challenging questions. These results suggest that automated grading can assist teachers by supporting marking decisions and prioritising responses for review, while still requiring human oversight.
Measuring Optimal Challenge: Trajectory-Based Difficulty Alignment in Open-Ended Language Tutoring
Ziqi Shu | Shuman Wang | Michael Hardy
Ziqi Shu | Shuman Wang | Michael Hardy
Conversational English as a Foreign Language (EFL) tutoring relies on dynamically generated exercises rather than fixed item banks, so traditional difficulty estimation cannot verify whether a task is appropriately calibrated to a learner. We propose a framework that measures difficulty alignment directly from observable interactional behavior, classifying each exercise into one of three states (Under-Challenged, Optimally Challenged, or Over-Challenged) based on turn-level sequences of student attempts, errors, confusion, and tutor scaffolding. Using 1,566 exercises from the Teacher-Student Chatroom Corpus, we validate the classification against human annotation (Cohen’s kappa = 0.79 at the state level) and show that a learner’s cumulative trajectory of these states predicts success on subsequent exercises. Aggregating these predictions into a within-session capability-shift proxy, we find that sessions with higher proportions of over-challenging exercises systematically yield lower estimated shifts, while optimally challenging interactions are significantly associated with greater improvement than under-challenging ones — patterns consistent with Krashen’s Input Hypothesis.
PeerMathDial: A Middle School Dialogue Dataset for Student Collaborative Math Problem Solving
Murong Yue | Desmond Mcglone | Emily Slutz | Wenhan Lyu | Yixuan Zhang | Jennifer Suh | Ziyu Yao
Murong Yue | Desmond Mcglone | Emily Slutz | Wenhan Lyu | Yixuan Zhang | Jennifer Suh | Ziyu Yao
Collaborative Problem Solving (CPS) is a core skill in education, where the process of peer interaction is highly important. However, existing educational dialogue datasets mostly focus on classroom instruction or tutoring (i.e., teacher/tutor-student interaction), yet datasets centering small-group, student-student interaction are limited. This thus leaves research with limited resources for studying how students interact, coordinate, and solve problems together in real educational settings. To address this, we introduce PeerMathDial, the first dataset of peer CPS dialogues collected from authentic middle school math classrooms. It contains 55 dialogues from 27 students, totaling 6,406 turns. To facilitate research on CPS discourse analysis, we further build a corpus-grounded dialogue act taxonomy assisted by LLMs. Using the dataset and the dialogue act taxonomy, we demonstrate the practical applications of PeerMathDial across three use cases. First, we track how dialogues evolve over time and measure the impact of teacher interventions. Second, we align dialogue actions with student surveys to reveal the connection between students’ traits (e.g., confidence, leadership) and their actual behaviors. Third, by evaluating LLMs on dialogue act prediction, we glimpse at the potential of LLMs for student simulation in educational applications. Our dataset and source code will be released to the community.
Effects of Varying LLM Access on Essay Writing Behavior
Julia Christenson | Karin de Langis | Shirley Anugrah Hayati | Dongyeop Kang
Julia Christenson | Karin de Langis | Shirley Anugrah Hayati | Dongyeop Kang
Investigating the degree to which large language models (LLMs) affect teaching and learning in universities can help identify strategies for integrating LLMs in a way that supports, rather than undermines, student learning outcomes. This study examined how varying levels of LLM assistance affect writing performance, engagement, and perceived authorship. We report a pilot study in which 24 college students were randomly assigned to write a short essay with no LLM access, limited access (<=3 prompts, responses capped at 100 words), or unlimited access. Overall essay quality was statistically indistinguishable across groups. Yet writing behavior and perceived authorship diverged sharply: students with limited access reported higher ownership (62.5% would submit the essay as independent work, vs. 25% in the unlimited group), stronger organizational gains, and more strategic, revision-focused prompting. The unlimited group spent more time writing, produced essays more similar to LLM output, and reported reduced creative expression. Our findings suggest that constraining, rather than banning, LLM access may preserve authorship confidence while retaining the scaffolding benefits of AI assistance.
Assessment of L2 speech global dimensions using large audio language models
Elsayed Issa | Mahmoud Ali
Elsayed Issa | Mahmoud Ali
Large audio language models (LALMs) integrate audio representations with large language models to enable unified understanding of spoken content. Their capabilities have been increasingly investigated across several benchmarks; however, the examination of their use in rating L2 speech is still in its infancy. This study explores the abilities of LALMs in scoring three L2 speech global dimensions: foreign accentedness, comprehensibility, and intelligibility. Ninety audio samples produced by L2 speakers were rated by ten native speaker raters as well as five LALM models. Model performance was evaluated against the human composite mean using Pearson r, Spearman p, mean absolute error (MAE), and systematic bias, with the human leave-one-out correlation (r = .46-.73 across dimensions) serving as an empirical performance benchmark. The results showed that no LALM reached human-level performance on any dimension. Only one model (i.e., Gemini) achieved a significant correlation with human ratings on comprehensibility (r = .28, p < .01), while Qwen2-Audio showed modest correlation on intelligibility (r = .32, p < .01). MAE ranged from 0.75 to 3.99 for accentedness (human: 1.24), 1.35 to 3.00 for comprehensibility (human: 1.24), and 12.03 to 15.43 for intelligibility (human: 8.49). All models exhibited systematic biases, with deviations ranging from -9.31 to +13.19 points. The paper concludes with a discussion of the implications for automated L2 speech assessment.
Incentives Of EdTech: A Systematic Review Of EduNLP Research
Gabrielle Gaudeau | Aoife O’Driscoll | Jasper Degraeuwe | Andrew Caines | Donya Rooein | Zeerak Talat
Gabrielle Gaudeau | Aoife O’Driscoll | Jasper Degraeuwe | Andrew Caines | Donya Rooein | Zeerak Talat
While the Natural Language Processing community has dedicated significant resources in developing educational technologies (EdTech) that support this shift, it remains unclear whose interests are being best served among the stakeholders of education. In this paper, we present a systematic literature review of 204 papers published in venues of the Association for Computational Linguistics’ Special Interest Group on Building Educational Applications in 2024 and 2025, and validate these against EdTech papers from the wider ACL Anthology. By examining stakeholder inclusion and the prioritisation of research tasks, our findings reveal a critical tension: a push and pull between private-sector incentives and the foundational needs of educational infrastructure. Our analysis reveals that teachers are systematically under-represented as beneficiaries of research (33.3%) despite being the most affected, that real-world deployment remains rare (9.8%), and that ethical engagement tends toward acknowledgement rather than action. Drawing on exemplary papers in our corpus, we offer concrete recommendations for more responsible EduNLP research practices.
Children’s English Reading Story Generation via Supervised Fine-Tuning of Compact LLMs with Controllable Difficulty and Safety
Qian Shen | Fanghua Cao | Min Yao | Shlok Gilda | Bonnie Dorr | Walter Leite
Qian Shen | Fanghua Cao | Min Yao | Shlok Gilda | Bonnie Dorr | Walter Leite
Large Language Models (LLMs) are widely applied in educational practices, such as for generating children’s stories. However, the generated stories are often too difficult for children to read, and the operational cost of LLMs hinders their widespread adoption in educational settings. We used an existing expert-designed children’s reading curriculum and its corresponding generated stories from GPT-4o and Llama 3.3 70B to design different experiments for fine-tuning three 8B-parameter LLMs, which then generated new English reading stories that were subjected to quantitative and qualitative evaluation. Our method prioritizes controllability over scale, enabling educators to target reading levels and error patterns with a compact, affordable model. Our evaluation results show that with appropriate fine-tuning designs, children’s English reading stories generated by 8B LLMs perform better on difficulty-related metrics than those from zero-shot GPT-4o and Llama 3.3 70B, with almost no discernible safety issues. Such fine-tuned LLMs could be more broadly used by teachers, parents, and children in classrooms and at home to generate engaging English reading stories with children’s interests, controllable difficulty and safety.
Transformer-based readability classifiers are worse than you think: Evidence from cross-domain Arabic readability assessment
Sarh Alzu’Bi | Robert Reynolds
Sarh Alzu’Bi | Robert Reynolds
Arabic readability assessment is under-explored compared to English, and existing models are typically evaluated only within the training domain. We introduce the Jordanian School Textbook Corpus (JSTC), 82,512 segments from 240 textbooks spanning grades 1–12, and combine it with DARES to train XGBoost classifiers, fine-tuned CAMeLBERT transformers, and hybrid architectures evaluated both in-domain and on the BAREC out-of-domain benchmark. CAMeLBERT achieves strong in-domain performance (QWK = 0.830) but its cross-domain QWK collapses to 0.085, while XGBoost over 127 handcrafted linguistic features alone maintains the highest cross-domain QWK (0.240); adding [CLS] embeddings to those features actively harms transfer. Probing reveals that CAMeLBERT layers implicitly capture some linguistic features but higher-level signals overwhelm them, and Captum attribution identifies nouns and nominal particles such as al- as the most important tokens. The results argue for prioritizing linguistically-grounded features over contextual embeddings when cross-domain robustness is required.
Predicting Item Difficulty and Generating Reading Comprehension Items via an Annotated Repository
Radhika Kapoor | Mayank Sharma | Sang Truong | Nick Haber | Ben Domingue | Maria Ruiz-Primo
Radhika Kapoor | Mayank Sharma | Sang Truong | Nick Haber | Ben Domingue | Maria Ruiz-Primo
Prediction of item difficulty from its text content is of substantial interest for automated generation of test items. In this paper, we focus on the related problem of recovering IRT-based difficulty when the data originally reported item p-value (percent correct responses). We model this item difficulty using a repository of reading passages and student data from US standardized tests from New York and Texas for grades 3-8 spanning the years 2018-23. This repository is annotated with meta-data on (1) linguistic features of the reading items, (2) test features of the passage, and (3) context features. Using a penalized regression model, we achieve an RMSE of 0.59 (compared to a 0.92 baseline) and a 0.77 correlation between true and predicted difficulty. We further evaluated the impact of LLM embeddings (ModernBERT, BERT, and LLaMA), finding that they marginally improve performance but function effectively as standalone alternatives to traditional linguistic features. Finally, we demonstrate how this difficulty prediction model powers a publicly available, human-in-the-loop tool for generating reading comprehension items.
Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment
Grandee Lee | Yue Wang | Che Yee Lye | Luke Peh
Grandee Lee | Yue Wang | Che Yee Lye | Luke Peh
When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM’s scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance (r = 0.698) with systematic positive bias. GEA is strong (r > 0.7) for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.
Evaluating LLM-Generated Formative Feedback for Undergraduate Mathematics Through the Lens of Feedback Theory
Aron Gohr | Marie-Amelie Lawn | Kevin Gao | Inigo Serjeant | Stephen Heslip
Aron Gohr | Marie-Amelie Lawn | Kevin Gao | Inigo Serjeant | Stephen Heslip
Large language models can generate feedback on free-form student writing, but it is unclear whether such feedback is correct and pedagogically useful. We evaluate LLM-generated feedback on 65 undergraduate proof-writing exercises using Hattie and Timperley’s feedback framework and a grade agreement metric, comparing two models (GPT-4.1, GPT-5) under two workflow configurations graded by two independent LLM evaluators. A mark-scheme-augmented workflow improves grade correlation with human experts for both models, and its precomputed mark schemes allow instructors to audit the system before deployment. GPT-5 produces higher-quality feedback across all dimensions. The metrics we collect give some evidence that in the setting studied, feedback quality is high, and several sanity checks on our experiments support this finding. However, providing meaningful self-regulation support and controlled tests with students remain to be done. The results in this contribution show that feedback theory provides a useful lens for evaluating automated mathematical feedback.
Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most
Tahreem Yasir | Wenbo Li | Sam Gilson | Sutapa Tithi | Xiaoyi Tian | Tiffany Barnes
Tahreem Yasir | Wenbo Li | Sam Gilson | Sutapa Tithi | Xiaoyi Tian | Tiffany Barnes
Effective tutoring requires distinguishing optimal, valid but suboptimal, and incorrect student solutions, a distinction central to intelligent tutoring systems (ITS) but untested for LLM-based tutors. As LLMs are increasingly explored as conversational complements to ITS, evaluating their diagnostic precision is essential. We present a benchmark of seven LLM feedback agents in propositional logic using knowledge-graph-derived ground truth across 10,836 solution–feedback pairs and three feedback conditions. Models achieved near-ceiling performance on optimal steps but systematically over-rejected valid but suboptimal reasoning and over-validated incorrect solutions, precisely where adaptive tutoring matters most. These failures persisted across models regardless of solution context, suggesting architectural rather than informational limits. Moreover, accurate diagnosis did not reliably produce pedagogically actionable feedback, revealing a gap between diagnostic judgment and instructional effectiveness. Our findings suggest that LLMs are better suited for hybrid architectures where KG-grounded models handle diagnosis while LLMs support open-ended scaffolding and dialogue.
Retrieval-Augmented Tutoring for Algorithm Tracing and Problem-Solving in AI Education
Mragisha Jain | Tirth Bhatt | Griffin Pitts | Aum Pandya | Peter Brusilovsky | Narges Norouzi | Arto Hellas | Juho Leinonen | Bita Akram
Mragisha Jain | Tirth Bhatt | Griffin Pitts | Aum Pandya | Peter Brusilovsky | Narges Norouzi | Arto Hellas | Juho Leinonen | Bita Akram
Students learning algorithms often need support as they interpret traces, debug reasoning errors, and apply procedures across unfamiliar problem instances. In this paper, we present KITE (Knowledge-Informed Tutoring Engine), a Retrieval-Augmented Generation (RAG)-based intelligent tutoring system designed to serve as a classroom teaching assistant for algorithmic reasoning and problem-solving tasks. KITE uses an intent-aware Socratic response strategy to tailor support to different student needs, responding with targeted hints, guiding questions, and progressive scaffolding intended to strengthen students’ algorithmic problem-solving ability. To keep responses aligned with course content, KITE uses a multimodal RAG pipeline that retrieves relevant information from course materials. We evaluate KITE using three forms of assessment: RAGAs-based metrics for response grounding and quality, expert evaluation of pedagogical quality, and a simulated student pipeline in which a weaker language model interacts with KITE across two-turn dialogues and produces revised answers after receiving feedback. Results indicate that KITE produces contextually grounded and pedagogically appropriate responses. Further, using simulated students, KITE’s feedback helped the student models produce more accurate follow-up responses on procedural and tracing questions, suggesting that its scaffolding can support algorithmic problem-solving. This work contributes a tutoring architecture and an evaluation approach for assessing retrieval-grounded explanations and scaffolded problem-solving feedback.
LLM-Powered but Rule-Grounded: Pedagogically Relevant Grammatical Error Characterization for Learner Model Construction
Soroosh Akef | Amália Mendes | P Rebuschat | Detmar Meurers
Soroosh Akef | Amália Mendes | P Rebuschat | Detmar Meurers
Grammatical error correction approaches rarely characterize the pedagogical value of corrected errors. We propose a framework that combines LLM-based second-language writing correction with a rule-based characterization module to identify pedagogically relevant, fine-grained grammatical properties in learner texts. The characterization module targets 252 European Portuguese properties which are categorized by the CEFR level at which they are taught according to an authoritative curriculum, and property accuracy is inferred from contrasts between the learner and corrected texts. We evaluate the framework extrinsically by training interpretable automatic proficiency assessment models on accuracy features extracted from characterized errors in a Portuguese learner corpus. Across different prompting strategies, we show that models trained on features derived from LLM-corrected texts perform similarly to those trained on features derived from annotator-corrected texts and comparably to models trained on linguistic complexity features. Feature importance overlap is likewise high, and similar predictive patterns are observed in both annotator-based and LLM-based models, further supporting the validity of the proposed framework.
Teaching Through Analogies: A Modular Pipeline for Educational Analogy Generation
Mariam Barakat | Ekaterina Kochmar
Mariam Barakat | Ekaterina Kochmar
We present a modular pipeline for educational analogy generation, decomposed into four stages – source finding, sub-concept generation, explanation generation, and evaluation – grounded in Structure Mapping Theory. Evaluating 12 LLMs across six model families on SCAR and ParallelPARC, we find that sub-concept grounding substantially improves retrieval precision and explanation quality but offers limited benefit in open-ended generation. We further validate an LLM-as-a-judge methodology against human annotations, finding that Claude Sonnet 4.6 aligns more reliably with human rankings than with absolute scores. Our results highlight cross-stage interactions that isolated studies cannot capture, and position sub-concept grounding as a key driver of analogy quality.
Edit-level Majority Voting Mitigates Over-Correction in LLM-based Grammatical Error Correction
Takumi Goto | Yusuke Sakai | Taro Watanabe
Takumi Goto | Yusuke Sakai | Taro Watanabe
Grammatical error correction using large language models often suffers from the over-correction issue. To mitigate this, we propose a training-free inference method that performs edit-level majority voting over multiple candidates generated by a single model, without requiring model modifications or additional training. Across nine benchmarks covering English, Czech, German, Ukrainian, Korean, Hindi, and Romanian, the proposed method outperforms both greedy and MBR decoding in most cases. Moreover, it yields stable correction quality regardless of the instruction prompts used. We release two repository supporting GEC datasets and LLM inference.
Zero Shot Phonics: Evaluating Constraint-Adherent Phonics Story Generation in Large Language Models
Maria Monica Manlises | Ethel Ong
Maria Monica Manlises | Ethel Ong
Phonics stories are essential for early literacy, requiring controlled repetition of grapheme-phoneme (GP) patterns while maintaining simplicity, suitability, and quality. Generating such texts poses a challenge for large language models (LLMs), which must balance multiple phonological and pedagogical constraints. We evaluate six LLMs in a zero-shot setting across 16 prompt configurations, producing 8,688 outputs and 39,096 stories. Outputs are assessed using a multi-dimensional framework covering phonological alignment, developmental lexical appropriateness, readability, and narrative quality. Results show that while LLMs generate highly readable and age-appropriate text, they exhibit variability in phoneme control and narrative coherence. Prompt design significantly affects performance, revealing trade-offs between focusing on multiple phonological, linguistic, and pedagogical constraints, while model choice also leads to significant differences. These findings highlight the challenges of controllable educational text generation and the importance of prompt design in balancing instructional objectives. We release our prompts, generated stories, and evaluation framework to support future work in phonics-based story generation for early readers.
From Dialogue to Learner Modeling: Identifying Candidate Signals of Productive Use in LLM-Based Grammar Practice
Luisa Ribeiro-Flucht | Lanhua Huang | Xiaobin Chen
Luisa Ribeiro-Flucht | Lanhua Huang | Xiaobin Chen
Adaptive language-learning systems often model progress through correctness in constrained exercises, where the target response is predefined. In dialogue-based tutors, by contrast, learners can respond appropriately in many ways, making evidence of progress harder to interpret. This raises a learner-modeling problem: determining whether learner production provides useful evidence of progress, which aspects are informative, and how they might support adaptation. We address this problem using pilot data from an LLM-based English grammar tutor, comprising 40 pre- and post-test tasks, treatment interactions, and 2,406 learner messages. We propose a coding scheme for learner production in dialogue and explore whether the resulting evidence types can support future adaptive decisions. Findings show that learner production in dialogue can support adaptive grammar practice: prior target use predicted short-term performance, while finer-grained evidence helped distinguish different levels of productive control. We discuss implications for adaptive grammar-based dialogue systems that use learner production to support communicative practice.
Evaluating Adaptive Personalization of Educational Readings with Simulated Learners
Ryan Woo | Anmol Rao | Aryan Keluskar | Yinong Chen
Ryan Woo | Anmol Rao | Aryan Keluskar | Yinong Chen
We present a framework for evaluating adaptive personalization of educational reading materials with theory-grounded simulated learners. Unlike typical intelligent tutoring systems that adapt questions or feedback, we treat reading as the primary intervention and use question answering only as an observation channel for Bayesian Knowledge Tracing (BKT). This enables controlled comparison of LLM-powered adaptive and non-adaptive reading policies before classroom deployment.The framework links open educational content to a shared ontology of learning objectives and knowledge components, which is used to generate aligned reading–assessment pairs targeting one objective at a time. Simulated learners update their knowledge through a comprehension-and-memory process that models encoding, integration with prior knowledge, and misconception revision.The learner model combines established theories of reading with constrained answer selection, ensuring responses are generated only from information the learner has plausibly retained. Together, these components provide an interpretable offline testbed for studying whether adaptive reading improves learning outcomes.
Toward Cross-Domain Automated Feedback: A Comparative Evaluation of Open-Source Models across Diverse Student Assessment Types
Muhammad Haseeb | Min Paing Hmue | Ahmad Imam Amjad | Maaz Amjad | Victor Sheng
Muhammad Haseeb | Min Paing Hmue | Ahmad Imam Amjad | Maaz Amjad | Victor Sheng
Constructive, personalized, and timely feedback is essential to student learning. However, providing such feedback in large classes remains a major challenge. Large language models (LLMs) offer alternatives to support instructors by generating relevant feedback that highlights both student strengths and areas for improvement. Nevertheless, most existing LLM-based feedback systems rely on proprietary APIs and are often tailored to specific tasks, which limits their accessibility, flexibility, and applicability in resource-constrained educational settings. In this study, we investigate the potential of two open-source LLMs (DeepSeek R1 and Qwen 3.5) to support automated feedback generation across three disciplines (e.g., programming assignments, essays, and mathematics problems). We evaluate two prompting strategies (unified and multi-agent) across these domains and use human judgment criteria to assess feedback quality. Through this analysis, we examine the potential of open-source models as practical, scalable alternatives to proprietary API-based systems for developing freely accessible feedback-generation tools. Our results show that a single open-source model can generate useful feedback across diverse domains, though with varying effectiveness. DeepSeek R1 performs better on reasoning-intensive tasks such as mathematics, while Qwen 3.5 works best for holistic tasks such as writing, but both models struggle with programming tasks. We find that model architecture matters more than prompting strategy, and reasoning-optimized models excel in structured domains, while general-purpose models perform better on holistic tasks. Finally, we conclude that a multi-agent approach does not consistently guarantee improved results over a single unified LLM approach.
Findings of the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners
Mariano Felice | Lucy Skidmore
Mariano Felice | Lucy Skidmore
This paper reports findings from the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners across three L1s (Spanish, German and Mandarin). The task featured open and closed tracks, using data from the British Council’s Knowledge-based Vocabulary Lists (KVL). Submissions were received from 23 teams employing diverse modelling approaches, including transformers, Large Language Models, feature-based approaches and ensembles. Results were evaluated using RMSE, with winning systems significantly exceeding the baseline and establishing new state-of-the-art benchmarks. This paper offers an examination of the participating systems, performance across tracks and L1s, and the factors that can affect prediction accuracy.
SATLab at BEA 2026 Shared Task 1: Predicting the Difficulty of English Words for Three L1 Learners Using Primarily Psycholinguistic Features
Yves Bestgen
Yves Bestgen
This paper presents SATLab’s participation in the BEA 2026 shared task on predicting the difficulty of English words for L2 learners. The proposed system uses features mainly derived from word frequency lists, lexical norms, and psychometric data, which are input into a gradient boosting decision tree model. It outperformed the Baseline system but performed significantly worse than the top-performing teams. Feature contributions to model performance are analysed using gain scores and Spearman rank correlations, and a brief analysis of the most significant errors is provided.
UGA Threshold at BEA 2026 Shared Task 1: Predicting Vocabulary Acquisition Difficulty with Hand-Crafted SLA-Based Features
Emma Dalbo
Emma Dalbo
This paper describes a feature-based system submitted to the BEA 2026 Shared Task on Vocabulary Difficulty Prediction (closed track). The system models vocabulary difficulty for English learners using linguistically motivated features capturing frequency, cross-linguistic similarity, phonological and orthographic complexity, and semantic properties, supplemented by multilingual embeddings (reduced via PCA). Multiple regression models were evaluated using cross-validation, with final predictions generated from ensemble and single-model configurations per language.The system achieves competitive performance across all three L1 groups (German, Spanish, and Chinese), outperforming the XLM-RoBERTa baseline in seven of nine runs in terms of RMSE, with the strongest gains observed for Chinese and more modest improvements for Spanish. An ablation study further demonstrates that frequency and cross-linguistic similarity factors contribute most substantially to predictive performance, with effects varying across L1s. These findings highlight the role of interpretable linguistic features in modeling vocabulary difficulty in an L1-aware setting.
TeamXBC at BEA 2026 Shared Task 1: How AI (and I) won the shared task: Vibe and agentic coding solutions for practical machine learning problems
Xiaobin Chen
Xiaobin Chen
The paper describes how the author used AI coding agents and a technique called vibe coding to successfully tackle the BEA 2026 shared task on vocabulary difficulty prediction. Three sets of predictions (runs) were submitted to the competition, corresponding to three experiments the author ran by giving the coding agent different levels of agency: (1) a one-off solution fully planned and implemented by the AI, (2) an AI self-determined iterative process that ran for 24 hours, and (3) a collaborative human-in-the-loop process where solutions were discussed between the author and the AI. Competition results showed that the collaborative mode delivered the best performance, demonstrating that at the current stage domain expert input and decision making are important and necessary for vibe coding solutions to practical machine learning problems.
SAAKTH at BEA 2026 Shared Task 1: L1-Aware English Vocabulary Difficulty Prediction with Hybrid Transformer and Psycholinguistic Features
Karthik Mattu | Adit Dhall | Arshad Naguru | Shubh Sehgal | Thejas Gowda | Hakyung Sung
Karthik Mattu | Adit Dhall | Arshad Naguru | Shubh Sehgal | Thejas Gowda | Hakyung Sung
This paper presents team SAAKTH’s system for the BEA 2026 Shared Task on Vocabulary Difficulty Prediction (Closed Track). We address the key challenge that English word difficulty is not fixed but varies with English learners’ native language. Our approach combines a fine-tuned XLM-RoBERTa-large encoder with handcrafted psycholinguistic features engineered separately for each L1 group. These features are integrated via a shallow multilayer perceptron and optimized separately per L1, with five-seed ensembling and XGBoost-based blending for stability. Our system achieves RMSEs of 0.997 (es), 1.002 (de), and 0.932 (cn) on the development set, improving 20–25% over the baseline. Results highlight the effectiveness of L1-aware modeling under limited data.
SurreyCTS at BEA 2026 Shared Task 1: Semantic Funnelling and Entropy-based Multilingual Lexical Difficulty Prediction
Georgina Willoughby | Jordan Painter | Diptesh Kanojia | Emily Wells | Constantin Orasan
Georgina Willoughby | Jordan Painter | Diptesh Kanojia | Emily Wells | Constantin Orasan
We describe the SurreyCTS system for the BEA 2026 shared task on lexical difficulty prediction. Our approach combines multilingual transformer encoders (RemBERT and COMET) with engineered linguistic features including semantic funnelling, lexical similarity, attention-derived signals, and language-aware representations. A weighted ensemble of the five strongest systems placed fifth among open-track teams, outperforming the open-track baseline across all three learner L1 groups (Spanish, German, and Chinese).
EduNLP at BEA 2026 Shared Task 1: Multi-Model Ensemble with Feature-Augmented Transformers for Vocabulary Difficulty Prediction
Avinash Kumar Sharma
Avinash Kumar Sharma
We describe our system submitted to the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners. Our approach combines handcrafted linguistic features with fine-tuned XLM-RoBERTa transformers in a multi-model ensemble, participating in both the closed and open tracks. Our system outperforms the baselines on both tracks across all three L1s, with best RMSEs of 1.058 (closed, CN) and 0.992 (open, CN). Post-hoc error analysis reveals that polysemous words in rare senses and nominalized -ing forms constitute the primary failure mode.
AIDA at BEA 2026 Shared Task 1: A Two-Stage Framework for L1-Aware Vocabulary Difficulty Prediction with Representation Diversity and Residual Calibration
Seok Hyeon Cho | JunHyeok Choi | Sangeun Ji | Sung Won Han
Seok Hyeon Cho | JunHyeok Choi | Sangeun Ji | Sung Won Han
We study vocabulary difficulty prediction for second language (L2) learners, a key component for adaptive language learning and assessment. Existing approaches often treat difficulty as an intrinsic property of words or contexts, overlooking representation-dependent variation and learner-specific factors such as L1 transfer.We participate in the BEA 2026 Shared Task Closed Track using the Spanish (L1) subset of the KVL dataset. We propose a two-stage framework that decouples representation learning from learner-aware calibration. Stage 1 constructs diverse representations using multiple pretrained encoders with varied pooling and prediction strategies, capturing complementary aspects of lexical and contextual complexity. Stage 2 models systematic residual errors with psycholinguistic and cross-lingual features, enabling explicit correction of prediction biases.Experiments show that our method outperforms strong baselines, improving RMSE (1.257 -> 0.976) and correlation (0.765 -> 0.857). These results highlight the importance of jointly modeling representation diversity and learner-specific effects. Our system ranked 3rd in the official BEA 2026 Shared Task Closed Track.
Failure at BEA 2026 Shared Task 1: One Pipeline, Three L1s: A Unified Language-Agnostic System for Vocabulary Difficulty Prediction
Abid Hossain | Kamruzzaman Khan Alve
Abid Hossain | Kamruzzaman Khan Alve
We present a unified, language-agnostic system for the BEA 2026 Shared Task on vocabulary difficulty prediction. The system uses a single training pipeline across Spanish, German, and Mandarin Chinese without any language-specific adaptation. Input features include serialized text fields and four scalar length-based features, processed using an XLM-RoBERTa encoder with attention-mask-weighted mean pooling. Hyperparameters are tuned with Optuna under reduced cross-validation, followed by full 5-fold training and checkpoint-based ensembling.Our approach improves over the official closed-track baseline across all three L1 conditions, demonstrating that a shared architecture and training strategy can yield consistent gains without language-specific engineering. Error analysis shows higher prediction error at difficulty extremes, suggesting a regression-to-the-mean tendency.
BoostedCats at BEA 2026 Shared Task 1: What Makes a Word Hard to Learn? Modeling L1 Influence on English Vocabulary Difficulty
Jonas Mayer Martins | Zhuojing Huang | Aaricia Herygers | Lisa Beinborn
Jonas Mayer Martins | Zhuojing Huang | Aaricia Herygers | Lisa Beinborn
What makes a word difficult to learn, and how does the difficulty depend on the learner’s native language? We computationally model vocabulary difficulty for English learners whose first language is Spanish, German, or Chinese with gradient-boosted models trained on features related to a word’s familiarity (e.g., frequency), meaning, surface form, and cross-linguistic transfer. Using Shapley values, we determine the importance of each feature group. Word familiarity is the dominant feature group shared by all three languages. However, predictions for Spanish- and German-speaking learners rely additionally on orthographic transfer. This transfer mechanism is unavailable to Chinese learners, whose difficulty is shaped by a combination of familiarity and surface features alone. Our models provide interpretable, L1-tailored difficulty estimates that can be used to design vocabulary curricula.
uogal at BEA 2026 Shared Task 1: Ensemble of Multilingual Encoders with NMT Augmentation for L1-Aware Vocabulary Difficulty Prediction
Bernardo Stearns | John P. McCrae | Thomas Gaillat | Jefkine Kafunah
Bernardo Stearns | John P. McCrae | Thomas Gaillat | Jefkine Kafunah
We submit a system for the closed track of the BEA 2026 shared task on L1-aware vocabulary difficulty prediction (Spanish, German, Mandarin Chinese). We compared three families of approaches: hand-crafted tabular features with tree-based regressors, fine-tuned multilingual encoders, and decoder-based artificial learner simulation using LoRA-tuned Pythia models, each evaluated with and without NMT-augmented English context. Our best system is an ensemble of four base and four NMT-augmented multilingual encoders combined through per-language stacking (Nelder-Mead and ElasticNet meta-learner), which placed 2nd in the closed track across all three languages. We also report a monotonic scaling study of the decoder-based artificial learner simulation pipeline.
Jinnie’s Lab at BEA 2026 Shared Task 1: Precalibration of Vocabulary Item Difficulty with Multilingual Transformers and Multi-Task Learning
Zhe Li | Pauline Aguinalde | Jinnie Shin
Zhe Li | Pauline Aguinalde | Jinnie Shin
This paper describes our submission to the BEA 2026 shared task 1 on vocabulary item difficulty prediction in multilingual settings. We investigated whether transformer-based representations learned directly from item content can support the prediction of vocabulary item difficulty across different L1 groups. Our approach adopted a multilingual BERT-based architecture, specifically the mmBERT, with representation augmentation at both the layer and token levels, followed by a multi-task cascade learning that incorporates part-of-speech information as an auxiliary structural signal. Results showed that multi-task mmBERT consistently outperforms the shared-task XLM-RoBERTa baseline across languages, while gains from more complex aggregation are not uniform. The findings showed that strong multilingual representations provide a competitive foundation for vocabulary item difficulty prediction, while the benefits of additional architectural complexity depend on the language and training setting.
Glite at BEA 2026 Shared Task 1: Holistic Difficulty Models Dominate, Feature Engineering Closes the Gap in L1-Aware Vocabulary Difficulty Prediction
Vassili Philippov | Dmitrii Andreev | Pavel Katunin | Anton Nikolaev
Vassili Philippov | Dmitrii Andreev | Pavel Katunin | Anton Nikolaev
This paper describes our submission to the BEA 2026 Shared Task on L1-Aware English Vocabulary Difficulty Prediction. We build per-L1 CatBoost regressors over 1,161 candidate linguistic, psycholinguistic, dictionary, and LLM-derived features drawn from 129 feature sets; out-of-fold predictions from fine-tuned encoder and decoder-LLM regression heads enter the model as additional features. Features are selected via Recursive Feature Elimination with nested cross-validation, producing compact per-L1 models of 29-150 features per run. For the closed track we introduce a per-feature-column compliance audit that classifies 57 of 129 feature sets as track-eligible under the organiser rulings, an audit that forced a rebuild of the selection and ensembling pipelines in the final week. We further show that decoder-LLM LoRA regression heads — LLaMA-3.1-8B being the single strongest model in our pool — provide the largest marginal gains in the open track, and that a simpler per-L1 CatBoost on RFE-selected features matches or exceeds Ridge-stacking ensembles over the same base models. Our systems ranked 1st in the closed track and 2nd in the open track on all three L1s (Spanish, German, Mandarin), reducing baseline RMSE by 29.9% in the closed track and 35.9% in the open track on average.
NLP-Explorers at BEA 2026 Shared Task 1: DeBERTa–CatBoost Weighted Ensemble Approach for L1-Specific Vocabulary Difficulty Prediction
Tayyab Latif | Asifa Bibi | Sabur Butt | Grigori Sidorov | Alexander Gelbukh
Tayyab Latif | Asifa Bibi | Sabur Butt | Grigori Sidorov | Alexander Gelbukh
Vocabulary difficulty prediction aims to estimate how difficult a word is for a learner. This is an important problem because word difficulty is shaped not only by the word itself, but also by the learner’s background and the context in which the word appears. In this work, we predict continuous difficulty scores for English target words using learnerspecific information. Our approach combines a fine-tuned DeBERTa v3 Large model with a CatBoost regressor trained on transformer-based embeddings. The final score is produced through weighted ensembling, where DeBERTa provides the main prediction and CatBoost adds a smaller complementary signal. Our final system achieved RMSE scores of 1.040 for Spanish, 0.992 for German, and 0.882 for Chinese. The results were also stable across multiple runs, showing that the model behaved consistently under small changes in ensemble weight. These findings show that a simple hybrid system can provide reliable performance for vocabulary difficulty prediction. They also suggest that combining strong contextual representations with a lightweight regression model is an effective way to model learner-sensitive word difficulty.
RETUYT-INCO at BEA 2026 Shared Task 1: Feature-Enriched mDeBERTa for Word Difficulty Prediction
Santiago Robaina | Aiala Rosá | Luis Chiruzzo
Santiago Robaina | Aiala Rosá | Luis Chiruzzo
We describe the RETUYT-INCO participation in the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners, a regression task that predicts GLMM psychometric difficulty scores for English target words given an L1 cue (Spanish, German, and Mandarin). We submitted two systems to the closed track (which restricts participants to the provided shared-task data and standard NLP resources, excluding external corpora and large language models): a feature-engineered XGBoost regressor for all three L1s, and, for Spanish, a 3-seed ensemble of mdeberta-v3-base fine-tuned with the same handcrafted features prepended as input text tokens. Our best test result is 1.094 RMSE on Spanish (ensemble), a 13.0% reduction over the XLM-RoBERTa-base closed baseline. We highlight two findings. First, a LaBSE cross-lingual cosine between the L1 source word and the English target word is the largest single-feature addition in our incremental ablation, reducing average development-split (dev) RMSE by 0.091 on top of an already strong string/frequency/POS feature set. Second, feature-only XGBoost, with no neural fine-tuning and no GPU, already beats the XLM-RoBERTa-base closed-track development baseline on average across the three L1s (1.273 vs. 1.287 RMSE).
Token Titans at BEA 2026 Shared Task 1: Multilingual Lexical Complexity Prediction via Fine-Tuned XLM-RoBERTa with Ensemble Decoding
Anubhab Parashar | Sandeep Mathias
Anubhab Parashar | Sandeep Mathias
We describe our submission to the BEA 2026 Shared Task on Multilingual Lexical Complexity Prediction. The system fine-tunes XLM-RoBERTa Large separately for Spanish, German, and Chinese, feeding each instance as a flat concatenation of the source word, its sentential context, an English clue, and the English target word. Training uses z-score label normalization and two independent runs thatdiffer in learning rate, scheduler, and random seed; a weighted ensemble of their predictions (0.6/0.4) consistently reduces variance on the validation set. On the official test set the system scores RMSE = 1.170 and Pearson = 0.812.
TOEBM at BEA 2026 Shared Task 1: Improving Lexical Difficulty Prediction with Context-Aligned Contrastive Learning and Ridge Ensembling
wicaksono M. | Joanito Lopo | Tsamarah Nugraha | Ahmad Adi | Muhamad Nurfajri
wicaksono M. | Joanito Lopo | Tsamarah Nugraha | Ahmad Adi | Muhamad Nurfajri
Lexical difficulty prediction is a fundamental problem in language learning and readability assessment, requiring models to estimate word difficulty across different first-language (L1) backgrounds. However, existing approaches rely on regression-only training with scalar supervision, which does not explicitly structure the representation space, limiting their ability to capture cross-lingual alignment and ordinal difficulty. To mitigate these issues, we propose Context-Aligned Contrastive Regression, which integrates Ridge regression ensemble with two complementary objectives, i.e., Cross-View Context and Ordinal Soft Contrastive Learning. Experiments on three L1 datasets show that (i) contrastive objectives improve cross-lingual representation alignment while preserving language-specific nuances, (ii) the learned representations capture the ordinal structure of lexical difficulty, and (iii) the ensemble effectively mitigates systematic biases of individual models, leading to more stable performance across difficulty levels.
Data Asgardians at BEA 2026 Shared Task 1: A Hybrid Transformer–Feature Ensemble for L1-Aware English Vocabulary Difficulty Prediction
Adrian Pineda | Sabur Butt | Héctor Ceballos Cancino
Adrian Pineda | Sabur Butt | Héctor Ceballos Cancino
This paper presents our system for the BEA 2026 Shared Task on Vocabulary Difficulty Prediction for English Learners. The task requires predicting psychometrically calibrated GLMM difficulty scores for English vocabulary items across three learner first-language (L1) backgrounds: Spanish (ES), German (DE), and Mandarin Chinese (CN). Our approach studies how hand-crafted linguistic features can complement contextual multilingual transformer representations. We engineer 33 phonological, morphological, semantic, contextual, and cross-lingual features, and evaluate feature-only regressors, Solo transformer models, Hybrid transformer models, and prediction-level ensembling. Our official Closed Track submissions were generated with XLM-RoBERTa-large Solo and Hybrid models, which improved over the official baseline for all three L1 groups, achieving test RMSEs of 1.182 (ES), 1.117 (DE), and 1.006 (CN) with a mean of 1.103. We then conducted a post-submission refinement using mDeBERTa-v3-base components and a Ridge stacking ensemble, which further reduced test RMSE to 1.037 (ES), 0.997 (DE), and 0.913 (CN), with a mean of 0.982, a mean improvement of 0.121 over our best XLM-RoBERTa-large system.
UOL@IDEM at BEA 2026 Shared Task 1: Neural Fusion and Feature-Rich Modeling for L1-Aware Vocabulary Difficulty Prediction
Nouran Khallaf | Serge Sharoff
Nouran Khallaf | Serge Sharoff
This paper describes UOL@IDEM’s closed-track submission to the BEA 2026 shared task on L1-aware vocabulary difficulty prediction. We model the task as regression and train separate systems for Spanish, German, and Mandarin Chinese. Our system combines multilingual contextual representations with engineered features capturing frequency, surface form, retrieval evidence, semantic alignment, cognate similarity, and masked-language-model predictability. Development results show consistent gains over the official closed-track baselines, with sentence-embedding encoders such as BGE-M3, multilingual E5, and LaBSE performing best. Official submissions achieve RMSE scores of 1.132, 1.037, and 0.891 for Spanish, German, and Chinese, respectively. Feature analysis identifies frequency as the most stable predictor, while contextual predictability, form similarity, retrieval, and semantic features provide complementary L1-sensitive signals. Error analysis shows strong ranking performance but weaker calibration for the easiest items, which are often overpredicted.
Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?
Adam Nohejl | Xuanxin Wu | Yusuke Ide | Maria Riera Machin | Yi-Ning Chang
Adam Nohejl | Xuanxin Wu | Yusuke Ide | Maria Riera Machin | Yi-Ning Chang
We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council’s Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online.
Report on the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German
Sebastian Gombert | Zhifan Sun | Fabian Zehner | Jannik Lossjew | Tobias Wyrwich | Berrit Czinczel | David Bednorz | Sascha Bernholt | Knut Neumann | Ute Harms | Aiso Heinze | Hendrik Drachsler
Sebastian Gombert | Zhifan Sun | Fabian Zehner | Jannik Lossjew | Tobias Wyrwich | Berrit Czinczel | David Bednorz | Sascha Bernholt | Knut Neumann | Ute Harms | Aiso Heinze | Hendrik Drachsler
We present the BEA 2026 shared task on rubric-based short answer scoring for German. Rubric-based short answer scoring is a case of automatic short answer scoring (ASAS) that requires models to apply textual scoring rubrics to student answers as a basis for assigning scores. For the shared task, we introduced a novel German-language dataset from multiple STEM domains to provide a comprehensive benchmark for this problem. The dataset was designed to evaluate both performance and generalization (the latter, by distinguishing between seen and unseen questions), as well as coarse- and fine-grained scoring (2-way vs. 3-way). The systems submitted to the shared task cover a wide range of approaches, including fine-tuned large language models, prompt-based methods, human-AI collaboration strategies, or a combination of these. The results show that structured, task-adapted LLM systems achieved the strongest performance across all tracks. The winning system, IWM-DKM, combined LoRA fine-tuning of Qwen models with rubric-aware input structuring, including checklist-style reasoning, rubric reframing as decision trees, background knowledge injection, and ensemble voting. Other systems similarly relied on fine-tuned LLMs, retrieval-augmented prompting, encoder–LLM ensembles, or weighted aggregation strategies. Overall, the shared task results show that rubric-based scoring benefits most from systems that explicitly operationalise rubric semantics, while generalisation to unseen questions remains a central challenge.
Open-source LLMs with simple, zero-shot prompts are at best middling graders on the BEA 2026 Automated Grading Shared Task – blunt-edge models, in fact. However, they are good enough to support human graders and save them time. We demonstrate the application of a hybrid grading approach that first transparently defines the success criteria and then pairs a zero-shot LLM grader with human review. The hybrid approach outperforms the LLM grader on its own and has the added advantage of keeping the human in the loop.
ASLAN at BEA 2026 Shared Task 2: Voting Across Scoring Paradigms
Marie Bexte | Yuning Ding | Josef Ruppenhofer | Nils-Jonathan Schaller | Daniel Mora Melanchthon | Torsten Zesch | Andrea Horbach
Marie Bexte | Yuning Ding | Josef Ruppenhofer | Nils-Jonathan Schaller | Daniel Mora Melanchthon | Torsten Zesch | Andrea Horbach
This paper describes the ASLAN system contribution to the BEA 2026 Shared Task on rubric-based short answer scoring for German (Gombert et al., 2026). We investigate three complementary modeling paradigms: similarity-based scoring, instance-based classification, and rubric-prompted large language models (LLMs). For the unseen answers track, where test answers belong to prompts observed during training, we compare question-specific and generic scoring models as well as ensemble variants. For the unseen questions track, where models must generalize to previously unseen prompts, we primarily rely on zero-shot LLM-based scoring using the scoring rubrics. Our experiments show that similarity-based models outperform instance-based models and LLM-based models in the unseen answers setting. In addition, we find that ensemble methods improve robustness over individual models.
WSE Research at BEA 2026 Shared Task 2: Multi-Strategy Rubric-Based Short Answer Scoring for German
Jonas Gwozdz | Andreas Both
Jonas Gwozdz | Andreas Both
We describe the WSE Research system for the BEA 2026 Shared Task 2 on Rubric-based Short Answer Scoring for German. Our system combines rubric-conditioned prompting with TF-IDF exemplar retrieval, LoRA fine-tuning of open-source Qwen models, and prediction aggregation across complementary scorers. The central question is when prompt engineering, parameter-efficient adaptation, and aggregation each help for rubric-based grading. On the ALICE-LP-1.0 trial set, a fine-tuned Qwen2.5-32B reaches QWK 0.769, surpassing the strongest prompted commercial baseline (Gemini 3 Flash, 0.748). On the official test set, the system ranks second on three tracks and third on the remaining one. Overall, the results show that rubric-conditioned fine-tuning is a competitive and cost-effective alternative to commercial APIs for German short answer scoring, while aggregation helps on seen questions but larger single models generalize better to unseen rubrics.
AMATI at BEA 2026 Shared Task 2: Automatic Short Answer Grading with Inductive Logic Programming and a Large Language Model
Alistair Willis | Aisling Third
Alistair Willis | Aisling Third
We discuss the AMATI submission to the BEA 2026 Shared Task on Rubric-based Short Answer Scoring for German. Our neuro-symbolic system uses a combination of symbolic rules, automatically learned with a form of Inductive Logic Programming, and the Mistral-large language model. We wanted to investigate whether the combination would improve overall grading performance, while using the automatically induced symbolic rules for explainability, and the LLM for robustness. We find that the combination of approached resulted in improved overall performance for the 3-way task. However, including the symbolic rules did not improve upon Mistral’s performance in the 2-way test.This paper presents our approach to the unseen answers challenges. Our team finished 6th out of 9 in the 2-way challenge, and 5th out of 8 in the 3-way challenge. In the 3-way challenge, neither our symbolic system nor the use of Mistral alone would have placed higher than 6th of the 8 competitors, illustrating the improvement of the combined approach over either of the individual approaches.
IWM-DKM at BEA 2026 Shared Task 2: Supplementing Supervised Fine-Tuning for Rubric-Based Short Answer Scoring
Kate Belcher | Marius De Kuthy Meurers | Kordula De Kuthy | Detmar Meurers
Kate Belcher | Marius De Kuthy Meurers | Kordula De Kuthy | Detmar Meurers
In this paper, we present the IWM-DKM team submissions to the BEA 2026 Shared Task 2: Rubric-based Short Answer Scoring for German. We systematically explored how fine-tuned language models can be reliably employed for short answer scoring, for which three aspects turn out to be particularly beneficial: supplementing the fine-tuning process with generated domain expertise, restructured rubrics, and thinking traces. To increase the robustness of the scoring, we combine distinct approaches in an ensemble. Our best submissions finished in first place across all tracks, indicating promise for the further application of these strategies in automatic scoring.
RETUYT-INCO at BEA 2026 Shared Task 2: Meta-prompting in Rubric-based Scoring for German
Ignacio Sastre | Ignacio Remersaro | Facundo Díaz | Nicolás De Horta | Luis Chiruzzo | Aiala Rosá | Santiago Góngora
Ignacio Sastre | Ignacio Remersaro | Facundo Díaz | Nicolás De Horta | Luis Chiruzzo | Aiala Rosá | Santiago Góngora
In this paper, we present the RETUYT-INCO participation at the BEA 2026 shared task "Rubric-based Short Answer Scoring for German". Our team participated in track 1 (Unseen answers three-way), track 3 (Unseen answers two-way) and track 4 (Unseen questions two-way). Since these tracks required scoring short student answers using specific rubrics, we looked for ways to handle the changing nature of the task. We created a method called Meta-prompting. In this approach, an LLM creates a custom prompt based on examples from the Train set. This prompt is then used to grade new student answers. Along with this method, we also describe other approaches we used, such as classic machine learning, fine-tuning open-source LLMs, and different prompting techniques. According to the official results, our team placed 6th out of 8 participants in Track 1 with a QWK of 0.729. In Track 3, we secured 4th place out of 9 with a QWK of 0.674, and we also placed 4th out of 8 in Track 4 with a QWK of 0.49.
SDPA at BEA 2026 Shared Task 2: Efficient LLM Fine-Tuning for Rubric-based Short Answer Scoring
Zhexiong Liu | Jing Zhang
Zhexiong Liu | Jing Zhang
Automated short-answer scoring (ASA) is an important yet challenging task in educational assessment as it aims to evaluate open-ended student responses against predefined scoring rubrics that are often interrelated. Although large language models (LLMs) have demonstrated impressive capabilities in text understanding and reasoning, their application to ASA has primarily focused on prompt-based inference, largely due to the limited availability of annotated data required for effective model training. In this work, we investigate parameter-efficient fine-tuning strategies for LLMs using ASA annotations in German. Our experiments show that fine-tuned LLMs consistently outperform both prompt-based and ensemble-based language models, suggesting domain-adaptive LLM fine-tuning is more effective than prompting alone for ASA.
up
Proceedings of Beyond Alignment: Transdisciplinary Conversations on Human-AI Futures
Proceedings of Beyond Alignment: Transdisciplinary Conversations on Human-AI Futures
Malihe Alikhani | Camille Gagnier | Lauren M. E. Goodlad | Dan Roth | Mark Sammons | Matthew Stone
Malihe Alikhani | Camille Gagnier | Lauren M. E. Goodlad | Dan Roth | Mark Sammons | Matthew Stone
“I Was a Young AI”: On Probing the Effectiveness of Intervening on Anthropomorphic AI System Outputs
Su Lin Blodgett | Myra Cheng | Alexandra Olteanu
Su Lin Blodgett | Myra Cheng | Alexandra Olteanu
We see growing concerns about how the increasingly pervasive deployment of AI systems whose outputs appear human-like might impact people. These concerns have already motivated work both examining what makes such outputs appear human-like, as well as developing interventions to help reduce perceptions of human-likeness or mitigate adverse impacts. In this paper, we report on an exploratory crowd study we designed to examine challenges for assessing the effectiveness of interventions, including whether interventions intended to minimize perceptions of human-likeness also mitigate adverse impacts. We find variations both in what kinds of outputs different participants deem more human-like, as well as in their preferences for human-like outputs. Even when participants seem to prefer the outputs they deem more human-like, many of them also recognize that such outputs can have adverse impacts. Drawing on these results and prior work, we discuss challenges to and considerations for assessing the effectiveness of interventions.
up
Proceedings of The Big Picture v2: Crafting a Research Narrative
Proceedings of The Big Picture v2: Crafting a Research Narrative
Yanai Elazar | Allyson Ettinger | Nora Kassner | Sebastian Ruder
Yanai Elazar | Allyson Ettinger | Nora Kassner | Sebastian Ruder
From Natural Language to Certified Geometry Proofs: A Survey of LLM-Augmented Verification and Neuro-Symbolic Theorem Proving
Ioannis Tzachristas | Georgios Tzachristas
Ioannis Tzachristas | Georgios Tzachristas
Large Language Models (LLMs) can produce convincing geometric arguments, yet their outputs are not reliable enough to be treated as proofs without independent verification. In parallel, symbolic geometry tools (e.g. automated theorem provers in dynamic geometry systems) offer strong rigor guarantees, but require formalized inputs and can struggle with problem formalization, auxiliary construction, and proof presentation. This survey synthesizes work at the intersection of these lines: hybrid LLM–symbolic systems for geometry that (i) translate natural language and diagrams into formal constraints, (ii) search for solution plans and proof steps using learned or heuristic methods, and (iii) verify the resulting steps using symbolic provers or proof assistants. We propose a taxonomy organized around (a) the role of the LLM in the pipeline (parser, strategist, prover, critic), (b) the target proof artifact (answer-only, informal proof, semi-formal step trace, or kernel-checked formal proof), and (c) the verification backend (numeric testing, algebraic provers, synthetic provers, and proof-assistant kernels). We review representative systems in NLP and AI (e.g. GeoS, Inter-GPS, FormalGeo, AlphaGeometry, AutoGPS, and recent heuristic-only deductive solvers), and connect them to broader neurosymbolic paradigms for faithful reasoning (e.g. SatLM, LINC, and autoformalization). Finally, we outline evaluation protocols emphasizing step-level soundness and robustness, and we discuss open problems in multimodal formalization, handling of non-degeneracy conditions, human-readable certified proofs, and reproducibility.
Open Problems Solved by LLMs? A Survey of Verifiable Mathematical Discovery
Ioannis Tzachristas | Georgios Tzachristas | Aifen Sui
Ioannis Tzachristas | Georgios Tzachristas | Aifen Sui
Recent years have produced a small but rapidly growing set of results where Large Language Models (LLMs) - usually embedded in a search-and-verification loop - advance the state of the art on problems previously regarded as "open" in the pragmatic sense of lacking a best-known construction, bound, or proof certificate. This paper surveys that emerging line of work with a Big Picture emphasis: what makes these successes possible, what should count as "solved", and what design patterns generalize? We (i) propose an evidence ladder for interpreting "LLM solved an open problem" claims, (ii) map mathematical subfields by difficulty dimensions that matter for LLM-based discovery, (iii) curate a timeline of key breakthroughs leading to verifiable discovery systems, and (iv) synthesize the techniques and frameworks - tool use, retrieval, search, and verification - that repeatedly appear in successful case studies. We give particular attention to formal-methods backends common in security and verification contexts, including Linear Temporal Logic (LTL) and Satisfiability Modulo Theories (SMT) solvers, as scalable middle-layer verifiers between lightweight tests and proof assistants. We close with an evaluation and reproducibility checklist aimed at making the next wave of claims easier to trust, reproduce, and build upon, while separating peer-reviewed or certificate-backed results from fast-moving community reports that are useful signals but not yet stable evidence.
Current hallucination detection systems operate under a flawed assumption: that model outputs deviating from factual grounding are uniformly problematic regardless of task context, modality, or cultural setting. Through analysis of computational humor as a motivating case study, we demonstrate that identical model behaviors require radically different evaluations depending on context. We propose reframing hallucination detection as task-output alignment assessment, introducing a three-dimensional framework spanning factual grounding requirements, novelty requirements, and risk tolerance.
Challenging the Myth: A Research Arc on LLMs as Human Simulacra
Simon Münker | Achim Rettinger | Damian Trilling
Simon Münker | Achim Rettinger | Damian Trilling
When Large Language Models (LLMs) combined with prompt-based approaches as human simulacra emerged, they promised revolutionary shortcuts. Models trained on vast internet corpora may replicate human behavior and communication through text-based alignment. The initial optimism of the NLP community positioned LLMs as universal human proxies capable of replacing participants in surveys, generating authentic social media content, and simulating diverse cultural perspectives. We systematically dismantle this "myth of universal generalization" and document a shift toward methodological rigor. Our research reveals fundamental limitations: LLMs exhibit inhuman response patterns in psychometric assessments and produce detectable synthetic content. We analyze the difference between superficial linguistic fluency and genuine human-like representation, and reframe the current paradigm from asking "can LLMs replace humans?" to "under what validated conditions might LLMs serve as useful research components in social sciences?" Our work shows how interconnected research efforts challenge foundational assumptions and establishes best practices for deploying LLMs as human simulacra.
A socio-technical gap exists between how NLP systems are developed and evaluated and how people use them in practice. To help close this gap, I propose a direction for scientific progress in NLP centered on advancing trustworthy AI-mediated communication between humans, using cross-lingual and cross-cultural interaction as a stress test for this goal – settings where common ground is hard-won, miscommunication can go unnoticed, and human users often lack the means to independently evaluate AI outputs. I outline a research agenda emphasizing two complementary requirements spanning both sides of the interaction. On the model side, I study how multilingual systems access and use knowledge across languages, and when they systematically privilege sources in certain languages. On the user side, I design decision-support mechanisms and evaluate how they shape user’s reliance on imperfect outputs. Taken together, these results motivate future work for aligning multilingual NLP with real communicative practice, with the goal of building AI systems that more reliably serve diverse communities. This paper summarizes and draws heavily on my PhD thesis proposal.
Challenging Quadratic Attention - A Holistic View On the Rise of Alternative Language Model Architectures
Alexander M. Fichtl | Jeremias Bohn | Josefin Kelber | Edoardo Mosca | Georg Groh
Alexander M. Fichtl | Jeremias Bohn | Josefin Kelber | Edoardo Mosca | Georg Groh
Transformers have dominated sequence processing tasks for the past seven years—most notably language modeling. However, the inherent quadratic complexity of their attention mechanism remains a significant bottleneck as context length increases. We review and distill the recent efforts to overcome this bottleneck, including advances in (sub-quadratic) attention variants, recurrent neural networks, state space models, and hybrid architectures. We critically analyze approaches regarding compute and memory complexity, benchmark results, and fundamental limitations to assess whether the dominance of pure-attention transformers may soon be challenged, which we consider possible, particularly in domain-specific and edge-device applications.
Why Low-Resource NLP Needs More Than Cross-Lingual Transfer: Lessons Learned from Luxembourgish
Fred Philippy | Siwen Guo | Jacques Klein | Tegawendé F. Bissyandé
Fred Philippy | Siwen Guo | Jacques Klein | Tegawendé F. Bissyandé
Cross-lingual transfer has become a central paradigm for extending natural language processing (NLP) technologies to low-resource languages. By leveraging supervision from high-resource languages, multilingual language models can achieve strong task performance with little or no labeled target-language data. However, it remains unclear to what extent cross-lingual transfer can substitute for language-specific efforts. In this paper, we synthesize prior research findings and data collection results on Luxembourgish, which, despite its typological proximity to high-resource languages and its presence in a multilingual context, remains insufficiently represented in modern NLP technologies. Across findings, we observe a fundamental interdependence between cross-lingual transfer and language-specific efforts. Cross-lingual transfer can substantially improve target-language performance, but its success depends critically on the availability of sufficiently high-quality, task-aligned target-language data. At the same time, such resources, particularly in low-resource settings, are typically too limited in scale to drive strong performance on their own. Instead, such resources reach their full potential only when leveraged within a cross-lingual framework. We therefore argue that cross-lingual transfer and language-specific efforts should not be viewed as competing alternatives. Instead, they function as complementary components of a sustainable low-resource NLP pipeline. Based on these insights, we provide practical guidelines for integrating and balancing cross-lingual transfer with language-specific development in sustainable low-resource NLP pipelines.
Building Arabic NLP from the Ground Up: Twenty Years of Lessons, Failures, and Open Problems
Wajdi Zaghouani
Wajdi Zaghouani
This paper reflects on twenty years of building NLP resources and research infrastructure for Arabic, a language spoken by hundreds of millions yet historically underserved relative to languages such as English or Chinese. The first decade focused on foundational linguistic infrastructure; the second shifted toward computational social science, social media analysis, and socially oriented applications. Rather than cataloguing outputs, the paper examines what the experience of building them revealed. Three counterintuitive lessons emerge: building datasets is as much a social process as a technical one; communities formed around shared tasks often matter more than the tasks themselves; and moving from language resources to computational social science exposes challenges that traditional NLP training does not address. We discuss three failures: a depression detection corpus that never reached clinical practice, a period of spreading across too many shared tasks without sufficient depth, and a long-standing assumption that Modern Standard Arabic infrastructure would transfer cleanly to dialectal tasks. These experiences suggest that the hardest problems in developing NLP for underserved communities are not linguistic but social, institutional, and epistemic, and require competencies the field rarely teaches.
Speaking of Language: Reflections on Metalanguage Research in NLP
Nathan Schneider | Antonios Anastasopoulos
Nathan Schneider | Antonios Anastasopoulos
This work aims to shine a spotlight on the topic of metalanguage. We first define metalanguage, link it to NLP and LLMs, and then discuss our two labs’ metalanguage-centered efforts. Finally, we discuss four dimensions of metalanguage and metalinguistic tasks, offering a list of understudied future research directions.
Harnessing the Latent Space: From Steering Vectors to Model Calibrators for Control and Trust
Nishant Subramani
Nishant Subramani
Language models have changed from unreliable text generators to highly-capable large models with trillions of parameters. Capability increases come hand-in-hand with increases in scale, making understanding the internal representations of models more challenging. Since millions of users increasing rely on language models to interact with external tools or make decisions in medium or high-stakes scenarios, we need to establish control over model behavior and know when to trust model outputs. In this paper, we discuss our contributions on harnessing the latent spaces by proposing steering vectors for control and developing latent space-based model calibrators for trust. Together, our contributions help demystify the latent spaces of language models and offer new insights into how to harness model internals to build more trustworthy language technology.
Language models are increasingly used to quantify cultural phenomena, but what makes such measurement distinctively cultural? This paper argues that NLP work on culture is a material-discursive practice: the apparatus—model, data, annotation, evaluation—participates in constituting the cultural reality it measures, rather than passively recording it. Drawing on Karen Barad’s concept of the agential cut—the contingent boundary between phenomenon and instrument—I show that the apparatus’s substantive design choices draw such boundaries, and that the boundary is entangled from the start because language models have already internalized much of the cultural material they measure. I illustrate this through three case studies on television and film dialogue and two examinations of the apparatus itself: erasure of character names as cultural markers, and attunement to historically distant Restoration drama. This big picture analysis proposes a research program that is theory-driven, empirically rigorous, and culturally contingent, treating each agential cut as a conscious commitment.
Memorisation in deep learning is undergoing a paradigm shift; it is increasingly recognised as a mechanism that can support, rather than hinder, generalisation. This is particularly relevant in NLP, where language combines compositional, generalisable structure with non-compositional expressions such as idioms, requiring memorisation from models and humans alike. My PhD work investigated memorisation in transformer models in generic terms, and through the lens of (non-)compositionality, from both data and model-internal perspectives. I analysed which training examples require memorisation, whether memorisation supports generalisation, and where memorisation occurs within model layers. I also studied how transformers process non-compositional idiom translations and how they balance compositional generalisation with non-compositional memorisation. Based on my findings, I stress that memorisation is an inherent part of learning natural language, can be beneficial, and is partially predictable. Yet it is not cleanly separable from generalisation, both at the level of data and of model parameters. Here, I summarise those findings and reflect on my PhD work.
up
BioNLP 2026
The Divergence Hypothesis: Unmasking Lexical Interference and Label Bias in Mental Health NLP
Moustafa Hassan
Moustafa Hassan
Computational mental health (CMH) classifiers often degrade under distribution shift because human annotators and distant-supervision pipelines reward different linguistic signals. We introduce TSS (Triple-Stream Stress probe), a multi-channel diagnostic framework that decomposes text into (A) lexical character n-grams, (B) a small, mostly content-free morpho-syntactic channel, and (C) a 154-feature psycholinguistic style channel. Across four English datasets (N = 12,906), TSS reveals a lexical interference effect: adding lexical features to the style channel reduces Macro-F1 on human-labeled data (mean drop 0.072, p 10??) but not on auto-labeled data. We propose Degree of Divergence (DoD), a difference-in-differences statistic adapted from econometrics for label-source auditing, with instance-level bootstrap inference; the headline estimate is DoD(BC?A) = 0.0374, 95% CI [0.0097, 0.0651], p = 0.0032. A platform-stratified Twitter-only DoD (which removes the Reddit vs. Twitter contrast) reproduces the pattern with bootstrap inference: DoD??,BC?A = +0.096 (p 0.001) and DoD??,AC?A = ?0.089 (p 0.001). Interventional masking (pos_only) retains ?95?99% of Channel C’s performance after destroying content words on human datasets, indicating that the style channel does not rely primarily on lexical surface form. TSS is positioned as a diagnostic audit framework, not a clinical screening tool: it flags label-source-specific shortcut learning before generalization claims are made.
Towards Unified Factuality Evaluation for Biomedical QA and Summarization: Aligning Metrics with Clinical Use-Cases
Mahule Roy | Subhas Roy
Mahule Roy | Subhas Roy
Large language models achieve strong performance on biomedical question answering and summarization benchmarks, yet traditional evaluation metrics often fail to detect clinically significant factual errors. We introduce a unified evaluation framework that combines reference-based measures with evidence-grounded factuality verification to assess biomedical text generation. Evaluating four open-source models across three benchmarks (BioASQ, PubMedQA, MedLFQA), we find that 13.4?24.7% of generated claims are contradicted and 23?41% are unsupported, despite high lexical overlap scores. Our proposed Fact-Aligned Score (FAS) correlates strongly with claim-level verifiability (rho=0.68), substantially outperforming ROUGE-L (rho=0.41). We release an open-source toolkit with model outputs and analysis scripts to support reproducible factuality evaluation and safer deployment of biomedical LLMs.
Using Synthetic Records to Improve Automated Identification of Seizure Freedom in Clinical Text about People with Epilepsy
Stephen Barlow | Yujian Gan | Joe Davies | Joel Winston | James Teo | Mark Richardson | Ben Holgate
Stephen Barlow | Yujian Gan | Joe Davies | Joel Winston | James Teo | Mark Richardson | Ben Holgate
Seizure freedom is a key clinical outcome for people with epilepsy (PWE) yet it is primarily recorded in free-text notes and letters in the United Kingdom, making it difficult to aggregate and track at scale. This paper introduces a generative LLM-based pipeline boosted by synthetic data to identify a PWE’s seizure freedom status in clinicians’ records. We fine-tuned seven different LLMs with between 4-14 billion parameters using LoRA to compare models trained on synthetic records against those trained on expert annotated records. The best performing configuration, based on Qwen-2.5-14B, was trained entirely on synthetic records and used chain-of-thought (CoT) reasoning (both generated by GPT-5). This achieved an F1 score of 0.90±0.02 on double-annotated test data and outperformed the equivalent model trained on authentic clinician records, which achieved 0.87±0.04. The synthetically trained models also have the benefit of outputting their CoT reasoning process for greater decision-making transparency and can also make use of the unused supervised training data for significantly increased test examples. This work has implications for monitoring a key treatment outcome for PWE automatically and at scale.
Analyzing Prompt Design Choices in Biomedical Information Extraction for Low-Resource Languages
Ayesha Khatun | Kadir Bulut Ozler | Steven Bethard | Egoitz Laparra
Ayesha Khatun | Kadir Bulut Ozler | Steven Bethard | Egoitz Laparra
This paper studies how to improve biomedical named entity recognition (NER) using large language models (LLMs), especially for low-resource languages like Bangla and Basque. The main goal is to understand how different prompt styles and output formats affect model performance. The study finds that the way we design prompts is very important. Among all methods, question-style prompting works best across all languages. It helps the model understand the biomedical task more clearly and improves accuracy. In fact, improvements are much greater in Bangla and Basque compared to high-resource languages like English and Spanish. Another key finding is about the output format. Traditional BIO tagging (labeling each word) performs poorly with LLMs because it is strict and sensitive to small errors. Instead, span-based extraction (directly extracting text phrases) works much better and gives higher F1 scores. This is because LLMs naturally generate text spans rather than token-level labels. The paper also analyzes errors. Common problems include hallucination, missing entities, and boundary mistakes. Translation-based prompts can reduce hallucination, while question-style prompts reduce empty outputs in biomedical NER. Overall, the study shows that choosing the right prompt and output format is very important, especially for low-resource high-vocabulary languages. It provides useful guidance for building better multilingual medical information extraction systems.
Hierarchy-Aware Hyperbolic and Semantic Reranking for Ontology-Based Phenotype Linking
Thomas Labbe | Moussa Baddour | Axel Bonesteve | Paul Rollier | Marie De Tayrac | Olivier Dameron
Thomas Labbe | Moussa Baddour | Axel Bonesteve | Paul Rollier | Marie De Tayrac | Olivier Dameron
Extracting structured knowledge from unstructured text is a fundamental challenge in machine learning, particularly for concepts organized within complex hierarchical ontologies. In genomics, identifying phenotypes from clinical narratives is crucial for diagnostic precision, yet current methods struggle with contextual interpretation and subtle clinical descriptions. We present a hierarchy-aware workflow for ontology-based phenotype linking that combines semantic and hierarchical signals. Our approach integrates Large Language Models for span detection with retrieval and a hybrid reranking strategy using both Euclidean (semantic) and hyperbolic (hierarchical) embeddings trained on the Human Phenotype Ontology. We show that while hyperbolic embeddings alone do not outperform standard semantic retrieval, they provide complementary structural signals that improve performance over strong baselines when combined with Euclidean representations. In particular, the hybrid approach outperforms existing state-of-the-art methods and yields more hierarchically coherent predictions, especially in settings involving implicit phenotype mentions. Experiments on a public benchmark (ID-68) and a newly released clinical dataset (CHU-50), publicly released with code and data, highlight both performance gains and improved alignment with ontology structure. We further introduce a hierarchy-aware evaluation framework that reflects clinical relevance beyond exact-match metrics.
Agentic Feature Selection via LLM for Epileptic Seizure Detection
Aizierjiang Aiersilan | Xiaodong Qu
Aizierjiang Aiersilan | Xiaodong Qu
Automated epileptic seizure detection from electroencephalography (EEG) signals is a clinically important task in which feature selection is typically performed using purely statistical criteria. We investigate whether a small instruction-tuned large language model (LLM) can guide iterative feature selection for binary seizure detection on the Epileptic Seizure Recognition dataset (11{,}500 samples, 178 features). The LLM agent (Qwen2.5-1.5B-Instruct) receives five complementary statistical summaries and selects a feature subset through multi-round reasoning. The agent achieves 96.5\% accuracy and 0.911 F1 with 40 features, compared to 97.9\% accuracy and 0.946 F1 for the best full-feature baseline (SVM-RBF on 178 features). Critically, 39 of the agent’s 40 features coincide with the top-39 mutual-information features, and a deterministic Top-39 MI filter, evaluated by the same Random Forest classifier, attains the same 96.5\% accuracy and 0.911 F1. We therefore present this work as an empirical baseline: at the 1.5B-parameter scale, the LLM behaves close to a univariate MI ranker. We situate the result against the recent LLM-based feature selection literature and enumerate the ablations and multi-dataset extensions required to determine whether larger or domain-specialized LLMs add value beyond statistical filtering.
Training Biomedical Retrievers From Large-Scale Citation Contexts
Xing David Wang | Duy Le Thanh | Ulf Leser
Xing David Wang | Duy Le Thanh | Ulf Leser
The MedCPT model has demonstrated that strong biomedical retrievers can be trained using proprietary PubMed search logs. In this work, we study whether freely available citation sentences are sufficient to train similarly effective models. We construct a large-scale training dataset of ~ 62 million citation sentence-abstract pairs extracted from PubMed Central. We train a lightweight BERT-based retriever-reranker model called CiteRec on this dataset and evaluate it across three benchmark settings: (a) the biomedical subset of BEIR for information retrieval, (b) SciRepEval for generalizable scientific document embeddings, and (c) CitancePlus, a new set of ~ 90 thousand citation sentence-abstract pairs for PubMed-scale citation recommendation. We show that CiteRec performs competitively with MedCPT on the biomedical BEIR subset and outperforms it on SciRepEval. On CitancePlus, CiteRec achieves strong performance for citation recommendation over the full PubMed corpus, outperforming both MedCPT and a substantially larger Qwen3-Embedding-8B retriever.
Reliable Automated Triage in Spanish Clinical Notes: A Hybrid Framework for Risk-Aware HIV Suspicion Identification
Rodrigo Morales-Sánchez | Soto Montalvo | Raquel Martínez
Rodrigo Morales-Sánchez | Soto Montalvo | Raquel Martínez
Standard clinical Natural Language Processing (NLP) benchmarks often yield inflated metrics by forcing deterministic classification on ambiguous instances, thereby obscuring the clinical risks of overconfident predictions. To bridge this gap, we propose a risk-aware hybrid selective classification framework, evaluated on early Human Immunodeficiency Virus suspicion identification in Spanish clinical notes. Our dual-verification approach explicitly decouples aleatoric uncertainty through Mondrian conformal prediction and epistemic uncertainty using a Multi-Centroid Mahalanobis Distance veto. Empirical evaluations reveal that standard uncertainty metrics and baseline classifiers are structurally insufficient for safe medical triage, suffering severe coverage collapse when forced to operate under strict reliability constraints. In contrast, by demanding that clinical narratives pass both probabilistic and geometric safeguards, the proposed framework successfully isolates a highly trustworthy operational domain.The obtained results show that explicit, decoupled uncertainty quantification is essential for translating biomedical NLP into responsible clinical practice.
SciFact is a widely-used benchmark for scientific claim verification (645 citations, included in the BEIR evaluation suite). We present, to our knowledge, the first systematic annotation audit of its development and training sets, combining automated screening with a small language model ($0.11 in API fees) and exhaustive manual verification against source publications. We identify 11 gold-label errors in the development set (5.3%, 95% CI 2.7?9.2%, of 209 audited claim?document pairs) and 13 in the training set (2.3%, 95% CI 1.2?3.9%, of 564 audited pairs). The dev errors exhibit a directional asymmetry?9 of 11 mislabel a claim as SUPPORT (one-sided binomial p=0.033, two-sided p=0.065)?and fall into four recurring types. Correcting the dev labels raises binary macro-F1 by 1.7?3.8 points across GPT-5.4 (mini, nano) and Claude Haiku 4.5; gains are larger in 3-way evaluation when mislabeled evidence is recast as NEI (e.g., +9.2 with Haiku 4.5). The binary range is comparable in magnitude to inter-system margins on the SciFact leaderboard. A simple claim-only probe with Haiku 4.5 does not support label memorization as the main explanation for these gains. We release corrected annotations and a blind annotator packet, and recommend that benchmark users prefer the corrected release going forward.
BioRAG: A Systematic Ablation Study of Retrieval Strategies for Biomedical Question Answering
Krushil Bhojani | Mayank Waghmare | Hima Bindu Nandyala
Krushil Bhojani | Mayank Waghmare | Hima Bindu Nandyala
Retrieval strategy selection is a critical but understudied design decision in biomedical RAG systems. Existing evaluations rely on lexical metrics that miss answer grounding, or require proprietary infrastructure that limits reproducibility. We present BioRAG, a head-to-head ablation of seven retrieval strategies on BioASQ-13b, evaluated using four RAGAs metrics with a locally deployed judge at zero monetary cost. Hybrid BM25 plus dense retrieval with Reciprocal Rank Fusion achieves faithfulness of 0.534 and context recall of 0.507, improvements of 50% and 85% over naive dense retrieval, confirmed across three random seed re-samples. HyDE improves faithfulness by 14% but reduces context precision by 52%, a tradeoff not previously documented on BioASQ. No single strategy dominates all four metrics, indicating that strategy selection must be application-driven. Sensitivity analysis across k in {3,5,10} confirms ranking stability. A domain mismatch diagnostic confirms 2% corpus coverage failure. The full pipeline runs on consumer hardware without paid APIs, directly addressing BioNLP 2026’s emphasis on reproducibility and evaluation frameworks for health-related applications.
Post Hoc Agentic Refinement for Improving Precision in Multilingual Clinical Text De-identification
Justin Xu | Alistair Johnson | Thomas Lin | David Eyre | Rodolfo Quispe
Justin Xu | Alistair Johnson | Thomas Lin | David Eyre | Rodolfo Quispe
De-identification systems prioritize recall to protect privacy, but excessive over-tagging reduces data utility. We propose an agentic refiner that reviews high-recall annotations using lightweight tools (validation functions, adaptive context retrieval, persistent to-do state, and modular review skills) to improve precision while minimizing recall loss. Experiments across three multilingual datasets show that the agent achieves significant improvements to binary precision. To support fine-grained analysis, we further introduce a synthetic error dataset of common and systemic failure modes, on which the agent corrects 99% of injected errors in the medical datasets. Our results suggest that agent-based refinement provides a flexible and effective mechanism for improving de-identification precision as a modular extension to existing high-recall systems.
Do Syntactic Features Help Biomedical Relation Extraction? An Empirical Study of Verb Token and Dependency Graph Augmentation
Mustafa Sikder | Ernest Kwegyir-Afful
Mustafa Sikder | Ernest Kwegyir-Afful
We investigate whether explicit syntactic features improve transformer-based biomedical relation extraction when added to typed entity marker pooling. We evaluate two augmentation strategies on top of BiomedBERT: (1) verb token augmentation, which concatenates the hidden state of the dependency root verb to the entity representations, and (2) a two-layer graph convolutional network (GCN) that refines encoder hidden states over the dependency parse before entity pooling. We experimented on three biomedical datasets: ChemProt, DDI, and AIMed with three random seeds. We found neither strategy consistently outperformed the entity-only baseline. The GCN yielded modest gains on AIMed (+0.007 F1) and ChemProt (+0.003 F1) but decreased performance on DDI (-0.013 F1). Verb token augmentation helps only on AIMed (+0.004 F1) and underperforms on the other two datasets. A syntactic characterization of the datasets reveals that DDI has substantially higher passive voice usage (50.7\% of relation-bearing sentences) than AIMed (27.0\%) or ChemProt (30.9\%), suggesting that syntactic augmentation is more effective when sentences exhibit active verbal structure with semantically informative predicates. These results suggest that corpus-level syntactic characteristics, particularly passive voice usage, may moderate the utility of explicit syntactic augmentation, though the small magnitude of observed differences warrants caution in interpretation.
Beyond Knowledge Graphs: PubMedBERT Embeddings as a Competitive Standalone Modality for Drug Re-purposing
Rishik Kondadadi | John E. Ortega
Rishik Kondadadi | John E. Ortega
Drug repurposing methods rely heavily on knowledge graph (KG) embeddings, but building and curating these graphs takes considerable effort. We present two findings on the Hetionet drug-disease benchmark and an epilepsy ranking task. First, PubMedBERT text embeddings, fed through the same downstream classifiers and identical 10-fold splits as four re-trained KG baselines (TransE, ComplEx, DistMult, RotatE), reach AUROC $0.910$, above all four (best: RotatE, $0.854$); a Random Forest on the same vectors scores $0.880$. The comparison is asymmetric in one important way: PubMedBERT was pretrained on the literature Hetionet was curated from, so the result is best read as “text-with-literature-supervision vs.graph-only,” and a head-to-head with text-augmented KG methods (KG-BERT, TxGNN) is left as follow-up. Second, across all seven combinations of text, molecular (ECFP4), and gene expression (LINCS L1000) features, cross-attention fusion of weaker modalities into text consistently degrades performance, despite a gated mechanism intended to suppress unhelpful modalities; the residual path forces the strong modality to absorb noise. The model also ranks proconvulsants (amoxapine, flumazenil) near the top, because text embeddings encode strength of association with a disease but not its direction.
When Demographic Sensitivity Isn’t What It Seems: Baseline-Aware Counterfactual Audits for Clinical NLP
Hyunwoo Yoo
Hyunwoo Yoo
Clinical NLP systems are increasingly used for triage support, prediction, and decision assistance in EHR-based settings, where demographic fairness is a critical concern. A common evaluation approach is counterfactual demographic perturbation: modifying attributes such as age or sex while holding clinical evidence fixed and measuring output changes. However, we show that such counterfactual audits can be misleading when interpreted in isolation. Across three clinical LLMs, we find that non-demographic control perturbations (e.g., paraphrases) often induce output variability comparable to or greater than demographic edits. This can contribute to overestimation or misinterpretation of demographic bias.To address this, we propose a baseline-aware audit framework that explicitly compares demographic perturbations against control baselines. Our analysis reveals that (i) label-level stability can mask substantial variation in generated rationales and recommendations, and (ii) age-based perturbations generally induce larger effects than sex-based ones in borderline cases. Crucially, we identify a high intrinsic instability ("noise floor"; 0.46–0.71 Jaccard instability) in clinical LLM generations, while additional matched-metric analyses show that demographic perturbations are often comparable to non-demographic baseline variability.These findings highlight a key limitation of existing fairness evaluations: without establishing appropriate baselines, apparent demographic sensitivity may be over- or mis-attributed to bias rather than broader generative instability. We argue that baseline-aware counterfactual audits, which explicitly compare demographic effects against intrinsic model noise, provide a more reliable lens for evaluating clinical NLP systems in high-stakes settings.
CoreELM: An Open-Source Framework for Aligning Large Language Models to Embedding Spaces
Brian Ondov | Chia-Hsuan Chang | Yujia Zhou | Mauro Giuffrè | Hua Xu
Brian Ondov | Chia-Hsuan Chang | Yujia Zhou | Mauro Giuffrè | Hua Xu
Text embeddings have become an essential part of a variety of language applications. However, methods for interpreting, exploring and reversing embedding spaces are limited, reducing transparency and precluding potentially valuable generative use cases. In this work, we develop an open-source, domain-agnostic framework for aligning Large Language Models to embedding spaces using the recently reported Embedding Language Model (ELM) method. We demonstrate our framework by training models to recover, summarize, and compare clinical trial abstracts from embeddings alone. In addition to inverting embeddings back to text more reliably than existing methods, our models can decode novel, interpolated embeddings into new clinical trial abstracts that human experts cannot distinguish from real ones. We further show that these generated abstracts are responsive to moving embeddings along concept vectors for age and sex of study subjects. Our public ELM implementation and experimental results will aid the alignment of Large Language Models to embedding spaces in the biomedical domain and beyond.
Uncertainty-Aware Multi-Label Routing of Clinical Text to Surveillance Pathways
Agathe Zecevic | Sebastian Zeki | Angus Roberts
Agathe Zecevic | Sebastian Zeki | Angus Roberts
Clinical decision support systems that operate across multiple downstream care pathways must first determine which pathway or pathways are relevant for a given patient. We study this routing problem in gastrointestinal surveillance, where paired endoscopy and histopathology text reports may indicate multiple concurrent conditions and therefore require multi-label routing. In this context, standard hard-label evaluation can be insufficient: a model may achieve reasonable overall performance while still excluding clinically important pathways when uncertain. We formulate gastrointestinal report routing as a multi-label uncertainty-aware classification task over six pathway labels and compare lightweight lexical baselines, frozen embedding models and a fine-tuned transformer baseline under two complementary uncertainty mechanisms: threshold-based abstention and set-valued conformal prediction. Using 1,773 paired reports from a single NHS trust with disjoint train, calibration and test splits, we evaluate both hard-routing performance and the downstream review burden introduced by uncertainty-aware prediction. The fine-tuned ClinicalBERT model achieved the strongest overall performance (0.811 subset accuracy, 0.861 macro-F1) and the lowest AURC of 0.084 under min-margin abstention. Threshold-based abstention consistently reduced exact-match routing error on accepted reports. For conformal routing at ?=0.10, Mondrian calibration achieved high mean positive-label recall coverage across learned baselines (0.883-0.917). The fine-tuned model achieved 0.891 mean recall coverage with a mean prediction set size of 1.70, 0.642 candidate-label precision and 0.61 false-positive labels per report. Compared with a recall-tuned threshold baseline at similar recall, Mondrian CP produced smaller candidate sets, higher candidate-label precision and fewer false-positive pathway suggestions. These results show that uncertainty-aware evaluation exposes clinically important failure modes missed by aggregate metrics. They also show that high-recall routing is not cost-free: set-valued prediction can reduce missed-pathway risk but must be interpreted as candidate generation for downstream review rather than automated pathway selection.
MedCAT v2: a modular, extensible architecture for clinical named entity recognition and linking under real-world privacy and compute constraints
Mart Ratas | Thomas Searle | Adam Sutton | Richard Dobson
Mart Ratas | Thomas Searle | Adam Sutton | Richard Dobson
MedCAT is an open-source framework for clinical named entity recognition and linking (NER+L) widely used in research and healthcare settings. We present MedCAT v2, a re-engineered version designed to improve modularity, extensibility, and maintainability while preserving the core functionality and performance of previous releases. The new architecture introduces a registry-based component system and a flexible pipeline that enables easy substitution of components, integration of alternative methods, and future expansion, including support for pre-trained components across the full NER+L and contextualisation workflow. This enables systematic exploration of clinical NER+L design trade-offs by evaluating different components in the pipeline. Evaluation across multiple public datasets shows equivalent or improved performance compared to earlier versions, with reduced integration overhead and improved runtime flexibility. The framework also supports optional extensions such as meta-annotation, relation extraction, providing a unified and reproducible environment for clinical NLP in real-world settings.
Effects of Adaptive Pretraining in Specialized Domains for Named Entity Recognition
Jack Lynam | Sam Henry
Jack Lynam | Sam Henry
Due to unique concepts, syntactic structure, and vocabulary of specialized domains, it is common to train specialized Language models (LMs) for their target domain. For example, BioClinicalBERT is a specialized LM designed for clinical applications. These specialized LMs are typically created starting with a foundation model (such as BERT-base) which has been pretrained for the general English domain, and then adapted to the target domain via additional pretraining. Alternatively, LMs may be pretrained from scratch on data from the target domain. Both techniques are extremely computationally expensive and as such, these specialized LMs are often publicly released for other researchers. For some domains, such as the biomedical domain there are many, similar models available, and as a developer, this raises the question, which pretrained LM should I choose? Alternatively, in novel domains for which no specialized LMs exist, it raises different questions: Is it worth the cost to pretrain a LM from scratch? Should I adapt a general English model instead? Should I just use a general English model without adaptive pretraining? This is a particularly salient question when considering a limited budget. i.e. Should I pay for compute time or for annotators to create a larger dataset. In this paper we compare results of nine LMs across nine datasets spanning the clinical, scientific, and biomedical-related social media domains. From these comparisons we make several conclusions that can simplify the hyperparameter-tuning process and inform researchers and developers in novel domains. Broadly, these are that the effects of adaptive fine-tuning are small. If an adapted model exists in your domain, choose the one most closely related to your task. If no model exists, using a foundation model is likely sufficient.
Trade-offs in Medical LLM Adaptation: An Empirical Study in French QA
Ikram Belmadani | Oumaima El Khettari | Carlos Ramisch | Frederic Bechet | Richard Dufour | Benoit Favre
Ikram Belmadani | Oumaima El Khettari | Carlos Ramisch | Frederic Bechet | Richard Dufour | Benoit Favre
The development of large language models (LLMs) has led to increased focus on their adaptation to specialized domains and languages, yet the effectiveness of domain adaptation strategies remains unclear. We present a study of medical domain adaptation using French medical question answering (QA) as a case study. We compare continual pretraining (CPT), supervised fine-tuning (SFT), and their combination across three model families, multiple sizes, and three initialization types, explicitly disentangling adaptation effects from base model choice. We evaluate both multiple-choice (MCQA) and open-ended QA (OEQA) under greedy and constrained decoding using automatic metrics and LLM-as-a-Judge evaluation. For MCQA, CPT+SFT most often achieves the best scores, but gains over SFT are small and frequently not statistically significant, making SFT a strong and cost-effective default. For OEQA, CPT consistently improves overlap-based metrics, while SFT often degrades generation quality; instruction tuning and CPT+SFT are preferred by LLM-based evaluation. Cross-lingual experiments further show effective transfer from French adaptation to English benchmarks. Overall, we provide practical guidelines for selecting adaptation strategies under computational constraints.
PromptRad: Knowledge-Enhanced Multi-Label Prompt-Tuning for Low-Resource Radiology Report Labeling
Ying-Jia Lin | Tzu-Chin Lo | Ping-Chien Li | Chi-Tung Cheng | Chien-Hung Liao | Hung-Yu Kao
Ying-Jia Lin | Tzu-Chin Lo | Ping-Chien Li | Chi-Tung Cheng | Chien-Hung Liao | Hung-Yu Kao
Automatic report labeling facilitates the identification of clinical findings from unstructured text and enables large-scale annotation for medical imaging research. Existing rule-based labelers struggle with the diverse descriptions in clinical reports, while fine-tuning pre-trained language models (PLMs) requires large amounts of labeled data that are often unavailable in clinical settings. In this paper, we propose PromptRad, a knowledge-enhanced multi-label prompt-tuning approach for radiology report labeling under low-resource settings. PromptRad reformulates multi-label classification as masked language modeling and incorporates synonyms from the UMLS Metathesaurus into a multi-word verbalizer to enrich category representations. By fine-tuning the PLM without additional classification layers, PromptRad requires substantially less labeled data than conventional fine-tuning. Experiments on liver CT (computed tomography) reports show that PromptRad outperforms dictionary-based and fine-tuning baselines with only 32 labeled training examples, and achieves competitive performance with GPT-4 despite using a much smaller model. Further analysis demonstrates that PromptRad captures complex negation patterns more effectively than existing methods, making it a promising solution for report labeling in data-scarce clinical scenarios. Our code is available at https://github.com/ila-lab/PromptRad.
Diagnosing Lower Extremity Arteriovenous Diseases Using Agentic LLMs
Zicen Liao | Yunhao Sun | Matthew Purver
Zicen Liao | Yunhao Sun | Matthew Purver
This paper introduces LEA-Dialog, a multi-turn diagnostic dialogue dataset for lower-extremity arteriovenous diseases, together with a carefully developed diagnostic handbook and a process-aligned agentic framework for structured outpatient diagnosis. The dataset provides stage annotations for each turn and guideline-grounded probability trends, enabling evaluation beyond final diagnostic accuracy. Experiments show that the framework improves reasoning stability and reduces drift across both online and offline LLMs, with particularly large gains for smaller offline models.
KGRxn-LLM: Knowledge Graph Enhanced Large Language Models for Molecular Reaction Reasoning
Weichen Liu | Qiyao Xue | Yuyang Wu | Olexandr Isayev | Natasa Miskov-Zivanov
Weichen Liu | Qiyao Xue | Yuyang Wu | Olexandr Isayev | Natasa Miskov-Zivanov
Large language models (LLMs) demonstrate strong general language capabilities but remain limited in chemical reasoning, particularly for tasks requiring structured, mechanistic understanding of molecular reactions. We present Knowledge Graph Reaction LLM (KGRxn-LLM), a framework that augments LLMs with a hierarchical chemical knowledge graph (KG) to ground reasoning in molecular transformations and reaction patterns. Existing benchmarks primarily emphasize reaction or molecular fact recall, providing limited assessment of reaction-level mechanistic reasoning. To address this gap, we introduce KGRxn-Bench, a benchmark of 1,200 questions designed to evaluate LLMs on reaction-centric reasoning tasks, including functional group identification, reaction type classification, and product and reagent prediction. Experimental results show that our approach of grounding LLMs in structured KG substantially improves performance across multiple tasks and model backbones, outperforming domain-specific fine-tuned models on KG-covered splits and most hold-out splits.
MAX-EVAL-11: A Large Scale Benchmark for Evaluating Large Language Models on Full-Spectrum ICD-11 Medical Coding
Ujjwal Singh | Sarthak Deshwal | Nitish Dube | Arjun Sharma
Ujjwal Singh | Sarthak Deshwal | Nitish Dube | Arjun Sharma
The global transition to the ICD-11 taxonomy demands robust automated medical coding, yet comprehensive benchmarks to evaluate Large Language Models (LLMs) on this task remain absent. We introduce MAX-EVAL-11, the first large-scale benchmark for full-spectrum ICD-11 medical coding. MAX-EVAL-11 comprises 10,000 MIMIC-III discharge summaries with mapped, expert-validated ICD-11 annotations spanning 99.87\% of the diagnostic taxonomy. To better reflect clinical utility, we propose a novel hierarchical evaluation framework that assigns partial credit based on ICD-11’s 5-level structure, addressing the brittleness of traditional exact-match metrics. Our evaluation of state-of-the-art LLMs reveals significant performance gaps. The best-performing model (Claude 4 Sonnet) achieves a weighted score of 0.433, outperforming both general-purpose peers and specialized medical models (MedCoder). Crucially, all models exhibit near-zero exact match rates (0?4.8\%) and rely primarily on hierarchical credit, underscoring the extreme difficulty of precise ICD-11 code generation. Furthermore, the superiority of general-purpose LLMs over legacy ICD-10 medical models (with ICD-11 codelist) suggests that broad reasoning capabilities currently outweigh domain-specific training for complex taxonomy scaling.
Trustworthy NLP for Low-Resource Languages: Agent-Based Uncertainty Modeling for Hebrew Radiology Report Structuring
Hadas Ben Atya | Naama Gavrielov | Zvi Badash | Gili Focht | Ruth Cytter-Kuint | Talar Hagopian | Dan Turner | Moti Freiman
Hadas Ben Atya | Naama Gavrielov | Zvi Badash | Gili Focht | Ruth Cytter-Kuint | Talar Hagopian | Dan Turner | Moti Freiman
Reliable extraction of structured information from radiology reports using Large Language Models (LLMs) remains a significant challenge, particularly for complex, non-English texts such as Hebrew. This study proposes an agent-based, uncertainty-aware framework to enhance the reliability and interpretability of LLM predictions in clinical contexts. A total of 9,683 Hebrew radiology reports from Crohn’s disease patients (2010?2023) across three medical centers were analyzed. Of these, 512 reports were manually annotated for six gastrointestinal organs and 15 pathological findings, while the remainder were automatically labeled using HSMP-BERT. Structured data extraction was performed with Llama 3.1 (Llama 3-8b-instruct) employing Bayesian Prompt Ensembles (BayesPE), which utilized six semantically equivalent prompts to quantify uncertainty. An Agent-Based Decision Model aggregated prompt outputs into five calibrated confidence levels and was benchmarked against three entropy-based approaches. Model performance was assessed using accuracy, F1 score, precision, recall, and Cohen’s Kappa before and after filtering high-uncertainty cases. The agent-based model outperformed all baselines, achieving an F1 score of 0.3967, recall of 0.6437, and Kappa of 0.3006; after excluding cases with uncertainty = 0.5, the F1 score increased to 0.4787 and Kappa to 0.4258. The proposed framework improves uncertainty calibration and predictive reliability, advancing the safe deployment of LLMs in medical data extraction.
Treating Decoder-Only LLMs as Encoders: A Simple and Effective Fine-tuning Approach for Named Entity Recognition
Ken Yano | Hiroya Takamura
Ken Yano | Hiroya Takamura
NER requires token-level classification using both left and right context, which makes encoder-only models like BERT naturally well-suited for the task. Decoder-only LLMs, by contrast, use causal masking during training, so their token representations lack right-side context, limiting their effectiveness on structured prediction tasks like NER despite their strong general capabilities. To address this, the authors propose fine-tuning decoder-only LLMs with causal attention replaced by full attention, combined with label-supervised discriminative training. While similar ideas exist in prior work, those studies were limited in scope. This work evaluates seven LLMs across four model families (Gemma, Qwen2.5, Llama3.1, Llama3.2) and compares full fine-tuning against LoRA. Results show that the proposed approach with an appropriate LoRA configuration outperforms encoder baselines (BERT, RoBERTa, DeBERTa), and achieves strong NER performance without auxiliary data or architectural modifications, though it does not reach SOTA on BC5CDR and CoNLL2003.
A Multi-View Framework for Cross-Domain Nutrition Misinformation Detection in Social Media
Vishwaa Shah | Indika Kahanda | Andrea Arikawa | Asal Abbaszadeh | Richard Loftis
Vishwaa Shah | Indika Kahanda | Andrea Arikawa | Asal Abbaszadeh | Richard Loftis
Nutrition misinformation on social media often arises from selective interpretation of scientific evidence rather than outright falsehoods, making it difficult to detect. We introduce a curated, expert-annotated Instagram dataset focused on seed oils and omega-6, two domains characterized by contested dietary claims. We evaluate feature-based, embedding-based, and transformer-based models under in-domain and cross-domain settings. Results show strong in-domain performance across all models, with Sentence-BERT achieving the highest AUPRC (up to 0.96). However, performance drops substantially under cross-domain transfer, indicating limited robustness to topic shift. Analysis suggests that while contextual embeddings capture strong in-domain semantic signals, linguistically and psychologically grounded features are more stable under distribution shift. These findings highlight the value of combining semantic and interpretable linguistic signals for robust misinformation detection.
Ontological Validation of Biomedical Topic Models: SNOMED CT Hierarchy Distance as an Automated Evaluation Metric
Ilan Rubinfeld | Sami Zaidi | Milosh Djuric | Loay Kabbani | Mouhammad Halabi | Alex Shepard
Ilan Rubinfeld | Sami Zaidi | Milosh Djuric | Loay Kabbani | Mouhammad Halabi | Alex Shepard
Standard coherence metrics for biomedical topic models encode no clinical knowledge and cannot detect clinically implausible topic groupings. We propose SNOMED CT Wu?Palmer hierarchy distance as a post hoc, ontology-grounded diagnostic. On vascular surgery (47,318 articles) and craniofacial surgery (27,493 articles) corpora, the metric flags clinically heterogeneous topics that coherence misses?e.g., abdominal aortic aneurysm repair grouped with deep vein thrombosis (d = 0.600). Diagnostic signals are nearly identical across eight BERTopic embedding strategies including ontology-enhanced models, but diverge across model families: BERTopic alone produces a positive within- vs. cross-topic Cohen’s d, while LDA, NMF, and Top2Vec at matched topic counts score below their own cross-topic baselines (Cohen’s d 0; Mann?Whitney p 0.99). The score is therefore sensitive to topic-model output choice, not only to embedding choice within a single pipeline. A pre-clustering screening experiment finds near-zero correlation (|?| 0.08) between embedding cosine and SNOMED CT similarity, arguing that ontological validation belongs after clustering rather than as an embedding screen. We additionally describe a two-stage UMLS-CUI stopword filter that preserves high-frequency domain-specific concepts which naive frequency filtering would discard. After one-time concept curation, the diagnostic itself is automated and requires no per-topic expert scoring.
Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale
Jinghui Liu | Sarvesh Soni | Anthony Nguyen
Jinghui Liu | Sarvesh Soni | Anthony Nguyen
Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect – such as similarity or utility comparisons – even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes – despite their task-agnostic nature – can effectively augment task-specific training for rare ICD codes.
Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes
Jinghui Liu | Anthony Nguyen
Jinghui Liu | Anthony Nguyen
Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.
EmCellLLM: Human Peri-Implantation Embryonic Cell Annotation Based on Large Language Models
Xiaorui Guo | Zhiwei Liu | Qianqian Xie | Sophia Ananiadou
Xiaorui Guo | Zhiwei Liu | Qianqian Xie | Sophia Ananiadou
The advent of single-cell RNA sequencing has enabled unprecedented resolution of cell fate decisions and regulatory mechanisms during peri-implantation human embryogenesis, in which accurate cell type annotation is a fundamental prerequisite and the first step for subsequent fate and mechanism inference. Large language models (LLMs) have demonstrated outstanding performance in various fields. However, current studies mostly rely on traditional methods and have not explored the application of LLMs in the field of human embryonic cell annotation. The main reason is the lack of instruction tuning datasets and evaluation benchmarks. In this paper, we proposed EmCellLLM, the first open sourced LLMs that are specialized for human embryonic cell type prediction task based on fine-tuning Qwen3-8B with EmCell4Instruction, the first embryonic cell type prediction instruction dataset. To support LLM instruction tuning, we also build EmCellBench, the first benchmark for evaluating human embryonic cell type prediction ability of LLMs. We compare our models with a variety of LLMs on EmCellBench, where our model outperforms all other open-sourced LLMs as well as DeepSeek.
Randomized Controlled Trials as the Gold-Standard for Evaluating LLMs: A Primer for Biomedical NLP Researchers
Vicente Ivan Sanchez Carmona | Shanshan Jiang | Bin Dong
Vicente Ivan Sanchez Carmona | Shanshan Jiang | Bin Dong
Large Language Models (LLMs) are no longer mere laboratory objects of study. LLMs have become everyday tools in society across diverse populations and domains. In clinical contexts, LLMs have already been devised as clinical support applications. However, along with benefits, negative or adverse effects might arise, such as LLMs potentially providing psychologically distressing advice to adolescents when used for mental health support. This raises questions on the benefits of LLMs and calls for real-world evaluations: Are LLMs really helpful and effective for the intended purposes people are using them or will use them for? To answer this type of question we propose to use Randomized Controlled Trials (RCTs). RCTs are considered the most strict experimental design in the fields of Medicine, Psychiatry, Psychology, among others; however, the use of RCTs in the NLP field is almost negligible. In spite of the NLP field being the de facto locus of research on LLMs, other fields, prominently Medicine, are leading the RCT evaluations on LLMs. In this primer paper, we present a concise introduction to the principles of RCTs to guide NLP researchers to design RCT studies for evaluating LLMs.
Citation-Aware Continual Pre-Training for Biomedical Language Models
Masaki Asada | Tomoki Tsujimura | Tatsuya Ishigaki | Shusaku Egami | Ken Fukuda | Hiroya Takamura
Masaki Asada | Tomoki Tsujimura | Tatsuya Ishigaki | Shusaku Egami | Ken Fukuda | Hiroya Takamura
The biomedical literature contains rich structured knowledge, including citation links that encode relationships between scientific studies, but such information is typically ignored in standard language model pre-training. We propose a citation-aware continual pre-training method for decoder-only language models that incorporates citation graph information from PubMed into next-token prediction by placing citation-linked abstract pairs within a shared context. We evaluate our method on multiple biomedical QA benchmarks using two model families. Results show that citation-aware continual pre-training achieves higher average accuracy than both the original base models and citation-unaware pre-training across biomedical tasks.
TrackList: Tracing Back Query Linguistic Diversity for Head and Tail Medical Knowledge in Open Large Language Models
Ioana Buhnila | Aman Sinha | Mathieu Constant
Ioana Buhnila | Aman Sinha | Mathieu Constant
While humans can easily produce various types of answers, such as definitions, examples or paraphrases, Large Language Models (LLMs) struggle to provide correct answers to medical questions that require diverse answer formats. In this paper, we introduce TrackList, a fine-grained linguistic and statistical analysis pipeline to investigate the impact of the pre-training data on LLMs answers to diverse linguistic queries. We also propose RefoMed-EN, a medical dataset consisting of 6,170 human-annotated medical terms alongside their corresponding definitions, denominations, exemplifications, explanations, or paraphrases. We investigated whether the high or low frequency of a concept (head or tail knowledge) impacts the language model’s performance for answering medical questions. We evaluated the quality of the LLM’s output using syntactic and semantic similarity metrics, statistical correlations and embeddings. Results showed that the LLM’s answer quality for definition-type questions is the highest, while for the exemplification-type being the lowest. Additionally, we showed that for definition-type medical questions ("What is multiple sclerosis?"), LLMs are prone to paraphrase more for popular medical concepts, and less on more specialized medical knowledge.
Discharge Instructions are not One Task: Grounding Differences Between Surgical and Non-Surgical Admissions
Mayank Jobanputra | Justin Xu | Samarth Oza | Hulma Naseer | Yifan Wang | Blerta Veseli | Chandralekha Kona | Haochen Cui | David Eyre | Vera Demberg
Mayank Jobanputra | Justin Xu | Samarth Oza | Hulma Naseer | Yifan Wang | Blerta Veseli | Chandralekha Kona | Haochen Cui | David Eyre | Vera Demberg
Discharge instructions are patient-facing, safety-critical documents that guide medication use, follow-up care, and recovery after hospitalization. Because they must synthesize information across the clinical record and often include post-discharge guidance not stated verbatim in the EHR, they are a difficult target for clinical text generation. In this work, we study discharge instructions in MIMIC-IV through a grounding-first lens. Using two LLMs, we decompose each discharge instruction into medically relevant statements and verify them against the Electronic Health Record (EHR). We find that discharge instructions for Surgical admissions are much longer, averaging roughly 24–25 statements per admission versus 11–12 in Non-Surgical cases, while supported content remains similar in absolute count. The additional Surgical content is dominated by statements that are not directly stated in the record or require clinically plausible extrapolation. Through this analysis, we advocate for better grounding and completeness evaluations at a fine-grained level, establishing a foundational step toward safer and more reliable discharge-instruction generation.
PrionNER: A Named Entity Recognition Dataset for Prion Disease Biomedical Literature
An Dao | Nhan Ly | Thao Tran | Yuji Matsumoto | Akiko Aizawa
An Dao | Nhan Ly | Thao Tran | Yuji Matsumoto | Akiko Aizawa
Prion diseases are rare, rapidly progressive, and fatal neurodegenerative disorders that remain difficult to diagnose, particularly in their early stages because of nonspecific clinical presentations. However, to our knowledge, there is no publicly available prion-disease-focused dataset designed to capture a broad range of clinically relevant entities from the biomedical literature. We introduce PrionNER, a manually annotated named entity recognition dataset for prion disease clinical information in PubMed abstracts. The current release comprises 317 abstracts, 2,943 sentences, and 6,955 text-bound entity annotations spanning 15 coarse-grained and 31 fine-grained clinically oriented entity types covering diseases, symptoms, diagnostics, findings, anatomy, treatments, and temporal and statistical evidence. Inter-annotator agreement reaches 81.78 exact-match F1, indicating strong annotation consistency. We benchmark supervised BERT baselines, W2NER, and zero-shot extractors on PrionNER. W2NER is the strongest supervised model, and Gemma-4-31B is the strongest zero-shot model, but the benchmark remains challenging, especially for structurally complex mentions and fine-grained clinically adjacent label distinctions. PrionNER provides a clinically grounded benchmark for prion-disease information extraction and supports research on rare-disease biomedical NLP under low-resource, fine-grained, and non-flat extraction conditions. The dataset, annotation guidelines, and evaluation scripts are available at https://github.com/daotuanan/PrionNER/
Evaluation of Multilingual Text Simplification for the Mental Health Domain: Exploring Small Language Models
Olga Pelloni | Sandra Just | Lars Bongo
Olga Pelloni | Sandra Just | Lars Bongo
Individuals with particular mental health disorders may find it difficult to learn about their own condition. Therefore, efforts have been made to create materials that explain complex medical information in simpler words, which are also beneficial for caregivers and others. However, text simplification is commonly done in English and only sporadically in other languages. In this study, we explore potential ways for language-agnostic medical text simplification for the mental health domain. Our approach is to simplify the ICD-11 articles on primary psychotic disorders in English, German and French, using small LMs and various metrics for evaluating different aspects of the texts: lexical complexity and readability. Our results show that acceptable texts were produced only in English, and that a joint analysis of Measure of Textual Lexical Diversity (MTLD) and Flesch Reading Ease (FRE) provides the most insight, capturing both the best outcomes and signaling different types of issue. The study is preliminary and requires further investigation.
BioTopicXplor: A Web Tool for Interactive Exploration of PubMed Literature through Reproducible Topics.
Lana Yeganova | Donald Comeau | Won Kim | Natalie Xie | Shubo Tian | W John Wilbur | Zhiyong Lu
Lana Yeganova | Donald Comeau | Won Kim | Natalie Xie | Shubo Tian | W John Wilbur | Zhiyong Lu
The rapid growth of biomedical literature presents a major challenge for organizing knowledge and identifying emerging research trends. While PubMed provides effective access to relevant articles, it does not support understanding the conceptual structure of document collections. Existing tools rely on predefined features, ontologies, or parameter-sensitive clustering methods, limiting their ability to uncover fine-grained, data-driven topics in a reproducible manner. We present BioTopicXplor, an on-demand web server for interactive exploration of biomedical literature derived from arbitrary PubMed queries. The system integrates ConvexTopics, a convex optimization?based topic modeling framework that guarantees convergence to a global optimum and eliminates the need for predefined parameters. This enables the generation of reproducible and fine-grained topic structures across large document collections. Given a PubMed query, BioTopicXplor retrieves relevant articles, performs topic discovery, and organizes the resulting subtopics into a hierarchical structure of higher-level themes. To enhance interpretability, the system incorporates large language models to generate concise, literature-grounded summaries and descriptive titles for each topic, with links to supporting evidence. We demonstrate the utility of BioTopicXplor through a case study on anti-aging research, where the system reveals meaningful thematic structures and supports knowledge discovery.
Reading Between the Lines: Toward Translating Verbose Patient-authored Messages into Clinician-Formulated Questions
Sarvesh Soni | Madeline Bittner | Dina Demner-Fushman
Sarvesh Soni | Madeline Bittner | Dina Demner-Fushman
Patient portal messages often embed clinical questions inside long, emotionally nuanced narratives, requiring clinicians to infer the underlying information need. We study the task of rewriting verbose patient-authored narratives into concise, clinician-interpreted questions framed as if querying an electronic health record (EHR) system. We evaluate a lightweight LLM-based rewrite pipeline that constrains outputs to 10-15 words and uses rule-based validation with regeneration. We test the approach on 140 distinct patient questions drawn from the ArchEHR-QA dataset and shared task. Each system output is double-annotated by two annotators for quality (Good/Ok/Bad) and error types (Generic, Malformed, Tangential, Hallucination). Results show that while models follow output constraints, they often produce overly generic or tangential questions, and occasional hallucinations introduce unsupported clinical details. Across both clinician-question and patient-narrative comparison settings, automatic metrics show substantial overlap across human quality labels; in pairwise meta-evaluation, BERTScore is the strongest proxy for human preferences. We release our code and annotations to support future work.
Investigating Stigmatizing Language in Clinical Documentation with Open-Source Large Language Models
Rajashree Dahal | Pardis Hosseinpour | Pranithi Kamishetty | Satwik Pamulaparthy | Saeid Tizpaz-Niari | Natalie Parde
Rajashree Dahal | Pardis Hosseinpour | Pranithi Kamishetty | Satwik Pamulaparthy | Saeid Tizpaz-Niari | Natalie Parde
Clinical documentation is essential for patient care, billing, and medical research, but it is subject to entrenched bias. While manual chart reviews can identify such bias, they are labor-intensive and expert-dependent. We introduce and evaluate StigMAD, a Multi-Agent Debate framework leveraging open-source Large Language Models (LLMs) to detect stigmatizing language in clinical documentation. We investigate reasoning (multi-agent debate), self-reflection, and self-consistency within this framework. Extensive experiments on clinical notes and patient summaries demonstrate that our framework provides significant advantages over rule-based and supervised baselines. A domain-specific LLM (MedGemma) achieved its highest performance using the StigMAD reasoning framework, while a general-purpose LLM (Llama) showed superior results with the self-consistency framework. These findings suggest that open-source LLMs, steered by structured prompting and reflective reasoning, can effectively support the scalable auditing of stigmatizing language, marking a critical step toward more equitable clinical NLP systems.
Learning to Combine AI Annotations for Improved Biomedical Relevance Labeling
Won Gyu Kim | Lana Yeganova | Shubo Tian | Donald Comeau | W John Wilbur | Zhiyong Lu
Won Gyu Kim | Lana Yeganova | Shubo Tian | Donald Comeau | W John Wilbur | Zhiyong Lu
Accurate labeling of relevance between biomedical abstracts is essential for improving information retrieval, semantic similarity modeling, training of ranking systems and other Natural Language Processing tasks. However, manual annotations are time-consuming, labor intensive and costly. Studies show that large language models (LLMs) can facilitate automated annotation, but their performance still falls short of human expert-level accuracy, especially in domain-specific tasks. It has been shown that combining annotations from multiple non-expert annotators can achieve performance comparable to, or even exceeding, that of trained experts. Based on this evidence, we treat AI-generated annotations as contributions from non-expert annotators and combine them using Learning to Rank framework. Our results show significant improvement in overall annotation quality. The proposed method looks promising to reduce reliance on human annotation while maintaining reliable performance for large-scale biomedical applications.
When Does Retrieval Beat Direct LLM Diagnosis in Rare Disease? An Empirical Study of Ontology Coverage
Mohamed Elmofty | Ulf Leser
Mohamed Elmofty | Ulf Leser
Recent high-complexity agentic systems such as DeepRare perform strongly on rare disease diagnosis benchmarks, but it remains unclear when gains come from structured knowledge access and when they come from parametric LLM knowledge. We compare phenotypebased retrieval, LLM reranking, and unrestricted LLM diagnosis across seven benchmarks covering 10,382 cases. We find a clear performance crossover driven by retrieval coverage?the fraction of cases whose true diagnosis is within the retriever’s top-50: on highcoverage datasets, ontology-based retrieval dominates; on low-coverage datasets, openended LLM diagnosis takes the lead. Building on this, adding an LLM reranker over retrieved candidates further improves accuracy across our patient-case benchmarks, closing most of the remaining gap to agentic systems (within 2 pp on MME and LIRICAL). We trace the crossover to two structural failure modes of ontology-based retrieval?annotation sparsity and phenotypic homogeneity?and show that aggregate scores across mixed benchmarks can hide these qualitatively different diagnostic settings. These findings motivate per-dataset evaluation and hybrid diagnostic systems that combine retrieval, reranking, and parametric LLM generation based on case characteristics.
BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs
Nourah Salem | Elizabeth White | Michael Bada | Lawrence Hunter
Nourah Salem | Elizabeth White | Michael Bada | Lawrence Hunter
Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs’ performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries.
A Multi-Agent Open-Source LLM for Structured Cancer Registry Information Extraction from Pathology and Medical Reports
Abdulrahman Aal Abdulsalam | Adhari Al Zaabi | Riham Jeeballah | Habiba El Keraby
Abdulrahman Aal Abdulsalam | Adhari Al Zaabi | Riham Jeeballah | Habiba El Keraby
Extracting structured cancer registry information from pathology and medical reports is challenging due to heterogeneous reporting styles and implicit clinical reasoning. We propose a modular multi-agent framework that decomposes registry abstraction into semantic chunking, retrieval, field-specific extraction, validation, evaluation, and aggregation stages. The dataset includes 818 annotated cancer cases from Sultan Qaboos University Hospital. Evaluation in this study focuses on breast (n=454) and colorectal (n=174) reports across grade, morphology, TNM staging, and laterality extraction tasks. The framework is compared against prompt-based LLaMA 3.3 baselines using accuracy and weighted/macro F1-score metrics. The proposed framework improved performance in context-dependent tasks, particularly grade extraction, where weighted F1-score increased from 0.71 to 0.78 for breast cancer and from 0.56 to 0.67 for colorectal cancer. Improvements were also observed for colorectal laterality extraction. For other extraction tasks, particularly highly structured tasks such as TNM staging and morphology extraction, the multi-agent framework achieved performance comparable to direct prompting. Although the baseline achieved slightly higher average weighted F1-scores overall, the proposed framework provides improved modularity, traceability, and pipeline-level interpretability through explicit intermediate reasoning stages, supporting error analysis and future clinician-guided refinement.
BioConflict: A Benchmark for Evaluating Large Language Models in Biomedical Contradiction Detection and Consensus Synthesis
Ashwin Kirubakaran | Henry Gagnier
Ashwin Kirubakaran | Henry Gagnier
Resolving contradictions in biomedical literature requires more than factual recall; it demands identifying the hidden variables that explain divergent findings. Existing NLI benchmarks such as MedNLI operate at the sentence level and fail to capture document-level conflicts driven by differences in dosage, cell type, or study design. We introduce BioConflict, a benchmark of 250 expert-annotated paper pairs (500 abstracts) across ten biomedical topics, formalizing three tasks: conflict detection, contextual variable extraction, and consensus synthesis. We evaluate five general-purpose large language models and two domain-specific baselines, finding that general-purpose large language models achieve strong conflict detection (F1 up to 0.89) but exhibit brittle reasoning in synthesis, while domain-specific models lag significantly on all generative tasks. These findings highlight the need for context-aware biomedical AI capable of resolving, not merely retrieving, conflicting scientific evidence.
Tokenization Granularity and Medical Term Representations in Language Models
Vojtech Lanz | Pavel Pecina
Vojtech Lanz | Pavel Pecina
We investigate how tokenization granularity affects the representation of medical terminology in language models. Prior work links tokenization granularity to downstream performance under contextualized settings for specifically pretrained and fine-tuned models. We instead ask whether this relationship already emerges at the level of isolated term representations across existing pretrained models. We introduce an intrinsic definition retrieval task using UMLS term-definition pairs, with comparison to WordNet. We show that despite substantially heavier fragmentation of medical terminology, the models remain relatively robust in maintaining semantic alignment between medical terms and their definitions. At the same time, tokenization granularity still correlates with retrieval performance, indicating that effects previously observed in downstream biomedical tasks are already reflected at the level of isolated term representations. Encoder models benefit primarily from whole-token preservation, while for decoder LLMs, tokenization effects emerge mainly at deeper retrieval ranks.
CAP: A Source-Grounded Proposition Scaffold for Faithful Clinical Dialogue-to-Note Generation
Hyunkyung Lee | Jisoo Jung | Jeonguk Lee | Jaehyo Yoo | Wooseok Han | Minkyu Kim | Gibaeg Kim
Hyunkyung Lee | Jisoo Jung | Jeonguk Lee | Jaehyo Yoo | Wooseok Han | Minkyu Kim | Gibaeg Kim
Clinical dialogue-to-note generation is challenging because clinically salient evidence is noisy, distributed across turns, and often revised later in the encounter. Direct transcript-only prompting and coarse intermediate scaffolds can therefore suffer from omissions, section leakage, unsupported fill-in, and brittle final-state tracking. We propose Clinical Atomic Propositions (CAPs), a dialogue-aware intermediate representation for faithful clinical note generation. CAPs extract source-grounded clinical assertions while preserving modifiers such as verification status, temporality, speaker/source, and action type. We also study an optional event consolidation layer that groups CAPs into problem-oriented care bundles before note rendering. We evaluate five methods on a 197-case ACI-Bench cohort: a transcript-only baseline, prompt-based reimplementations of Cluster2Sent and MEDSUM-ENT, CAP, and CAP+Event. The main task uses a sectioned-note template, with SOAP-template rendering and transcript-free rendering reported as ablations. We use MEDSUM-ENT-style GPT-R/P/F1 metrics and a proposition-grounded semCAP-R/P/F1 audit to measure concept-level and source-grounded faithfulness, complemented by case-level win/tie/loss analysis and clinician deep review. Results show that CAP improves preservation of transcript-grounded clinical propositions while remaining competitive on concept-level GPT metrics. CAP+Event is not uniformly better than CAP, but qualitative and boundary analyses show when problem-oriented consolidation can improve organization and when compression can introduce omissions. We release code, prompts, intermediate representations, generated notes, and evaluation artifacts at a public repository.
Segmentation Matters: Exploring LLM-Based Strategies for Temporal Clinical Event Identification in Oncology Reports
Cristiano Bellucci | Francesco Madeddu | Chiara Iacomini | Carlotta Masciocchi | Stefano Patarnello | Massimo Bernaschi | Mario Santoro | Livia Lilli
Cristiano Bellucci | Francesco Madeddu | Chiara Iacomini | Carlotta Masciocchi | Stefano Patarnello | Massimo Bernaschi | Mario Santoro | Livia Lilli
Processing unstructured clinical narratives remains a major challenge in medical Natural Language Processing (NLP), particularly when critical information is embedded within lengthy and heterogeneous reports. Clinical notes often describe key diagnostic and therapeutic events through a verbose narrative, making automatic event identification difficult. In this work, we frame the identification of clinical events as a text segmentation task.We conduct a comparative study of three segmentation strategies applied to oncology reports: (i) a fully regex-based approach, (ii) a cascaded regex?LLM pipeline, and (iii) the same cascade architecture augmented with a recovery mechanism to mitigate LLM rephrasing. Segmentation quality is evaluated using complementary structural metrics (Pk, WindowDiff, Boundary Similarity, Segment Count Accuracy, and Text Overlap IoU), and its impact is also observed on downstream segment tagging, performed to identify the corresponding event type (e.g. surgery, biopsy, imaging, treatment, laboratory).The results demonstrate the high potential of LLM-based approaches, particularly in preserving semantic coherence within segments and generalization on new data sources. However, regex-based segmentation achieves higher performance according to structural segmentation metrics, also leading to better downstream clinical event identification. In general, these results highlight the critical role of context-adaptive high-quality segmentation strategies in the structuring of verbose clinical narratives and in the accurate identification of key patient events.
Operation-Mechanism Alignment for Reliable Clinical Reasoning over Electronic Health Records
Guanyu Tao | Siyao Wang | Yong Xue | Ashwani Tanwar | Yuting Ji | Kai Sun | Monica Mok | Marzana Chowdhury | Deepa Gupta | Ashok Gupta | Jingqing Zhang | Vibhor Gupta | Yike Guo
Guanyu Tao | Siyao Wang | Yong Xue | Ashwani Tanwar | Yuting Ji | Kai Sun | Monica Mok | Marzana Chowdhury | Deepa Gupta | Ashok Gupta | Jingqing Zhang | Vibhor Gupta | Yike Guo
Clinical reasoning over electronic health records (EHRs) involves heterogeneous operations, including text interpretation, numerical computation, temporal filtering, and guideline-based aggregation. However, many existing LLM-based approaches still cast these heterogeneous operations as a single end-to-end generation process, obscuring their different reliability requirements and making intermediate failures difficult to inspect. We therefore propose a framework based on operation-mechanism alignment that represents clinical reasoning as a directed acyclic graph of typed operations, where each node is assigned to the execution mechanism best suited to its reliability requirements. The framework also preserves structured evidence provenance for intermediate results. Across six clinician-annotated binary decision tasks, the framework outperforms direct prompting, single-step retrieval-augmented prompting, and chain-of-thought baselines, supporting operation-mechanism alignment as a practical design principle for reliable clinical reasoning over EHRs.
MeSHClass-ES and AnatEM-ES: Open Resources for Spanish Biomedical NLP
Santiago Martinez Novoa | Lina Gomez Mesa | Juan Prieto | Ruben Manrique
Santiago Martinez Novoa | Lina Gomez Mesa | Juan Prieto | Ruben Manrique
Despite Spanish being one of the most widely spoken languages in the world, biomedical NLP resources and systematic evaluations remain limited relative to English. We address this gap by constructing and releasing two Spanish biomedical corpora: (1) **MeSHClass-ES**, a 29,063 abstract bilingual corpus translated from PubMed with Opus-MT, and (2) **AnatEM-ES**, the AnatEM anatomical entity corpus translated with a chunk-level LLM-based pipeline that jointly preserves BIO annotations across 13,849 entity mentions. Both corpora achieve a mean COMET score of 0.73 despite using different translation systems. We benchmark nine encoder models spanning general-domain Spanish, domain-specific, and multilingual architectures for both tasks. RigoBERTa-2.0 leads both tasks (micro-F1 classification 0.69, tied with SciBETO-large; NER F1 0.66). Both domain pretraining and model capacity drive performance, with the gap slightly more pronounced for NER (4-point spread) than classification (3-point spread). XLM-RoBERTa-large emerges as a competitive multilingual baseline. A parallel evaluation of four open-weight decoders (7?9B) reveals a task-dependent encoder-decoder gap: QLoRA-adapted Gemma-2-9B reaches 88% of the best encoder on classification (micro-F1 .61 vs .69), but for NER every decoder configuration we tested stays at or below 40% of the best encoder F1. We release both corpora on the HuggingFace Hub1, translation pipelines, and evaluation code on GitHub.
When Evidence Conflicts: Uncertainty and Order Effects in Retrieval-Augmented Biomedical Question Answering
Yikun Han | Mengfei Lan | Halil Kilicoglu
Yikun Han | Mengfei Lan | Halil Kilicoglu
Biomedical retrieval-augmented LLMs are often evaluated under helpful retrieved context, but in practice the evidence can also be misleading or internally conflicting. This paper studies uncertainty under those harder settings using the HealthContradict benchmark and six open-weight models. We evaluate five controlled evidence conditions: no context, correct-only context, incorrect-only context, and two mixed conditions that contain the same correct and contradictory documents in opposite orders. Correct evidence improves both accuracy and calibration, while incorrect evidence substantially degrades both. Under conflicting evidence, document order also matters: reversing the order of the same two documents changes 11.4%–25.2% of predictions and consistently reduces performance when the incorrect document appears first. We further evaluate a conflict-aware abstention score that combines model confidence with a detector of evidence conflict. In the two hardest conditions, incorrect-only and incorrect-first conflict, this score improves selective accuracy over confidence-only abstention, with mean gains of 7.2–33.4 and 3.6–14.4 points across 75%, 50%, and 25% coverage. These results show that biomedical RAG systems should be evaluated not only under helpful retrieval, but also under misleading and conflicting evidence.
A Comparative Analysis of In-Context Learning and Fine-Tuning for Biomedical Information Retrieval and Sentence Extraction Using Research Domain Criteria
Athlene Jones | Khanh Lieu | Indika Kahanda
Athlene Jones | Khanh Lieu | Indika Kahanda
Research Domain Criteria (RDoC) is a National Institute of Mental Health framework for studying mental disorders by integrating information across genetics, circuits, and behavior. Manually curating biomedical abstracts relevant to RDoC is a significant challenge due to semantically overlapping construct definitions (e.g., "Acute Threat," "Potential Threat," and "Sustained Threat") and the exponential growth of biomedical literature. This study compares two modeling strategies, domain-adapted fine-tuning and in-context prompting, across two RDoC-related subtasks from the official BioNLP-OST 2019 RDoC shared task. For Task 1, unlabeled PubMed abstracts are retrieved and ranked by relevance to eight of the RDoC constructs. We compare a TF-IDF baseline against ModernBERT and Llama (zero-shot and five-shot) using Mean Average Precision (MAP). For Task 2, the objective is to identify the single most relevant sentence from an abstract for a given construct, evaluated using per-construct accuracy. The fine-tuning track performs end-to-end fine-tuning of BioBERT, PubMedBERT, ModernBERT, and RoBERTa using a cross-encoder input format and per-construct grid search. These are compared against the in-context learning of several open-source language models. Both our approaches are competitive against the best-performing team’s score from the BioNLP-OST 2019 RDoC shared task. Taken together, these findings suggest that five-shot prompted LLMs and domain-adapted fine-tuned transformers are viable tools for semi-automating the expert annotation in RDoC curation.
Clinical sources and patient-authored reviews often describe antidepressant side effects in different ways, but these differences are rarely measured directly. We present ClinPeer-AE, a linked dataset for comparing side-effect evidence from PubMed, ClinicalTrials.gov, WebMD, and Drugs.com while preserving source identity. Across five widely prescribed antidepressants, we find low overlap between clinical and peer sources, large differences in relative emphasis, and evidence that many peer-only effects also appear in U.S. Food and Drug Administration Adverse Event Reporting System (FAERS) reports. These findings suggest that patient reviews provide useful context about recurring medication experiences and offer a complementary view of how side effects are described outside formal clinical settings.
A Deterministic Multi-Stage Retrieval Pipeline for Longitudinal EHR Question Answering
Shubham Agarwal | Thomas Searle | Richard Dobson | Ninoslav Majkic | Niko Moller-Grell
Shubham Agarwal | Thomas Searle | Richard Dobson | Ninoslav Majkic | Niko Moller-Grell
Retrieval-augmented generation (RAG) holds promise for clinical question answering over electronic health records (EHRs), but existing systems treat retrieval as an opaque subroutine, limiting auditability and reliability in patient care workflows. We introduce a deterministic multi-stage retrieval pipeline for longitudinal EHR question answering that decomposes retrieval into four distinct, ablated stages where each stage is instrumented with diagnostic metrics, making the flow of clinical evidence measurable and auditable at every step. Evaluated on a broad LLM-annotated cohort and an expert-annotated cardiovascular benchmark developed alongside clinicians from real ICU records, the full pipeline achieves 22-23% relative recall gain over a strong dense retrieval baseline across both cohorts, with consistent improvements in downstream answer quality. The pipeline’s deterministic and transparent design addresses a critical gap in clinical NLP: retrieval systems that clinicians and researchers can not only rely on, but inspect, audit, and build upon for real-world deployment.
Interpretable ICD Code Classification with Faithful Sentence Extraction
Yichen Wang | Lian Hong | Masato Mizogaki | Shunnosuke Umeda | Toshimune Kenmotsu | Akihiro Tamura | Daniel Andrade
Yichen Wang | Lian Hong | Masato Mizogaki | Shunnosuke Umeda | Toshimune Kenmotsu | Akihiro Tamura | Daniel Andrade
Transformer-based models such as PLM-CA achieve strong performance for automatic ICD coding, but their attention weights do not provide faithful explanations of their predictions. This is a major limitation for electronic medical records, where users often need concise and trustworthy evidence for each assigned code. To address this issue, we jointly train a sentence extractor and an ICD code classifier such that predictions are based only on the extracted sentences. As a result, the extracted sentences serve as faithful rationales for each predicted code and substantially reduce the effort required to inspect long medical records. Experiments on MIMIC-III show that our method approaches the performance of a transformer baseline that processes the full record while using only a small fraction of the document.
Evaluating LLM-as-a-Judge for Medical Term Simplification
Ioana Buhnila | Aman Sinha | Rohit Agarwal | Dilip K. Prasad | Mathieu Constant
Ioana Buhnila | Aman Sinha | Rohit Agarwal | Dilip K. Prasad | Mathieu Constant
Highly technical medical terms are difficult for patients to understand during fast-paced hospital consultations, leading them to rely on Large Language Models (LLMs) for simplified explanations. However, LLMs can produce inaccurate or false information. Since expert evaluation is costly and time-consuming, LLM-as-a-Judge (LaaJ) approach is increasingly adopted to assess the quality of LLM-generated text. In this paper, we investigate the reliability and robustness of LaaJ for specialized medical knowledge by evaluating six LLMs for their judgment capabilities on three dimensions: correctness, readability, and completeness. We utilized three judgment setups: Vanilla, Epistemic, and Bias to probe robustness, and assess them against human expert annotations to measure alignment. To address the lack of specialized medical benchmarks, we introduce BrainCancerDB, an English dataset of 219 brain cancer terms with 23,652 annotations. Our findings indicate that while LLM-Judges and humans display similar trends in ranking simplified explanations, LLM-Judges tend to be more lenient on correctness, which may have serious implications in medical setting. Additionally, we observe that hallucinations in LaaJ setups can be mitigated by epistemic markers.
FACT: Functional Group Alignment and Consistency in Token Space for Structure-aware Molecular Representation Learning
Hyeonyeong Nam | Woojae Choi | Deok-Joong Lee | Young-Han Son | Sangwoon Lee | Bogyeong Kang | Eunjung Jo | Tae-Eui Kam
Hyeonyeong Nam | Woojae Choi | Deok-Joong Lee | Young-Han Son | Sangwoon Lee | Bogyeong Kang | Eunjung Jo | Tae-Eui Kam
Molecular representation learning aims to capture chemically meaningful structures for various downstream tasks such as accurate molecular property prediction. However, incorporating functional group (FG) information into SMILES-based models remains challenging. The absence of explicit alignment between graph-defined FG atom sets and tokens in sequence prevents complete substructure masking, while multiple valid SMILES forms of the same molecule lead to inconsistent FG representations in token space. To address these challenges, we propose FACT (Functional Group Alignment and Consistency in Token Space), an end-to-end framework for structure-aware SMILES-based representation learning. FACT introduces an atom?token alignment module for complete FG span masking during pre-training and enforces FG consistency across different SMILES forms during fine-tuning. Experiments on MoleculeNet benchmarks show that FACT achieves state-of-the-art or competitive performance on eight tasks, demonstrating the effectiveness of alignment and consistency learning for molecular representation.
Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization
Gaurav Kumar
Gaurav Kumar
Large Language Models such as GPT-4o and GPT-5 achieve strong zero-shot performance on biomedical claim verification, but cost and opacity limit scalable use. We fine-tune three small LLMs; Phi-3-mini (3.8B), Qwen2.5-3B, and Mistral-7B; via QLoRA on SciFact and HealthVer, providing the first study of QLoRA models against GPT-4o and fine-tuned BioLinkBERT encoders. Mistral-7B QLoRA achieves higher F1 than both GPT-4o and GPT-5 (up to 12% gain) at 44.5x lower cost using just 1,008 training examples, representing a compelling cost-quality trade-off. We conduct extensive in-domain and cross-domain evaluation: models trained on SciFact tested on HealthVer and vice versa, at matched sizes to isolate dataset structure from data quantity. We identify a previously unreported structural artifact in SciFact that inflates in-domain scores, and show through bidirectional out-of-domain evaluation that training on structurally sound data enables robust cross-domain transfer. We plan to release all code and adapter checkpoints.
Diagnosable ColBERT: Debugging Late-Interaction Retrieval Models Using a Learned Latent Space as Reference
François Remy
François Remy
Reliable biomedical and clinical retrieval requires more than strong ranking performance: it requires a practical way to find systematic model failures and curate the training evidence needed to correct them. Late-interaction models such as ColBERT provide a first solution thanks to the interpretable token-level interaction scores they expose between document and query tokens. Yet this interpretability is shallow: it explains a particular document–query pairwise score, but does not reveal whether the model has learned a clinical concept in a stable, reusable, and context-sensitive way across diverse expressions. As a result, these scores provide limited support for diagnosing misunderstandings, identifying irreasonably distant biomedical concepts, or deciding what additional data or feedback is needed to address this. In this short position paper, we propose Diagnosable ColBERT, a framework that aligns ColBERT token embeddings to a reference latent space grounded in clinical knowledge and expert-provided conceptual similarity constraints. This alignment turns document encodings into inspectable evidence of what the model appears to understand, enabling more direct error diagnosis and more principled data curation without relying on large batteries of diagnostic queries.
Developing Literature Annotation Guidelines for Representing Normal Physiology in Biolink-Compatible Knowledge Graphs
Madeline Bittner | Willie Rogers | Dina Demner-Fushman | Richard Scheuermann | Matthew Diller
Madeline Bittner | Willie Rogers | Dina Demner-Fushman | Richard Scheuermann | Matthew Diller
Much of our knowledge about anatomy and physiology is found in text format in research papers and medical textbooks. For an information system to have access to this knowledge, extracting and translating it into a computable format that can be stored in an ontology or knowledge graph is advantageous. Unfortunately, existing text mining corpora, which are needed to train and evaluate data mining models, are old and consist almost entirely of research papers, which rarely contain complete information needed to capture complex normal physiological processes and, subsequently, understand the pathophysiology of a disease. As a first step to filling in this gap, we have developed a guide for annotating medical textbooks for physiological events and entities involved in these events. In addition to providing our guidelines and describing the guideline development process, we analyze the coverage of normal physiology in existing ontologies.
CENT: Context Engineering Framework for Normalization of Clinical Trial Procedures
Sanya Taneja | Ziqing Ji | Hans Verstraete | Ali Samadani
Sanya Taneja | Ziqing Ji | Hans Verstraete | Ali Samadani
Clinical Concept Normalization is essential for clinical research applications involving trial protocols, such as patient-trial matching. Existing approaches focus heavily on specific domains and need large, annotated datasets. To address these challenges, we propose CENT, a context engineering framework that combines semantic matching for candidate retrieval and Large Language Model (LLM) prompting for disambiguation. We applied CENT on a publicly available dataset of procedures normalized to Current Procedural Terminology (CPT) concepts and evaluated the framework using both binary and hierarchical metrics that take into account hierarchical characteristics of predicted codes. CENT achieves superior performance on clinical procedures normalization in both binary and hierarchical metrics compared to semantic matching or LLM-only approaches, without requiring fine-tuning. Advanced prompt strategies, including Chain-of-Thought and Tree-of-Thoughts, achieve high performance at practical cost. We further applied CENT to predict codes in two clinical protocol-derived datasets to validate its performance on noisy procedure texts. CENT is scalable and adaptable for normalization across diverse clinical vocabularies in real-world clinical applications.
Clinical documentation places significant time demands on medical professionals, consumes institutional resources, and is prone to errors that may compromise patient care. Recent advances in LLMs offer promising approaches for automating clinical note generation; however, the impact of different AI architectural designs remains underexplored, particularly for agentic AI systems. This study compares three architectures ? single-LLM, multi-agentic, and swarm-agentic ? for automated SOAP (Subjective, Objective, Assessment, Plan) note generation from doctor?patient dialogues. All approaches employ QLoRA-finetuned Ministral 3 models (3B and 8B parameters) trained on the MedSynth dataset, comprising 10,030 dialogue?note pairs across 2,006 ICD-10 code classes. Performance is evaluated using ROUGE-1, ROUGE-2, ROUGE-L, and BERTScore against a lexical-overlap baseline (dialogue vs. ground-truth SOAP, no inference). Results show that all finetuned models substantially outperform the baseline, while differences between architectural variants remain marginal. The single-LLM setup achieves the strongest performance across all metrics; 3B and 8B variants perform nearly identically on semantic similarity (BERTScore), while ROUGE differences are small but statistically significant. Qualitative inspection further reveals that residual differences across architectures are driven primarily by shared dataset priors rather than by architectural reasoning capacity. The results are based on synthetic data without human evaluation and reflect architectural behavior only.
VERICITE: Evaluating Sentence-Level Citation Faithfulness in Retrieval-Augmented Medical Question Answering
Yixian Ma | Bohao Chu | Norbert Fuhr
Yixian Ma | Bohao Chu | Norbert Fuhr
Retrieval-augmented generation (RAG) reduces hallucination in large language models by grounding outputs in retrieved evidence, but it does not guarantee that the resulting citations support the associated claims. We present VERICITE, a framework for evaluating citation faithfulness in retrieval-augmented medical QA. Our system retrieves PubMed abstracts via the NCBI E-Utilities API, prompts LLMs to generate answers with inline citations, and verifies each citation at the sentence level using a DeBERTa-v3-large NLI model. We evaluate four LLMs on 500 BioASQ questions at retrieval depths of 3 and 5, with extended experiments up to k = 15 and an oracle setting with gold standard documents. Only 27?41% of citation pairs are supported at the sentence level at retrieval depths of 3 and 5, with support rates declining further at larger k. Under the oracle condition, answer quality improves, but citation faithfulness does not substantially improve, suggesting that generation-side citation behavior contributes substantially to unfaithful citations.
Overview of the Medical Decision Extraction, Analysis, and Classification Task (MedExACT) of BioNLP 2026
Mohamed Elgaar | Jiali Cheng | Nidhi Vakil | Mehrnaz Sadrolashrafi | Mitra Mohtarami | Adrian Wong | Hadi Amiri | Leo Celi
Mohamed Elgaar | Jiali Cheng | Nidhi Vakil | Mehrnaz Sadrolashrafi | Mitra Mohtarami | Adrian Wong | Hadi Amiri | Leo Celi
This paper presents an overview of the Medical Decision Extraction, Analysis, and Classification task (MedExACT) of BioNLP 2026. The focus of this task is the extraction and labeling of medical decisions in ICU discharge summaries. The task is built on MedDec, a MIMIC-III-based dataset of 451 expert-annotated summaries, and asks systems to extract and classify spans of text that contain medical decisions according to the decision categories defined in the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM). The official ranking combines span F1 and token F1 with a worst-group robustness metric computed over sex, race, and English-proficiency subgroups. MedExACT attracted broad international interest, with 130 official submissions from 36 teams comprising about 60?100 participants, and has improved information extraction performance by nearly 15% over the previous state of the art. The submitted systems predominantly use long-context encoder models, ensemble decoding, boundary-refinement modules, and robustness-aware training or model selection, with the best submitted run reaching a final fairness-based F1 of 0.596.
Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation
Sylvey Lin | Joseph Menke | Shufan Ming | Dongin Nam | Neil Smalheiser | Halil Kilicoglu
Sylvey Lin | Joseph Menke | Shufan Ming | Dongin Nam | Neil Smalheiser | Halil Kilicoglu
Biomedical abstracts play a critical role in downstream NLP applications, such as information retrieval, biocuration, and biomedical knowledge discovery. However, a non-trivial number of biomedical articles do not have abstracts, diminishing the utility of these articles for downstream tasks. We propose DPR-BAG (Divide, Prompt, and Refine for Biomedical Abstract Generation), a training-free, zero-shot framework that generates coherent and factually grounded abstracts for biomedical articles with full text but no abstract. DPR-BAG decomposes full-text documents into structured rhetorical facets following the Background-Objective-Methods-Results-Conclusions (BOMRC) schema, performs parallel LLM-based summarization for each facet, and applies a final refinement stage to restore global discourse coherence. On PMC-MAD, a distribution-aligned dataset of 46,309 biomedical articles, DPR-BAG improves abstractive novelty over strong extractive and fine-tuned baselines, while maintaining factual consistency. Our ablation study reveals a counterintuitive finding: increasing prompt complexity or explicitly injecting entity-level guidance can degrade factual alignment, highlighting the importance of controlled prompting strategies. These findings underscore the potential of training-free, structure-aware frameworks for scalable biomedical abstract generation in low-resource settings. Our data and code are available at https://huggingface.co/datasets/pmc-mad/PMC-MAD and https://github.com/ScienceNLP-Lab/MultiTagger-v2/tree/main/DPR-BAG.
AAbAAC: An Annotated Corpus for Autoimmunity Information Extraction
Fabien Maury | Solène Grosdidier | Maud De Dieuleveult | Adrien Coulet
Fabien Maury | Solène Grosdidier | Maud De Dieuleveult | Adrien Coulet
Despite advances in information extraction driven by deep learning and large language models, performance gaps remain in highly specialized biomedical fields, where domain-specific complexity poses challenges for generalist models.In this work, we focus on the domain of autoimmunity where the main entities of interest are autoimmune diseases, autoantibodies (i.e. molecules that may mark or cause these diseases), their molecular targets, their location in the body, and the associated clinical signs. Herein, we present AAbAAC (AutoAntibodies and Autoimmunity Annotated Corpus), a corpus of 115 abstracts selected from PubMed that we manually annotated for those entities and their relationships. First, AAbAAC was used to evaluate several methods on the task of named entity recognition (NER), and second, to fine-tune NER models. Our study demonstrates the utility of AAbAAC for information extraction in the domain of autoimmunity, showing expected improvement in NER performance after fine-tuning. This illustrates the value of small-scale annotation efforts for specialized domains and contributes to the computational study of autoimmunity. The AAbAAC corpus is available at: https://github.com/f-maury/AAbAAC .
What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework
Robert Leaman | Rezarta Islamaj | Zhiyong Lu
Robert Leaman | Rezarta Islamaj | Zhiyong Lu
Biomedical named entity recognition (NER) and entity linking (EL) strongly depend on annotated corpora, but the utility of these resources for benchmarking is often assumed rather than characterized. We present a corpus-centric framework for diagnosing benchmark-relevant properties directly from corpus annotations, concept links, train-test splits, document metadata, and terminology mappings. The framework organizes standardized statistics into five families: (1) scale, density and label distribution, (2) lexical and conceptual structure, (3) train-test overlap, (4) metadata composition, and (5) terminology coverage where applicable. Applying the framework to nine corpora spanning diseases, chemicals, and cell types, we find that corpus properties can differ substantially, even when they address the same apparent task. We find differences in the evaluation signal they provide, the generalization demands they impose, the degree of train?test reuse they permit, and the regions of biomedical literature and concept space they represent. These differences suggest that commonly reported corpus statistics can be insufficient to characterize what biomedical NER and EL benchmarks evaluate. We argue that corpus-centric diagnostics provide a practical framework for analyzing corpora beyond surface descriptors such as corpus size and entity type, for identifying potential transfer risks, and for interpreting the scope of benchmarking conclusions. We release the framework as open-source code with an interactive dashboard to support reproducing our analyses and characterizing additional corpora.
Towards Grounded Hallucination Definitions for Biomedical Question Answering with Reproducible Examples from ClinIQLink
Brandon Colelough | Davis Bartels | Madeline Bittner | Dina Demner-Fushman
Brandon Colelough | Davis Bartels | Madeline Bittner | Dina Demner-Fushman
Hallucinations in biomedical question answering are hard to define and compare because the literature uses overlapping and inconsistent terms. There is currently no grounded definition set that works for biomedical QA, with real examples from open-source LLMs. We introduce a layered definition of hallucinations for biomedical QA, hierarchically structured from the overarching idea of Hallucination in relation to generated model content, to source and consistency orientations, and finally to subtypes. We ground our definition taxonomy in source-attributed literature definitions and reproducible examples from REMOVED FOR REVIEW, where cases can be traced to the question, source passage, generated answer, and annotation record. We provide a framework with annotation, comparison, and error analysis to provide a clearer reference for evidence-grounded biomedical QA. We aim for this example-grounded taxonomy to support automated detection of hallucinations and their potential harmfulness.
Can NLP Models Detect When One Publication Outweighs Twenty? Predicting Systematic Review Conclusion Changes
Ebrahim Alharbi | Mark Stevenson
Ebrahim Alharbi | Mark Stevenson
Systematic reviews underpin evidence-based medicine but can outdate quickly when new evidence appears. We formulate a novel prediction task: given a review and new studies that have appeared since its publication, predict whether the review’s conclusions will change. A dataset of 3,326 Cochrane review-update pairs is constructed and a range of approaches explored including feature-based baselines, zero and few-shot LLMs, in addition to parameter efficient fine-tuning. Fine-tuning Qwen2.5 14B achieves the highest AUC-ROC (70.4%).
VaxScope: Document-Level Structured Evidence Extraction from Immunization Systematic Reviews
Bahar Ilgen | Ebenezer Awotoro | Georges Hattab
Bahar Ilgen | Ebenezer Awotoro | Georges Hattab
Systematic reviews are fundamental to evidence-based medicine, but the clinical evidence they contain is primarily expressed in unstructured text, making large-scale extraction and reuse difficult. Existing biomedical NLP methods have achieved strong performance on span-level extraction from clinical trials and abstracts; however, these approaches are insufficient for systematic reviews, where evidence is often distributed across multiple studies, sentences, and sections and must be aggregated into normalized document-level attributes. We introduce VaxScope, a benchmark dataset for document-level structured evidence extraction from immunization-related systematic reviews. VaxScope is constructed through an expert-guided semi-automatic annotation pipeline that combines automatic candidate generation with domain expert validation to ensure consistency and annotation quality. We formalize the task as document-level structured extraction, where target labels are defined at the review level and require aggregating evidence beyond isolated textual spans. We further establish baselines for document-level structured extraction using abstract-level input representations and evaluate how access to evidence-grounded contextual input improves performance over abstract-only settings. Baseline experiments show that PubMedBERT achieves the best overall performance (Avg F1: 0.850), with evidence-grounded input improving performance particularly for fields requiring distributed contextual reasoning.
Medical Context Variation: A source of impairment for Event classification
Aman Sinha | Marianne Clausel | Mathieu Constant | Xavier Coubez
Aman Sinha | Marianne Clausel | Mathieu Constant | Xavier Coubez
The variation in writing style encapsulates nuanced characteristics, which are often exploited for author or demographic identification. In the medical domain, language models are frequently deployed to capture relevant information from unstructured or complex data, such as clinical notes that often include patients’ medical histories. Such data is largely free-form and unstructured, obtained through diverse clinician?patient interactions. In this work, we present a case study investigating whether variations in clinicians’ writing styles can lead to differences in medical context understanding capabilities for pre-trained language models (PLMs) on downstream tasks, such as medical event classification. Our findings indicate that variation in writing style, characterized by linguistic features, can indeed lead to suboptimal performance in deployed systems. Furthermore, we explore linguistic guided counterfactual reasoning in order to mitigate the impact of writing style variation which suggests LLM-based stylistic normalization to be effective for this purpose.
KALIMBA: Knowledge-Assisted Literature Mining for Biological Interaction Analysis
Niloofar Arazkhani | Maciej Kotecki | Brent Cochran | Natasa Miskov-Zivanov
Niloofar Arazkhani | Maciej Kotecki | Brent Cochran | Natasa Miskov-Zivanov
The exponential growth of biomedical literature has made manual curation of biological interaction networks increasingly difficult. Existing automated biological interaction extraction systems address the scaling challenge but treat extraction as a final step, delivering structured output with limited or no integrated support for biologists to interactively verify, correct and contextually interrogate extracted interactions against their source evidence within the same environment. We present Knowledge-Assisted Literature Mining for Biological Interaction Analysis (KALIMBA), an end-to-end, human-in-the-loop platform that integrates three complementary extraction methods (NLP-only, LLM-only, and hybrid) alongside expert annotation and evidence-grounded conversational querying through retrieval-augmented generation (RAG) chat module driven by a dual-context prompt, within a single unified workflow. Evaluation on a corpus of 40 signaling-focused papers demonstrates that the LLM-only back-end recovers substantially more interactions than the NLP-only approach. RAG chat evaluation by a domain expert confirms that the conversational module provides scientifically grounded responses that support curation decisions beyond what the structured interaction table alone conveys.
When Retrieval Doesn’t Help: A Large-Scale Study of Biomedical RAG
Erfan Nourbakhsh | Rocky Slavin | Ke Yang | Anthony Rios
Erfan Nourbakhsh | Rocky Slavin | Ke Yang | Anthony Rios
Medical question answering is a high-stakes setting where factual errors can have serious consequences. Retrieval-augmented generation (RAG) is widely viewed as a promising solution, and prior work has reported substantial gains for large medical QA models. We revisit this assumption across a broad range of open-weight instruction-tuned models spanning 7B to 72B parameters. Across five models, ten biomedical QA datasets, four retrieval methods, and four retrieval corpora, we find that retrieval yields only small and inconsistent improvements over a no-retrieval baseline, typically within 1–2 points. In contrast, the choice of backbone model has a much larger effect than the choice of retriever or corpus, and expert and layman retrieval sources perform similarly in most settings. These results suggest that the main bottleneck is not retrieval quality alone, but the model’s limited ability to use retrieved evidence effectively.
LLM-based drug–drug interaction (DDI) assessment remains difficult to audit when predictions are not explicitly tied to evidence. While retrieval-augmented generation (RAG) improves grounding, predictions are not guaranteed to be entailed by retrieved items. We present CrossDDI, a verification-first framework that separates LLM-based evidence extraction from deterministic, LLM-free arbitration over DrugBank and PubMed, requiring positive predictions to be linked to explicit supporting evidence. Evaluated on 1,000 DDInter 2.0 pairs under a positive–unlabeled setting, CrossDDI achieves recall of 0.576–0.593 over confirmed positives with interaction prediction rates comparable to RAG, while reducing cross-backbone variation (0.018 vs. 0.066). Analysis identifies literature evidence acquisition and attribution as the primary bottleneck: PubMed retrieval covers only 40.5% of confirmed positives, and Path B-only evidence is substantially less reliable than structured evidence. These results suggest that verification-first architectures can improve traceability and backbone consistency, while broader and more reliable literature evidence is needed to extend coverage beyond structured sources.
GRAFT: Gated Retrieval-Augmented Fine-Tuning for Relation Extraction
Yuhang Jiang | Ramakanth Kavuluru
Yuhang Jiang | Ramakanth Kavuluru
Even in the era of large language models (LLMs), biomedical relation extraction (RE) still plays a major role in timely creation of knowledge graphs that further guide biomedical knowledge discovery. The main task in RE is to extract a relation "as expressed" in an input text. At times, crucial definitional information or other auxiliary information about the entities involved may be missing from the input text. Augmenting it from other external textual sources appears helpful on the surface but can be harmful too, as these sources can overwhelm the signal in the original input, leading to false positives or false negatives. To counter this, we leverage a pre-trained biomedical text retriever to augment original inputs with additional instance-specific snippets. This is done through a gating mechanism that allows the retrieved snippets to enhance but not overwhelm the signal from the original input. We evaluate our approach on three standard biomedical relation extraction datasets (CDR, BioRED, and ChemProt) and show consistent improvements (up to 10 F1 points) compared with strong supervised baselines involving both encoder and decoder models. All our code and the datasets used are available for reuse: \url{https://github.com/bionlproc/GRAFT-RE}.
Overview of the PsyDefDetect Shared Task at BioNLP 2026: Detecting Levels of Psychological Defense Mechanisms in Supportive Conversations
Hongbin Na | Zimu Wang | Zhaoming Chen | Yining Hua | Rena Gao | Kailai Yang | Ling Chen | Wei Wang | Shaoxiong Ji | John Torous | Sophia Ananiadou
Hongbin Na | Zimu Wang | Zhaoming Chen | Yining Hua | Rena Gao | Kailai Yang | Ling Chen | Wei Wang | Shaoxiong Ji | John Torous | Sophia Ananiadou
We present an overview of PsyDefDetect, the shared task on detecting levels of psychological defense mechanisms in emotional support dialogues, co-located with BioNLP@ACL 2026. Grounded in the clinically validated Defense Mechanism Rating Scales (DMRS) framework, the task asks systems to classify a target seeker utterance, given its preceding dialogue context, into one of nine categories: seven hierarchical DMRS levels plus two auxiliary labels. Participants worked on PsyDefConv, a newly released corpus of 200 dialogues and 2336 help-seeker utterances annotated under DMRS with substantial inter-annotator agreement. The task attracted 172 participants on CodaBench who produced 563 submissions, with 21 teams officially registering their results for the final ranking. The best system achieved a macro F1-score of 0.420, surpassing the strongest fine-tuned baseline reported in the dataset paper by a notable margin, yet leaving clear headroom. Our analysis highlights (i) a persistent tendency to over-predict the majority High-Adaptive class, (ii) a widening gap between accuracy and macro-F1 that reveals class-imbalance sensitivity, and (iii) the value of theory-aware and LLM-based approaches for fine-grained defensive-function classification. We release all task materials and invite the community to continue work on this novel intersection of clinical psychology and NLP.
SCoPE: Planning for Hybrid Querying over Clinical Trial Data
Suparno Chowdhury | Manan Choudhury | Tejas Anvekar | Muhammed Khan | Kaneez Khakwani | Mohamad Sonbol | Irbaz Riaz | Vivek Gupta
Suparno Chowdhury | Manan Choudhury | Tejas Anvekar | Muhammed Khan | Kaneez Khakwani | Mohamad Sonbol | Irbaz Riaz | Vivek Gupta
Systematic reviews of clinical trials require analysts to extract attributes that are rarely stored as ready-made columns. For example, the drug class of an immunotherapy named in a regimen, the additional agents combined with it, or whether a listed endpoint is a primary or secondary outcome. These attributes must be inferred from the visible content of other fields through normalization, classification, or structured extraction, and existing approaches such as direct LLM prompting, text-to-SQL, and agentic pipelines leave this reasoning implicit in a single generation step or pay a heavy execution cost for limited accuracy gains. We propose SCOPE (Structured Clinical hybrid Planning for Evidence retrieval in clinical trials), a multi-LLM planner-based framework that decomposes the task into row selection, structured planning, and execution. The planner makes the source field, reasoning rules, and output constraints explicit before answer generation, reducing ambiguity relative to direct prompting. We evaluate SCOPE on 1,500 hybrid reasoning questions over oncology clinical-trial tables against zero-shot, few-shot, chain-of-thought, TableGPT2, BlendSQL, and EHRAgent. Results show that explicit multi-LLM planning improves accuracy for reasoning-based questions while offering a stronger accuracy-efficiency tradeoff than heavier agentic baselines. Our findings position clinical trial reasoning as a distinct table understanding problem and highlight hybrid planner-based decomposition as an effective solution.
Expert-Guided Schema-Based Structured Extraction from CONSORT Diagrams Using Vision-Language Models
Damian Stachura | Bartosz Przechera | Monika Opa?ek | Ewelina Sadowska | Ewa Borowiack | Artur Nowak
Damian Stachura | Bartosz Przechera | Monika Opa?ek | Ewelina Sadowska | Ewa Borowiack | Artur Nowak
Visual-language models (VLMs) are rapidly advancing on tasks that require visual understanding of text, tables, plots, and diagrams. Yet extracting structured information from text-heavy scientific diagrams remains challenging, as it requires not only OCR but also recovery of layout, grouping, and flow relationships. We study this problem in the context of CONSORT flow diagrams, which summarize participant screening, randomization, follow-up, and analysis in randomized controlled trials. We introduce a 200-example benchmark of PubMed Central diagrams, annotated by a biomedical team specializing in systematic literature reviews and clinical evidence extraction, and evaluate schema-constrained CONSORT extraction across proprietary and open-weight model families. Using structure-aware metrics, we compare single-pass and stepwise extraction strategies. Expert-guided single-pass extraction performs best for proprietary frontier models, with Gemini 3 Pro achieving the strongest overall results, whereas stepwise prompting improves less capable open-weight models on challenging arm-level extraction. These results offer practical deployment guidance and suggest that high-quality schema-constrained extraction is feasible, but not yet solved.
From Rules to Predictions: Federated Tabular Learning with LLM Reasoning
Afsaneh Mahanipour | Hana Khamfroush
Afsaneh Mahanipour | Hana Khamfroush
Tabular data is widely used in important areas such as healthcare and finance, but building accurate models in real-world settings faces three main challenges: protecting data privacy, handling distributed data, and maintaining strong performance. Existing methods do not solve these issues together. Converting tabular data into text for Large Language Models (LLMs) can expose sensitive information, struggle with anonymized features and exact numerical values, and require expensive training while often not outperforming traditional tree-based models. In addition, many real-world datasets are spread across different institutions, making centralized training impossible. We propose a federated framework that connects distributed tabular data with LLM reasoning using decision tree rules as privacy-preserving intermediaries. Each client trains a local Random Forest and shares only extracted rules?feature comparisons and thresholds, without revealing raw data. These rules are combined into a global pool, allowing an LLM to generate a better partitioning rule without accessing any original data, adding an extra layer of privacy. Using this rule, each client learns local gradient-based corrections, which are then aggregated. We also show that this process reduces prediction error. Experiments on 12 datasets, including seven medical tasks, show that our method consistently outperforms federated baselines and achieves results close to centralized models.
MedBench: Deliberative Evaluation of Medical Language Models
Pratik Jalan | Mukul Joshi | Akhilesh Magotra | Kshitij Jadhav
Pratik Jalan | Mukul Joshi | Akhilesh Magotra | Kshitij Jadhav
We introduce MedBench, a benchmark for evaluating medical language models as deliberating agents rather than isolated predictors. MedBench evaluates eight models (4B?32B) on 19,625 questions from six medical QA datasets using Consensus-Aware Model Panel (CAMP), a two-tier protocol in which five 4B?8B models answer independently, revise after observing peer reasoning, and escalate persistent disagreements to larger 20B?32B models. Compared with zero-shot, few-shot, and chain-of-thought baselines, CAMP shows that deliberation is not uniformly accuracy-improving, but reveals interaction-driven behaviors hidden by single-model evaluation. On PubMedQA without external context, the 4B?8B panel outperforms the evaluated 20B?32B individual zero-shot models (54.1% vs. 33.9%), and achieves the best evaluated result with context (75.7%), suggesting that structured interaction can sometimes complement scale. Across five datasets, initial inter-model agreement is positively associated with correctness and serves as a useful difficulty signal. However, on MedXpertQA, unanimous agreement yields only 6.6% accuracy despite 14.4% overall accuracy, suggesting correlated ignorance, where shared biases make consensus misleading. Error analysis shows that most failures are debate-insufficient cases, where incorrect majorities persist despite interaction (93?97%), while debate-harmful cases account for 3?7%. MedBench positions deliberative evaluation as a complement to accuracy-centric benchmarking, measuring when model interaction corrects errors, reinforces shared mistakes, or signals the need for stronger evidence and human review.
Fast, Accurate, and Local Conversion of MIMIC-IV to OMOP with DBT
Adam Sutton | Niko Moller-Grell | Thomas Searle | Richard Dobson
Adam Sutton | Niko Moller-Grell | Thomas Searle | Richard Dobson
dbt mimic omop is a free, open-source resource that converts the MIMIC-IV dataset to the Observational Medical Outcomes Partnership (OMOP) common data model (CDM) format on consumer level hardware. CDM approaches are increasingly adopted in both industry and academia due to the need for interoperability and reproducibility, including in clinical NLP tasks such as cohort selection, information extraction, and retrieval-augmented generation. The MIMIC-IV database is among the most widely used critical care research datasets, yet existing pipelines to transform it to OMOP depend on enterprise database infrastructure and complex orchestration, limiting accessibility for practitioners and resource-constrained researchers. We further integrate free-text clinical notes (195.6M clinical annotations) and chest radiographs into the OMOP note nlp and imaging extension tables, making all MIMIC-IV modalities (structured data, free-text, and imaging) accessible through a common data model. This resource generates a more comprehensive dataset than existing alternatives and is intended to be used to aid in system development, testing, and evaluation.
Exploring Novel Drug Research Area using Large Language Models Based on Research Trends in Biomedical Literature
Afnan Afnan | Michael Van Supranes | Tomohiro Nishiyama | Shoko Wakamiya | Eiji Aramaki
Afnan Afnan | Michael Van Supranes | Tomohiro Nishiyama | Shoko Wakamiya | Eiji Aramaki
The rapid expansion of biomedical literature makes manual identification of novel drug-disease relationships increasingly difficult. Existing approaches have leveraged LLMs to mine abstracts or construct knowledge graphs for drug repurposing. There are two key limitations: finite context windows for capturing macro-level research trends, and single-pass black-box pipelines make it difficult to verify outputs. This paper proposes a pipeline for discovering new drug targets by combining disease and drug research trends using Large Language Models (LLMs). Our method extracts PICO components from PubMed abstracts, normalizing the Population and Intervention Component to ICD and ATC codes, respectively. A temporal frequency delta matrix is constructed to capture publication count shifts across 2013 to 2022, then used to discover novel drug areas. Compared with the abstract-based baseline, our approach showed qualitative signs of generating combinations that were more closely aligned with observed research trends and, in some cases, more clinically plausible. These findings suggest the potential usefulness of structured trend information for LLM-based exploration, although the differences between the two methods were limited and the results remain preliminary. Future work will focus on validating the consistency and reliability of these candidates.
FHexchange: Resources for Family Health History Extraction and Normalization From Consumer Dialog Sources
Michelle Nguyen | Nidhi Soley | Ayah Zirikly | João Sedoc | Casey Taylor
Michelle Nguyen | Nidhi Soley | Ayah Zirikly | João Sedoc | Casey Taylor
Family health history (FHx) offers insight into a person’s health and disease risk, but it is largely held within free-text clinical formats that require processing for maximal utility of the data. The rapid deployment of ambient AI scribes and conversational agents in clinical settings necessitates evaluation on dynamic patient-clinician and patient-agent dialogs. To address this gap, we introduce two new datasets of patient FHx dialog documents designed to benchmark information extraction and entity linking. Distinct from clinician-entered datasets, patient-reported dialog data has its own semantic and content characteristics, which need to be studied for more patient-centered healthcare. We contribute a publicly available resource called FHexchange, with new annotations for family members, clinical observations, related entities, and standardized UMLS CUIs, offering the clinical NLP community a robust evaluation bed for emerging generative AI tools.
Forgotten Words: Benchmarking NeoBERT for Dementia Detection in Low-Resource Conversational Filipino and English Speech
Rez Samantha Floresca | Edric Castel Hao | Hannah Grachiella Buñales | Chelsea Dominique Temprosa | Georgianna Reyes | Kervin Gabriel Chua
Rez Samantha Floresca | Edric Castel Hao | Hannah Grachiella Buñales | Chelsea Dominique Temprosa | Georgianna Reyes | Kervin Gabriel Chua
Dementia detection from spontaneous speech offers a scalable approach to cognitive screening, yet NLP systems remain predominantly English-centric. This limitation is especially acute in the Philippines, where Filipino?English code-switching is pervasive and no prior work has addressed NLP-based dementia detection.We present the first systematic evaluation of transformer-based dementia detection in Filipino speech and the first assessment of NeoBERT in a clinical NLP setting. To separate language from domain effects, we construct a parallel bilingual dataset of 4,000 DementiaBank-derived transcripts, with Filipino translations produced manually to preserve discourse-level markers of cognitive decline. We evaluate five model families, TF-IDF + LogReg, BERT, NeoBERT, XLM-R, and RoBERTa-Tagalog, under monolingual, zero-shot cross-lingual, and bilingual fine-tuning settings. We find that in-domain performance does not transfer across languages, with English-trained BERT dropping to Macro-F1 = 0.455 on Filipino, and that architectural modernization alone does not improve robustness. Bilingual fine-tuning, however, eliminates cross-lingual degradation across all transformer models, converging to Macro-F1 = 0.969–0.973. These results suggest that multilingual clinical NLP performance is driven primarily by linguistic coverage during training rather than model scale or architecture.
IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages
Shubham Nigam | Suparnojit Sarkar | Piyush Patel
Shubham Nigam | Suparnojit Sarkar | Piyush Patel
We present IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages (Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu). The dataset extends the MDDial corpus with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors introduced during automatic translation. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation (LoRA) of a quantized small language model, incorporating an optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate IndicMedLM against zero-shot multilingual baselines across ten languages and conduct systematic error analysis, identifying five failure modes: Instruction Drift, Label Collapse, Cross-Domain Confusion, Tokenization Failure, and Paraphrase-over-Label Generation. Results show strong post-processed diagnostic accuracy in Hindi, Marathi, and Bengali, while Assamese, Tamil, and Telugu remain in an extreme failure tier attributable to base-model tokenizer gaps, a finding with direct patient safety implications. Medical expert evaluation confirms the clinical plausibility and safety of the generated consultations.
Towards a Radiologist Imitation Framework for 3D CT Diagnosis with Multimodal LLMs
Kaidi Zhang | Zhiyuan Yan | Gao Cheng | Zhenyang Cai
Kaidi Zhang | Zhiyuan Yan | Gao Cheng | Zhenyang Cai
Three-dimensional Computed Tomography (3D CT) is a cornerstone of precision medicine. Most AI diagnostic models analyze large num bers of CTslices uniformly, treating all slices as equally important. While this has partly accel erated radiologists’workflows, it overlooks that clinically relevant information is often sparsely distributed throughout a volume. Without tar geted or weighted processing, fine-grained cues may be missed and substantial computation wasted on diagnostically uninformative slices. Wepropose aradiologist-simulating framework for selective and efficient 3D CT interpreta tion. Evaluated on a 3D CT dataset covering eight thoracic lesion types, it was compared with state-of-the-art multimodal large language models such as GPT-4o and supervised visual backbones including ViT and ResNet-50. Us ing accuracy, F1-score, AUC, and blind radiolo gist assessment, Screen-CLIP achieved an AUC of 0.87 and F1-score of 0.82, surpassing ViT Base (AUC: 0.84). For report generation, our method outperformed M3D across all metrics, reaching a BLEU-Avg of 29.03, and achieved the highest average Doctors’ Score (6.16/10) in a preliminary human evaluation.
Probing and Steering Uncertainty in Biomedical Language Models: Representational Structure and Behavioral Limits
Debmalya Pal
Debmalya Pal
Biomedical language models can generate overly confident clinical statements despite incomplete or ambiguous evidence. We study whether linguistic uncertainty (the hedged epistemic stance expressed in phrases such as "consistent with" or "cannot exclude") is encoded in model representations and can be controlled without retraining. Across six biomedical language models spanning two architectures (causal decoders and bidirectional encoders), we show that uncertainty is captured by robust low-dimensional linear structure in hidden states. We then apply activation steering to manipulate this representation directly, increasing hedged generation in decoder models and inducing targeted uncertainty related shifts in encoder representations. Together, these results show that epistemic stance is not merely a surface linguistic phenomenon but an interpretable and controllable feature of biomedical language model representations, with implications for safer and more calibrated clinical text generation.
Relations of Linguistic Features and Medical Text Preferences are Nontrivial
Davis Bartels | Brandon Colelough | Dina Demner-Fushman
Davis Bartels | Brandon Colelough | Dina Demner-Fushman
We study how simple linguistic features relate to reader preferences in medical question answering. Our dataset contains answers to medical questions ranked in order of quality. We examine eight interpretable features of the answer text: length in words, average words per sentence, percentage of polysyllabic words, medical named entity density, perplexity, coherence, and dependency distance. We find substantial variation across annotators in both the strength and direction of these relationships. Answer length shows some of the strongest associations and predictive signals, but preferences are not consistent across annotators, with some favoring longer answers and others favoring shorter ones. A leave-one-out ablation study shows the relative impact on the predictive accuracy of our models. Overall, these results suggest that linguistic form can influence reader preference in medical text, but that these effects vary across readers and may be more complex than simple linear correlations.
Overview of the MedGenVidQA 2026 Shared Task on Medical Generative Video Question Answering
Deepak Gupta | Collin Campbell | Pedram Golnari | Dina Demner-Fushman
Deepak Gupta | Collin Campbell | Pedram Golnari | Dina Demner-Fushman
This paper presents an overview of the MedGenVidQA 2026 shared task on medical video question answering, collocated with the 25th BioNLP workshop at ACL 2026. The shared task addressed three related sub-tasks of the medical multimodal (textual and video) question answering: (i) multimodal retrieval tasks, (ii) multimodal answer generation with citations, and (iii) a visual answer localization task. The key theme of the stated task is to develop reliable multimodal question answering systems for consumers and medical professionals by leveraging generative models. A total of nine teams participated in the shared task challenges and submitted a total of forty-three submissions across all tasks. We performed both automated and human assessments to evaluate the submissions. This paper describes the tasks, datasets, evaluation metrics, participation, and baseline systems for all three tasks. Additionally, we summarize the techniques and results of the evaluation of the various approaches explored by the participating teams. Finally, we discuss the key findings and implications for the development of multimodal medical question answering.
Overview of the ClinicalSkillQA 2026 Shared Task on Continuous Perception and Procedural Reasoning in Clinical Skill Assessment
Xiyang Huang | Renxiong Wei | Yihuai Xu | Zhiyuan Chen | Keying Wu | Jiayi Xiang | Buzhou Tang | Yanqing Ye | Jinyu Chen | Cheng Zeng | Min Peng | Qianqian Xie | Sophia Ananiadou
Xiyang Huang | Renxiong Wei | Yihuai Xu | Zhiyuan Chen | Keying Wu | Jiayi Xiang | Buzhou Tang | Yanqing Ye | Jinyu Chen | Cheng Zeng | Min Peng | Qianqian Xie | Sophia Ananiadou
This paper presents an overview of the ClinicalSkillQA 2026 shared task, which was organized with the BioNLP Workshop at ACL 2026. The goal of this shared task is to evaluate continuous perception and procedural reasoning in clinical skill assessment by requiring systems to reconstruct the correct temporal order of shuffled clinical key frames and generate rationales grounded in clinical workflow knowledge. The benchmark contains 200 test-only instances sampled from clinical skill videos, covering three emergency-care procedures. Each instance is annotated with the ground-truth temporal order and an expert-verified rationale. A total of seven teams participated in the task, collectively making 90 submissions, with four teams providing system description papers. Systems are evaluated using Task Accuracy, Pairwise Accuracy, and BERTScore, which measure exact sequence reconstruction, local temporal consistency, and rationale quality, respectively. In this paper, we describe the task setup, dataset construction, and evaluation criteria. We further summarize the methodologies adopted by participating teams and present a comprehensive analysis of the submitted systems. The official results suggest that current models still struggle with continuous perception and procedural reasoning, especially when they must integrate visual evidence, temporal structure, and clinical workflow knowledge.
up
Proceedings of the BioNLP 2026 (Shared Tasks)
This paper describes our participation in the CRF Filling Shared Task 2026, which aims to automatically populate a predefined Case Report Form (CRF) from clinical notes describing patients with dyspnea.We propose a two-stage pipeline based on large language models (LLMs). In the first stage, a few-shot prompted LLM extracts candidate CRF fields from the clinical note and outputs them in a structured JSON format. In the second stage, a separate LLM verifies each extracted field against the original note and removes predictions that are not supported by explicit textual evidence. This verification step aims to reduce false positives generated during extraction.Experiments on the development set show that the verification stage significantly reduces unsupported predictions while preserving most correct extractions, resulting in improved macro F1. On the official test set, the proposed system achieves a macro F1 score of 0.56 for both English and Italian. These results indicate that separating extraction and verification can balance recall-oriented extraction with precision-oriented validation in CRF population tasks.
VerbaNexAI at ClinicalSkillQA: From Visual Evidence to Procedural Order A Two-Stage Generative Vision-Language Framework for ClinSkillQA
Andrea Menco Tovar | Jairo E. Serrano | Edwin Puertas | Juan Carlos Martinez-Santos
Andrea Menco Tovar | Jairo E. Serrano | Edwin Puertas | Juan Carlos Martinez-Santos
This work addresses the temporal ordering task of clinical frames in the Basic Life Support (BLS) subset of ClinSkillQA. A two-stage hybrid pipeline based on Qwen2-VL-2B-Instruct in a zero-shot configuration is proposed. In Stage 1, each image is processed independently to extract factual visual evidence, which is then transformed, using deterministic rules, into a structured representation. In Stage 2, ordering is formulated as an ordinal scoring task over procedural stages, with ties broken using PCA applied to multimodal embeddings. Evaluation followed the official benchmark protocol, considering Task Accuracy, Pairwise Accuracy, and BERTScore. In the test phase, the system achieved Task Accuracy = 0.17, Pairwise Micro Accuracy = 0.60, and BERT F1 = 0.71, with complete coverage in both predictions and rationales. The results demonstrate an interpretable and reproducible foundation, although challenges in fine-grained temporal discrimination remain.
zzucs at PsyDefDetect: Bridging Long-Tail Imbalance and Clinical Rubrics for DMRS Defense-Level Detection
Bin Huang | Liuyuan Su | Kaixuan Yuan | Guanghui Zhao | Shixin Zhang | Kunli Zhang
Bin Huang | Liuyuan Su | Kaixuan Yuan | Guanghui Zhao | Shixin Zhang | Kunli Zhang
Detecting DMRS defense levels in emotionalsupport dialogues is challenging due to severe class imbalance and fine-grained clinical distinctions between adjacent levels, issueswell documented in psychotherapy-orientedNLP surveys (Na et al., 2025). We presentzzucs for PsyDefDetect at BioNLP 2026 (Naet al., 2026a), adopting a data–supervisionco-design strategy. SCCR applies stratifiedresampling to balance support across nine defense levels. CoR–QLoRA encodes clinical rubrics, including task contracts, taxonomy definitions, and boundary cues, into staticprompts for 8B model fine-tuning. Ablationsshow SCCR improves macro-F1 by 4.9 pointsover random oversampling. Our system fromteam zzucs, submitted on CodaBench underthe display name sly_zzu with submission ID652647, achieves 0.3585 macro-F1 on the official blind-test leaderboard LB1. It ranks6th of 21 registered teams with official submissions and surpasses all published 8B baselines by 4.4 F1 points over the strongest 8Bcomparator, Ministral-8B. The code has beenreleased at https://github.com/jackssdd/zzucs_psydefdetect_code.
zzunlp at ClinicalSkillQA: Perceive-and-Plan with Decomposed In-Context Learning and Saliency-Guided Perception for Clinical Skill Keyframe Reordering
Bin Huang | Yi Luo | Zhontian Hua | Guanghui Zhao | Kaixuan Yuan | Kunli Zhang
Bin Huang | Yi Luo | Zhontian Hua | Guanghui Zhao | Kaixuan Yuan | Kunli Zhang
Multimodal Large Language Models (MLLMs)show strong medical visual understanding,however their capability for continuous per-ception in procedural clinical workflows re-mains underexplored. We present Perceive-and-Plan, a decomposed in-context learningparadigm for clinical skill keyframe reorder-ing. The method separates visual perceptionfrom temporal planning via two stages: (1)structured visual perception with saliency-guided Picture-in-Picture (PiP) compositionthat magnifies critical regions (head, chest)as color-coded insets, and (2) temporal rea-soning with chain-style self-verification viafresh conversation reset and visual-evidenceanchoring (BLS Rules R1-R11). Withoutparameter updates, our system scores 71.43overall (2nd place, ClinSkill QA 2026), with0.86 pairwise accuracy and 1.0 rationale cover-age. Structured prompting with visual saliencyguidance measurably improves MLLMs’ pro-cedural understanding.Our code is pub-lished at https://github.com/NanceTide/clinskillqa-perceive-and-plan.
DLNLP at ClinicalSkillQA: EvidenceFlow for Structured Zero-Shot Clinical Keyframe Ordering
Kexin Li | Zhekun Wang | Yiran Wang | Di Zhao
Kexin Li | Zhekun Wang | Yiran Wang | Di Zhao
The ClinSkill QA shared task requires models to recover the temporal order of scrambled clinical keyframes and generate explanations. We propose EvidenceFlow, a structured zero-shot framework based on Qwen2.5-VL that decomposes the task into global overview, local evidence modeling, and ordering decision, with two variants: model-led EvidenceFlow-M and rule-guided EvidenceFlow-R. On the official test set, EvidenceFlow-R achieves better ordering performance, while EvidenceFlow-M produces better explanation quality, revealing a trade-off between ordering stability and rationale generation. EvidenceFlow provides an interpretable zero-shot baseline for clinical keyframe ordering.
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
Dima Galat | Marian Rizoiu
Dima Galat | Marian Rizoiu
This paper describes our system for classifying psychological defense mechanisms in emotional support dialogues using the Defense Mechanism Rating Scales (DMRS), placing second (F1 0.406) among 64 teams.1 A central insight is that defense mechanisms are defined by what is absent: missing affect, blocked cognition, denied reality. We encode this as an affect-cognition integration spectrum in prompt-level clinical rules, which account for the largest single gain (+11.4pp F1).Our architecture is a multi-phase deliberative council of Gemini 2.5 agents where class-specific advocates rate evidence strength rather than voting, achieving F1 0.382 with no fine-tuning - a top-5 result on its own. We find, however, that the council is confidently wrong about minority classes: 59–80% of stable minority predictions are incorrect, driven by a systematic "L7 attractor" in which emotional content defaults to the majority class. A targeted override ensemble from three fine-tuned Qwen3.5 models applies 16 overrides (+2.4pp), selected by a structured multi-agent system (builder, critic, regression guard) that produced a larger F1 gain in one iteration than 8 prior attempts combined.
Otter at MedExAct2026: Diverse Encoder Ensemble for Medical Decision Span Detection
Lalita Lowphansirikul | Piyalitt Ittichaiwong
Lalita Lowphansirikul | Piyalitt Ittichaiwong
We build an ensemble of 10 transformer encoders for the MedExACT 2026 shared task on medical decision span detection. The ensemble is diversified along three training directions: encoder initialization (including domain-adaptive pre-training on clinical text), loss function, and data augmentation with LLM-generated synthetic notes and silver-labeled clinical documents. Greedy forward search selects the combination with the highest validation final score. A BERT-based boundary refiner is applied to the ensemble’s predicted spans to correct offset errors before submission.
Eraserhead at PsyDefDetect: Prompt Design and Class Rebalancing for Psychological Defense Mechanism Detection
Muhammad Abu Horaira | Mehreen Rahman | Nahian Chowdhury
Muhammad Abu Horaira | Mehreen Rahman | Nahian Chowdhury
We describe the Eraserhead system submitted to the PsyDefDetect shared task at BioNLP 2026, which frames psychological defense level detection as a nine-class utterance classification problem over supportive dialogue. Our system is based on Qwen3-14B and combines clinically informed prompt design, per-label oversampling, and careful inference settings for stable prediction. A central challenge of the task is strong class imbalance, with High-Adaptive responses appearing far more often than several minority classes. This makes it easy for models to favor the majority class and achieve reasonable accuracy while performing poorly on rarer categories. To address this, we iteratively adjusted oversampling targets based on error analysis and predicted label distributions across submission rounds. Our final system achieved an official macro F1 of 0.3418 on Leaderboard 1 and 0.3947 on Leaderboard 2, ranking 7th among the 21 registered teams on both leaderboards. We further analyze the main failure modes of the system, especially the difficulty of distinguishing Minor Image Distorting defenses from High-Adaptive responses and the persistent tendency to over-predict the majority class. These findings highlight the broader difficulty of modeling psychological function from text alone.
Nürnberg NLP at PsyDefDetect: Multi-Axis Voter Ensembles for Psychological Defence Mechanism Classification
Philipp Steigerwald | Eric Rudolph | Jens Albrecht
Philipp Steigerwald | Eric Rudolph | Jens Albrecht
Detecting levels of psychological defence mechanisms in supportive conversations is inherently ambiguous. In the PsyDefDetect shared task at BioNLP 2026 the eight positive defence categories share surface language and differ only in pragmatic function and trained raters reach only moderate inter-annotator agreement. On such a task the decisive lever is not a stronger single model but error independence, since any single representation will waver on the overlapping defence boundaries. We translate this insight into a 9-voter ensemble spanning three orthogonal axes: class granularity (all nine classes for the gatekeeper, only the eight defence classes for the specialists), training method (generative and discriminative) and base model. The system reaches an F1 score of .420 on the hidden test set, placing first among 21 registered teams.
Neural Nexus at PsyDefDetect: Fine-Tuning RoBERTa with Focal Loss and Role-Tagged Dialogue History for Defense Level Detection
Subhrajyoti Basu
Subhrajyoti Basu
We describe our system for the PsyDefDetect shared task at BioNLP 2026, which focuses onclassifying help-seeker utterances in multi-turn supportive conversations into nine psychological defense mechanism levels defined by the Defense Mechanism Rating Scales (DMRS). Our approach fine-tunes roberta-base using a composite training objective that combines focal loss, label smoothing, and squareroot dampened class weights to address the severe label imbalance present in the PSYDEFCONV corpus, where the dominant class constitutes 52% of the training data. The inputrepresentation is constructed by concatenating up to eight dialogue turns with role-specific tags, separated using RoBERTa’s native /s tokens, followed by the target utterance marked using a [TARGET] token. Model selection is performed using macro-F1 based early stopping on a stratified 15% validation split, along with cosine learning rate decay for stable optimization. Our best submission achieves an official Leaderboard 1 (positive classes) macroF1 score of 0.2556, ranking 11th among 21 registered teams.
ELiRF-UPV@MedExACT 2026: Dynamic Section Conditioning for Medical Decision Span Detection in Discharge Summaries
Vicent Ahuir | Lluís Hurtado | María Castro-Bleda
Vicent Ahuir | Lluís Hurtado | María Castro-Bleda
Extracting medical decisions from discharge summaries is essential for downstream clinical analytics, yet the task remains challenging due to the heterogeneous structure of electronic health records. For the MedExACT track at ACL 2026, we proposed a system that achieved the 4th position. Our approach first applies dynamic section conditioning to capture the contextual dependencies inherent in each document. A transformer backbone is then augmented with category- and section-aware layer mixing, enabling us to fuse global document structure with fine-grained semantic cues. To further improve robustness, we employ an ensemble of instruction-tuned large language models for automatic section extraction, while a fairness-oriented model selection criterion ensures that performance does not degrade on minority demographic subgroups. The resulting system attains a final score of 0.5806 on the held-out test set and demonstrates significant gains over the baseline across all evaluated subpopulations.
VISHC at PsyDefDetect: Mitigating Data Scarcity in Psychological Defense Classification with Context-Aware Synthetic Augmentation
Hoang-Thuy-Duong Vu | Quoc-Cuong Pham | Huy-Hieu Pham
Hoang-Thuy-Duong Vu | Quoc-Cuong Pham | Huy-Hieu Pham
Psychological defense mechanisms (PDMs) are unconscious cognitive processes that modulate how individuals perceive and respond to emotional distress. Automatically classifying PDMs from text is clinically valuable but severely hindered by data scarcity and class imbalance, challenges which generative augmentation alone cannot resolve without psychological grounding. In this work, we address these challenges in the PsyDefDetect shared task (BioNLP@ACL 2026) by proposing a context-aware synthetic augmentation framework combined with a hybrid classification model. Our hybrid model integrates contextual language representations with basic clinical features, along with 150 annotated defense items. Experiments demonstrate that definition quality in prompting directly governs generation fidelity and downstream performance. Our method surpasses DMRS Co-Pilot, reaching an accuracy of 58.26% (+40.25%) and a macro-F1 of 24.62% (+15.99%), thereby establishing a strong baseline for psychologically grounded defense mechanism classification in low-resource settings. Source code is available at: https://github.com/htdgv/CASA-PDC.
Diverse Transformer Ensemble with Majority Voting for Medical Decision Extraction at MedExACT 2026
Rishik Kondadadi
Rishik Kondadadi
This paper describes our system for the MedEx-ACT 2026 shared task on extracting and classifying medical decisions from ICU discharge summaries. We frame the task as BIO token classification and train 25 diverse transformer models spanning 13 distinct architectures, including Longformer, DeBERTa, RoBERTa, BioBERT, SciBERT, and others. Each model is trained with category-aware oversampling, focal loss, and demographic-group-aware sampling to address class imbalance and promote fairness across patient subgroups. At inference time, we aggregate predictions via text-normalized majority voting, retaining spans agreed upon by at least 6 of 25 models. Our best submission achieves a final score of 0.5554 on the test set, demonstrating that a simple vote-based ensemble over architecturally diverse models outperforms more complex filtering approaches. We find that architectural diversity is a key driver of ensemble quality and that cross-validation is essential for reliable model selection on small clinical datasets.
FBK-NLP at ClinSkill QA 2026: Improving Temporal Reasoning via Keypoint-Augmented Inputs
Pedro Gabriel Campana | Alberto Lavelli | Bernardo Magnini
Pedro Gabriel Campana | Alberto Lavelli | Bernardo Magnini
Understanding procedural skills from visual data is a key challenge in medical AI, especially for tasks that require reasoning over temporal sequences. We report on FBK-NLP’s participation at the ClinSkill QA 2026 shared task, which requires models to arrange shuffled key frames into a coherent sequence of clinical actions and provide explanations for the resulting order. We conduct a systematic study of prompting and reasoning strategies using an open and easily deployable vision-language model (VLM). The central finding of our study is that incorporating keypoint-based representations of people’s body parts substantially improves temporal reasoning behind frame ordering. Furthermore, we show that model performance is highly sensitive to prompt design and to seemingly minor factors such as filename ordering and the inclusion of domain information.
transformer_1376 at PsyDefDetect: A QLoRA-Based Generative Framework for Context-Aware Psychological Defense Mechanism Detection
Pritha Saha | Shuvodwip Saha | Anik Mahmud Shanto
Pritha Saha | Shuvodwip Saha | Anik Mahmud Shanto
Psychological defense mechanisms play a cru-cial role in shaping human responses duringemotionally charged conversations, yet remainunderexplored in natural language processing.In this work, we address the PSYDEFCONVshared task, which involves classifying defensemechanisms in multi-turn dialogues using clin-ically grounded annotations based on the De-fense Mechanism Rating Scales (DMRS). Wepropose a generative supervised fine-tuningframework that reformulates the task as con-ditional text generation. A pre-trained causallanguage model (Gemma-2-2B) is adapted us-ing parameter-efficient fine-tuning (PEFT) with4-bit quantization, enabling efficient trainingunder limited computational resources. To han-dle class imbalance, we apply random oversam-pling, and we design a prompt-based input rep-resentation to incorporate conversational con-text effectively. Experimental results demon-strate that our generative approach is compet-itive with discriminative baselines while of-fering improved flexibility in modeling sub-tle and context-dependent defensive behaviors.The findings highlight the potential of genera-tive large language models for psychologicallygrounded dialogue understanding tasks.
Explainators at PsyDefDetect: Hierarchical Prompting and Representation-Based Classification for Psychological Defenses
Liudmila Babakova | Christopher Luongo-Vazquez | Ilia Stepin
Liudmila Babakova | Christopher Luongo-Vazquez | Ilia Stepin
Psychological defense detection is one of essential present-day challenges in clinical practice. The state-of-the-art natural language processing (NLP) tools aim to automate this task. However, their potential and efficiency remain largely unexplored. This manuscript attempts to address this problem from various perspectives: it first explores the efficiency of direct large language model (LLM)-prompting. Then, it applies NLP techniques for LLM fine-tuning applied to the psychological defense classification task. Finally, it attempts to generate states of mind based on the speaker’s psychological state. The results show that the complexity of the task requires further improvement of the software solutions used.
PerceptionLab at PsyDefDetect: Overcoming Extreme Response Bias in LLMs via Rubric-Grounded Retrieval and Supervised Clinical Reasoning Distillation for Fine-Grained Ordinal Classification
Tamjid Fahim | Syed Johan | Saad Bin Maksud
Tamjid Fahim | Syed Johan | Saad Bin Maksud
Automating the classification of psychological defense mechanisms is a critical yet challenging frontier in clinical natural language processing. General-purpose Large Language Models (LLMs) struggle to apply fine-grained ordinal frameworks like the Defense Mechanism Rating Scales due to the implicit nature of clinical cues and a fundamental clinical reasoning gap. These models exhibit severe extreme response bias, systematically gravitating toward the scale’s endpoints while failing to resolve nuanced, mid-level defenses. In this paper, we present our third-place system for the PsyDefDetect Shared Task at BioNLP 2026, designed specifically to overcome this failure mode. We propose a hybrid architecture that synergizes label-flattened generative retrieval with an LLM classifier fine-tuned via the distillation of supervised clinical reasoning traces. This dual approach, grounding decisions in rubric criteria while leveraging task-specific supervision, successfully mitigates the observed bias, achieving an accuracy of 67.37% and a macro-F1 of 39.56%. Our work provides empirical evidence that tightly integrating targeted clinical supervision with dynamic rubric-grounded retrieval significantly outperforms the raw parameter scale of un-tuned foundation models.
LinguIUTics at PsyDefDetect: Iterative Imbalance-Aware Fine-tuning of Qwen3-8B for Psychological Defense Mechanism Classification
Shefayat Adib | Ahmed Sani | Md Hasibur Alif | Ajwad Abrar
Shefayat Adib | Ahmed Sani | Md Hasibur Alif | Ajwad Abrar
Detecting psychological defense mechanisms in conversational text remains a challenging clinical NLP problem. For the PsyDefDetect 2026 shared task (9-class utterance classification evaluated via macro F1), our team LinguIUTics1 achieves a macro F1-score of 0.3917 on the official positive-class leaderboard, ranking 4th out of 21 registered teams and improving over the Ministral-8B task baseline (31.48 macro F1) by +7.7 absolute points (+24.4% relative). BERT-family encoders and zero-shot LLMs proved ineffective on rare classes due to severe class imbalance, leading us to QLoRA fine-tuning of Qwen3-8B. We leverage three key strategies: grouped stratified cross-validation (preventing leakage), minority-class round-robin lexical augmentation, and a post-processing pipeline with logitbias tuning and ensemble blending. Together, these components close much of the validation–leaderboard gap and substantially improve minority-class recall, driving the critical "Unclear" class (Level 8) from near-zero performance to F1=0.797.
TONI-NLP at PsyDefDetect: Defense Mechanism Detection via LLM-based Ensemble Methods
Durjoy Paul | Arshitha Basavaraj | Callum Chan | Veronica Perez-Rosas | Diana Inkpen | Francisco Pereira | Juan Antonio Lossio-Ventura
Durjoy Paul | Arshitha Basavaraj | Callum Chan | Veronica Perez-Rosas | Diana Inkpen | Francisco Pereira | Juan Antonio Lossio-Ventura
This system paper presents the approach of Team TONI-NLP to the PsyDefDetect 2026 shared task. The objective of the task was to classify utterances from helper–seeker conversations into nine categories: seven labels representing progressively higher levels of defensive maturity, one label indicating the absence of a defense mechanism, and one label for cases requiring additional information. We investigated several modern NLP approaches, including prompt engineering, fine-tuning, hierarchical modeling and classification using text embeddings derived from transformer-based models as well as classical embeddings such as TF-IDF. Our results show that ensemble methods performed best among our submitted systems, achieving a macro-F1 score of 0.320 and ranking 9th in the shared task out of 21 teams.
Zero-Shot, Fine-Tuned, and Retrieval-Augmented Extraction of Clinical Decisions with Corpus Boundary Diagnostics
Mohammed Alliheedi | Robert Mercer | Anemily Machina | Sudipta Roy | Yetian Wang | Xindi Wang
Mohammed Alliheedi | Robert Mercer | Anemily Machina | Sudipta Roy | Yetian Wang | Xindi Wang
We present the CanSA system for the MedEx-ACT@ACL 2026 shared task, which requires extracting and classifying clinical decisions from ICU discharge summaries into nine DIC-TUM categories. We have developed three approaches: (1) a training-free system which consists of a preprocessing module that normalizes text and an inference engine combining zero shot LLMs with a RAG ensemble, (2) a supervised fine-tuning method which required training, and (3) a training-free retrieval-augmented pipeline employing TF–IDF-based lexical retrieval to surface in-context exemplars from the development corpus, combined with section aware chunking and structured extraction calls to a large language model. Our team’s best submission achieved a Final Score of 0.41, ranking 34th out of 37 on the official test leaderboard.
CASPAR: A Context-Aware Span Refinement Approach for Decision Support
Jing Tao | Amir Eskandari | Farhana Zulkernine
Jing Tao | Amir Eskandari | Farhana Zulkernine
This paper presents CASPAR, a two-stage approach for the MedExACT shared task on medical decision span extraction and classification from ICU discharge summaries. Stage 1 performs document-level sequence labeling using a sliding-window RoBERTa encoder with BiGRU and CRF to generate candidate spans. Stage 2 applies a lightweight refinement module that revisits each candidate within its surrounding context to revise category assignments and correct span boundaries. The system achieves a final score of 0.5668 on the official leaderboard, substantially outperforming the organizer baseline on span-level F1. In addition to system description, we provides ablation results, repeated-run validation statistics, and subgroup- and error-level analyses that highlight the challenges of exact boundary recovery and confusion in race categories subgroups in clinical decision extraction.
KCL-Cogstack at PsyDefDetect: A Hierarchical Approach to Detecting Defense Mechanisms in Supportive Dialogue
Shubham Agarwal | Thomas Searle | Richard Dobson
Shubham Agarwal | Thomas Searle | Richard Dobson
We present our system for the PsyDefDetect shared task, which focuses on detecting and classifying psychological defense mechanisms in peer emotional support conversations. Our core contribution is a hierarchical classification framework that structures prediction as a coarse-to-fine pipeline over a clinically validated label hierarchy, grounded in the Defense Mechanism Rating Scales (DMRS). Through systematic experimentation with flat fine-tuning, few-shot prompting, and hierarchical classification, we demonstrate that explicitly modelling the structured relationships among defense levels offers a more effective alternative to flat classification, achieving a macro F1 of 0.23 on the official test set.
DAL Team at PsyDefDetect: From Supervised Encoders to Hierarchical LLM-RAG for Psychological Defense Detection
Anh Chu | Luong Tran | Dat Do | Phuong Mai | Quynh Le | Cat Can
Anh Chu | Luong Tran | Dat Do | Phuong Mai | Quynh Le | Cat Can
We propose a hierarchical framework for psychological defense mechanism detection in multi-turn dialogues, integrating large language models, retrieval-augmented generation, and heuristic calibration. Our approach decomposes prediction into coarse-to-fine reasoning stages and incorporates dialogue reconstruction, explanation-enhanced retrieval, and hybrid LLM–supervised filtering to address severe label imbalance and implicit, context-dependent labeling. Experiments on the PsyDefDetect dataset show that LLM-based RAG improves performance on minority and ambiguous classes, achieving a Macro F1 of 0.31, while also revealing persistent challenges in fine-grained discrimination of latent psychological constructs.
CUAMC @ MedExACT 2026: Robust Ensemble Voting for Fair Medical Decision Extraction
William Baumgartner | Lisa Schilling
William Baumgartner | Lisa Schilling
Automated extraction of medical decisions from clinical notes is a critical step to constructing more granular patient health trajectories than what is currently obtainable from structured healthcare data. Here we present a system designed for the MedExACT shared task that employs an ensemble of BERT-based classifiers to account for demographic diversity when extracting mentions of medical decisions from MIMIC-III discharge summaries. A simple voting strategy combined with architectural diversity is demonstrated to work best when training data is limited.
LAMAR at MedExACT 2026: Agreement-Driven Large Language Model Ensembles for Clinical Decision Extraction from Discharge Summaries
Monrada Chiewhawan | Keetawan Limaroon | Titipat Achakulvisut
Monrada Chiewhawan | Keetawan Limaroon | Titipat Achakulvisut
This paper presents an ensemble of Qwen3.5-4B language models for extracting medical decisions from discharge summaries in the MedDec dataset. The models were trained to annotate discharge summaries with inline XML-like tags. Three different training strategies were used including dynamic fine-tuning, reinforcement learning, and pseudo-label augmentation. By combining predictions based on inter-model agreement, the system improved performance across evaluation metrics, achieving an overall F1 of 0.5942 and ranking second on the test leaderboard. The results also showed stable performance across demographic groups, suggesting fairness for underrepresented populations.
CS_Metro at PsyDefDetect: Detecting Psychological Defense Mechanisms in Mental Health Dialogues with Summarization-Enhanced Transformer Ensembles
Oarisa Rebayet | Radiul Walee | Symom Hossain Shohan | Kawsar Ahmed | Mohammed Moshiul Hoque
Oarisa Rebayet | Radiul Walee | Symom Hossain Shohan | Kawsar Ahmed | Mohammed Moshiul Hoque
Detecting psychological defense mechanisms in supportive conversations is essential for assisting mental health practitioners. Natural language processing techniques are increasingly integral to such systems, enabling automated classification of defense levels to better understand help-seeker behavior and resistance patterns. In PsyDefDetect at BioNLP 2026, we address the task of nine-class defense level classification on the PSYDEFCONV corpus. We propose a three-stage pipeline combining LLM-based dialogue summarization, domain-specific transformer fine-tuning, and rule-based ensemble prediction. Additionally, we evaluate three mental health domain-specific transformers (Mental-BERT, Mental-RoBERTa, Mental-XLNet) alongside fine-tuned LLMs (Qwen3-4B, Qwen3-1.7B, Mistral-7B under different input conditions. Experimental results on the released test-set gold labels show that our ensemble approach achieves the best performance, reaching 34.69% macro F1 and surpassing the baseline by 4.69 percentage points. On the official PsyDefDetect Leaderboard 1 (labels 1–8), the submitted system achieved a Macro-F1 score of 23.46%, ranking 15th out of 21 teams, while on Leaderboard 2 (labels 0–8), it achieved 30.04%, securing 14th place. These findings demonstrate that domain-specific transformers substantially outperform generic LLM fine-tuning on this specialized clinical task.
Sparse Category Routing and Fairness-Aware Optimization for Medical Decision Extraction
Ahmed Elshehaby | Mohamed Abdalla | Youssef Mohamed
Ahmed Elshehaby | Mohamed Abdalla | Youssef Mohamed
Extracting structured medical decisions fromICU discharge summaries is hard because oflong documents, severe category imbalanceacross nine DICTUM decision types, and afairness-aware evaluation that penalizes incon-sistent performance across demographic sub-groups. We present our system for the MedEx-ACT 2026 shared task (Elgaar et al., 2026),which fine-tunes BiomedBERT with a com-posite loss combining label-smoothed cross-entropy, a soft token-F1 auxiliary term, andR-Drop regularization. At inference time weapply a deterministic ensemble: half-offsetsliding-window augmentation across four win-dow configurations, dual-branch logit aggrega-tion from the same checkpoint, per-categorylength calibration on the Anchor Branch, andsparse routing of categories 4 and 7 to a context-weighted specialist branch motivated by theirunusual span-length distributions. Adding R-Drop improved validation Overall_F1 by 1.24points over the CE + soft-F1 baseline, with alarger 1.70-point gain on Worst-Group F1. Ourbest submission achieves Span F1 of 0.4900,Token F1 of 0.6796, and an official Overall_F1of 0.5724, with the African American subgroupas the Worst-Group bottleneck at Base_Score0.5601
AlienAnnotators at PsyDefDetect: What Lies Between the Lines: Probing Lightweight Open-Source LLMs for Psychological Defense Mechanism Detection
Siam Karip | Nahid Hossain
Siam Karip | Nahid Hossain
Detecting psychological defense mechanisms in therapy dialogue is a clinically valuable but computationally underexplored task. We present our systematic analysis for PsyDefDetect, a shared task at BioNLP@ACL 2026, which frames defense detection as a nine-class utterance-level classification problem based on the Defense Mechanism Rating Scale (DMRS). We systematically evaluate six open-source, instruction-tuned small language models (SLMs, = 9B parameters) in zero-shot and fine-tuning settings, and compare a clinically-grounded prompt against the organizer-provided baseline. Our official submission achieved 59.96% accuracy and 16.28% Macro F1. Post-submission experiments show that fine-tuning combined with 5-fold cross-validation and logit averaging ensemble substantially improves performance, with the best configuration reaching 34.59% Macro F1 and 65.25% accuracy. We find that clinically-grounded prompts outperform bare label definitions, model scale does not consistently improve zero-shot performance, and fine-tuning dramatically recovers even collapsed zero-shot models. Certain defense tiers remain persistently difficult across all settings, pointing to clinical ambiguity at tier boundaries as a more fundamental bottleneck than data imbalance alone.
Team Aurum at MedExACT 2026@ACL: Data Augmentation and Clinical Longformer Fine-Tuning for Medical Decision Extraction
Jyoti Kumari | Vinay Ulli | Anindita Mondal
Jyoti Kumari | Vinay Ulli | Anindita Mondal
This paper describes the system submitted by team Aurum to the Medical Decision Extraction, Analysis, and Classification Task (MedExACT) at BioNLP 2026. The task requires the extraction and classification of contiguous text spans representing medical decisions from lengthy ICU discharge summaries. To address the dual challenges of long document lengths and severe class imbalance withina limited training set of 350 notes, we propose a two-pronged strategy. First, we employ a tripartite data augmentation pipeline utilizing rule-based entity replacement, LLM-based contextual paraphrasing, and synthetic note generation to expand the training data to over 2,300 notes. Second, we fine-tune a domain-specific Clinical Longformer model equipped with a sliding-window inference mechanism and Focal Loss to handle sequences up to 2,048 tokens while focusing on rare decision categories. Paired with a targeted post-processing module,our system achieved a Final Score of 0.5251, demonstrating high token-level detection (Token F1: 0.6311) and strong stability across patient demographics.
NJUST-KMG at MedGenVidQA 2026: Cascade Multi-modal Alignment with Gaussian Soft Priors for Medical Visual Answer Localization
Jinglong Li | Yang Yang
Jinglong Li | Yang Yang
This paper describes the system developed for the Medical Visual Answer Localization (MVAL) task at MedGenVidQA 2026. Accurately locating surgical or instructional steps in medical videos is inherently challenging due to audio-visual asynchrony and the visual homogeneity of surgical scenes. We propose a Cascade Multi-modal Alignment Framework that integrates Large Language Models (LLMs) to bridge the semantic-temporal gap. Our pipeline utilizes WhisperX for word-level speech transcription to ensure precise textual anchoring. We then employ Gemini3 as a high-level semantic ranker to generate multi-scale textual priors. Crucially, we transform these discrete semantic scores into a continuous 1D Gaussian Soft Prior, which is injected as an attention bias into our cross-modal fusion network. This mechanism preserves global temporal context while guiding the model to focus on query-relevant frames. Our system achieves highly competitive performance on the validation leaderboard, particularly under strict evaluation metrics, reaching an IoU@0.7 of 67.5%.
LAMAR-2 at MedGenVidQA 2026: Visual Answer Localization in Medical Videos via Multimodal LLM and Context-Augmented Prompting
Watcharitpol Sermsrisuwan | Nopporn Lekuthai | Seksan Yoadsanit | Titipat Achakulvisut
Watcharitpol Sermsrisuwan | Nopporn Lekuthai | Seksan Yoadsanit | Titipat Achakulvisut
This paper presents an approach to localizing visual answers within continuous medical videos using a multi-step multimodal generation pipeline with the MedGenVidQA dataset. We frame visual answer localization as a multimodal fusion problem, integrating raw video, timestamped ASR transcripts, and VLM-generated scene descriptions into structured contextual blocks, enabling the model to cross-reference spoken commentary against observable physical events. We show that targeted guidance, which forces the model to treat audio transcripts as supplementary hints with observable visual movements, significantly outperforms baseline approaches. It achieves state-of-the-art performance on the test leaderboard, yielding an mIoU of 79.55, alongside IoU@0.3, IoU@0.5, and IoU@0.7 scores of 93.75, 90.00, and 77.50, respectively. Our findings highlight the effectiveness of combining multimodal context fusion with targeted guidance to overcome text bias, establishing a promising approach for achieving the micro-level precision required in the medical domain. We release our code on GitHub at https://github.com/biodatlab/medgenvidqa-lamar.
Varja-Dominators at MedGenVidQA 2026: Hybrid Video and Document Retrieval using PubMedBERT, T5 Query Expansion, and Cross-Encoder Re-Ranking
Pratik Dhaktode | Suhani Bighane | Anupama Phakatkar
Pratik Dhaktode | Suhani Bighane | Anupama Phakatkar
This paper presents a system for Task A of the MedGenVidQA 2026 shared task, which requires simultaneously retrieving relevant PubMed documents and medical videos for 60 consumer health topics. The core contribution is a unified multi-stage pipeline that treats video and document retrieval as complementary rather than independent problems.For video retrieval, the system fine-tunes a PubMedBERT bi-encoder on 2,710 MedVidQA training samples using BM25-driven hard negative mining. Video transcripts (833 unique videos) are segmented into overlapping 30-second temporal chunks with a 10-second stride, producing 32,489 indexed chunks. At query time, T5-based query expansion generates enriched queries for BM25 sparse retrieval, while the original query drives FAISS dense retrieval. The two ranked lists are fused via weighted Reciprocal Rank Fusion (RRF, dense weight 0.75, sparse weight 0.25), and a cross-encoder (MiniLM-L-6-v2) re-ranks the top-200 fused candidates to produce the final top-10 videos. For document retrieval, the NCBI PubMed ESearch API is queried using a progressive keyword fallback chain with exponential backoff, ensuring full topic coverage.The system achieves a MAP of 0.3898, Recall@10 of 0.8449, and NDCG@10 of 0.1079, with complete 60/60 topic coverage across both retrieval modalities. Key limitations include reliance solely on transcript text for video retrieval (no visual or audio features) and dependence on a live API for document retrieval.
Pride-Boiler at MedGenVidQA 2026: LLM-Augmented BM25 Retrieval with Corrective Self-Verification for Biomedical Evidence Retrieval
Basil Ebinesar | Keyuan Jiang | Charansai Maddineni | Ashok Raja
Basil Ebinesar | Keyuan Jiang | Charansai Maddineni | Ashok Raja
This paper describes the Pride-Boiler system submitted to MedGenVidQA 2026 Shared Task A, which asks for retrieving relevant PubMed articles and medical instructional videos in response to consumer health queries. Our approach pairs Pyserini BM25 retrieval with LLM-driven query rewriting and a corrective self-verification loop inspired by the Corrective Retrieval-Augmented Generation (CRAG) paradigm. Given a consumer query, the pipeline first asks Google Gemini to generate clinically optimized search text, one targeting PubMed abstracts with MeSH terms and clinical synonyms, and another targeting video subtitles with procedural action language. BM25 retrieves a broad candidate pool, and Gemini then scores each candidate against the original query, blending its relevance judgment with the normalized lexical signal. A quality grader assesses the top results: if they are judged insufficient, the pipeline triggers a corrective cycle with reformulated terminology and retries up to three attempts. The entire workflow is orchestrated as a LangGraph state machine. In the official shared task evaluation, Pride-Boiler ranked first among all participating systems on PubMed article retrieval, achieving an nDCG of 0.6532 and MAP of 0.5550, both exceeding the organizer-provided Text-RR baseline. Our performance on video (text) retrieval achieves 0.5304 in MAP and 0.5927 in nDCG, outperforming other systems but falling below that of baseline, indicating the structural limitations of lexical matching over noisy subtitle text. We release the pipeline code to support reproducibility on GitHub at https://github.com/basilll007/BioNLP.
Seahawk at MedGenVidQA 2026: LLM Segment-Range Selection for Medical Visual Answer Localization
Xiaotian Tian | Gulustan Dogan
Xiaotian Tian | Gulustan Dogan
Medical visual answer localization requires identifying the temporal span in a video where a medical question is answered or visually explained. We present a simple retrieval-and-selection pipeline for Task C that treats visual answer localization as segment-level answer paragraph selection over timestamped video transcripts. Given a question and a segmented transcript, our system prompts DeepSeek to select a contiguous range of transcript segments rather than directly generating timestamps. The final start and end times are then computed deterministically from the selected segment boundaries, decreasing the risk of hallucinated or malformed temporal outputs. To support long videos, we apply overlapping sliding-window prompting and rank candidate ranges using lexical question. In a 20-sample sanity check on test dataset, a completeness-biased configuration achieved an mIoU of 0.3217, while a shorter duration-penalized configuration improved performance to 0.4815. These results suggest that constrained LLM-based segment selection, combined with deterministic timestamp extraction, is a practical baseline for medical visual answer localization.
UNCC at MedGenVidQA 2026: Structured Temporal Grounding for Medical Video Question Answering
Hilmi Demirhan | Wlodek Zadrozny
Hilmi Demirhan | Wlodek Zadrozny
MedGenVidQA 2026 Task C evaluates visualanswer localization in medical videos. Thesystem receives a video and a question, then returns the start and end time of the visual answer.Our framework used timestamped automaticspeech recognition (ASR) as a proposal sourcerather than as a final boundary label. The framework generated transcript tables, phase maps,lexical and dense candidate windows, schemaconstrained ranking inputs, selective key-framechecks, and a deterministic validation pass forthe final JSON file. The ranker selected amongbounded candidate intervals instead of generating arbitrary timestamps over a full transcript.Each output can be traced to segment identifiers, candidate source families, selected anchors, phase labels, and validation flags. Ourbest run ranked fifth among six participant systems, with 62.50 IoU@0.3, 36.25 IoU@0.5,22.50 IoU@0.7, and 42.57 mIoU. The threshold pattern suggests that coarse temporal retrieval was more reliable than strict start-endlocalization.
up
Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)
Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)
Vinodkumar Prabhakaran | Sunipa Dev | Luciana Benotti | Daniel Hershcovich | Yong Cao | Li Zhou | BOlei Ma | Ife Adebara
Vinodkumar Prabhakaran | Sunipa Dev | Luciana Benotti | Daniel Hershcovich | Yong Cao | Li Zhou | BOlei Ma | Ife Adebara
Human annotation is a foundational component of modern natural language processing (NLP). Labeled datasets underpin widely used benchmarks for sentiment analysis, toxicity detection, hate speech classification, and stance detection. Within standard NLP workflows, annotation is generally treated as a technical process aimed at recovering an objective ground truth according to predefined guidelines. This paper argues that such a view overlooks the inherently interpretive nature of annotation. Drawing on insights from sociolinguistics, discourse analysis, and cultural theory, and on a growing empirical literature on annotator subjectivity, we propose that annotation should be understood as a culturally situated interpretive practice. Annotators rely on culturally shaped norms, values, and communicative expectations when interpreting linguistic meaning, and labels in NLP datasets often reflect culturally specific interpretations rather than universal truths. We position this argument relative to recent work on perspectivism, annotator-aware modeling, and cross-cultural annotation, and we use published findings from large-scale cross-cultural annotation studies to illustrate the concrete consequences of treating annotation as objective. We close with a research agenda for culturally informed annotation practice that includes operational recommendations on documentation, modeling, and evaluation.
Somatic in the East, Psychological in the West?: A Clinically-Grounded Evaluation of Cross-Cultural Depression Symptoms in LLMs
Shintaro Sakai | Jisun An | Migyeong Kang | Haewoon Kwak
Shintaro Sakai | Jisun An | Migyeong Kang | Haewoon Kwak
Large language models (LLMs) are increasingly used for mental health applications, raising questions about whether they reflect established clinical knowledge. Clinical psychology has documented systematic cultural differences in the presentation of depression symptoms, with Western populations emphasizing emotional symptoms and many East Asian populations reporting more somatic symptoms. We evaluate whether general-purpose LLMs reproduce these clinically established cross-cultural patterns using prompts grounded in clinical descriptions of depression. We examine model responses under different cultural personas and languages.We find that LLMs struggle to reproduce expected cultural patterns when prompted in English. Prompting in major Eastern languages improves alignment in some configurations, suggesting that language cues partially activate cultural knowledge. However, model behavior remains dominated by a strong, culture-invariant hierarchy of depression symptoms that often overrides cultural cues, highlighting limitations in current LLMs for mental health applications.
Modeling Cultural and Subcultural Variation in Code-Switched Discourse with Topic Annotation
Nemika Tyagi | Nelvin Licona-Guevara | Olga Kellert
Nemika Tyagi | Nelvin Licona-Guevara | Olga Kellert
Code-switching is often modeled in NLP as a structural or token-level phenomenon, overlooking its role as a discourse practice shaped by social and cultural context. In this work, we propose topic-based annotation as a framework for analyzing cultural and subcultural variation in bilingual discourse. Using large language models, we annotate 3,691 code-switched sentences from Spanish-English (Miami) and Spanish-Guaraní (Paraguay) corpora with topic and discourse-level information, integrating sociolinguistic metadata. Our analysis reveals systematic relationships between discourse topics, language choice, and social variables such as gender and language dominance. We observe subcultural variation within the Miami community and a clear diglossic distribution in Paraguay, where Guaraní is associated with formal domains and Spanish with informal communication. These findings suggest that modeling code-switching through discourse-level categories provides a more complete representation of multilingual communication and enables both cross-cultural and intra-cultural comparison at scale.
GCCLA: Graph-Conditioned Cross-Lingual Adaptation of Large Language Models Under Extreme Data Scarcity (A Case Study in Tigrigna)
Hagos Gebremedhin Gebremeskel | Chong Feng | Asefa Mebrahtu Abera
Hagos Gebremedhin Gebremeskel | Chong Feng | Asefa Mebrahtu Abera
Adapting large language models (LLMs) to extremely low-resource languages remains challenging due to severe data scarcity and the lack of structured linguistic supervision. We introduce GCCLA, a graph-conditioned cross-lingual adaptation framework that integrates multilingual knowledge graphs into parameter-efficient LLM adaptation. GCCLA conditions a frozen multilingual LLM on structured semantic and typological relations encoded in a multilingual graph, providing a strong inductive bias for data-efficient transfer. We instantiate and evaluate the framework through a focused case study on English-to-Amharic-to-Tigrinya transfer, where labeled data is extremely limited. By separating knowledge representation from language modeling, GCCLA stabilizes learning and improves sample efficiency in few-shot regimes. We evaluate the approach on five tasks, sentiment analysis, named entity recognition, natural language inference, question answering, and extractive summarization, under extreme data scarcity, with as few as 0–1000 labeled Tigrinya examples. Experimental results show that GCCLA consistently outperforms multilingual, translation-based, and parameter-efficient baselines, achieves competitive performance with as few as 100 labeled examples, and degrades gracefully under partial graph coverage. These findings demonstrate that graph conditioning is an effective principle for data-efficient cross-lingual adaptation of LLMs advancing equitable NLP.
LLM-Adapted Colombian Spanish Lexicography: Proficiency Control, Hallucination, and Cultural Distortion
Johnatan E. Bonilla
Johnatan E. Bonilla
We evaluate whether open-source LLMs can produce proficiency-graded English adaptations of entries from the Diccionario de colombianismos (DiCol), a Colombian Spanish lexicographic resource used in language teaching. Three 7–8B instruction-tuned models—Llama 3.1, Qwen2.5, and Mistral—generate Beginner, Intermediate, and Advanced translations for all 8,252 definitions using structured zero-shot prompts identical across levels except for the target CEFR band. Automated metrics show that Intermediate targeting collapses (73–83% classified as Advanced by vocabulary, 𝜒2 > 705, p < .001) and that Advanced outputs expand 4.9–8.2× relative to the source. Expert annotation of a 360-entry stratified sample (𝜅 = 0.61–0.68) identifies hallucination in 19% of entries (Fleiss’ 𝜅 = 0.77 for cultural preservation categories, 97% unanimity among flagged cases). Hallucination concentrates in the Advanced condition (81%, 𝜒2 = 86.6, p < .001) and is associated with higher expansion (U = 16,662, p < .001, r = 0.68), manifesting primarily as generic elaboration and, in a smaller proportion, as Colombia-stereotyping and pragmatic polarity inversion. We discuss these findings through the lens of (CITATION)’s domestication framework and describe the observed pattern as algorithmic domestication.
Soft Prompts for Adapting LLMs to Cultural Commonsense Knowledge
Gabrielle Le Bellier | Marine Carpuat | Benoît Sagot | Chloé Clavel
Gabrielle Le Bellier | Marine Carpuat | Benoît Sagot | Chloé Clavel
Large Language Models (LLMs) show unbalanced knowledge of cultures across the globe, favoring high-resource cultures over low-resource ones. A possible way to tackle this issue is to fine-tune LLMs on culturally specific data. However, fine-tuning recent LLMs requires high computational resources as well as memory storage, which triggered the development of parameter-efficient fine-tuning (PEFT) approaches, the most widespread being LoRA. In this article, we investigate the use of another class of PEFT approaches, namely soft prompt methods (prompt-tuning and prefix-tuning), to improve LLMs’ cultural knowledge across diverse cultures. We focus on cultural alignment on Multiple-Choice Questions of cultural commonsense knowledge. On this task with limited fine-tuning data, we show that soft-prompt-based methods outperform LoRA in comparable settings. Moreover, the trained soft prompts are interpretable and capture similarities between cultures.
The Mirage of Diversity: Unmasking the Cultural Vocabulary Ceiling in LLMs
Soumedhik Bharati | Subhrajit Mukherjee | Shibam Mandal
Soumedhik Bharati | Subhrajit Mukherjee | Shibam Mandal
Large Language Models are widely used to generate and adapt cultural texts, yet the depth of their cultural representation remains poorly quantified. Intuitively, as a narrative text expands in length, the diversity of cultural words should scale proportionately. To formally test this, we evaluate the FairyTaleQA dataset, adapted by three models and introduce our primary contribution: the Contextual Stereotype Amplification Index (CSAI), an evaluation framework combining LLM-as-a-judge extraction, embedding-based cliché anchoring, and Natural Language Inference (NLI) congruence validation. By mapping the frequency of extracted Culture Specific Items (CSIs) against narrative length using Heaps’ Law (V = k ⋅ T𝛽), we present empirical evidence of a systematic limitation in current systems: they struggle to scale cultural diversity even under explicit cultural prompting. Models rapidly hit a "Cultural Vocabulary Ceiling," constrained to a fixed set of hyper-stereotypical terms. Furthermore, we demonstrate that merely optimizing for higher CSI frequency as done in prior works rewards logically broken tokenism. Our CSAI formulation actively penalizes such gratuitous stereotyping, offering a more principled approach to measuring and evaluating cultural homogenization in generative AI systems.
The American Palimpsest: Quantifying South Asian English Dialect Erasure in LLMs
Soumedhik Bharati | Shibam Mandal | Swarup Kr Ghosh | Sayani Mondal
Soumedhik Bharati | Shibam Mandal | Swarup Kr Ghosh | Sayani Mondal
Large Language Models are increasingly deployed as writing assistants for usersin the Global South, yet rewriting prompts can suppress institutionalizedpostcolonial varieties. We quantify South Asian English (SAsE) dialect erasure ina state-of-the-art open-weight model using a 500-sentence diagnostic benchmark(320 lexical and 180 syntactic markers). On Llama 3.3 70B, standard grammarcorrection retains only 26.0% of markers (lexical 31.2%; syntactic 16.7%),while formalization is more destructive (14.0% overall retention). For lexicalitems, we observe Americanization in 56.2% (correction) and 59.4%(formalization) of cases, typically via Standard American paraphrases. A simpledialect-aware prompt raises retention to 92.0% and reduces lexicalAmericanization to 6.2%, although some function-word phenomena remain resistant. A stress test shows evenstronger suppression (6.7% retention). We position dialect erasure withinrepresentational-harm and cultural-competence frameworks, and provide areplicable protocol for auditing writing-assistance systems.
Multilingual NLP is often treated as a route to global inclusion, but linguistic coverage and cultural competence frequently diverge. This paper synthesizes over 50 papers spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal benchmarks, benchmark-design critique, and community-grounded data practices. Across this literature, training data coverage remains important, but outcomes are also shaped by tokenization, prompt language, translated benchmark design, culturally grounded supervision, modality, and who authors or validates evaluation data. We argue that culturally grounded NLP should move beyond treating languages as isolated rows in benchmark tables and instead model communicative ecologies: the institutions, scripts, domains, modalities, and communities through which language is used. We propose a layered evaluation and reporting agenda centered on representation audits, mixed elicitation, ecological validity, community validation, adaptation provenance, within-language variation, and maintenance of living cultural resources.
TabletCraft: Bridging a 4,000-Year Cultural Gap with Bidirectional Akkadian NMT and Cuneiform Rendering
Zhaohui Geoffrey Wang
Zhaohui Geoffrey Wang
Half a million cuneiform clay tablets survive in museums worldwide, yet modern humans can neither read nor write in the world’s oldest writing system, creating a 4,000-year cultural barrier that existing NLP tools have only partially addressed. Prior work enables one-way, scholar-oriented translation from Akkadian to English, but offers no path in the reverse direction: ordinary people cannot express their own thoughts in cuneiform, and thus remain passive consumers of ancient culture rather than active participants. We present TabletCraft, the first open-source system that enables bidirectional interaction with Mesopotamian writing. Users can read ancient tablets (Akkadian to English) and write their own messages as cuneiform clay tablets (English to Akkadian to cuneiform to rendered tablet). The system integrates a ByT5-based translation model trained on 116K bidirectional samples, a cuneiform sign converter with 14,240 mappings (95.3% coverage), and a visual tablet renderer, packaged as a pip-installable toolkit with both a command-line interface and a web demo.
Lost in Translation? How Language Shapes Responsibility Attribution in Large Language Models
Pavithra P M Nair | Gilad Gressel | Krishnashree Achuthan
Pavithra P M Nair | Gilad Gressel | Krishnashree Achuthan
Large language models (LLMs) are increasingly deployed in multilingual settings, yet little is known about whether their moral and social judgments remain consistent across languages. In particular, when faced with moral and social dilemmas, LLMs must often implicitly or explicitly assign responsibility — to an individual, to broader social forces, or across multiple parties — a process known as responsibility attribution. This study investigates whether responsibility attributions vary across languages, whether any observed variation persists across thematic domains, and whether the degree of variation differs across LLMs. We evaluate three models (GPT-5.2, Gemini-2.5-Pro, and LLaMA-3.3-70B) across 12 scenarios spanning six thematic domains (marriage, career, authority, gender, elder care, and family). Each model was prompted to attribute responsibility for each scenario by selecting from four options: the primary individual, a secondary interpersonal actor, a broader societal factor, or distributed responsibility shared across multiple parties. Results reveal a significant overall association between language and responsibility attribution (Cramér’s V = 0.24) that persists within every thematic domain (V = 0.26–0.53). The magnitude of cross-language variation is strongly model-dependent: GPT-5.2 and Gemini-2.5-Pro show modest shifts (V ≈ 0.19), while LLaMA-3.3-70B exhibits substantially stronger divergence (V = 0.52). These findings suggest that normative consistency across languages cannot be assumed and should be treated as a distinct dimension of model evaluation.
Ontology-oriented lexico-semantic modeling and neural classification of Chinese chéngyǔ: A culture-aware NLP approach
Lian Chen
Lian Chen
This paper proposes a semi-automatic lexico-semantic modeling framework for Chinese chéngyǔ containing body-part and animal lexemes. The framework combines manual semantic annotation, lightweight RDF/OWL formalization and semantic classification in order to investigate whether lexical mediators such as 心 xīn “heart/mind”, 口 kǒu “mouth” or 马 mǎ “horse” are sufficient to predict idiomatic semantic interpretation. Based on 440 annotated chéngyǔ normalized into 18 semantic categories, we compare three classification approaches: a rule-based keyword baseline, character n-gram TF-IDF with logistic regression, and BERT-base-chinese. The results show that lexical mediators cannot be directly equated with semantic categories and that TF-IDF achieves the best overall performance, suggesting that lightweight character-level representations remain robust for very short idioms in low-resource settings. The study contributes an interpretable RDF/OWL-compatible resource for culture-aware modeling of Chinese idioms.
"Sorry, Can’t Help You": How Large Language Models Judge Failures to Help Across Languages
Pavithra P M Nair | Gilad Gressel | Krishnashree Achuthan
Pavithra P M Nair | Gilad Gressel | Krishnashree Achuthan
Cross-cultural psychology has shown that moral judgments about failures to help vary systematically across cultures. In a landmark study, Miller, Bersoff, and Harwood (1990) found that while Indian and American participants agreed that failures to help are undesirable, they differed in whether they considered helping a moral obligation subject to social sanction or a personal decision. We adapt Miller et al.’s paradigm—nine scenarios crossing need severity (life-threatening, moderate, minor) with role relationship (parent, friend, stranger) and their original probe questions—to a cross-lingual LLM setting, presenting them to four LLMs (GPT-5.4, Claude-Opus-4.6, DeepSeek-V3.1, Qwen3-235B) across ten languages. We find that language significantly shapes how LLMs categorize failures to help as moral violations, social conventions, personal-moral concerns, or personal decisions (𝜒2(27) = 116.14, p < .001, Cramer’s V = 0.147). Models agree across languages that failures to help are undesirable, but diverge substantially in how they classify them, with the primary divergence falling between moral violations and personal decisions. The proportion of responses classifying failures as moral violations decreases as need severity decreases and the role relationship becomes more distant. Cross-lingual variation differs substantially across models, with open-weight models showing significantly stronger variation than closed-weight models. These findings indicate that users consulting LLMs in different languages may receive substantively different moral guidance, underscoring the need for cross-lingual normative auditing as a component of multilingual LLM evaluation.
Does Reasoning Kill the Joke? Long-Context Humor Understanding in Hindi
Kaveri Anuranjana | Navya Shrivastava | Atharv Johar | Rishabh Sabharwal | Gautam Ranka | Aryan Lunawat | Punit Rathore | Radhika Mamidi
Kaveri Anuranjana | Navya Shrivastava | Atharv Johar | Rishabh Sabharwal | Gautam Ranka | Aryan Lunawat | Punit Rathore | Radhika Mamidi
Verbal humor involves reasoning through complex conversational contexts. Although LLMs have achieved strong performance on English humor datasets, their ability to interpret humor in Hindi remains unexplored. In this paper, we evaluate Hindi humor for which we extract dialogues from humorous video clips. We use a pipeline that transforms video content into detailed textual streams, including dialogue transcripts and scene descriptions, allowing reasoning over inputs exceeding 2,000 words. We test various LLMs, from efficient edge models (Qwen-2.5-7B, Qwen-3-7B, Gemma-3-27B) to Indic-focused models (Sarvam-M-24B) and large frontier models (Llama-3.1-70B, Gemini-2.0-Flash). Our findings show a concave performance pattern in long-context understanding, with reasoning quality peaking at moderate lengths (250–750 words) and declining at higher context lengths. We also show that standard metrics overstate pragmatic competence. While increasing model size generally improves performance, we also observe distinct failures in smaller LLMs due to instructional and linguistic issues, necessitating diversity metrics to capture hallucinations. Smaller, Hindi-focused models can compete with much larger generalist models. Importantly, our evaluation reveals that conversational humor is a challenge for even specialized models, making HinS a valuable benchmark for advancing research in Hindi Long-Context Humor Reasoning.
One Style Fits All? Cultural Values Embedded in Conversational AI via a People-Pleasing Lens
Yi-Jun Chen | I-Tsen Hsieh | Li-Wun Chang
Yi-Jun Chen | I-Tsen Hsieh | Li-Wun Chang
Conversational AI systems trained on large-scale web corpora inevitably encode the cultural values and interactional norms embedded in their training data, yet our understanding of how deployed LLMs reflect or reinforce culture-specific social expectations remains limited. This study examined how supportive versus challenging chatbot interaction styles shape user experience and continuance intention, and whether people-pleasing tendency (PPT) moderates these effects across cultures. Taiwanese (N = 49) and Korean (N = 52) participants completed a collaborative tourism-planning task. Results showed that: (1) supportive chatbots consistently led to higher continuance intention, satisfaction, and trust; (2) PPT did not moderate these effects; and (3) cultural variation emerged only in perceived threat, where higher PPT was associated with greater baseline threat in the Taiwanese but not the Korean sample. These findings reveal how a general-purpose LLM style may differentially activate culturally situated social scripts, raising implications for culturally inclusive conversational AI design.
Beyond Monolithic Culture: Evaluating Understandability of Online Text Across Cultural Dimensions
Saurabh Kumar Pandey | Harshit Gupta | Sougata Saha | Monojit Choudhury
Saurabh Kumar Pandey | Harshit Gupta | Sougata Saha | Monojit Choudhury
Culture shapes how people interpret language, especially in online reviews containing culture-specific items (CSIs). Yet, most existing evaluations treat culture as a monolithic construct, offering no insight into which cultural dimensions pose difficulty for readers, or how large language models (LLMs), which power AI reading assistants, perform across them. This gap limits our ability to obtain reliable, cross-cultural estimates of model performance. To address this, we analyze CSIs in English Goodreads reviews across Newmark’s cultural dimensions (e.g., material, ecology, customs, habits, social) and evaluate six LLMs of varying sizes on their ability to identify CSIs within each dimension. We find that readers struggle most with CSIs from the material, customs, and social dimensions, while models underperform on more localized ones (e.g., habits), revealing systematic cultural blind spots. To support further research on culturally representative benchmarking, we release an expert-annotated dataset of CSIs labeled by cultural dimension. Empirical analysis shows our dataset as more challenging and of higher quality than existing cultural benchmarks, enabling finer-grained evaluation of cultural understanding in models.
up
Proceedings of the 1st Workshop on Computational Developmental Linguistics (CDL)
Proceedings of the 1st Workshop on Computational Developmental Linguistics (CDL)
Martin Ziqiao Ma | Emmy Liu | Jing Liu | Tyler A. Chang | Abdellah Fourtassi | Alex Warstadt | Michael Hahn | Weiwei Sun | Freda Shi
Martin Ziqiao Ma | Emmy Liu | Jing Liu | Tyler A. Chang | Abdellah Fourtassi | Alex Warstadt | Michael Hahn | Weiwei Sun | Freda Shi
Linguistics Theory Meets LLM: Code-Switched Text Generation via Equivalence Constrained Large Language Models
Garry Kuwanto | Chaitanya Agarwal | Genta Indra Winata | Derry Tanti Wijaya
Garry Kuwanto | Chaitanya Agarwal | Genta Indra Winata | Derry Tanti Wijaya
Code-switching is a common practice for millions of multilingual speakers but remains challenging for Large Language Models (LLMs). This paper investigates LLM capabilities in generating code-switched text, conducting extensive experiments across five diverse language pairs: English paired with Hindi, Tamil, Malayalam, and Indonesian, as well as Indonesian-Javanese. Our analysis, grounded in comprehensive human evaluations by native speakers, uncovers a directional asymmetry: LLMs consistently produce higher-quality (more accurate and fluent) code-switched text when prompted with a lower-resource language (e.g., Hindi, Tamil, Javanese) as the source, compared to when a higher-resource language (English, Indonesian) serves as the source. This asymmetry mirrors sociolinguistic patterns, particularly the Matrix Language Frame model, suggesting LLMs implicitly learn common code-switching structures from their training data where regional languages often form the grammatical base. Furthermore, we find that explicit linguistic guidance, applied through Equivalence Constraint Theory (ECT) to identify switching points, primarily benefits generation quality only in the less common, higher-resource-source direction where LLMs intrinsically struggle. These findings highlight a crucial interplay between the implicit linguistic knowledge captured by LLMs and the targeted utility of explicit linguistic constraints. We also introduce CSPref, a pairwise preference dataset derived from our human evaluations, to facilitate future research in code-switching generation and evaluation.
Do Structural Priors Help Neural Language Models Learn Grammar? Evidence from Child-Scale Data
Jon-Paul Cacioli
Jon-Paul Cacioli
We show that structural grammatical priors produce targeted, linguistically specific effects on grammatical learning: improving filler-gap dependencies — which require long-distance hierarchical tracking — by 9–13 percentage points beyond structural regularisation alone (d = 2.41–2.82), while damaging locally cued phenomena regardless of whether the grammar is real or random. This phenomenon-specificity, revealed by a random grammar control, suggests the right question is not whether structural priors help, but for which constructions and why. We test this by augmenting BabyBERTa (7.4M parameters) with a differentiable PCFG auxiliary loss derived from Minimalist Grammar, trained on AO-CHILDES (893K sentences of child-directed speech). In a pre-registered study of 190 experimental runs spanning 7 constraint strengths, 3 data scales, 5 random seeds, and 3 independent lexicon permutations, our confirmatory hypotheses about overall accuracy and sample efficiency are falsified. However, a random grammar control (n = 15 runs per condition; three independent lexicon permutations) reveals that linguistically accurate category assignments specifically drive filler-gap gains: real grammar outperforms both a structurally equivalent random grammar and the no-grammar baseline, while both conditions equally damage subject-verb agreement. These results show that structural priors function as targeted interventions rather than global boosters: they help specifically the constructions, specifically long-distance dependencies, whose computational demands align with what phrase-structure representations encode. We release code and pre-registered materials.
Fine-tuned speech representations track spoken language convergence to adult models in infants and children who are deaf/hard-of-hearing
Landon Choy | Ali Sartaz Khan | Sonia Patrizi | Daisy S. Ye | Julianna Gross | Margaret Cychosz
Landon Choy | Ali Sartaz Khan | Sonia Patrizi | Daisy S. Ye | Julianna Gross | Margaret Cychosz
Language development is characterized by a gradual convergence of children’s speech toward adult patterns. Measuring this process has traditionally required detailed transcription and language-specific expertise, limiting scalability across languages and populations. Here, we use fine-tuned speech embeddings to capture this convergence directly from the acoustic signal in longform, child-centered recordings, taken as children go about their daily lives. Using BabyHuBERT, we extracted embeddings from vocalizations of children who are deaf/hard-of-hearing and their female adult caregivers (>925 hrs. observation). Embedding distance between children and caregivers decreased with hearing age, controlling for pitch, indicating, as expected, that children’s speech patterns converge to caregivers over development. This single distance metric likewise related to multiple standardized measures of speech and language, from infancy through preschoolhood. These results suggest a path toward scalable, language-neutral assessment of spoken language development from children’s everyday lives.
Do Language Models Show Structural Priming Across Different Domains?
So Young Lee | Russell Scheinberg | Ameeta Agrawal
So Young Lee | Russell Scheinberg | Ameeta Agrawal
We test whether large language models show cross-domain structural priming by asking whether arithmetic expressions influence relative-clause attachment preferences. Experiment 1 examines English and French using materials based on prior psycholinguistic studies, and Experiment 2 extends the test to a larger multilingual dataset. Across both experiments, we find no robust priming effect. Instead, responses largely reflect baseline attachment preferences, which vary across languages and only partially align with human patterns. These findings suggest that, although language models show some structural sensitivity, they provide limited evidence of abstract structural generalization across domains.
Do large language models and humans follow similar learning stages? Assessing GPT-2’s order of Swedish grammar acquisition within the Processability Theory framework
Stella Lundqvist | Murathan Kurfali | Johan Sjons
Stella Lundqvist | Murathan Kurfali | Johan Sjons
We investigate whether GPT-2 acquires Swedish grammatical structures in the same implicational order as for human second language (L2) learners, as predicted by Processability Theory (PT). We present SwePT – a minimal pair dataset targeting Swedish syntactic and morphological structures that are acquired by human L2 learners on four separate stages of language development – and evaluate the GPT-2 models on SwePT using an acceptability classification task throughout fine-tuning with different input orders in regards to the grammatical structures identified in the data. We find that the observed acquisition orders correlate across the fine-tuned models, while violating the implicational order sequence as hypothesized by PT. The observed relation between performance on the classification task and frequency distributions of the contrasting features in the minimal pairs suggests that the acquisition order can be explained by unigram and n-gram heuristics. While the adaptation of NLP methodologies into the PT framework requires further conceptual and methodological refinement, we do not find evidence for PT-like grammatical development in our experiments.
On the Learnability of Syntax from Raw Speech with Autoregressive Predictive Coding
Shunsuke Kando | Yusuke Miyao
Shunsuke Kando | Yusuke Miyao
Children are known to generalize syntactic knowledge at ages when their linguistic input is predominantly raw speech rather than text. This raises the question of whether syntactic generalization can emerge directly from acoustic input. We address this question using Autoregressive Predictive Coding (APC), a simple prediction-based self-supervised speech model. To approximate the input available to human learners while enabling controlled comparison, we train models on both child-directed speech and audiobook speech. We evaluate the models on a minimal-pair benchmark targeting elementary syntactic phenomena, designed to be acquisition-friendly. Our results show that APC partially generalizes word-order regularities when trained to predict near-future frames. However, the model fails to generalize agreement phenomena, suggesting that predictive learning from acoustic signals alone is insufficient. Furthermore, we observe distinct learning dynamics across word-order phenomena, suggesting that some improvements may be driven by shallow statistical regularities rather than genuine syntactic generalization.
Modeling Writing Development as Coordinated Change Across Linguistic and Semantic Dimensions
Michelle Banawan | Andrew Potter | Tracy Arner | Danielle S McNamara
Michelle Banawan | Andrew Potter | Tracy Arner | Danielle S McNamara
Writing development is often assessed through aggregate improvements in surface-level features, yet less attention has been given to how multiple linguistic dimensions evolve jointly over time. We model writing development as a multidimensional system shaped by stable individual variation and instructional progression across staged assignments, using interpretable linguistic features from the Writing Analytics Toolkit (WAT) and transformer-based sentence embeddings.Variance partitioning reveals substantial between-student stability alongside stage-dependent change. Mixed-effects models identify non-uniform developmental trajectories: academic focus, information density, and conventional language increase, whereas development of ideas and lexical variety decline, indicating tradeoffs across competing dimensions. Cross-lagged analyses further show dynamic dependencies between dimensions, suggesting coordinated change rather than independent progression.Embedding-based analyses capture stage-dependent shifts in semantic representation, with larger changes in earlier stages and increasing stability over time. Although assignment structure contributes to observed variation, stable individual differences and cross-stage dependencies indicate underlying developmental processes that generalize across tasks.Together, these findings characterize writing development as structured change in a multidimensional representational system, highlighting the need for computational models that capture stable variation, non-monotonic trajectories, and interactions among linguistic components.
L1 Influence in L2 Language Models: A Human-centric Approach
Laura Barbenel | Lily Goulder | Aoife O’Driscoll | Suchir Salhan | Catherine Arnett | Andrew Caines | Paula Buttery
Laura Barbenel | Lily Goulder | Aoife O’Driscoll | Suchir Salhan | Catherine Arnett | Andrew Caines | Paula Buttery
Language learners typically exhibit first language (L1) influence in their written second language (L2) production. We investigate whether similar patterns emerge in L2 language models (L2LMs), which are typically assessed on task-based benchmarks rather than on language use. We evaluate the use of Native Language Identification (NLI) as a method for detecting whether L2LMs exhibit human-like L1 influence. Using existing learner corpora and our novel L2 English dataset, we identify the conditions that yield the highest NLI accuracy, and show that text length but not proficiency affects performance. We then apply NLI to L2LM-generated text under various instruction-tuning and prompting conditions. We find that instruction tuning on human learner essays yields high NLI accuracy (~90%) and is necessary for detectable L1 influence. Whilst NLI accuracy is similar for L2LM and human essays, human evaluation shows that LM-generated L1 influence remains distinguishable from human writing.
A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research
Divyesh Pratap Singh | Dakshesh Gusain | Federica Bulgarelli | Alison Eisel Hendricks | John Beavers | Nathan M. Beers | Ifeoma Nwogu
Divyesh Pratap Singh | Dakshesh Gusain | Federica Bulgarelli | Alison Eisel Hendricks | John Beavers | Nathan M. Beers | Ifeoma Nwogu
Manner and result verbs encode different aspects of event structure and have been discussed in developmental work as a potentially informative distinction for studying early verb learning. However, this distinction remains difficult to measure at scale because large annotated resources for manner and result classification are not currently available. We present a computational approach for identifying manner and result verbs in sentence context. Using linguistically informed prompts, we generate sentence-level annotations with large language models over data drawn from MASC and InterCorp, extending coverage from previously annotated portions of VerbNet to 436 classes. We then train a RoBERTa-based classifier on these annotations and evaluate it on three held-out gold-standard datasets, including previously annotated items and a new expert-annotated set. Across these evaluations, the model shows promising performance, with average accuracy up to 89.6%. We present this work as a scalable measurement tool that can support future research on verb semantics in developmental and other language datasets, while noting that further validation is needed for borderline cases, mixed manner/result verbs, and downstream developmental applications.
Making Synthetic Questions More Child-Directed: Prompting and Sampling Effects
Whitney Poh | Michael Tombolini | Libby Barak
Whitney Poh | Michael Tombolini | Libby Barak
Child-directed Speech (CDS) has been shown to better support language learning as training data for computational models. Artificially generated input aims at replicating the advantage of CDS by re-creating targeted linguistic properties. Recently, the use of questions in CDS has been suggested as a linguistic property that may entail an effective discourse structure for model training. However, previous work has shown inconsistent improvement over baseline using questions in training data. In this study, we propose a new question generation method that aligns both the generation prompts and sampling methods with properties of CDS. We show that prompt wording substantially changes whether synthetic questions match CDS on surface properties such as MLU and question type. Despite marked improvements over baseline, enhanced CDS-likeness does not translate into consistent downstream gains. Overall, our results show that the role of questions in training data is a topic worth looking further into.
up
Proceedings of the 2nd Workshop on Computational Humor (CHum 2026)
Proceedings of the 2nd Workshop on Computational Humor (CHum 2026)
Ori Amir | Christian F. Hempelmann | Julia Rayz | Tiansi Dong | Tristan Miller
Ori Amir | Christian F. Hempelmann | Julia Rayz | Tiansi Dong | Tristan Miller
One Joke to Rule them All? On the (Im)possibility of Generalizing Humor Detection
Mor Turgeman | Chen Shani | Dafna Shahaf
Mor Turgeman | Chen Shani | Dafna Shahaf
Humor is a complex form of communication that remains challenging for machines. Despite its broadness, most existing research on computational humor traditionally focused on modeling one specific type of humor. In this work, we wish to understand whether competence on specific humor tasks confers any ability to transfer to novel, unseen types; in other words, is this fragmentation inevitable? This question is especially timely as new humor types continuously emerge in online contexts (e.g., memes, anti-humor, AI fails). If LLMs are to keep up with this evolving landscape, they must be able to capture deeper, transferable mechanisms. To investigate this, we conduct a series of transfer learning experiments across four datasets, representing different humor tasks. We explore varied diversity settings (varying between 1-3 datasets in training, testing on a novel one). Experiments show that models are capable of some transfer, reaching up to 75% accuracy on binary unseen datasets; training on diverse sources improves transferability (1.88-4.05%) with minimal-to-no drop in in-domain performance. Somewhat surprisingly, the one dataset (Dad Jokes) emerges as the best enabler of transfer, but the hardest one to transfer to. We release data and code.
Timing In stand-up Comedy: Text, Audio, Laughter, Kinesics (TIC-TALK): Pipeline and Database for the Multimodal Study of Comedic Timing
Yaelle Zribi | Florian Cafiero | Vincent Lépinay | Chahan Vidal-Gorène
Yaelle Zribi | Florian Cafiero | Vincent Lépinay | Chahan Vidal-Gorène
Stand-up comedy, and humor in general, are often studied through their verbal content. Yet live performance relies just as much on embodied presence and audience feedback. We introduce TIC-TALK, a multimodal resource with 5,400+ temporally aligned topic segments capturing language, gesture, and audience response across 90 professionally filmed stand-up comedy specials (2015–2024). The pipeline combines BERTopic for 60 s thematic segmentation with dense sentence embeddings, Whisper-AT for 0.8 s laughter detection, a fine-tuned YOLOv8-cls shot classifier, and YOLOv8s-pose for raw keypoint extraction at 1 fps. Raw 17-joint skeletal coordinates are retained without prior clustering, enabling the computation of continuous kinematic signals—arm spread, kinetic energy, and trunk lean—that serve as proxies for performance dynamics. All streams are aligned by hierarchical temporal containment without resampling, and each topic segment stores its sentence-BERT embedding for downstream similarity and clustering tasks. As a concrete use case, we study laughter dynamics across 24 thematic topics: kinetic energy negatively predicts audience laughter rate (r=−0.75,N= 24), consistent with a stillness-before-punchline pattern; personal and bodily content elicits more laughter than geopolitical themes; and shot close-up proportion correlates positively with laughter (r= +0.28), consistent with reactive montage.
Arabic humor provides a challenging diagnostic test for large language models because interpreting jokes often requires pragmatic inference, sociolinguistic awareness, and culturally grounded knowledge that standard NLP benchmarks do not evaluate. Arabic is particularly suitable for probing these abilities given its diglossic structure and dialect diversity, where humor frequently arises from register contrast, dialect-specific vocabulary, and shared cultural references. We propose a three-layer taxonomy of Arabic humor mechanisms covering pragmatic, semantic, and sociolinguistic phenomena, illustrated through thirteen curated examples spanning Egyptian, Levantine, Gulf, Tunisian, and Iraqi Arabic. Building on this taxonomy, we introduce a diagnostic evaluation framework using contrastive minimal pairs, a multi-dimensional scoring rubric, and a cultural presupposition ontology. A small proof-of-concept probing study with GPT-4o, Gemini 2.0 Flash, and Claude Sonnet 4.5 reveals recurring failure patterns in sarcasm interpretation, register contrast reasoning, dialectal vocabulary coverage, and cultural grounding. We position this work as a diagnostic framework and pilot, not a mature benchmark, and outline a path toward larger annotated resources.
Cards Against LLMs: Benchmarking Humor Alignment in Large Language Models
Yousra Fettach | Guillaume Bied | Hannu Toivonen | Tijl De Bie
Yousra Fettach | Guillaume Bied | Hannu Toivonen | Tijl De Bie
Humor is one of the most culturally embedded and socially significant dimensions of human communication, yet it remains largely unexplored as a dimension of Large Language Model (LLM) alignment. In this study, five frontier language models play the same Cards Against Humanity games (CAH) as human players. The models select the funniest response from a slate of ten candidate cards across 9,894 rounds. While all models exceed the random baseline, alignment with human preference remains modest. More striking is that models agree with each other substantially more often than they agree with humans. We show that this preference is partly explained by systematic position biases and content preferences, raising the question whether LLM humor judgment reflects genuine preference or structural artifacts of inference and alignment.
The Roast of GPT4o: Experiments in Generating, Detecting and Evaluating Celebrity Roast Comedy
Jens Lemmens | Jérémy Genette | Tony Veale | Walter Daelemans
Jens Lemmens | Jérémy Genette | Tony Veale | Walter Daelemans
We present exploratory experiments in the comedic roasting capabilities of GPT4o. Specifically, @ComedyCentral roasts were scraped to design a survey in which participants blindly evaluated snippets of human and AI roasts, and had to predict the author (AI/human) in a second round of reviewing. The results show that there is no significant difference in how the barbs in human- and AI-generated roasts are rated. Further, a qualitative analysis showed that although the model utilizes specific recurrent phrases to imitate the style of human comedians, both generative LLM detectors and humans performed suboptimally in predicting the true author of the roasts.
Phonetic Cues Improve LLM-Based Pun Detection in Short Text
Adith Santosh Thaniserikaran | Govind Harikrishnan
Adith Santosh Thaniserikaran | Govind Harikrishnan
This paper studies joke detection in short text, focusing only on jokes triggered by lexical ambiguity. Following Attardo and Raskin, we treat these jokes as cases where humor arises from a script opposition activated through a logical mechanism such as homography or homophony. Our framework combines contextuals emantic analysis for homographs with phoneme-level similarity for homophones and near-homophones, using CMUdict, weighted Levenshtein distance, and prompt-based reasoning to recover ambiguities that are not visible in spelling alone. Results show that explicit phonetic modeling improves detection of sound-based puns.
Does Bigger Mean Funnier? Evaluating Humor Generation Across the Qwen3 Model Family
Jatin Agrawal | Radhika Mamidi
Jatin Agrawal | Radhika Mamidi
We investigate whether scaling model parameters improves humor generation through a controlled ablation study. Using five Qwen3 variants (8B–235B, dense and MoE), we generate jokes across 50 themes. Beyond evaluating humor scaling, this work serves as an empirical study into the nature of LLM versus human evaluations on highly subjective creative tasks. While an automated judge yields a perfect monotonic ranking between parameter count and win rate, human annotators find no significant aggregate difference in humor quality. Restricting to themes where annotators agree reveals a significant preference for the largest model (p = 0.039), suggesting scaling effects exist but are masked by a "quality floor." Crucially, our analysis of bias characteristics shows that the automated judge exhibits severe positional and length biases compared to human evaluators, further suggesting that LLMs may systematically distort quality differences on subjective tasks.
Navigating the Joke Space: Towards Automated Originality Assessment of AI-Generated Humor
Ori Amir | Huyen Ngo | Joe Toplyn | Kevin Hickerson
Ori Amir | Huyen Ngo | Joe Toplyn | Kevin Hickerson
This study validates automated, corpus-based methods for quantifying joke originality using “topic handles” — key nouns or noun phrases capturing a joke’s script opposition and logical mechanism (per the General Theory of Verbal Humor). Using a reference corpus of one million jokes in English from Reddit, we compute Pointwise Mutual Information (PMI) in three variants (raw co-occurrence, semantic-cluster smoothing, and word-decomposition) and two embedding-based measures (handle-level conceptual distance and full-text corpus novelty via Sentence-BERT). We evaluate these measures on 400 LLM-generated jokes (200 each from GPT-4o and GPT-5.4) and 80 jokes from the Witscript-powered JEST benchmark, rated by three professional comedians for originality and funniness. Corpus novelty and concept distance between the most semantically distant handle pair both correlated significantly with human originality ratings (𝜌 = .37); PMI-based measures showed weaker but significant associations (𝜌 = .23–.25) on the most original handle pair. A Lasso-based composite of the three strongest predictors achieved 𝜌 = .40 (cross-validated), capturing 82% of the theoretically predictable variance given inter-rater agreement. These results demonstrate that handle-based PMI and semantic novelty metrics offer practical, quantitative tools for assessing originality in AI-generated humor, advancing objective evaluation of computational creativity.
up
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)
Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2026)
Aya Zirikly | Kfir Bar | Sean MacAvaney | Molly Ireland | Yaakov Ophir | Dana Atzil-Slonim | Vasudha Varadarajan | Steven Bedrick | Bart Desmet
Aya Zirikly | Kfir Bar | Sean MacAvaney | Molly Ireland | Yaakov Ophir | Dana Atzil-Slonim | Vasudha Varadarajan | Steven Bedrick | Bart Desmet
"How’d You Type That So Fast?" A Descriptive Analysis of Counselor Message Text Reuse in Text-Based Crisis Counseling
Stevi Gligorovic | Jens Kristian Schou | Zac Imel | Brent Kious
Stevi Gligorovic | Jens Kristian Schou | Zac Imel | Brent Kious
Suicide is a major public health concern, underscoring the importance of understanding communication practices used in crisis intervention. Text-based crisis services are increasingly used, yet little is known about how counselors construct messages across encounters. One understudied feature of this setting is counselor text reuse, or the repeated use of identical or highly similar message content across different clients. Although reuse may support efficiency and consistency, it may raise questions about how personalised responses are across counselors. This study provides a descriptive analysis of counselor text reuse in a large dataset of 4.7 million messages of real-time text-based crisis counseling conversations. Across 136 counselors, mean message similarity was very low, indicating little overall text reuse for most counselors. However, 103 counselors showed at least one instance of detected reuse, and a smaller subset demonstrated more consistent reuse. Reuse was also positively associated with counselor encounter volume across measures of reuse. Frequently reused longer passages primarily involved structured coping-oriented or psychoeducational content, such as coping strategies, grounding exercises, self-care tips, and relaxation techniques. The findings suggest that counselor text reuse increased with encounter volume, but average levels of reuse were low across counselors and they provide a foundation for future work examining associations with service delivery and client outcomes.
Automated Detection and Classification of Delusion-related Content in Naturalistic Audio Diaries Using Multi-Agent Language Models
Feng Chen | Justin Tauscher | Changye Li | Meliha Yetisgen | Alex Cohen | Adam Kuczynski | Angelina Tsai | Benjamin Buck | Dror Ben-Zeev | Trevor Cohen
Feng Chen | Justin Tauscher | Changye Li | Meliha Yetisgen | Alex Cohen | Adam Kuczynski | Angelina Tsai | Benjamin Buck | Dror Ben-Zeev | Trevor Cohen
Speech monologues recorded in naturalistic settings provide opportunities to characterize mental illness phenomenology and detect symptom exacerbation. Large language models (LLMs) offer new possibilities for automating this process, as they require annotated data primarily for evaluation rather than training. In this paper, we present a novel automated, multi-agent LLM pipeline for the fine-grained, multi-label extraction of language suggestive of delusional beliefs, associated affective responses, and behavioral responses from transcripts of naturalistic audio diaries collected from people with moderate persecutory ideation. Evaluating an ensemble of three foundation models, we demonstrate that detailed diagnostic prompt instructions successfully reduce false positives for delusional theme classification, but also constrain the interpretation of affective or behavioral responses. Furthermore, comparing multi-agent adjudication frameworks reveals a critical divergence from standard NLP benchmarks: complex conversational debate between agents diminishes accuracy on clinically ambiguous text by inducing premature consensus. Instead, majority voting establishes robust performance (Micro F1 of 0.872 and 0.779 for delusion detection and classification respectively). This work provides a validated and scalable pipeline for the automated detection and characterization of content suggesting delusional beliefs in naturalistic speech.
Clinical Prompt Engineering: Encoding Clinical Knowledge into AI Training Simulations - A Crisis Deployment Case Study
Yuval Holzman | Eshkol Rafaeli | Zohar Elyoseph | Yuval Haber | Karen Yirmiya | Omer Linkovski | Tal Elyoseph | Elad Refoua
Yuval Holzman | Eshkol Rafaeli | Zohar Elyoseph | Yuval Haber | Karen Yirmiya | Omer Linkovski | Tal Elyoseph | Elad Refoua
When large language models simulate patients or clients, they tend to produce cooperative dialogue, premature emotional insight, and rapid resolution. These defaults undermine clinical training, where the pedagogical value lies in sustained difficulty. We describe Clinical Prompt Engineering (CPE), a methodology developed by a multidisciplinary team of clinician-researchers and prompt engineering experts within the [ProjectName] project. CPE encodes clinical knowledge directly into prompt design: each simulated character is constructed through layered psychological profiles, explicit contingency rules linking interactional events to internal states, and enforced non-linear emotional trajectories that resist the model’s pull toward resolution. The methodology has been applied across several clinical training simulations involving over 300 participants in formal studies and iterative pilot phases. Each simulated character is embedded within a multi-agent training environment that provides real-time reflective guidance during the interaction and structured, clinically informed feedback afterward. We illustrate the approach through Talking with Lia, a Hebrew-language simulation in which parents practice responding to a seven-year-old child during repeated missile alerts and forced sheltering. The simulation was deployed within the first week of an acute security crisis in Israel in Winter 2026. Of 132 sessions initiated organically through professional networks, 42 were completed; qualitative feedback emphasized the simulation’s difficulty as pedagogically meaningful. Because CPE operates at the level of prompt design, it can be developed by clinician-researcher teams and adapted to new populations, developmental stages, and crisis contexts, potentially extending access to expert-informed training beyond the settings where such expertise is typically available. Where much computational work in clinical psychology has focused on classifying mental health states from text, CPE addresses a complementary task: whether clinicians can respond effectively to those states as they shift in real time. The next step is testing whether the skills practiced in simulation transfer to real interactions.
Discriminant Validity: Disentangling Health and Emotional Constructs from Language-Based Assessments
Scott Feltman | Adithya V Ganesan | Whitney Ringwald | H. Andrew Schwartz | Roman Kotov | Benjamin Luft | Ryan Boyd | Oscar Kjell
Scott Feltman | Adithya V Ganesan | Whitney Ringwald | H. Andrew Schwartz | Roman Kotov | Benjamin Luft | Ryan Boyd | Oscar Kjell
Language-based assessments have demonstrated high convergent validity with corresponding mental and physical health constructs, however often fail to address discriminant validity - the measure’s ability to distinguish the target construct from related ones. This is a common phenomenon within the domain of mental health, as well as comorbidity with physical health conditions. Identifying key features of individual dimensions of mental and physical health present in language can unlock new avenues of research for natural language processing and psychology. We propose two augmentations to the objective function of the Ridge model, deriving closed-form solutions compatible with Singular Value Decomposition-based solvers, to enforce discriminant validity of off-target constructs using Mean Squared Error (MSE) and Squared Cosine Similarity (SCS,) both having widespread use in contrastive learning. By varying the discrimination strength, we find that a decrease in 0.005 Pearson correlation points can result in a Pearson correlation point increase upwards of 0.132 in discriminant validity for mental and physical health constructs derived from self-reported questionnaires. We see similar improvements across multiple fundamental psychopathology dimensions simultaneously, increasing discriminant validity by 0.012 with stronger increases coming from more noisy, less reliable constructs. Our contributions provide a theoretically grounded path towards improving confidence in language-based assessments in the clinical sector, improving specificity of said assessments to various areas of health.
Multistream Modelling for Mental Health: Modelling Linguistic and Temporal Contexts with Mutual and Self-Excitation in Social Media
Anthony Hills | Talia Tseriotou | Mahmud Akhter | Junyu Mao | Iqra Ali | Xenia Miscouridou | Maria Liakata
Anthony Hills | Talia Tseriotou | Mahmud Akhter | Junyu Mao | Iqra Ali | Xenia Miscouridou | Maria Liakata
We present MHRoBERT (Multistream HEAT over Recurrence over BERT), a hierarchical transformer architecture for longitudinal mental health monitoring that models self- and mutual excitation patterns in linguistic and temporal data across multivariate event streams relating to an individual’s mental health. To supply the model with complementary perspectives on each post, we apply a Large Language Model (LLM) based annotation to extract three streams from social media posts: emotional states, personal life events, and mental health symptoms. A central finding is that multi-task learning with these automatically-generated stream labels provides substantial, consistent improvements across all model architectures evaluated. Multistream information further consistently benefits simpler models not explicitly designed to exploit it: LLM baselines incorporating stream annotations improve macro F1 by 12.6% over text-only prompting. These results have direct implications for the CLPsych Shared Task on Moments of Change detection: multistream auxiliary supervision yields consistent, substantial gains regardless of architecture, suggesting it is a simple and portable strategy that future systems can readily adopt with minimal architectural changes. MHRoBERT additionally produces interpretable learned parameters across streams, revealing temporal interaction patterns between mental health indicators.
On the Role of Context in LLM Alignment to Mental Health Counseling Competencies
Sadiya Sayara Chowdhury Puspo | Marcos Zampieri | Özlem Uzuner
Sadiya Sayara Chowdhury Puspo | Marcos Zampieri | Özlem Uzuner
As Large Language Models (LLMs) demonstrate strong performance on clinical benchmarks, it remains unclear whether this reflects true patient-specific reasoning or reliance on generalized symptom patterns. To address this gap, we evaluate LLMs on a counseling competency benchmark to assess their use of patient-specific contextual information. Through controlled experiments with ablation experiments, role framing, Thread-of-Thought (ThoT) prompting, and input perturbations, we find that removing contextual details results in only modest performance drops, and predictions remain stable under input variations, indicating limited sensitivity to context. Although structured prompting increases explicit mention of patient details, it does not improve answer accuracy. Error analysis reveals systematic patterns where models favor general clinical associations over context-specific cues, even when such cues are correctly identified during intermediate reasoning. Our findings suggest that achieving passing-level performance does not guarantee context-sensitive decision-making revealing an important gap between apparent clinical competence and actual contextual reasoning. This indicates the need for evaluation frameworks that directly test context integration in mental health applications.
The Reliability Illusion in Synthetic Patients: Psychometric Misalignment of Open-weight LLMs on PHQ-9 and GAD-7
Qian Shen | Yu Han
Qian Shen | Yu Han
Globally, the incidence of depression and anxiety continues to rise, and the importance of mental health assessment scales as diagnostic tools has grown accordingly. Researchers are increasingly employing generative AI to produce large volumes of items and entire scales, which in turn elevates the costs of validating their reliability and validity. In this study, we used four open-weight LLMs to complete the GAD-7 and PHQ-9, varying prompts, sampling temperature, and dynamic contextual scenarios to emulate realistic human response patterns. Using multi-group confirmatory factor analysis, differential item functioning analyses, and other psychometric methods, we evaluate the factor structure of LLM-generated responses and assess measurement invariance relative to human responses. Our findings reveal a critical paradox: although open-weight LLMs exhibit exceptionally high internal consistency, they demonstrate severe structural mismatch and fail to achieve scalar measurement invariance against human baselines. Furthermore, pervasive differential item functioning and extreme prompt fragility indicate that these models rely on superficial, stereotype-driven semantic matching rather than simulating stable latent psychological dynamics.
Automatic Annotation of Mental Health Recovery Narratives: A Benchmark Study
Shrankhla Pandey | Graham Murray | Ben Laws | Stefan Rennick-Egglestone | Mike Slade | Sarah Morgan
Shrankhla Pandey | Graham Murray | Ben Laws | Stefan Rennick-Egglestone | Mike Slade | Sarah Morgan
Manual annotation of mental health recovery narratives is slow and emotionally demanding, which limits the scalability of the digital mental health resource. A framework exists to characterise such narratives, called INCRESE, but there are currently no methods to automatically annotate the characteristics defined in INCRESE. We benchmarked the ability of support vector classifiers to annotate INCRESE characteristics when trained with three families of text representations: bag of words, GloVe static embeddings, and BERT contextual embeddings, using a dataset of 355 mental health recovery narratives. Characteristics related to diagnosis and turning points achieved a balanced accuracy greater than 0.67. Characteristics related to content warnings achieved a balanced accuracy of 0.72 but showed poor recall, which may be harmful for readers because it could lead to unsolicited exposure to sensitive content such as abuse or sexual violence. The lived-experience advisors endorsed the project objectives and addressed challenges of characteristic prioritization, adding insights not visible from quantitative metrics alone.
Before the Labels: How Dataset Construction Shapes Suicidality Detection in Clinical Text
Priyanshi Garg | Ishita Rao | Jieqiong Ding | Amandalynne Paullada
Priyanshi Garg | Ishita Rao | Jieqiong Ding | Amandalynne Paullada
Clinical NLP increasingly relies on electronic health record (EHR) datato detect suicidal behaviors, treating clinical documentation as morereliable ground truth than social media. We argue that this framingobscures how EHR-based suicidality datasets encode a particularoperationalization of suicidality, shaped by who authors the data,how episodes are bounded, and how ambiguity is resolved. We groundthis argument in a case study of the ScAN dataset,built over MIMIC-III clinical notes. We show how governanceconstraints, ICD-based cohort selection, single-annotator labeling,and hospital-stay-level aggregation produce labels that foregroundclinician judgment, treat suicidality as a bounded episode, andassume that intent can be reliably inferred from documentation. Alinguistic analysis demonstrates that identical labels subsumeheterogeneous clinical framings differing in temporality, negation,and uncertainty, and that labeling patterns differ across insurancestatus. We argue the clinical NLP community should examine theassumptions embedded in suicidality datasets before interpretingtheir labels as ground truth.
Culture by Design: A Sociotechnical Framework for Culturally Grounded AI for Mental Health
Sunny Rai | Elizabeth Stade | Graise Zhou | Neil Sehgal | Simone Schriger | Sara Gerke | Lyle Ungar | Sharath Chandra Guntuku
Sunny Rai | Elizabeth Stade | Graise Zhou | Neil Sehgal | Simone Schriger | Sara Gerke | Lyle Ungar | Sharath Chandra Guntuku
AI systems for mental health are developed predominantly using data from Western, Educated, Industrialized, Rich, and Democratic (WEIRD) populations, raising concerns about their validity, fairness, and generalizability across geo-cultural contexts. This limitation is especially consequential in mental health, where linguistic expression, symptom presentation, help-seeking behavior, and access to care vary substantially across populations. We argue that culturally responsive AI mental health systems require explicit attention to culture throughout the development lifecycle, from data collection to training and deployment. We present a sociotechnical framework for developing culturally responsive AI mental health applications to provide AI researchers and practitioners with an actionable roadmap for building more equitable, reliable, and contextually appropriate mental health technologies.
Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language
Yunkai Xu | Saeed Abdullah
Yunkai Xu | Saeed Abdullah
AI and large language models (LLMs) have emerged as promising tools to address global mental health challenges. Despite the global nature of these challenges, there remains a critical shortage of high-quality datasets for training and evaluating such systems. To mitigate this gap, researchers increasingly generate synthetic clinical personas to simulate user data and test digital mental health support systems. However, most validated personas rely on English-centric contexts. This paper investigates whether similar persona-based methods can be used to generate multilingual mental health datasets. We modified nationality and language parameters in personas to generate clinical dialogues in Mandarin, Bengali, and Hindi. We then examined how different LLMs perform when evaluating the depression severity of these generated multilingual datasets against the baseline in English. Our findings indicate that just adding nationality and language parameters in personas might not be adequate, as it can introduce clinical inconsistency across languages. LLM judge models often exhibit inaccuracies in assessing depression severity in non-English texts, with performance varying across different models. This exposes the systemic limitations of applying English-centric personas to multilingual contexts. Ultimately, our work highlights the urgent need for culturally responsive data generation to ensure equitable mental health systems globally.
Designing Structured Conversational Support for Tuberculosis Treatment Adherence and Patient Coping
Priyanshi Garg | Sarah Iribarren | Sikha Pentyala | Yvette Rodriguez | Priscilla Carmiol-Rodriguez | Alfie Vidrio | Charles Kwanin | Jennifer Sprecher | Javier Roberti
Priyanshi Garg | Sarah Iribarren | Sikha Pentyala | Yvette Rodriguez | Priscilla Carmiol-Rodriguez | Alfie Vidrio | Charles Kwanin | Jennifer Sprecher | Javier Roberti
Tuberculosis (TB) remains a major global health challenge, and treatment adherence continues to be difficult despite the availability of effective medication. While Digital Adherence Technologies (DATs) have improved monitoring and care coordination, prior deployments highlight unmet needs for timely, personalized, and emotionally supportive communication outside clinical settings. We develop and iteratively refine a Spanish-language TB treatment-support chatbot through multiple rounds of internal expert evaluation. The system separates three core functions: (i) TB information support grounded in curated resources, (ii) coping-oriented support inspired by Dialectical Behavior Therapy (DBT), and (iii) safety-critical crisis handling via a deterministic, non-generative pathway. These components are implemented within a routed architecture with shared conversational state. Iterative evaluation identified recurring failure modes in unstructured conversational systems, including weak grounding, poor multi-turn continuity, and inconsistent safety behavior. Addressing these issues motivated explicit routing, state tracking, and task-specific prompting. Our findings suggest that in clinical support settings, reliable conversational behavior depends on structured interaction design and explicit control over routing, memory, and safety, rather than on model capability alone.
Enhancing Mental Health Counseling Support in Bangladesh using Culturally-grounded Knowledge
Md Arid Hasan | Azhagu Meena Sp | Aditya Khan | Abu Bhuiyan | Helal Ahmed | Joysree Debi | Farig Sadeque | Annie Lee | Syed Ishtiaque Ahmed
Md Arid Hasan | Azhagu Meena Sp | Aditya Khan | Abu Bhuiyan | Helal Ahmed | Joysree Debi | Farig Sadeque | Annie Lee | Syed Ishtiaque Ahmed
Large language models (LLMs) show promise in generating supportive responses for mental health and counseling applications. However, their responses often lack cultural sensitivity, contextual grounding, and clinically appropriate guidance. This work addresses the gap of how to systematically incorporate domain-specific, clinically validated knowledge into LLMs to improve counseling quality. We utilize and compare two approaches, retrieval-augmented generation (RAG) and a knowledge graph (KG)–based method, designed to support para-counselors. Our KG is constructed manually and clinically validated, capturing causal relationships between stressors, interventions, and outcomes, with contributions from multidisciplinary people. We evaluated multiple LLMs in both settings using BERTScore F1 and SBERT cosine similarity, as well as human evaluation across five metrics, which is designed to directly measure the effectiveness of counseling beyond similarity at the surface level. The results show that KG-based approaches consistently improve contextual relevance, clinical appropriateness, and practical usability compared to RAG alone, demonstrating that structured, expert-validated knowledge plays a critical role in addressing LLMs limitations in counseling tasks.
Evaluating Document-Tuned Transformer Representations for Person-level Mental Health Assessment
Aaron Marker | Oscar Kjell | Vasudha Varadarajan | H. Andrew Schwartz
Aaron Marker | Oscar Kjell | Vasudha Varadarajan | H. Andrew Schwartz
Person-level psychological assessment requires aggregating meaning across many messages from the same individual, a task that document-level training objectives were not explicitly designed for. We present a systematic, empirical comparison between architecturally matched traditional (a) base-transformers and (b) document-tuned-transformers (further contrastively fine-tuned at the document-level, sometimes referred to as "sentence transformers") under otherwise identical conditions. Comparing layer-wise and overall performance across two longitudinal mental health and psychological datasets, we find document-tuned models demonstrated a consistent improvement over base representations (increase in Pearson r of 13.4%, p=.015). Robustness analyses revealed document-tuned models remained more accurate under perturbations to word deletion, synonym replacement, typo injection, and back translation. Further, hedged language (e.g., ’usually’) was more characteristic of outcomes in document-tuned embeddings while abundance (e.g., ’lot’) was more characteristic of base-transformers, suggesting document-tuned models may better capture uncertainty.These results suggest representation choice impacts mental health prediction, document-tuned models often being more adept.
Exploration of Perceptual Speech Features for Clinical Decision-Support in Mental Health Care
Vassilis Lyberatos | Edmund Dervakos | Eleni Adamidi | Athanasios Voulodimos | Giorgos Stamou
Vassilis Lyberatos | Edmund Dervakos | Eleni Adamidi | Athanasios Voulodimos | Giorgos Stamou
Speech and language technologies offer valuable opportunities for supporting mental health assessment through objective and interpretable cues. We present a systematic feature-based analysis framework leveraging perceptually grounded acoustic and linguistic characteristics, including prosody, vocal quality, semantic coherence, syntactic structure, and sarcasm. Using statistical analysis and interpretable machine learning (XGBoost with SHAP and LIME), we examine associations between speech features and validated symptom measures of depression, anxiety, and ADHD. Evaluated on both controlled benchmark datasets (StressID, DAIC-WOZ, Androids, EATD) and a real-world clinical dataset, the framework reveals stable and consistent relationships between symptom severity and vocal irregularities (e.g., shimmer, jitter), lexical–syntactic patterns, and affective tone. An ablation study conducted across all datasets further identifies the most informative feature groups. This work explores a transparent and clinically interpretable approach to speech-based mental health analysis.
Exploring Profiles of Cognitive Distortions Associated with Mental Health Disorders
Alina Anikejeva | Kairit Sirts
Alina Anikejeva | Kairit Sirts
Cognitive distortions, distorted patterns of thinking, have been increasingly studied in computational mental health research. Although they are related to many, if not all, mental health disorders, most existing studies focus primarily on depression. In this work, we explore distortion profiles across multiple mental health conditions. We analyzed a large Reddit-based dataset containing posts from ten self-reported mental health groups as well as a control group using both an n-gram-based method and a fine-tuned transformer model for detecting cognitive distortions. The mental health groups, both when pooled together and when examined individually, show a higher prevalence of cognitive distortions compared to the control group, with the effect sizes ranging from small to moderate. When comparing distortion profiles of different mental health conditions, we observe largely similar patterns, but with some conditions showing an overall higher frequency of distortions than others. These findings suggest that even relatively simple methods can be suitable for exploratory analyses that reveal group-level trends in large-scale mental health text data.
Facet-Informed Prompting for LLM-Based Personality Assessment: Error-Guided Exemplar Selection and Hierarchical Prediction
Rasiq Hussain | Juhi Shah | Joshua Oltmanns | Mehak Gupta
Rasiq Hussain | Juhi Shah | Joshua Oltmanns | Mehak Gupta
Large language models (LLMs) are increasingly applied to automatic personality assessment, yet most prior work relies on coarse binary labels and direct domain-level predictions, limiting interpretability and ignoring the hierarchical facet structure of personality. In this study, we implement a structured prompting approach with three complementary objectives: direct domain-level prediction, fine-grained facet-level prediction, and domain-level prediction informed by facet outputs. All predictions use a five-level ordinal label scheme, capturing a continuum from very low to very high trait expression. Across all prompt types, we adopt an error-guided self-refinement procedure using in-context learning (ICL) to guide the model toward more accurate predictions. Zero-shot prompts assess baseline performance, while one-shot prompts incorporate a single demonstration example selected through the refinement procedure. Our framework evaluates both domain- and facet-level predictions, enabling examination of how prediction granularity and targeted exemplar selection influence LLM inference. By combining hierarchical domain-facet relationships with structured prompting and refinement, this work aims to provide a systematic approach for interpretable and principled LLM-based personality assessment from long-form life narratives.
From Responses to Trajectories: Modeling the Development of Reflective Listening Skills
Dhruvil Thummar | Verónica Pérez-Rosas
Dhruvil Thummar | Verónica Pérez-Rosas
Reflective listening is a core counseling skill that supports effective communication in mental and behavioral health. Understanding how this skill develops over time is critical for designing scalable training and feedback systems.In this paper, we study how counseling trainees develop reflective listening skills over time. Using a real-world dataset of 6,196 trainee responses, we model responses as trajectories in semantic embedding space and apply residual embeddings and similarity-based metrics to quantify week-to-week learning progression.Our analyses reveal systematic changes, including increased semantic alignment and reduced variability, consistent with consolidation of reflective listening skills. We further show that these trajectory patterns are accompanied by subtle linguistic shifts associated with effective counseling practice.
Ground Truths in Suicide Research: The Current State of AI-Based Suicide Detection in Social Media
Yaakov Ophir | Ofri Hefetz | Refael Tikochinski | Kfir Bar | Shir Lissak | Shulamit Grinapol | Haya Wachtel | Eyal Fruchter | Roi Reichart
Yaakov Ophir | Ofri Hefetz | Refael Tikochinski | Kfir Bar | Shir Lissak | Shulamit Grinapol | Haya Wachtel | Eyal Fruchter | Roi Reichart
Recent advances in artificial intelligence (AI) and social media data have led to growing optimism about the ability to detect suicide risk at scale. However, the empirical foundations of this work remain unclear. This article provides a synthesis of current research on AI-based suicide detection in social media, drawing on a recent umbrella review of 22 systematic reviews covering studies up to 2022, alongside an ongoing literature review extending the analysis to more recent work.Across these sources, we identified 195 relevant studies, which are documented in a detailed supplementary dataset outlining their key characteristics and findings (see Supplementary Information). Analysis of these studies reveals consistent patterns, including rapid growth, concentration on a small number of platforms, reliance on textual and English-language data, and repeated use of similar datasets. Most importantly, the majority of studies rely on indirect labeling strategies that do not involve direct, individual-level validation of suicide risk. Instead, ground truth is typically inferred from observable features of online content, such as linguistic markers or community membership. As a result, the predictive task often shifts from identifying individuals at risk to classifying posts that contain suicidal or distress-related language, limiting the ability of current approaches to detect individuals who do not express such content explicitly online.These findings suggest that current advances in model performance should be interpreted with caution. Progress in this field is likely to depend less on improving model performance and more on ensuring that model predictions meaningfully correspond to suicide risk as it is experienced in real life.
Language-Based Detection of Adherence to Evidence-Based Psychotherapy Scripts
Samuel Campione | Elizabeth Stade | Stefanie Losavio | Shreya Singhvi | William Xuan | Tony Bui | Maria Martin Lopez | Shashanka Subrahmanya | Bailee Schuhmann | Courtney Worley | Shannon Wiltsey Stirman | Johannes Eichstaedt | H. Andrew Schwartz
Samuel Campione | Elizabeth Stade | Stefanie Losavio | Shreya Singhvi | William Xuan | Tony Bui | Maria Martin Lopez | Shashanka Subrahmanya | Bailee Schuhmann | Courtney Worley | Shannon Wiltsey Stirman | Johannes Eichstaedt | H. Andrew Schwartz
Some psychotherapies, such as written exposure therapy for posttraumatic stress disorder, utilize "scripts" during parts of treatment, but verifying script adherence to ensure engagement of key mechanisms of change is a time-consuming step for therapy supervisors. Here, we formalize therapy script adherence as an NLP task, and evaluate several simple (text similarity) and more complex (few-shot LLM) approaches. Over 351 annotated therapist utterance-script pairs, we find text similarity approaches to be highly competitive with LLMs and produce fewer false positives. ROUGE-L recall achieves F1 = 0.973, and BLEU achieves F1 = 0.972 with full precision and zero false positives. GPT-5.2 achieves F1 = 0.935 and GPT-4o-mini achieves F1 = 0.876. Given that the text similarity techniques are multiple orders of magnitude less complex, our results underscore the ability for simpler NLP techniques to still be effective in the age of LLMs for tasks that are more textual in nature, suggesting that aspects of therapist fidelity to evidence-based treatments can be assessed without using cloud API calls.
LLMs as Standardised Patients for Motivational Interviewing: How Faithful Are They?
Van Hoang | Eoin Rogers | Robert Ross
Van Hoang | Eoin Rogers | Robert Ross
Recent advances in large language models (LLMs) have enabled the creation of highly realistic digital patients across a broad range of clinical scenarios, yet systematic evaluation of such simulations remains challenging due to a lack of standardised methodology. This paper investigates the faithfulness of LLM-simulated patients within motivational interviewing contexts. We directly compare the properties of data generated by simulated and human patients given identical profiles, rather than relying on subjective user experiences. Our findings reveal that while simulated and human patients produce semantically similar content and engage with comparable topics, their modes of expression differ substantially. LLM-simulated patients struggle to reproduce the full complexity of human behaviours and attitudes. While human patients exhibit a mix of positive and negative responses, LLM patients skew toward uniformly ones.
Measuring the quality of therapy sessions against assessment scales using augmented semantic-similarity approaches
Kejian Cui | Simon D’alfonso | Mike Conway
Kejian Cui | Simon D’alfonso | Mike Conway
Therapist fidelity and competence rating scales provide a way to measure quality assurance and therapist training outcomes. Scores on these scales reflect the extent to which a therapist adheres to specific therapeutic principles during a psychotherapy session. Existing research has employed natural language processing (NLP) techniques to automatically predict scale ratings. However, existing approaches require a model trained on a dataset of therapy sessions annotated with the target rating scale.Recent work has explored directly inferring therapeutic alliance by computing semantic similarity between therapy transcripts and the Working Alliance Inventory, via cosine similarity between sentence embeddings.In this paper, we extend this line of work by computing semantic similarity between therapist talk turns and therapist fidelity scale items to directly infer fidelity to specific therapeutic modalities. We further enhance this method by augmentation with LLM-generated example therapist utterances that instantiate target behaviours (as expressed by scale items) across varied therapeutic contexts.In evaluations on two independent datasets, our example-augmented semantic similarity approach consistently shows effectiveness in discriminating therapeutic modalities and levels of therapist fidelity.
Mirroring Minds: Asymmetric Linguistic Accommodation and Diagnostic Identity in ADHD and Autism Reddit Communities
Saad Mankarious | Nour Zeid | Iyad Ait Hou | Rebecca Hwa | Ayah Zirikly
Saad Mankarious | Nour Zeid | Iyad Ait Hou | Rebecca Hwa | Ayah Zirikly
Social media research on mental health has focused predominantly on detecting and diagnosing conditions at the individual level. In this work, we shift attention to intergroup behavior, examining how two prominent neurodivergent communities, ADHD and autism, adjust their language when engaging with each other on Reddit. Grounded in Communication Accommodation Theory (CAT), we first establish that each community maintains a distinct linguistic profile as measured by the Linguistic Inquiry and Word Count (LIWC) dictionary. We then show that these profiles shift in opposite directions when users cross community boundaries: features that are elevated in one group’s home community decrease when its members post in the other group’s space, and vice versa, consistent with convergent accommodation. Finally, in an exploratory longitudinal analysis around the moment of public diagnosis disclosure, we find that its effects on linguistic style are small and, in some cases, directionally opposite to cross-community accommodation, providing initial evidence that situational audience adaptation and longer-term identity processes may involve different mechanisms. Our findings contribute to understanding intergroup communication dynamics among neurodivergent populations online and carry implications for community moderation and clinical perspectives on these conditions.
Mostly Grounded, Occasionally Risky: Expert Evaluation of LLM-Generated Supervisory Feedback in a Psychotherapy Training Simulator
Adrian Montesano | Justin Bloomberg | Marc Pérez-Buriel
Adrian Montesano | Justin Bloomberg | Marc Pérez-Buriel
Automated feedback is increasingly cited as a key advantage of AI-based psychotherapy training, yet the clinical groundedness of LLM-generated supervisory feedback remains unevaluated. We present an expert evaluation of supervisory feedback generated by PRACTICE, an LLM-powered open-ended psychotherapy training simulator, across 21 feedback instances from four novice trainees. Two clinical psychology experts independently coded 167 feedback propositions as Justified, Unjustified, or Unsure. Inter-rater reliability was near-perfect (raw agreement = 98.2\%; $\kappa$ = 0.902). Of the 167 propositions, 149 (89.2\%) were rated Justified; however, 52.4\% of feedback instances contained at least one non-justified proposition, and qualitative analysis identified three recurring failure types: incorrect characterization, referential imprecision, and unclear communication. In clinical training contexts, even low error rates carry ethical weight: unjustified feedback risks reinforcing inappropriate clinical behaviors in trainees that can be trasnferred to real practice. These findings provide an initial empirical basis for the responsible deployment of LLM-generated feedback in clinical training and call for traceable, expert-auditable feedback architectures.
Psycholinguistic Profiles of Cognitive Distortions in Reddit Data
Neha Sharma | Navneet Agarwal | Kairit Sirts
Neha Sharma | Navneet Agarwal | Kairit Sirts
Cognitive distortions (CDs) are systematically biased patterns of thinking associated with the onset and maintenance of mental health conditions such as depression and anxiety. Computational research on CDs has primarily focused on detection and classification, while the linguistic characterization of distorted language; what psycholinguistic features distinguish distorted from non-distorted text, and whether individual distortion types carry distinct language patterns, remains largely unexplored. Using a Reddit dataset, we apply a Generalized Linear Model (GLM) with bootstrap sampling to LIWC-derived features and find that CD language is psycholinguistically distinct from non-distorted language. We further characterize type-specific psycholinguistic profiles for each CD, and through hierarchical clustering show that CD types are not fully separable, with certain distortions sharing stable linguistic signatures. Together, these findings contribute to the linguistic characterization of CDs, offering an empirically grounded account of the psycholinguistic properties that distinguish distorted language at the level of CDs as a whole and across specific distortion types.
The Attachment Index: Auditing Attachment Language Cues and Relational Safety Risks in Human-LLM Dialogue
Cyndie Demeocq | Animesh Prasad | Marzieh Saeidi | Karen Goodall | Björn Ross
Cyndie Demeocq | Animesh Prasad | Marzieh Saeidi | Karen Goodall | Björn Ross
As conversational AI systems grow increasingly toward emotional support contexts, relational safety failures between users and chatbot remain under-measured. We present a psycholinguistic grounded framework for auditing attachment-relevant language cues. Our approach identifies when an LLM’s replies exhibit linguistic attachment cues and surface related patterns that may signal parasocial bonding, including anthropomorphism or over-dependence. We adapt the Adult Attachment Interview into two complementary, automatable lenses - attachment cues features and Gricean maxims - and combine them with psychologist-led annotation of multi-turn persona dialogues. Applying this framework, we observe that models can align with persona-intended attachment cue patterns. We also find that judge-LLMs alone are unreliable, highlighting the need for psychologist-in-the-loop evaluation. The 25 psychologist-led annotated conversations revealed risks, including boundary blurring and missed opportunities for appropriate referral or triage. These insights motivate attachment-aware safeguards - such as non-personification, boundary language, and explicit referral mechanisms - to reduce mis-attunement and over-attachment in LLM conversational settings.
The Text Aphasia Battery (TAB): A Clinically-Grounded Benchmark for Aphasia-Like Deficits in Language Models
Nathan Roll | Jill Kries | Flora Jin | Catherine Wang | Ann Marie Finley | Meghan Sumner | Cory Shain | Laura Gwilliams
Nathan Roll | Jill Kries | Flora Jin | Catherine Wang | Ann Marie Finley | Meghan Sumner | Cory Shain | Laura Gwilliams
Large language models (LLMs) have emerged as a candidate ‘model organism’ for human language, offering an unprecedented opportunity to study the computational basis of linguistic disorders like aphasia. However, traditional clinical assessments are ill-suited for LLMs, as they presuppose human-like pragmatic pressures and probe cognitive processes not inherent to artificial architectures. We introduce the Text Aphasia Battery (TAB), a text-only benchmark adapted from the Quick Aphasia Battery (QAB) to assess aphasic-like deficits in LLMs. The TAB comprises four subtests: Connected Text, Word Comprehension, Sentence Comprehension, and Repetition. This paper details the TAB’s design, subtests, and scoring criteria. To facilitate large-scale use, we validate an automated evaluation protocol using Gemini 2.5 Flash, which achieves reliability comparable to expert human raters (prevalence-weighted Cohen’s k=0.255 for model–consensus agreement vs. 0.286 for human–human agreement). We release TAB as a clinically-grounded, scalable framework for analyzing language deficits in artificial systems.
The Visibility of Depression in Social Media: Mapping Symptoms to Linguistic Features
Ștefana-Arina Tăbușcă | Ana Sabina Uban | Liviu Dinu
Ștefana-Arina Tăbușcă | Ana Sabina Uban | Liviu Dinu
Digital phenotyping research assumes that depression symptoms are detectable in people’s written discourse, yet there is room to explore which specific symptoms leave linguistic traces and which remain invisible. In this paper, using matched clinical and social media data from 169 Reddit users (eRisk 2021), we construct a clinical symptom network from BDI-II responses and a symptom-language bridge matrix mapping each of the 21 BDI-II symptoms to 15 curated LIWC-22 linguistic features. After FDR correction, 37 significant associations emerge, revealing a divide between cognitive-affective symptoms (sadness, worthlessness, suicidality) that leave clear linguistic traces through mental health vocabulary, anxiety words, and first-person pronouns, while others, like vegetative symptoms (sleep, appetite, irritability, libido) appear less visible. These findings suggest that there might be dimensions of depression that are missed by text-based depression monitoring.
Thinking With a Machine: An AI Agent’s Account of Agentic Research in Clinical Psychology
Elad Refoua | Mor Bar
Elad Refoua | Mor Bar
The debate surrounding AI’s role in clinical research is often reduced to the automation of discrete tasks, such as summarizing literature, analysis copilots, and assisting with prose, this "tool-use" paradigm obscures a more fundamental transformation. We propose a shift toward agentic research infrastructure, where AI systems function not as passive instruments, but as active collaborators in the scientific process. Co-authored by a clinical psychology doctoral researcher, a computational psychotherapy scholar, and the AI agent itself, this paper argues that the transition from passive to agentic AI represents a "change in kind" rather than degree. Drawing on a months-long collaboration involving over 30 specialized research capabilities, we demonstrate how agentic systems reconfigure the topology of the research process. By collapsing the temporal friction between theoretical intuition and empirical validation, these systems transform clinical inquiry from a rigid, linear pipeline into a fluid, multidimensional landscape. This newfound immediacy allows clinician-researchers to ask, pursue, and pivot between complex questions in real-time—expanding the investigative horizon to include inquiries previously sidelined by the logistical constraints of traditional methods. We introduce the concept of "Agent Learning" to describe the accumulation of domain-specific nuance through sustained research engagement and argue that formalizing human-agent methodologies is now an urgent priority for the future of clinical psychological inquiry.
When Rating Scales Fall Short: LLM-Assisted Discovery of ADHD Signals in Turkish Teacher Narratives
Baris Karacan | Irem Aktar Songur | Ahmet Ozaslan | Elvan Iseri
Baris Karacan | Irem Aktar Songur | Ahmet Ozaslan | Elvan Iseri
Attention Deficit Hyperactivity Disorder (ADHD) is one of the most common neurodevelopmental disorders in childhood, and its diagnosis relies on assessments combining clinician judgment with standardized rating scales and reports from parents and teachers. While structured instruments such as the Conners’ Teacher Rating Scale–Revised Short Form (CTRS-R:S) quantify ADHD-related behaviors, teachers also provide open-ended narratives that may contain complementary signals not captured by structured assessments. However, it remains unclear to what extent teacher narratives encode signals overlooked by rating scales. In this study, we analyze de-identified Turkish teacher evaluation forms collected during clinical ADHD assessments, including both CTRS-R:S scores and open-ended teacher narratives. We compare predictive signals from structured scores and narrative text and identify cases where structured assessments fail to clearly distinguish ADHD from non-ADHD students while narrative-based models capture distinct behavioral patterns. Notably, these cases show minimal overlap with those missed by the narrative model, suggesting that structured and narrative information encode complementary signals. To interpret these differences, we apply a large language model (LLM)-assisted theme discovery pipeline that reveals distinct attention, behavioral, and family-related patterns, highlighting the potential of natural language processing (NLP) to uncover clinically relevant signals from teacher narratives and to complement traditional ADHD screening tools.
Why Do Self-Harm Prediction Models Struggle to Generalise? – Lexical and Semantic Variations in Emergency Department Triage Notes
Liuliu Chen | Mike Conway | Jo Robinson | Vlada Rozova
Liuliu Chen | Mike Conway | Jo Robinson | Vlada Rozova
Self-harm presentations to emergency departments (EDs) are strongly associated with higher suicide risk. NLP models have shown strong performance in detecting self-harm from triage notes within single hospitals, yet performance often declines across institutions. To examine potential causes, we compare ED triage notes from two hospitals by analyzing lexical characteristics, highly associated predictive features, and salient topics. Our results reveal variation in lexical expression and feature importance related to self-harm across hospitals, despite consistent core themes such as self-poisoning and self-injury. These documentation differences are associated with reduced cross-site performance. These findings provide insight into how institutional variation affects the identification of self-harm in clinical text and highlight potential methods to improve model generalisability.
Overview of the CLPsych 2026 Shared Task: Capturing and Characterizing Mental Health Changes through Social Media Timeline Dynamics
Iqra Ali | Talia Tseriotou | Guy Dvir | Callum Chan | Yuxiang Zhou | Juan Antonio Lossio-Ventura | Ayal Klein | Aya Shamir | Dan Sayda | Anthony R Hills | Ayah Zirikly | Diana Inkpen | Dana Atzil-Slonim | Maria Liakata
Iqra Ali | Talia Tseriotou | Guy Dvir | Callum Chan | Yuxiang Zhou | Juan Antonio Lossio-Ventura | Ayal Klein | Aya Shamir | Dan Sayda | Anthony R Hills | Ayah Zirikly | Diana Inkpen | Dana Atzil-Slonim | Maria Liakata
We provide an overview of the CLPsych 2026 Shared Task, which focuses on capturing and characterizing mental health dynamics from social media timelines through structured modeling of self-states. This year advances the longitudinal paradigm set by prior CLPsych shared tasks (2022, 2025), by integrating fine-grained psychological representation using the MIND framework. The task is organized into three main components: (1) post-level identification of adaptive and maladaptive self-states through ྀི elements and sub-elements, along with estimation of their presence; (2) timeline-level detection of Moments of Change, including both abrupt switches and gradual escalations based on ABCd element and sub-element combinations; and (3) sequence-level modeling, involving summarization of change processes over time and identification of recurrent dynamic signatures.
A Multi-Strategy Fusion Framework for Dynamic Mental State Modeling
Mengjia Zhang | Rui Chen | Haonan Xiao | Yi Yang
Mengjia Zhang | Rui Chen | Haonan Xiao | Yi Yang
This work presents a multi-strategy framework for the CLPsych 2026 Shared Task. We integrate psychological element extraction, temporal change detection, and clinical summarization, achieving competitive performance on the official leaderboard.
Agentic Pipelines Meet Retrieval-Augmented ICL: A Zero-Training Approach to Mental Health Modeling
Anson Antony | Gautam Kumar | Annika Marie Schoene
Anson Antony | Gautam Kumar | Annika Marie Schoene
This paper describes a system for the CLPsych 2026 shared task that uses retrieval-augmented in-context learning with frozen LLMs and no fine-tuning. The core contribution is a five-agent agentic pipeline for Task 3.1 sequence summarisation: two rule-based agents detect change type (Switch/Escalation) and direction (improvement/deterioration), an LLM-based DynamicsExtractor produces structured ABCD analysis, a SummaryWriter composes prose grounded in retrieved gold exemplars, and a Validator enforces structural constraints. This pipeline is iteratively refined across three submissions via NLI-based candidate reranking and per-sentence contradiction reduction. For Tasks 1.1 and 1.2, a single LLM call combines static and RAG-retrieved examples; for Task 2, an auto-tuned prompt detects moments of change. The system ranked 1st on Task 1.2 (RMSE 0.917) and Task 3.1 (score rank average 4.00), 3rd on Task 1.1 (F1 0.420), and 8th on Task 2 (F1 0.466).
CUNY at CLPsych 2026: A Pipeline Approach to Classification and Summarization of Mental Health Change
Amirmohammad Ziaei Bideh | Shameed Job | Ava Yahyapour | Alla Rozovskaya
Amirmohammad Ziaei Bideh | Shameed Job | Ava Yahyapour | Alla Rozovskaya
We describe our submission to the CLPsych 2026 Shared Task on capturing and characterizing mental health changes through social media timeline dynamics. To infer the dominant self-states in posts (Tasks 1.1 and 1.2), we ensemble in-context learning of three open-weight large language models using majority voting. For predicting moments of change in a timeline (Task 2), we train supervised classifiers on features derived from Task 1.1 predictions. To summarize the patterns of mood dynamics and their progression over time within a timeline (Task 3.1), we augment in-context example labels predicted by upstream systems (Tasks 1.1, 1.2, and 2), yielding performance gains over zero-shot and unaugmented in-context learning baselines. Our submission ranked first on Task 1.1, fourth on Task 1.2, fourth on Task 2, and third on Task 3.1.
DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods
Maryia Zhyrko | Daisy Lal | Erik van Mulligen | Lifeng Han
Maryia Zhyrko | Daisy Lal | Erik van Mulligen | Lifeng Han
We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence-level summarization.For Task 1, we combine LLM-based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few-shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short-term temporal context. For Task 3.1, we explore both a deterministic rule-based summarization pipeline and a few-shot LLM-based approach, ranking \textbf{2nd} officially.Our RAG-based method achieves strong performance in Task 3.2, ranking \textbf{1st} for Improvement and \textbf{3rd} for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity-based evaluation metrics.These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks.We share our code and prompts at \url{https://github.com/4dpicture/CLPsych2026}
Hierarchical Multi-Stage Modeling of Adaptive and Maladaptive Self-States in Social Media Timelines
Abir Naskar | Mike Conway
Abir Naskar | Mike Conway
We address the CLPsych 2026 Shared Task on modeling psychological self-states from longitudinal social media data. We propose (i) a hierarchical multi-stage framework that integrates a multi-task transformer encoder and (ii) a four stage instruction-tuned large language model finetuning pipeline for subelement classification, presence estimation, and evidence extraction. Our approach incorporates element-conditioned label masking and cross-stage encoder transfer, enabling structured prediction aligned with the ABCD psychological framework. Experiments show improvements over the baseline on the development setup, with RoBERTa achieving an 8.3\% gain in macro-F1 and improved RMSE, while a fine-tuned Qwen3 model attains the best overall performance. These results demonstrate the effectiveness of combining hierarchical multi-task learning with structured generation for interpretable mental health analysis.
McMasters of Change: Predicting Well-Being States and Transitions from Longitudinal Language
Hongyi Zhang | Derron Li | Scarlett Cleary | Aadi Sanghani | Akshay Krishna Sirigana | Brian Miguel Pimentel | Kelsey Isman | Kian Omoomi | Vasudha Varadarajan | Charles Welch | Allison Lahnala
Hongyi Zhang | Derron Li | Scarlett Cleary | Aadi Sanghani | Akshay Krishna Sirigana | Brian Miguel Pimentel | Kelsey Isman | Kian Omoomi | Vasudha Varadarajan | Charles Welch | Allison Lahnala
Most existing work on mental health prediction from language focuses on isolated posts, overlooking temporal dynamics in longitudinal timelines. We present McMaster NLP’s system for the CLPsych 2026 Shared Task, which centers on modeling mental health dynamics in social media timelines using the MIND framework~\cite{atzil_slonim_2025_mind}. The task comprises: (1) identifying adaptive and maladaptive self-state components within posts, (2) detecting moments of change in well-being, and (3) generating structured summaries. For self-state prediction, we leverage LLM-generated archetypal representations of language use as semantic anchors within a dual-encoder architecture, enabling interpretable prediction of subelements and their intensities through alignment with prototypical expressions of psychological states. For temporal dynamics, we use BiLSTM-based sequence models to detect moments of change. For summarization, we employ a prompt-based LLM to generate grounded, structured summaries emphasizing causal interactions and temporal progression of self-states. Finally, we analyze model failure modes with respect to human evaluation and identify directions for reconciling the MIND framework with how state-assessment models encode meaning.
P2P - from Posts to Patterns: An LLM Ensemble Approach to Mental Health Dynamics Detection
Federico Ravenda | Volodymyr Karpenko | Antonietta Mira | Andrea Raballo
Federico Ravenda | Volodymyr Karpenko | Antonietta Mira | Andrea Raballo
This paper presents the USAI team’s submission to the CLPsych 2026 Shared Task, targeting Tasks~1.1, 1.2, 2, and~3.1. We propose an ensemble-based approach combining multiple open-source large language models, where the contribution of each model is weighted according to its alignment with clinically grounded human annotations on the training set. Our system achieves competitive results across the evaluated subtasks, with particularly strong performance on Tasks~1.2 and~2.
Prompt-Based Modeling of Moments of Change and Change Summaries in Mental Health Timelines
Duc Do | Tin Pham | Vu Tran | Minh Nguyen
Duc Do | Tin Pham | Vu Tran | Minh Nguyen
This paper presents our prompt-based approach for modeling mental health timelines from Reddit user posts. We address two tasks: identifying moments of change and generating summaries of clinically meaningful changes across post sequences. Our framework uses large language models with in-context learning to analyze self-states and mental health indicators without task-specific fine-tuning. We build an inference pipeline with vLLM and Qwen2.5-72B-Instruct-GPTQ-Int8, and experiment with few-shot prompting, and balanced few-shot sampling. We also examine how the number of visible posts affects the model’s ability to capture temporal changes. Our results suggest that prompt-based methods provide a practical and competitive baseline in low-resource and sensitive mental health settings, particularly for modeling self-state dynamics and generating summaries of psychological change over time.
psytechlab at CLPsych 2026: Utilising Natural Language Processing methods and Large Language Models for Social Media Text Analysis
Igor Buyanov | Nafisa Valieva | Ekaterina Mazurina
Igor Buyanov | Nafisa Valieva | Ekaterina Mazurina
Social media posts are a rich and valuable source of a data to analyze the mental health states and users’ well-being using automatic analysis tools. In this work we show, how we used a range of Natural Language Processing (NLP) methods such as Long-Short Term Memory (LSTM), BERT-based models and Large Language Models (LLMs) for self-states and well-being analysis and summarization during the CLPsych Shared Task 2026. Our approach achieved one of the top Consistency and Contradiction scores for summarization task and also middle-level results for the other tasks. By testing and developing such mental health-state estimation systems, we managed to contribute to the improvement of the mental health support systems. We make our code available.
Self-State Identification with Retrieved In-Context Examples and Open-Weight LLMs
Alina Ponomareva | Nina Stekacheva Sancho | Karina Litvinova
Alina Ponomareva | Nina Stekacheva Sancho | Karina Litvinova
We describe a system for the CLPsych 2026 shared task on post-level identification of adaptive and maladaptive self-states. The system addresses subelement classification (Task 1.1) and presence rating (Task 1.2) with a retrieval-augmented in-context learning ensemble of two open-weight LLMs (Qwen3.5-27B and Mistral-Small-3.2-24B-Instruct) and a three-call prompt decomposition (unified, adaptive-focused, and Affect-focused extraction). Outputs are merged across models via deterministic aggregation with element-selection strategies tuned per subtask. The system placed 2nd of 17 on Task 1.1 (subelement Macro F1 = 0.441) and 5th of 17 on Task 1.2 (Avg RMSE = 0.994).
Team Aurevia at CLPsych 2026: Local Healthcare NLP for Schema-Constrained Self-State Modeling
Nathan Roll | Irene Yi | Sufian Aldogom | Grace Brown | Eric Basile | Isaac Gutterman | Lakshika Tennakoon | Ammar Ahmed
Nathan Roll | Irene Yi | Sufian Aldogom | Grace Brown | Eric Basile | Isaac Gutterman | Lakshika Tennakoon | Ammar Ahmed
Team Aurevia introduces a local open-weight healthcare NLP system for the CLPsych 2026 Shared Task, predicting MIND-coded self-state elements, moments of change, summaries, anddynamic signatures from social media timelines. The task is difficult because coarse presence, fine-grained ABCD subelements, and timeline-level change require different longitudinal evidence over privacy-sensitive mental-health language. Our system combines TF-IDF retrieval, schema-constrained local Qwen2.5 prompting, ordinal calibration, and conservative post-processing. Among official runs, Aurevia ranked 3rd of 17 for Task 1.2 presence prediction, 5th of 13 overall for Task 3.1, 1st on Task 3.1 consistency, and 2nd of 9 for MIND-coded deterioration signatures, showing that constrained local LLM pipelines can remain competitive in sensitive healthcare NLP while reducing reliance on hosted proprietary inference.
Team MKC at CLPsych 2026: Capturing and Characterizing Mental Health Changes through Social Media Timeline Dynamics
Kyomin Hwang | Hyeonjin Kim | Hyunho Lee | Nojun Kwak
Kyomin Hwang | Hyeonjin Kim | Hyunho Lee | Nojun Kwak
Recent advances in Large Language Models (LLMs) have motivated their adoption across a wide range of domains, including Artificial Intelligence (AI) for mental health. Given the growing prevalence of mental health disorders worldwide and the limited accessibility of professional care, there is an increasing demand for scalable computational approaches that can assist in early detection and continuous monitoring of psychological well-being. In this area, ongoing efforts have focused on curating domain-specific datasets and leveraging them to develop LLMs capable of supporting holistic mental health analysis. In line with this direction, we propose an LLM-based pipeline for comprehensive mental health analysis over sequentially ordered user posts, as part of the CLPsych shared task. Our pipeline offers a unified framework that jointly enables post-level assessment and user-level temporal modeling.
Theory-Explicit Prompting for MIND Self-States: Hierarchical LLMs and Dynamic Signature Extraction in Mental Health Timelines
Pawan Kumar | Ankit Meshram | Shubham Jha | Loitongbam Singh
Pawan Kumar | Ankit Meshram | Shubham Jha | Loitongbam Singh
This paper presents a system for the CLPsych 2026 Shared Task on longitudinal mental health modeling from social media timelines, grounded in the MIND framework. MIND conceptualizes mental health as evolving self-states defined by Affect, Behavior, Cognition, and Desire (ABCD), providing a structured lens on mental health trajectories. The system centers on a theory-explicit prompting framework for structured sequence summarization (Task 3.1) and recurrent dynamic signature extraction (Task 3.2), encoding the full ABCD taxonomy directly into the LLM prompt to ensure clinically grounded, interpretable outputs. A three-stage pipeline infers a direction-of-change label per sequence, produces structured ABCD summaries with few-shot exemplar augmentation, and aggregates these summaries to derive cross-individual recurrent patterns. The system ranks first on deterioration-related recurrent signatures and second overall, achieving the top Fit and Specificity scores in Task 3.2, demonstrating the benefits of explicit clinical grounding for conceptual accuracy.
up
Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)
Proceedings of the Ninth Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-9)
Godfred Agyapong | Sarah Moeller | Antti Arppe | Ali Marashian | Daisy Rosenblum
Godfred Agyapong | Sarah Moeller | Antti Arppe | Ali Marashian | Daisy Rosenblum
Morphological Parsing for Media Lengua: When Accessibility Matters More Than State-of-the-Art
Jesse Stewart | Olga Kriukova
Jesse Stewart | Olga Kriukova
While machine learning approaches dominate contemporary NLP research, a critical gap exists between published models and tools actually used by target communities (Gessler & von der Wense, 2024). This paper presents two morphological parsers for Media Lengua (ISO 639-3: mue), an endangered mixed language of Ecuador, demonstrating that a JavaScript rule-based system (98.6% accuracy) can outperform a CRF model (95.7% F1) while offering immediate community accessibility.Not all language structures permit straightforward rule-based parsing; however, when a language’s morphology allows for this approach with competitive accuracy, we argue that it should be preferred for its practical advantages: immediate browser-based deployment, transparency, zero infrastructure requirements, and long-term maintainability. Our rule-based parser runs entirely in the browser, is freely available online, and can be adapted to other Quechuan languages. In contrast, while the CRF model performs well on benchmarks, it requires additional infrastructure to become accessible.Our comparison highlights the need to evaluate NLP tools not only on accuracy metrics but also on accessibility and real-world adoption, which is particularly crucial for endangered language communities where sustainable, community-accessible tools can support language documentation, education, and revitalization.
Speech Recognition and Synthesis Technologies Applied to Preservation and Revitalization of the Ainu Language
Tatsuya Kawahara | Kohei Matsuura
Tatsuya Kawahara | Kohei Matsuura
This paper gives an overview of our activities in developing automatic speech recognition (ASR) and text-to-speech (TTS) systems for the preservation and revitalization of the Ainu language, once spoken in the Hokkaido area of Japan, and listed as "severely endangered" of extinction. With a large pretrained model, a high-performing ASR system can be trained even with five hours of speech from a few speakers. It has been used to streamline the transcription and archiving of old recordings. A TTS system is also developed and used for revitalizing the speech of old folktales whose audio is missing. It is also used to provide a reference for speaking practice for new Ainu speakers. Speech technologies are important for endangered languages because their cultures have typically been passed down orally, and our efforts will be useful for passing them on to the future.
Choosing an ASR model for Dënë Sųłıné: Navigating polysynthesis and unstandardized orthography
Olga Kriukova | Antti Arppe | Olga Lovick
Olga Kriukova | Antti Arppe | Olga Lovick
While several pre-trained multilingual models are actively used for fine-tuning on under-resourced and endangered languages, it remains unclear which architectures perform better and what factors explain their varying performance across languages. Although this question may be less pressing for languages with adequate resources, it is critical for endangered language communities, where limited time and funding to experiment with multiple model options are available (Jimerson et al., 2023). We compare the performance of two ASR architectures, Wav2Vec2 and Whisper, on a Dënë Sųłıné dataset. This language and dataset present several challenges common to under-resourced and endangered languages: unstandardized orthography, pronunciation variation, and phonological and morphosyntactic structures that differ from the major languages represented in the multilingual datasets used for pre-training large ASR models. Although Wav2Vec2 reportedly outperforms Whisper in low-resource settings (see e.g., Coto-Solano et al., 2024; Nahabwe et al., 2025; Williams et al., 2023), our study shows that Whisper yields significantly better results on the Dënë Sųłıné dataset. These findings suggest that model performance may depend not only on architecture, dataset size, or typological features of language, but also on dataset-specific characteristics. In our case, Whisper showed better adaptability to a dataset with inconsistent spelling and pronunciation. Further verification across similarly inconsistent datasets is required to assess the generalizability of this result.
An Interactive System for Generating Revisable Grammar Lessons for Extremely Low-Resource Languages Without Expert Annotation
Sebastien Christian
Sebastien Christian
Endangered-language teaching often faces two practical bottlenecks: the scarcity of experts able to produce pedagogical grammars, and the dependence of most approaches on expert linguistic annotation. We present a human-in-the-loop system for extremely low-resource languages that addresses both constraints by combining lightweight concept-based annotation, typological inference, structured sentence-pair augmentation, document retrieval, and constrained language model generation. Rather than aiming to produce definitive grammatical descriptions, the system generates revisable grammar lesson drafts grounded in heterogeneous evidence, including elicited sentence pairs, free translation pairs, and descriptive documents. The interface is designed so that speakers, teachers, and other language practitioners without formal linguistic training can contribute usable data, inspect intermediate inferences, control source selection and generate draft grammar lessons. We describe the architecture, user workflows, and initial deployment experience in real-world revitalization settings. The contribution of the paper is an implemented workflow for early pedagogical draft generation under extreme data scarcity, not a controlled evaluation of pedagogical effectiveness.
Voices from the Margins: Modeling Linguistic Diversity in Spontaneous Speech for Low-Resource Languages
Vitthal Bhandari | Tiya Kumar | Katharine Mulhern
Vitthal Bhandari | Tiya Kumar | Katharine Mulhern
We conduct Automatic speech recognition (ASR) experiments on the Common Voice Spontaneous Speech dataset by Mozilla Data Collective, consisting of 21 low-resource languages across four continents of the world. We fine-tune popular multilingual speech models on all languages of this dataset, and observe that while a single-best-model solution doesn’t exist, the Massively Multilingual Speech model and Whisper achieve superior performance on certain languages. Through n-gram language modeling decoding experiments, we observe a significant improvement in error rate over greedy decoding by up to 27.3%. We follow our experiments with a close linguistic error analysis of the best performing models on Scots (sco) and Nubi (kcn) - two of the languages in our dataset, with very little prior audio and text modeling research. We highlight the morphosyntactic errors induced during speech recognition and perform a holistic analysis of these languages. We finally advocate for the importance of building efficient and accurate ASR tools for modeling speech in endangered languages with scarce resources, and their applications to language revitalization, language learning assistance, and accessibility.
Digital posters: Publishing Gurindji plant and animal poster content as websites using an open-source template-based RO-Crate preview tool
Ben Foley | Abigail Davis | Felicity Meakins
Ben Foley | Abigail Davis | Felicity Meakins
Bringing together Gurindji language material from an award-winning poster series and an existing website tool, our work demonstrates the benefits arising from packaging existing language material according to the RO-Crate standard. We describe a relatively fast, low-cost, low-maintenance and long-lasting method of publishing language content online with data in RO-Crate format. The production leverages the prior work done in collating content, requiring minimal further work to reformat and republish for online publication. Four websites were built using this method.
AvarLab: An Integrated Digital Ecosystem for Avar, a Morphologically Rich Low-Resource Language
Kebed Zagidov | Thomas Brochhagen
Kebed Zagidov | Thomas Brochhagen
This paper presents a digital ecosystem designed for Avar, a morphologically rich and vulnerable Northeast Caucasian language. Addressing the common bottleneck where lexical resources, corpora, and computational tools are developed in isolation or are entirely absent, we propose the "generate-verify" workflow. By developing a scalable, rule-based computational architecture, our system specifically targets the challenges of low-resource settings, overcoming data sparsity to generate over one million inflected forms from a static dictionary of 14,700 entries.Furthermore, by coupling morphological generation with corpus verification, we introduce a dynamic method to rapidly analyze and expand endangered language data. This approach transforms static linguistic documentation into active language reclamation tools, supporting dictionary lookup and the creation of silver-standard annotations for downstream NLP. The platform also serves as a unified model for the collection, management, and mobilization of fragmented language data, ensuring that the resulting resources are directly accessible and beneficial to the speaker community. Ultimately, AvarLab provides a practical, adaptable pathway for building sustainable digital infrastructure by fostering interaction among documentary linguists, computer scientists, and native speakers.
Revitalising Endangered Languages and Cultural Heritage through Language Technology: A Pilot Study for Dzardzongke
Hannah Claus | Songbo Hu | Emre Isik | Anna Korhonen | Kitty Liu | Marieke Meelen
Hannah Claus | Songbo Hu | Emre Isik | Anna Korhonen | Kitty Liu | Marieke Meelen
In this short paper, we present the first prototype of a mobile application to help preserve and revitalise the endangered language and cultural heritage of the speakers of Dzardzongke, a Tibetic language spoken in South Mustang, Nepal. With this pilot study, we provide a collaborative and highly accessible solution to revitalisation that has potential for any community interested in preserving their language and culture.
Annotation Tools for Language Documentation: A Survey of Capabilities, Gaps, and Morphological Support
Changbing Yang | Pt Anderson | Godfred Agyapong | Sarah Moeller
Changbing Yang | Pt Anderson | Godfred Agyapong | Sarah Moeller
Annotation tools are foundational infrastructure for language documentation, yet few comprehensive surveys have evaluated the tool landscape specifically from a documentary linguistics perspective. We survey 98 annotation tools across dimensions critical to language documentation workflows: annotation support, collaboration features, active learning, cost and openness, and institutional sustainability. Of the 44 tools both free and accessible for evaluation, only 15 support morpheme segmentation and glossing, and only 6 combine morphological annotation with remote collaboration at no cost. We identify a structural gap between the current tools and the requirements of field linguists working with endangered and Indigenous languages. While many NLP tools prioritize scalable annotation for high-resource settings, documentary linguists need interlinear glossed text (IGT) support and community-accessible interfaces. We taxonomise the tool landscape, present a multi-dimensional feature matrix, suggest current tools for language documentation, and conclude with concrete recommendations for tool developers and the documentary linguistics community.
Addressing Domain Mismatch in ASR for Akuzipik Language Documentation
Summer Chambers | Sylvia Woodrose Schwartz | Matthew Kelley | Lane Woodrose Schwartz
Summer Chambers | Sylvia Woodrose Schwartz | Matthew Kelley | Lane Woodrose Schwartz
The use of ASR models in endangered language documentation has grown in popularity given the bottleneck of manual speech transcription. Meta’s Massively Multilingual Speech (MMS) model is particularly popular for its extensibility to low-resource languages. However, it is mostly trained on read speech data from the Bible, meaning it may not perform well on other domains. We evaluated this model on data collected as part of a larger language documentation and revitalization project focused on Akuzipik, a polysynthetic Alaska Native language. We also finetuned and evaluated the model on a small (1h) collection of speech. The original model performed well on a dataset that roughly matched the Bible training data in domain and writing style but struggled on a separate collection of spontaneous speech. Performance on spontaneous speech improved after finetuning on a sample of our full dataset, and error rates reduced less dramatically after finetuning only on read speech. Both finetuning scenarios show promise for future model improvement, especially considering the relative ease of collecting read speech data. This experiment confirms the challenge of transcribing spontaneous speech with the MMS ASR model but provides hope for improving model performance for language documentation purposes, even with scarce data.
This paper investigates the challenges of low-resource machine translation for ʻŌlelo Hawaiʻi (Hawaiian), a critically endangered Polynesian language. We compile a corpus of publicly available Hawaiian-English bitext and investigate the effectiveness of neural sequence-to-sequence models and large language models for translating Hawaiian. To address data scarcity, we employ various data augmentation techniques, including backtranslation, multilingual training using parallel corpora in related languages, and leveraging dictionary entries. Our experiments demonstrate that multilingual training significantly improves model performance, particularly when incorporating bitext from related Polynesian languages. Fine-tuned large language models were not able to outperform mBART, highlighting that smaller and simpler models are still relevant, especially in low-resource scenarios.
Creole languages emerged from colonial contact and the slave trade. Although they inheritthe bulk of their vocabulary from a "lexifier"language, they remain classic low-resourcelanguages, presenting significant challengesfor speech technology. This paper exploreshow the abundant resources of a lexifier canbe leveraged for Creole-specific tools, focusing on Automatic Speech Recognition (ASR).Specifically, we use an artificial dataset generated a French-trained Text-to-Speech (TTS)model and French datasets to pre-finetune ASRmodels for two French-based Creoles. Ourresults demonstrate that a two-stage trainingsetup where models are first trained on artificial datasets leads to substantial performanceboost for transcribing Creole languages. Additionally, this approach serves as a viable firststep for ASR development in zero-resource scenarios.
Indigenous Writing Systems Matter: Rethinking NLP beyond Alphabetic Bias through Script-Aware Modeling
Ngoc Tan Le | Mamady Traore | Cristian Ahumada Oliva | Fatiha Sadat
Ngoc Tan Le | Mamady Traore | Cristian Ahumada Oliva | Fatiha Sadat
Natural Language Processing (NLP) has made significant progress in recent years, largely driven by large-scale pretrained models and vast textual and multimodal corpora. However, these advances remain unevenly distributed, disproportionately benefiting high-resource languages while Indigenous and endangered languages—especially those employing diverse and less widely supported writing systems—remain underrepresented. This paper examines the role of writing system diversity in NLP, with a focus on Indigenous and endangered languages. We propose a theoretical framework that accounts for variation across writing systems and its implications for computational modeling. Specifically, we (i) provide an overview of writing system diversity, (ii) synthesize available computational resources, and (iii) present a structured analysis of challenges in modeling, tokenization, and evaluation.Our analysis shows that writing system diversity reveals structural biases embedded in current NLP pipelines. We conclude by identifying key open challenges and outlining directions for future research toward more inclusive, script-aware NLP approaches that better account for writing system variation.
Language archives contain valuable linguistic materials that are undigitized and therefore difficult to access. Modern optical character recognition (OCR) systems have great potential to make these collections more accessible, but there are few system evaluations which can assess the quality of an OCR system specifically for language archive materials. We present CoRSAL-OCR, an OCR evaluation dataset of over 200 document pages with gold-standard transcriptions from two South Asian languages: Bodo (written in Devanagari) and Garo (written in Latin script). Using this dataset together with the 8-language AILLA-OCR benchmark, we evaluate four OCR systems: Tesseract, Google Cloud Vision, Gemini 3 Flash, and Qwen3.5-27B (an open-weight model). We find that vision language models (VLMs), when given appropriate prompts, achieve the lowest error rates on these datasets. However, prompt design has a large effect on VLM performance, with a detailed generic prompt reducing CER by up to six-fold compared to a minimal prompt. We release our dataset at https://github.com/larc-iu/corsal-ocr to support further research on OCR for language archives.
The Missing Middle: Language Documentation Needs Better Infrastructure, Not Better Models
Luke Gessler | Antonios Anastasopoulos | Sandra Auderset | Timotheus Bodt | Shobhana Chelliah | Sebastien Christian | Maxime Fily | Santiago Herrera | Eva Huber | Sharid Loaiciga | Marieke Meelen | Robert Östling | Alexis Palmer | Eline Visser
Luke Gessler | Antonios Anastasopoulos | Sandra Auderset | Timotheus Bodt | Shobhana Chelliah | Sebastien Christian | Maxime Fily | Santiago Herrera | Eva Huber | Sharid Loaiciga | Marieke Meelen | Robert Östling | Alexis Palmer | Eline Visser
Despite decades of progress in human language technology (HLT) and growing research interest in endangered languages, practical uptake of HLT in documentary linguistics workflows remains rare. In this opinion piece, we report on a structured dialogue among approximately twenty academics convened to diagnose why this gap persists. Across all topics, we identify a recurring structural problem, which we call the missing middle: despite the existence of many potentially useful HLTs, the connective infrastructure necessary to make them genuinely accessible to linguists and language communities does not exist. We report the details of our discussion and make four specific recommendations for how those active in language documentation and HLT research might orient their future work.
Aspects of Selecting the Right ASR Training Languages for Under-Resourced Languages
J. Elizabeth Liebl | Summer Chambers | Matthew Kelley | Géraldine Walther
J. Elizabeth Liebl | Summer Chambers | Matthew Kelley | Géraldine Walther
We investigate how training languages should be selected for cross-lingual IPA ASR on unseen languages. Using Common Voice audio and Vox Communis phonetic transcripts, we train multilingual IPA-based ASR models for Upper Sorbian, Luganda, and Tatar under three linguistically motivated selection strategies: genealogical relatedness, geographic proximity, and phonological inventory overlap. We compare these strategies to a random baseline and evaluate performance with phone error rate. Linguistically informed selection generally improves transfer, but no single strategy is consistently optimal. Geographic proximity performs best for Luganda, phonological overlap is slightly best for Tatar, and none of the proposed strategies outperform random selection for Upper Sorbian. The results suggest that linguistic similarity aids low-resource ASR transfer, but that the most useful dimension of similarity varies by target language.
Bottlenecks of In-Context Learning for Fieldwork ASR: A Case-study of Panãra
Siyu Liang | Myriam Lapierre | Gina-Anne Levow
Siyu Liang | Myriam Lapierre | Gina-Anne Levow
In-context learning (ICL) enables ASR models to transcribe unseen languages by conditioning on a handful of audio-transcript pairs at inference time, with no fine-tuning. This is appealing for language documentation, where transcribed data is scarce and recording conditions vary across sessions. We evaluate ICL on Panãra (Northern Jê, Brazil), a language with a complex practical orthography in which diacritics encode phonemic contrasts, across seven fieldwork recordings varying in speaker, narrative, and recording context. We find substantial within-language variation in transcription accuracy unexplained by any single recording-level factor, and show that diacritics are a systematic bottleneck with pronounced differences across diacritic types. An orthographic manipulation experiment further shows that how diacritics are represented in context transcriptions substantially affects model performance. These results highlight orthographic complexity and recording-level variation as key practical challenges for ICL-assisted fieldwork transcription.
Developing A Hawaiian Corpus Toolkit for Data-Driven Language Learning
Joseph Winkie | Michol Miller | Winston Wu
Joseph Winkie | Michol Miller | Winston Wu
This paper presents the development of an online multimodal corpus toolkit designed for data-driven language learning in Hawaiian. The toolkit supports corpus linguistics analyses including concordance/KWIC (Key Word In Context) searches, frequency analysis, collocation analyses, and complex queries with n-grams and regex pattern matching. Specifically designed for educators, students, and parents within the Hawaiian community, this easy-to-use tool facilitates a data-driven language learning process by enabling users to explore authentic language data, identify patterns, and develop deeper understanding of Hawaiian language structures through computational methods. By integrating corpus-based approaches into language education, this toolkit contributes significantly to preserving and promoting Hawaiian language learning and supports the broader community’s efforts in language revitalization.
Voice Activation Detection for Transcription of Indigenous Languages
Rolando Coto-Solano | Mikaela Browning | Thomas Corrado | Sally Akevai Nicholas
Rolando Coto-Solano | Mikaela Browning | Thomas Corrado | Sally Akevai Nicholas
Voice Activity Detection (VAD) is the first step in a workflow intended for the automated transcription of Indigenous and low-resource languages. However, VAD’s effectiveness when detecting voices in fieldwork settings remains untested. Fieldwork recordings have very different noise and interference conditions from the datasets that mainstream VAD models have been trained for, and so they might fail when confronted with this type of linguistic data. This paper tests different algorithms using data from two typologically distinct Indigenous languages: Bribri from Costa Rica and Cook Islands Māori from Polynesia. We compare energy-based methods (PyDub), GMM-based methods (WebRTC VAD), and two neural-network based methods (Silero and SpeechBrain) against human-annotated transcriptions. Our results indicate that hybrid architectures like that of SpeechBrain obtain the best results (89% accuracy for Bribri and 94% for Cook Islands Māori). However, no system performed well when tagging non-speech segments, which might indicate a bias towards marking the natural noise in a fieldwork setting as a false-positive for voice. With these findings we hope to inform the selection of VAD tools when implementing ASR workflows.
up
Proceedings of the 30th Conference on Computational Natural Language Learning
Proceedings of the 30th Conference on Computational Natural Language Learning
Claire Bonial | Yevgeni Berzak
Claire Bonial | Yevgeni Berzak
Evaluating Humanlike Memory Effects in Transformers Using Item Recognition Tasks
Christian Clark | William Schuler
Christian Clark | William Schuler
Recent studies examining cued recall in Transformers have observed that these language models remember information from the beginning or end of a passage more easily than information in the middle, a pattern which is evocative of serial position effects (primacy and recency) observed in human memory. However, while these effects have been documented in humans across a range of memory tasks (e.g., serial recall, free recall, item recognition), it is less clear whether they generalize beyond cued recall in Transformers.We address this limitation of previous work by performing novel behavioral evaluations on Transformers using a simple item recognition paradigm, which we compare against evaluations using cued recall. We find that Transformers show weak or absent recency effects in item recognition, a pattern which differs from human behavior and from Transformers’ own behavior in cued recall. A subsequent experiment examines the role of Transformers’ architectural biases in producing serial position effects in item recognition and cued recall.
Information-Theoretic Storage Cost in Sentence Comprehension
Kohei Kajikawa | Shinnosuke Isono | Ethan Gotlieb Wilcox
Kohei Kajikawa | Shinnosuke Isono | Ethan Gotlieb Wilcox
Real-time sentence comprehension imposes a significant load on working memory, as comprehenders must maintain contextual information to anticipate future input. While measures of such load have played an important role in psycholinguistic theories, they have largely been formalized using symbolic grammars, which assign discrete, uniform costs to syntactic predictions. This study proposes a measure of processing storage cost based on an information-theoretic formalization, as the amount of information previous words carry about future context, under uncertainty. Unlike previous discrete, grammar-based metrics, this measure is continuous, probabilistic, theory-neutral, and can be estimated from pre-trained neural language models. The validity of this approach is demonstrated through three analyses in English: our measure (i) recovers well-known processing asymmetries in center embeddings and relative clauses, (ii) correlates with a grammar-based storage cost in a syntactically-annotated corpus, and (iii) predicts reading-time variance in two large-scale naturalistic datasets over and above baseline models with traditional information-based predictors. Our code is available at https://github.com/kohei-kaji/info-storage.
Word predictability estimates from language models are not robust to tokenizer vocabulary
Kien Nguyen | Suhas Arehalli
Kien Nguyen | Suhas Arehalli
Much recent work has been interested in modeling language processing using measures of predictability estimated from pretrained language models. These models, however, are primarily built as language technologies rather than cognitive models, and make many design choices that may align poorly with theories of human language processing. We investigate one such choice — the size of the vocabulary learned by a BPE tokenizer — and investigate (1) its effect on the linguistic plausibility of subword units the model learns, (2) whether vocabulary size has a substantial influence on the surprisal estimates a model generates, and (3) whether those differences in surprisal translate to differences in the quality of downstream reading time predictions. We find that while vocabulary size doesn’t substantially affect the rate of morphologically reasonable tokenizations, it does have an impact on surprisal estimates and reading time predictions from 5-gram, LSTM, and GPT-2 language models. Moreover, we find that these differences primarily affect words that are split by the tokenizer, suggesting that psycholinguists should take care to design stimuli meant for computational modeling with subword tokenization in mind.
Sense and Sensitivity: “Reasoning” Models are More Robust, but can Diverge from Human Consensus in a Legal Interpretation Task
Dawson Petersen | Abhishek Purushothama | Nathan Schneider
Dawson Petersen | Abhishek Purushothama | Nathan Schneider
Can LLMs make metalinguistic judgments? While LLM embeddings are often regarded as high-quality semantic representations, it is not clear that prompting an LLM is a useful way to obtain metalinguistic insights (e.g., whether a DIY gun kit is a “firearm”). While some prior work has suggested LLM prompting can simulate surveys with human participants, computational studies in the domain of legal interpretation have found that LLMs are unreliable for metalinguistic judgments due to prompt sensitivity. However, these studies did not directly compare humans and LLMs on identical tasks, nor did they test so-called “reasoning” models. The current study addresses these gaps by directly comparing the robustness of human and LLM judgments (with and without reasoning) in an English-language legal interpretation task. Our results show that LLMs were more sensitive to irrelevant prompt features compared to human participants. Enabling reasoning improved the stability of LLM responses. However, even reasoning model outputs had only moderate correlations with human judgments, and all models sometimes output interpretations that no humans reached in response to the same prompt. We conclude that while reasoning decreases prompt sensitivity, LLMs are still poor proxies for human metalinguistic judgments.
Syntactically-guided Information Maintenance in Sentence Comprehension
Shinnosuke Isono | Kohei Kajikawa
Shinnosuke Isono | Kohei Kajikawa
Maintaining information in context is essential in successful real-time language comprehension, but maintenance is cognitively costly and can slow processing. We hypothesize that rational language users selectively maintain information that is crucial for future prediction, guided by syntactic structure. Under this view, two factors affect maintenance cost: the number of predicted heads and the number of incomplete dependencies. Although these factors have been treated as competing hypotheses in the literature, our account predicts that they are not reducible to one another. We show this is the case in a naturalistic reading time dataset in Japanese, a language in which the two factors contrast particularly clearly. We further show that there is a tradeoff such that readers that slow down for maintenance tend to benefit more from predictability, providing additional support for the proposed account. These patterns are not evident in English, however, and we highlight some issues to be resolved to understand the contribution of syntax in memory-efficient processing of various languages.
Similar Predictions, Different Processes: A Multi-Level Comparison of Human and Multimodal LLM Language Prediction
Shuqi Wang | Zhenguang Cai
Shuqi Wang | Zhenguang Cai
Humans and large language models (LLMs) both generate predictions during language processing, but whether they integrate structural and prosodic cues similarly during visually grounded speech remains underexplored. Multimodal LLMs that jointly process speech and vision now make it possible to compare not only what humans and models predict, but also when predictions emerge. We compared Mandarin speakers and Qwen2.5-Omni-7B on Mandarin dative constructions in a visual world paradigm (VWP), asking how these cues guide predictions about upcoming referents. Experiment 1 used a cloze-in-VWP task to assess offline prediction outputs; Experiment 2 examined online processing via human eye-tracking and a model audio-to-image cross-modal attention measure. In Experiment 1, humans and the model were both sensitive to structure and prosody, consistent with partial output-level alignment, but the model showed a larger structural effect and a condition-specific atypical prosody pattern. In Experiment 2, the time courses diverged: humans showed structural effects before the contrastive connective, whereas the model’s sensitivity emerged later, after connective onset. These findings indicate that output-level and process-level alignment can dissociate in this paradigm. This study contributes a methodology for multi-level human–model comparison and provides empirical constraints on claims about the cognitive plausibility of multimodal LLMs.
Collocational bootstrapping: A hypothesis about the learning of subject-verb agreement in humans and neural networks
Claire Hobbs | R. Thomas McCoy
Claire Hobbs | R. Thomas McCoy
In what ways might statistical signals in linguistic input assist with the acquisition of syntax? Here we hypothesize a mechanism called collocational bootstrapping, in which regularities in word co-occurrence patterns can provide cues to syntactic dependencies. We investigate whether this mechanism can support the acquisition of English subject-verb agreement. First, we simulate language acquisition by training neural networks on synthetic datasets that vary in how predictable their subject-verb pairings are. We find that there is a range of variability levels at which these statistical learners robustly learn subject-verb agreement. We then analyze the variability of subject-verb pairings in child-directed language, and we find that the variability in such data falls within the range that supported robust generalization in our computational simulations. Taken together, these results suggest that collocational bootstrapping is a viable learning strategy for the type of input that children receive.
Cognitively Inspired Developmental Trajectories Improve Explore-Exploit Dynamics in Neural Agent Emergent Communication
Jan Dziewoński | Flor Miriam Plaza-del-Arco | Tessa Verhoef
Jan Dziewoński | Flor Miriam Plaza-del-Arco | Tessa Verhoef
Emergent communication models support interaction-based language learning, benefiting both Natural Language Processing (NLP) applications and simulations of language evolution, but they are prone to destabilizing language drift. Inspired by developmental trajectories in human language acquisition, this paper investigates whether age-based plasticity, where younger agents learn quickly and older agents maintain stable representations, can reduce language drift. In our set-up, static populations first reliably develop shared languages, followed by a phase in which population turnover gradually replaces older agents with new learners. Age-based plasticity significantly reduces drift in this setting, maintaining high accuracy and language similarity. In contrast, in populations with uniformly low plasticity agents cannot adapt quickly enough to integrate newcomers and in those with uniformly high plasticity the language changes faster than stable conventions can form. These findings demonstrate that developmental trajectories in individual learners substantially reduce overall language drift in dynamic populations.
Addressing the Ecological Fallacy in Larger LMs with Human Context
Nikita Soni | Dhruv Vijay Kunjadiya | Pratham Piyush Shah | Dikshya Mohanty | H. Andrew Schwartz | Niranjan Balasubramanian
Nikita Soni | Dhruv Vijay Kunjadiya | Pratham Piyush Shah | Dikshya Mohanty | H. Andrew Schwartz | Niranjan Balasubramanian
Language model training and inference ignore a fundamental linguistic fact: there is a dependence between multiple sequences of text written by the same person. Prior work has shown that addressing this form of ecological fallacy can greatly improve the performance of multiple smaller (~124M) GPT-based models. In this work, we ask if addressing the ecological fallacy by modeling the author’s language context with a specific LM task (called HuLM) can provide similar benefits for a larger-scale model, an 8B Llama model. To this end, we explore variants that process an author’s language in the context of their other temporally ordered texts. We study the effect of pre-training with this author context using the HuLM objective, as well as using it during fine-tuning with author context (HuFT:Human-aware Fine-Tuning). Empirical comparisons show that addressing the ecological fallacy during fine-tuning alone using QLoRA improves the performance of the larger 8B model over standard fine-tuning. Additionally, QLoRA-based continued HuLM pre-training results in a human-aware model generalizable for improved performance over eight downstream tasks with linear task classifier training alone. These results indicate the utility and importance of modeling language in the context of its original generators, the authors.
Linguistic Profiling of Transformer Embedding Geometry
Lucia Domenichelli | Dominique Brunato | Felice Dell’Orletta
Lucia Domenichelli | Dominique Brunato | Felice Dell’Orletta
Transformer language models embed tokens in high-dimensional spaces, but whether geometry reflects linguistic structure remains unclear. We analyse token representations in BERT and GPT\mbox{-}2, selected as canonical encoder-only and decoder-only Transformer architectures, through a linguistically grounded geometric lens. We partition tokens from the UD English-EWT treebank by surface and syntactic features (position, length, POS, head distance and arity) and examine how their representational geometry evolves across layers. We employ complementary diagnostic metrics, including isotropy, linear and nonlinear intrinsic dimensionality, to capture distinct aspects of embedding structure. Our findings reveal that BERT maintains more isotropic and higher-dimensional subspaces, whereas GPT\mbox{-}2 exhibits stronger anisotropy driven by a compact cluster of sentence-initial tokens. Across models, open-class words, longer tokens, and high-arity predicates occupy more isotropic, higher-dimensional manifolds than short function words and pre-head modifiers, indicating that semantic richness and syntactic centrality play a key role in structuring embedding space. Our analysis provides a reusable framework for profiling how linguistic abstractions organize the geometry of Transformer embeddings.
Uncovering Autoregressive LLM Knowledge of Thematic Fit in Event Representation
Safeyah Khaled Alshemali | Daniel Bauer | Yuval Marton
Safeyah Khaled Alshemali | Daniel Bauer | Yuval Marton
Brain-tuning language models (LMs)—fine-tuning LMs to predict brain recordings elicited by linguistic stimuli—has been proposed as a promising way to align LMs closer to the human brain, with recent work reporting gains on a small number of downstream tasks. However, it remains unclear what benefits brain data provide beyond those obtainable from further training on the same underlying linguistic input, and whether such benefits generalize across tasks. Here, we present a comprehensive evaluation of jointly-tuned LMs, trained on both brain recordings and text-based stimuli, brain-tuned LMs and LMs tuned only on text-based stimuli (i.e., stimulus-tuned LMs). We compare models across a diverse suite of downstream linguistic tasks. We find that jointly-tuned LMs outperform other fine-tuned and pretrained models, and that brain-tuned LMs outperform stimulus-tuned LMs, demonstrating the richness of brain data as an additional training signal for LMs.
A Method for Learning Large-Scale Computational Construction Grammars from Semantically Annotated Corpora
Paul Van Eecke | Katrien Beuls
Paul Van Eecke | Katrien Beuls
We present a method for learning large-scale, broad-coverage construction grammars from corpora of language use. Starting from utterances annotated with constituency structure and semantic frames, the method facilitates the learning of human-interpretable computational construction grammars that capture the intricate relationship between syntactic structures and the semantic relations they express. The resulting grammars consist of networks of tens of thousands of constructions formalised within the Fluid Construction Grammar framework. Not only do these grammars support the frame-semantic analysis of open-domain text, they also house a trove of information about the syntactico-semantic usage patterns present in the data they were learnt from. The method and learnt grammars contribute to the scaling of usage-based, constructionist approaches to language, as they corroborate the scalability of a number of fundamental construction grammar conjectures while also providing a practical instrument for the constructionist study of English argument structure in broad-coverage corpora.
Child-directed speech facilitates production, not comprehension, in BabyLMs
Bastian Bunzeck | Sina Zarrieß
Bastian Bunzeck | Sina Zarrieß
Recent studies suggest that child-directed speech is not conducive to language learning in BabyLMs. However, current evaluations focus predominantly on comprehension and not production, which is central to usage-based theories of language acquisition which argue how CDS facilitates early language use through constructional ”frames” (frequent lexical patterns with open slots). We introduce a novel generation-based evaluation inspired by such theories in form of a **frame-completion task**, and compare Llama models trained with CDS, the BabyLM corpus, and web-crawl data (FineWeb-edu) on comprehension benchmarks and our novel framework. Our results reveal a clear dissociation between models’ comprehension and production capabilities: while FineWeb-trained models excel at minimal pairs, CDS-trained models produce grammatical completions substantially earlier in training and concentrate probability mass on appropriate slot-fillers. These findings show that comprehension benchmarks underestimate what CDS affords to BabyLMs.
Language Models Learn Constructional Semantics, Not To Mention Syntax: Investigating LM Understanding of Paired-Focus Constructions
Wesley Scivetti | Ethan Gotlieb Wilcox | Nathan Schneider | Kanishka Misra | Leonie Weissweiler
Wesley Scivetti | Ethan Gotlieb Wilcox | Nathan Schneider | Kanishka Misra | Leonie Weissweiler
Grasping the semantics of rare constructions (form–meaning pairings) has been shown to be a challenging problem that has currently only been solved by the largest LLMs. It remains an open question if open-source models have robust constructional understanding, and if so, what learning dynamics underlie the acquisition of this knowledge. Focusing on a set of rare Paired-Focus constructions in English (e.g. "let alone", "much less"), we construct a novel dataset to test their meanings using both scalar adjectival semantics and general world knowledge. Testing a wide range of models differing in parameter count, architecture, and pretraining dataset size, we find that several modestly sized models are sensitive to both the forms and the meanings of Paired-Focus constructions, though models trained on human-scale data fail at all meaning evaluations. Turning to training dynamics for a set of open-checkpoint models, we find that Paired-Focus understanding emerges later in training than Paired-Focus syntactic knowledge, and that learning of Paired-Focus semantics is correlated with gains in some domains of world knowledge. Overall, our empirical results support the conclusion that modestly sized open-source models can grasp the rare Paired-Focus constructions, and demonstrate a connection between knowledge of Paired-Focus constructions and other meaning domains.
Differences in Typological Alignment in Language Models’ Treatment of Differential Argument Marking
Iskar Deng | Nathalia Xu | Shane Steinert-Threlkeld
Iskar Deng | Nathalia Xu | Shane Steinert-Threlkeld
Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.
Harnessing Linguistic Dissimilarity for Language Generalization on Unseen Low-Resource Varieties
Jinju Kim | Haeji Jung | Youjeong Roh | Jong Hwan Ko | David R. Mortensen
Jinju Kim | Haeji Jung | Youjeong Roh | Jong Hwan Ko | David R. Mortensen
Low-resource language varieties used by specific groups remain neglected in the development of Multilingual Language Models. A great deal of cross-lingual research focuses on inter-lingual language transfer which strives to align allied varieties and minimize differences between them. However, for low-resource varieties, linguistic dissimilarity is also an important cue allowing generalization to unseen varieties. Unlike prior approaches, we propose a two-stage Language Generalization framework that focuses on capturing variety-specific cues while also exploiting rich overlap offered by high-resource source variety. First, we propose TOPPing, a source-selection method specifically designed for low-resource varieties. Second, we suggest a lightweight VAÇAÍ-Bowl architecture that learns variety-specific attributes with one branch while a parallel branch captures variety-invariant attributes using adversarial training. We evaluate our framework on structural prediction tasks, which are among the few tasks available, as proxy for performance on other downstream tasks. Using VAÇAÍ-Bowl with TOPPing yields an average 54.62% improvement in the dependency parsing task, which serves as a proxy for performance on other downstream tasks across 10 low-resource varieties.
Measuring the Effects of Visual Salience in Human and AI Descriptions with Image Editing
Nina Gregorio | Edoardo Ponti | Sharon Goldwater
Nina Gregorio | Edoardo Ponti | Sharon Goldwater
How does our perception of the world influence the way we talk about it? Psycholinguistic studies have investigated whether visual salience correlates with entity mention and ordering, but often disregarded its effect on grammar or relied on simplistic images or artificial cues. In this study, we explore the use of generative AI to better control for salience in visual stimuli while keeping them realistic, and to serve as a proxy for human participants in studying how different types of salience impact image descriptions.We consider three salience types: *perceptual* (e.g. relative size in the image), *inherent* (e.g. animacy), and *relational* (e.g. human–object interaction). We first analyze human- and AI-generated captions for natural images to examine how salience correlates with how early, and in what grammatical role, an entity is mentioned. We find strong correlations between models and humans in this observational study, justifying the use of AI models alone in a further causal study. For this second study, we created datasets composed of pairs of images, where we used an image-editing model to intervene on the salience of a target entity. We show that relational and perceptual salience lead to the entity being mentioned earlier in captions and being mapped to more prominent grammatical roles. The magnitude of this effect varies across entity types, with animate entities (high inherent salience) showing a particularly distinct pattern.
Symbolic Grounding Reveals Representational Bottlenecks in Abstract Visual Reasoning
Mohit Vaishnav | Tanel Tammet
Mohit Vaishnav | Tanel Tammet
Vision–language models (VLMs) often fail on abstract visual reasoning benchmarks such as Bongard problems, raising the question of whether the main bottleneck lies in reasoning or representation. We study this on Bongard-LOGO, a synthetic benchmark of abstract concept learning with ground-truth generative programs, by comparing end-to-end VLMs on raw images with large language models (LLMs) given symbolic inputs derived from those images. Using symbolic inputs as a diagnostic probe rather than a practical multimodal architecture, our Componential–Grammatical (C–G) paradigm reformulates Bongard-LOGO as a symbolic reasoning task based on LOGO-style action programs or structured descriptions. LLMs achieve large and consistent gains, reaching mid–90s accuracy on Free-form problems, while a strong visual baseline remains near chance under matched task definitions. Ablations on input format, explicit concept prompts, and minimal visual grounding show that these factors matter much less than the shift from pixels to symbolic structure. These results identify representation as a key bottleneck in abstract visual reasoning and show how symbolic input can serve as a controlled diagnostic upper bound.
Now They See It, Now They Don’t: Multimodal Reward Models Exhibit Unreliability in Physical World Constraints
Sadaf Ghaffari | Nikhil Krishnaswamy
Sadaf Ghaffari | Nikhil Krishnaswamy
Generative AI systems, especially those driven by autoregressive and diffusion-based models, are known to struggle with spatial reasoning. As such, it becomes critical to understand how humans regard those failure modes. In this paper, we examine how humans judge different types of errors in images generated by a text-to-image model. We curated prompts that described common household objects with variance in number, spatial relations, and orientations, and generated a variety of images using each prompt. Humans observed pairs of images generated using the same prompt and answered a set of systematic questions about each image. Survey results showed that incorrect spatial *orientation* regularly emerges as a reason that the generated images do not accurately represent the prompt. We further investigated how RLHF-based multimodal reward models score prompt-image alignment over the same data, and whether they can reliably distinguish the better image in a pairwise setting, as humans do. We find that even though a general cross-task reward model may output alignment scores that accord with those of humans, its reasoning traces are flawed with respect to spatial orientational and relational indicators—the very factors that human annotators rated as the most consequential errors in generated images. Our results show that human annotators regard spatial reasoning errors as highly impactful on the correctness of generated images, and undermine the reliability of multimodal reward model scores as a baseline for evaluating image quality.
Who Generates More Empathetic Responses—Humans or LLMs? A Comparative Evaluation with Human and LLM Judges
Anuradha Welivita | Fawzia Zeitoun | Pearl Pu
Anuradha Welivita | Fawzia Zeitoun | Pearl Pu
This paper compares the empathetic quality of responses generated by humans and large language models (LLMs). We evaluate four LLMs that were widely used at the time of study—GPT-4, LLaMA-2-70B-Chat, Gemini-1.0-Pro, and Mixtral-8×7B-Instruct—against a human baseline using a large-scale between-subjects study. A total of 1,000 human participants evaluated the empathetic quality of human- and LLM-generated responses to 2,000 dialogue prompts spanning 32 positive and negative emotions. To complement human judgments, we also employed an LLM-as-judge (GPT-4o-mini) to assess the same responses. Across emotions and evaluators, LLM-generated responses were rated as significantly more empathetic than human-written responses. We also observed that both human judges and the LLM-as-judge tended to rate responses generated by their own group more favorably, indicating self-favoring tendencies. These findings highlight both the strong performance of contemporary LLMs in empathetic responding and the need to interpret human- and LLM-based evaluations with care.
Adversarial red teaming is a central component of large language model (LLM) safety evaluation. While prior work has cataloged attack types and measured aggregate failure rates, less attention has been paid to the structured decision-making behavior of human attackers in multi-turn interaction. In this work, we model adversarial dialogue as a hierarchical and sequential process. We introduce a structured representation that decomposes red teaming conversations into goals, strategies, and tactics, where strategies capture distinct vulnerability dimensions and tactics operationalize these strategies at the linguistic level. Using 38,961 multi-turn conversations from a large-scale red teaming dataset, we analyze both first-turn strategy effects and multi-turn adaptation dynamics. Causal estimation reveals systematic differences in success rates across strategic categories. Predictive modeling further shows that incorporating structured strategy, tactic, and adaptation features improves AUC from 0.719 to 0.746 over a baseline without structure. Our findings suggest that adversarial effectiveness is not uniform but varies across structured vulnerability dimensions, and that modeling red teaming as sequential strategic interaction provides measurable explanatory and predictive gains.
CAIT: A Syntactic Parsing Toolkit for Child–Adult InTeractions
Francesca Padovani | Xiulin Yang | Bastian Bunzeck | Jaap Jumelet | Yevgen Matusevych | Nathan Schneider | Arianna Bisazza
Francesca Padovani | Xiulin Yang | Bastian Bunzeck | Jaap Jumelet | Yevgen Matusevych | Nathan Schneider | Arianna Bisazza
CHILDES is a paramount resource for language acquisition studies—yet computational tools for analyzing its syntactic structure remain limited. Leveraging the recent release of the UD-English-CHILDES treebank with gold-standard Universal Dependencies (UD) annotations, we train a state-of-the-art dependency parser specifically tailored to CHILDES. The parser more accurately captures syntactic patterns in child–adult interactions, outperforming widely used off-the-shelf English parsers, including SpaCy and Stanza. Alongside the parser, we also release a Part-of-Speech tagger and an utterance-level construction tagger, which together form the open-source Syntactic Annotation Toolkit for Child–Adult InTeractions (CAIT). Through a detailed error analysis and a case study tracking the distribution of syntactic constructions across developmental time in CHILDES, we demonstrate the practical utility of the toolkit for large-scale, reproducible research on language acquisition.
When transformers learn “impossible” languages, what do they learn?
Ram Janarthan | Coleman Haley | Sharon Goldwater
Ram Janarthan | Coleman Haley | Sharon Goldwater
Recent work suggests that transformer language models show a bias towards human languages over unnatural ("impossible") languages argued to be unacquirable by humans. However, this literature has largely based these claims on differences in sample efficiency and test-set perplexity, rather than on direct evaluations of the linguistic capacities that could plausibly explain non-attestation in human languages. We evaluate two theoretically motivated linking hypotheses: impossibility arising from deficiencies in grammatical sensitivity or generative production. Using GPT-2 style models trained on perturbed "impossible" variants of English, we measure sensitivity to grammaticality using BLiMP minimal pairs, finding that model performance exhibits only gradual degradation, mediated by the language’s information locality. In contrast, these models exhibited pronounced failures in generation, producing substantially fewer high-quality sentences at longer lengths. Together, these results suggest generative deficiency and transmission failures as a plausible linking hypothesis between language model behaviour and non-attestation of impossible languages.
Readers make targeted regressions to plausible errors in reanalysis of “noisy-channel garden-path” sentences
Thomas Hikaru Clark | Roger P. Levy | Edward Gibson
Thomas Hikaru Clark | Roger P. Levy | Edward Gibson
A key question in psycholinguistics is how inferences about the meaning of linguistic input unfold incrementally a comprehender’s mind. In this work, we study reading dynamics for “noisy-channel garden-path” sentences, which temporarily appear well-formed but feature late-appearing violations of expectation that can be resolved not by inferring an alternative syntactic structure, but by inferring the presence of an error. We find evidence for targeted regressions – eye movements towards regions that are promising loci of possible errors in light of later-arriving information, showing patterns consistent with the posterior inferences of a model of noisy-channel processing with reanalysis. We discuss the implications of these findings for theories of noisy-channel language comprehension and information-theoretic explanations of reading dynamics.
Presupposition and Reasoning in Conditionals: A Theory-Based Study of Humans and LLMs
Tara Azin | Yongan Yu | Raj Singh | Olessia Jouravlev
Tara Azin | Yongan Yu | Raj Singh | Olessia Jouravlev
Presupposition projection in conditionals is central to theories of meaning and pragmatics, yet it remains largely unevaluated in large language models. We address this gap through a parallel behavioral study comparing human judgments and LLM predictions on a normed dataset of conditional sentences that controls the relation between the antecedent and the projected presupposition. We collect likelihood ratings from 120 participants and four LLMs under matched contextual conditions. Results show that humans integrate probabilistic and pragmatic cues in their judgment, whereas LLMs show variable alignment with human patterns. Using a linguistically motivated checklist within an LLM-as-a-Judge framework, we further evaluate model reasoning. We observe models that best match human ratings often lack coherent pragmatic reasoning, while models with stronger reasoning produce less human-like judgments. These findings suggest that LLMs’ performance on such tasks may result from surface pattern matching rather than pragmatic competence. Our findings highlight the importance of benchmarks grounded in linguistic theory for comparing humans and models.
ThinkStruct: RST-Aware Attention for Logical Reasoning in Machine Reading Comprehension
Nhi Thao Tran | Tien Le | Toan Pham | Quoc Hoang Vu
Nhi Thao Tran | Tien Le | Toan Pham | Quoc Hoang Vu
Logical Reasoning is a novel approach to deal with challenging Machine Reading Comprehension tasks by utilizing the ability to construct logical structures in natural language. However, previous promising studies struggle with the accuracy of logical unit division and the consistency of model prediction on equivalent semantics. In this paper, we propose ThinkStruct, a new method that leverages a transformer network enhanced with the information of Rhetorical Structure (RS) relations for logical reasoning. Specifically, our method uses Rhetorical Structure Theory (RST) to split natural language text into Elementary Discourse Units (EDUs) and identify the relationship among these units. Node information is then fed into the fully connected transformer network, which is enhanced with logical relationships among the extracted units via adjacency matrix. Subsequently, the features of the transformer network are integrated before being passed into the answer prediction module. In addition, we employ a contrastive learning module for improving its understanding of the relationship between Elementary Discourse Units. Our experiments on the LogiQA and Reclor datasets demonstrate that our results outperform other state-of-the-art models.
Linguistic puzzles, wherein the solver must deduce rules of an unfamiliar language purely in-context, represent a uniquely perplexing problem format even for state-of-the-art large language models. Yet by exploring various inference-time scaling methods, we demonstrate that language models’ performance on these problems can be improved without the need for fine-tuning or providing supplementary linguistic context. To this end, this paper introduces the first domain-specific inference-time scaling framework for linguistic puzzles, which we use to improve the performance of three model families - R1 (Deepseek), Gemini 2.5 Flash (Google), and Llama 3.3 70B Instruct (Meta) - on a challenging Linguistics Olympiad-based benchmark by 4.9, 13.1, and 4.9 percentage points, respectively. Nonetheless, even when multiple optimisations are applied, we find that LLMs’ linguistic puzzle performance remains well below comparable mathematical and commonsense benchmarks, and we speculate as to why linguistic reasoning continues to pose a distinctive challenge for even the most capable large language models.
From Sparse to Sense-Grounded: Wikipedia Training for Ukrainian Visual-WSD
Yurii Laba | Rostyslav O. Hryniv
Yurii Laba | Rostyslav O. Hryniv
Visual Word Sense Disambiguation (Visual-WSD) requires ranking the correct image for an ambiguous word given a short trigger phrase. For low-resource languages, it is bottlenecked by scarce sense-level benchmarks and limited sense-aligned multimodal supervision. We study Ukrainian and (i) extend the Ukrainian Visual-WSD benchmark from 87 to 381 instances and benchmark multilingual CLIP checkpoints and multimodal large models, and (ii) introduce two scalable Wikipedia-derived dataset construction methods. Using compute-efficient adaptation we fine-tune a multilingual CLIP backbone and show that sense-grounded supervision drives the improvements: combining our two Wikipedia-derived datasets improves HIT@1 from 37.00% to 43.05%.
Multi-domain Dialogue State Tracking (DST) requires discourse coherence that transcends independent slot-filling. Most existing approaches rely on statistical regularities within static schemas, failing to capture the semantic coordination governing simultaneous slot updates. In this paper, we propose Event-DST, which models latent events as cognitive organizing units to dynamically coordinate slot interactions. By projecting dialogue context into a continuous semantic space, our model induces a dynamic structural bias to enforce pragmatic consistency. This structural guidance is integrated via a dual-stream fusion strategy that balances top-down structural constraints with bottom-up textual precision. Experimental results on two benchmarks demonstrate the superiority of our framework, providing an interpretable and parameter-efficient path toward robust dialogue understanding.
Capturing Classic Authorial Style in Long-Form Story Generation with GRPO Fine-Tuning
Jinlong Liu | Mark G. Lee | Mohammed Bahja | Venelin Kovatchev
Jinlong Liu | Mark G. Lee | Mohammed Bahja | Venelin Kovatchev
Evaluating and optimizing authorial style in long-form story generation is challenging because style judgments often rely on subjective human voting, and there is no stable automatic evaluation method. We propose a two-stage pipeline. First, we train a style-similarity judge by fine-tuning a sentence-transformer with authorship-verification supervision, and calibrate its similarity outputs into a bounded [0,1] reward. Second, we use this judge as the primary reward in Group Relative Policy Optimization (GRPO) to fine-tune an 8B story generator for style-conditioned writing, avoiding the accept/reject supervision required by Direct Preference Optimization (DPO). Across four target authors (Mark Twain, Jane Austen, Charles Dickens, Thomas Hardy), the GRPO-trained 8B model achieves higher style scores than open-weight baselines, with an average style score of 0.893 across authors. These results suggest that AV-calibrated reward modeling provides a practical mechanism for controllable long-form style transfer under moderate model size and training budget.
On the scaling relationship between cloze probabilities and language model next-token prediction
Cassandra L Jacobs | Morgan Grobol
Cassandra L Jacobs | Morgan Grobol
Recent work has shown that larger language models have better predictive power for eye movement and reading time data. However, we know less about how model capacity relates to human production statistics in the cloze task, which are used to predict reading times as well. While even the best models under-allocate probability mass to human responses, larger models assign higher-quality estimates of next tokens and their likelihood of production in cloze data because they are less sensitive to lexical co-occurrence statistics while being better aligned semantically to human cloze responses. The results provide support for the claim that the greater memorization capacity of larger models helps them guess more semantically appropriate words, but makes them less sensitive to low-level information that is relevant for word recognition.
RECAP: Resistance Capture in Text-based Mental Health Counseling with Large Language Models
Anqi Li | Yuqian Chen | Yu Lu | Zhaoming Chen | Yi Zhu | Yuan Xie | Zhenzhong Lan
Anqi Li | Yuqian Chen | Yu Lu | Zhaoming Chen | Yi Zhu | Yuan Xie | Zhenzhong Lan
Recognizing and navigating client resistance is critical for effective mental health counseling, yet its detection remains particularly challenging in text-based interactions. Existing NLP approaches oversimplify resistance categories, ignore the sequential dynamics of therapeutic interventions, and offer limited interpretability. To address these limitations, we propose PsyFIRE, a theoretically grounded framework capturing 13 fine-grained resistance behaviors alongside collaborative interactions. Based on PsyFIRE, we construct the ClientResistance corpus with 23,930 annotated utterances from real-world Chinese text-based counseling, each supported by context-specific rationales. Leveraging this dataset, we develop RECAP, a two-stage framework that detects resistance and fine-grained resistance types with explanations. RECAP achieves 91.25% F1 for distinguishing collaboration and resistance and 66.58% macro-F1 for fine-grained resistance categories classification, outperforming leading prompt-based LLM baselines by over 20 points. Expert evaluations confirm that the generated explanations are highly faithful and reliable. Applied to a separate counseling dataset and a pilot study with 62 counselors, RECAP reveals the prevalence of resistance, its negative impact on therapeutic relationships, and its potential to improve counselors’ understanding and intervention strategies.
A framework for analyzing concept representations in neural models
Burin Naowarat | Hao Tang | Sharon Goldwater
Burin Naowarat | Hao Tang | Sharon Goldwater
Understanding how neural models represent human-interpretable concepts is challenging. Prior work has explored linear concept subspaces from diverse perspectives, such as probing and concept erasure. We introduce a unified framework to study these subspaces along two axes: containment, which tests if a concept is fully represented in a subspace but not outside it, and disentanglement, which tests for isolation from other concepts. In experiments on both text and speech models, we first highlight that concept subspaces may not be uniquely determined, and discuss the implications for concept subspace analysis. Then, we compare properties of concept subspaces estimated using five estimators, proposed in different communities. We find that (1) the choice of estimator impacts the containment and disentanglement properties; (2) the state-of-the-art concept erasure method, LEACE, performs well on both testing axes, but still struggles to generalize to unseen data; and (3) in HuBERT speech representations, phone information is both contained and disentangled from speaker information, while speaker information is hard to contain in a compact subspace, despite being disentangled from phones.
A Dataset for Oral Reading in Young English Readers
Madison Rose | Michael Bennie | Valeria Pagliai | Hatice Kubra Karakis | Qian Shen | Xinyi Tai | Walter L. Leite | Zoey Liu
Madison Rose | Michael Bennie | Valeria Pagliai | Hatice Kubra Karakis | Qian Shen | Xinyi Tai | Walter L. Leite | Zoey Liu
Among English child speech corpora, very few focus on oral reading. Existing resources such as the CMU Kids Corpus (Ellis Weismer et al., 2013) face limitations in the lack of grade-appropriate, curriculum-aligned reading texts, the annotation scope and quality, and most crucially, comprehensive annotation scheme for characterization of children’s reading errors. This study presents a multi-layered, fully manually annotated corpus of oral reading from 63 1st-3rd grade students residing in the U.S. who grow up hearing and speaking English. Additionally, we contribute methodologically rigorous annotation guidelines that categorize 10 reading error categories and 26 sublevel error labels. Using a digital reading platform supported by GPT-4o-mini (OpenAI, 2024), children read stories on topics of their own interest, while the system records their speech and logs their interactions with embedded digital supports. Each recording is paired with detailed demographic and educational metadata and subjected to linguistic annotations, including: (1) sentence- and word-level time alignment; (2) phonemic transcription; (3) reading errors.
From Dependency to CCG to Incremental CCG: Approaches to Flexible Word Order in Turkish
Özge Bakay | Oğuz Kerem Yıldız | Rajesh Bhatt | Brian Dillon | Olcay Taner Yildiz
Özge Bakay | Oğuz Kerem Yıldız | Rajesh Bhatt | Brian Dillon | Olcay Taner Yildiz
Combinatory Categorial Grammar (CCG), a lexicalized formalism known for its flexible constituency, is well-suited for modeling headfinal languages with flexible word order like Turkish. Building on Kuzgun et al. (2023), we first develop a Turkish CCG lexicon by automatically inducing categories from a dependency treebank. By leveraging standard and extended operations tailored to Turkish syntax, our parser achieves a robust coverage of 92.5%. Furthermore, we introduce the first (partially) incremental, left-to-right CCG parser for Turkish, designed to facilitate the immediate integration of words into the evolving representation. Finally, we present an example experiment showing that CCG parsers can model psycholinguistic evidence for extra processing costs associated with arguments in noncanonical positions, via the frequency of order-reversing operations. These findings provide evidence that CCG offers a cognitively plausible framework for modeling real-time processing in languages like Turkish.
Examining Large Language Models’ form-meaning mappings of information structure constructions in Mandarin Chinese
Shihui Li | Xiaojuan Tan | Jelke Bloem
Shihui Li | Xiaojuan Tan | Jelke Bloem
Construction Grammar (CxG) knowledge in language models has been extensively studied for English, but remains underexplored in other languages. In Mandarin Chinese, the ba (把, disposal) and bei (被, passive) constructions are widely used for managing information structure. They foreground topical elements (information structure) and encode systematic form-meaning mappings (CxG), particularly with respect to the semantic role of the object. We probe language models’ linguistic competence with these constructions using minimal pairs, constructing a new minimal-pair dataset comprising seven paradigms that target both syntactic constraints and verb–construction compatibility. Our results show that it remains a challenge for many models to capture the form-meaning mappings underlying the ba construction, although they achieve high accuracy on paradigms driven by surface syntactic cues.
Mechanistic Interpretability of Animacy Effects on Structure Choice in GPT-2
Yue Li | Yan Cong | Elaine J. Francis
Yue Li | Yan Cong | Elaine J. Francis
Language models (LMs) exhibit human-like behavior across linguistic tasks, yet behavioral similarity does not establish mechanistic correspondence. Animacy — whether an entity is alive and sentient — is a well-documented semantic feature shaping linguistic behavior in humans. Although LMs show animacy sensitivity behaviorally, the mechanistic basis remains unexplored. In this study, we probe GPT-2 Small’s internal circuitry to test whether animacy representations causally drive syntactic structure choice. Activation patching confirms causality: swapping animacy representations in the model shifts its downstream output. Critically, bidirectional patching reveals that animacy conditions differ in how strongly they commit to a structure: some animacy configurations resist perturbation and exert strong causal influence, while others remain flexible. We identify 22 attention heads mediating these effects, split between passive-promoting and passive-suppressing populations, suggesting GPT-2 Small’s structure choice likely emerges from internal competition between opposing heads. These findings provide mechanistic grounding for animacy effects documented in extensive psycholinguistics research and demonstrate how interpretability methods can enrich and test psycholinguistic theory.
Discovering Lexical Gaps Using Embeddings from Multilingual LLMs
Yoonwon Jung | Aaron S. Cohen | Ben Bergen
Yoonwon Jung | Aaron S. Cohen | Ben Bergen
Lexical gaps are words that do not exist in certain languages. They pose challenges for building multilingual lexical resources, for machine translation, and for cross-lingual transfer. Existing lexical gap detection relies on human judgments or fixed conceptual taxonomies. We propose a data-driven framework for identifying cross-lingual lexical gaps. We extracted contextualized embeddings from Korean-English bilingual LLMs for Korean-to-English and English-to-Korean translation pairs. Combinations of LLMs, embedding types, dimensionality, and orthogonal transformations across 100 train-test splits yielded 4000 distinct embedding spaces in each source language. In each space, we computed the semantic similarity between each source word and its nearest neighbor in the target language, and compared their distribution for gap words versus non-gap words. In 94% (Korean-to-English) and 97% (English-to-Korean) of embedding spaces, gap words showed weaker cross-lingual semantic alignment than non-gap words. Logistic classifiers trained on unaligned embedding spaces can reliably separate gap words from non-gap words, achieving AUCs of 0.81 (Korean-to-English) and 0.76 (English-to-Korean) and retrieving 18/19 Korean and 26/27 English gap words. This approach provides a language-agnostic and taxonomy-free method for scalable lexical gap identification.
Revisiting Age of Acquisition in Curriculum Learning: Disentangling Lexical Features and Semantic Structure
Ian Gifford | Aaron Shah | Catherine Chen | Taimaa Kassab Bachi | Eva Portelance
Ian Gifford | Aaron Shah | Catherine Chen | Taimaa Kassab Bachi | Eva Portelance
Previous work has found that ordering training data by children’s Age of Acquisition (AoA) for words increases the stability of distributional word embeddings, suggesting that early-learned words play a privileged role in shaping semantic structure. In this study, we determine whether AoA itself drives these effects, or whether they emerge from correlated lexical factors such as frequency, concreteness, and phonological complexity. Using incremental Word2Vec training, we construct curricula ordered by AoA and by individual lexical features, while systematically controlling for vocabulary growth and deterministic ordering effects. We show that AoA-ordered curricula produce greater early-phase stability than shuffled baselines, even under controlled exposure conditions. We find that the advantage observed with AoA can be largely explained by correlated factors like overall word frequency. Despite limited gains on general similarity benchmarks, AoA-ordered embeddings outperform shuffled embeddings on a proxy domain-specific task: predicting human AoA norms. This advantage persists after debiasing timestamp effects, implying that AoA curricula induce developmentally meaningful semantic structure.
Logical Table-to-Text (LT2T) generation aims to produce natural-language sentences that are logically faithful to structured tabular data. While recent Large Language Models (LLMs) show high performance on aggregate fidelity metrics, these scores provide only a coarse view of performance, obscuring specific logic-type reasoning failures and models’ meta-logical awareness. We propose an operation-aware diagnostic framework that evaluates four core competencies: (1) Logical Form (LF) execution accuracy, (2) fidelity of LF-conditioned generation, (3) logic-type identification, and (4) LF-free generation.We apply this framework to a suite of frontier LLMs and perform fine-grained analysis across logic types such as aggregation, ordinal, and superlative reasoning. Our results show that LT2T fidelity assessment can be unstable; the choice of verifier and logic type can substantially alter conclusions and model rankings. Crucially, we identify a meta-logical gap: models often generate faithful statements while failing to identify the underlying operation.
What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
Zhenghao Zhou | William Dai | Maya Viswanathan | Simon Charlow | R. Thomas McCoy | Robert Frank
Zhenghao Zhou | William Dai | Maya Viswanathan | Simon Charlow | R. Thomas McCoy | Robert Frank
Children’s acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora – matrix wh-questions, embedded wh-questions, and relative clauses – and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children’s filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
An Information-Theoretic Study of RLHF-Induced Uniformity in Large Language Model Outputs
Nolan Chai | Tianqi Zhang | Alex Warstadt
Nolan Chai | Tianqi Zhang | Alex Warstadt
Reinforcement Learning with Human Feedback(RLHF) is a common post-training procedureto align the outputs of Large Language Mod-els (LLMs) with human preferences. As a re-sult, one might expect RLHF to induce someelements of human-like audience design intoLLMs. However, RLHF and other post-trainingalignment methods have many complex effectson the outputs of LLMs that have yet to be stud-ied quantitatively. We apply an information-theoretic lens to investigate the changes in the"naturalness" of language and the presence ofaudience design in LLMs before and after post-training. The Uniform Information Density(UID) Hypothesis posits that humans optimizelanguage production and comprehension acrossa noisy channel by transferring information ata more uniform rate. Accordingly, we analyzeand compare how information is distributedwithin model- and human-generated text fromdifferent domains. We find that pretrained andpost-trained LLMs both show superhuman uni-formity across various text domains, and bothRLHF and other post-training methods reduceuniformity slightly from their pretrained coun-terparts. However, RLHF uniquely encourageslower variance in uniformity between docu-ments, potentially demonstrating that trainingon human preferences encourages consistencyin information flow.
Bridging Linguistic Structure and Mechanistic Interpretability for Conceptual Interpretation in Language Models
Nura Aljaafari | Danilo Carvalho | Andre Freitas
Nura Aljaafari | Danilo Carvalho | Andre Freitas
Understanding how language models compose meaning from linguistic input remains a central problem in interpretability research. Mechanistic studies have attributed functional roles to core transformer components; however, these findings derive largely from factual retrieval settings. Whether the same mechanisms support conceptual interpretation, the compositional mapping from definitional expressions to abstract meaning, remains insufficiently characterised. We introduce DSRA (Definitional Semantic Role Analysis), a methodology that applies causal tracing within the reverse dictionary task and augments restoration traces with definitional semantic roles (DSRs) grounded in Argument Structure Theory. This linguistic overlay identifies which compositional functions (e.g., genus, differentia quality) are associated with high-recovery states, extending activation patching beyond token-level localisation. Applied to GPT-J-6B (English) and BERTIN GPT-J-6B (Spanish), the results show that MLP layers associate content-bearing tokens with high-specificity DSR categories in early layers, MHA layers distribute integration across middle-to-upper layers with concentration at the final token, and hidden states aggregate information in upper layers. Alignment between restored states and DSR categories indicates systematic correspondence between internal activations and definitional structure, with consistent localisation patterns across both languages.
Traces of Social Competence in Large Language Models
Tom Kouwenhoven | Michiel T. van der Meer | Max J. van Duijn
Tom Kouwenhoven | Michiel T. van der Meer | Max J. van Duijn
The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. ForLarge Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al., 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X *thinks*) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented fine-tuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a *think* vector as the causal driver of observed FBT behaviour.
Metric Grammars: A usage-based grammatical formalism that supports generation, parsing and morphological innovation
Whitney Tabor | Hyosun Lee
Whitney Tabor | Hyosun Lee
Grammatical theories which specify grammars by means of symbolic well-formedness constraints (e.g., Context Free Grammars, HPSG, LFG, Minimalism, Dependency Grammars, etc.) are ill-suited to model the (semantically and statistically) gradual character of grammatical change as it manifests in successive historical corpora. Grammatical theories which claim that the language system is subject to change based on what speakers do in life (i.e., usage-based accounts) are better-suited to handle such phenomena. Nevertheless, current usage-based theories (e.g., Cognitive Grammar, Construction Grammar) lack a clearly formalized model that specifies how usage can affect the grammatical system. In this paper, we describe Stretched Tree Metric Grammars (STMGs), a new formal model of syntax and semantics that exhibits usage-based effects. We show that the model can generate and parse simple sentences. Then we show how it supports morphological innovation in appropriately limited circumstances. We conclude by noting that STMGs are closely related to Large Language Models (LLMs), but they have the benefit of being more analytically interpretable.
Automated Refinement of Essay Scoring Rubrics for Language Models via Reflect-and-Revise
Keno Harada | Lui Yoshida | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Keno Harada | Lui Yoshida | Takeshi Kojima | Yusuke Iwasawa | Yutaka Matsuo
Large Language Models (LLMs) are increasingly used for Automated Essay Scoring (AES), yet the scoring rubrics they rely on are typically designed for human raters and may not be optimal for LLMs. Inspired by the calibration process that human raters undergo before formal scoring, we propose Reflect-and-Revise, an iterative framework that refines scoring rubrics by prompting models to reflect on their own chain-of-thought rationales and score discrepancies with human labels. At each iteration, the model identifies scoring-error patterns from sampled mismatches and revises the rubric accordingly. Experiments on three essay scoring benchmarks (ASAP, ASAP 2.0, and TOEFL11) with three LLMs (GPT-5 mini, Gemini 3 Flash, and Qwen3-Next-80B-A3B-Instruct) demonstrate that our method yields improvements in Quadratic Weighted Kappa (QWK), achieving gains of up to +0.403 over human-authored rubrics. Starting from a minimal seed rubric that specifies only the score scale, our method matches or exceeds expert rubric performance in most dataset-model combinations, indicating that iterative refinement can reduce the manual effort of rubric authoring. Analysis of the refined rubrics reveals that the refinement process introduces explicit procedural structures, such as conditional gating rules and quantitative thresholds, that are absent from human-authored rubrics, highlighting a gap between rubrics designed for human raters and those effective for LLMs.
up
Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Proceedings of the Second Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)
Sheshera Mysore | Sachin Kumar | Vidhisha Balachandran | Shirley Anugrah Hayati | Faeze Brahman | Hanane Nour Moussa | Alireza Salemi
Sheshera Mysore | Sachin Kumar | Vidhisha Balachandran | Shirley Anugrah Hayati | Faeze Brahman | Hanane Nour Moussa | Alireza Salemi
AI-generated text detectors gain adoption in educational and professional contexts, their fairness remains underexamined. While prior research has uncovered isolated cases of bias, particularly against English Language Learners (ELLs), there is a lack of systematic evaluation of such systems across broader sociolinguistic factors. In this work, we propose a comprehensive evaluation framework for AI detectors across various types of biases. As part of this framework, we introduce a suite of targeted datasets spanning 7 major categories: demographics, age, educational grade level, dialect, formality, political leaning, and topic. Using this, we evaluate four open-source state-of-theart AI text detectors and find consistent disparities in detection performance, particularly low recall rates for texts from underrepresented groups. Our contributions provide a scalable, transparent approach for auditing AI detectors and emphasize the need for bias-aware evaluation before these tools are deployed for public use.
Small Language Models for the Democratization of Financial Literacy: Challenges and Opportunities
Tagore Rao Kosireddy | Jeffrey David Wall | Evan Lucas
Tagore Rao Kosireddy | Jeffrey David Wall | Evan Lucas
This study seeks to test whether low-cost inference and efficient Small Language Models (SLMs) fine-tuned on existing open-source question answering datasets are capable of creating financial literacy chat bots that can answer financial questions for those with limited financial knowledge. The use of SLMs is growing in popularity across many domains, but SLMs are not thoroughly explored in the finance sector. This study offers an exploration of challenges and opportunities that exist in the finance sector to utilize SLMs for open-source financial question answering applications. In particular, this study examines the outputs of several open-source SLMs fine-tuned on the open-source FinGPT FiQA_QA financial question answering dataset. We fine-tuned two versions of each model, one with an instruction prompt and one without an instruction prompt and compared the model outputs with ground truth human responses from the dataset. Further qualitative rating and analysis are provided for model outputs and the dataset. The exploration highlighted challenges with available open data and the fine-tuned SLMs. Existing open data sets in the financial AI research community are not sufficient to produce high-quality outputs with SLMs. Successful fine-tuning of SLMs has occurred in other domains with high quality data sets. We thus issue a call for new and better open financial question answering datasets that could result in higher-quality small language models.
From Understanding to Engagement: Personalized pharmacy Video Clips via Vision Language Models (VLMs)
Suyash Mishra | Qiang Li | Anubhav Girdhar | Srikanth Patil
Suyash Mishra | Qiang Li | Anubhav Girdhar | Srikanth Patil
Vision Language Models (VLMs) are poised to revolutionize the digital transformation of pharmacyceutical industry by enabling intelligent, scalable, and automated multi-modality content processing. Traditional manual annotation of heterogeneous data modalities (text, images, video, audio, and web links), is prone to inconsistencies, quality degradation, and inefficiencies in content utilization. The sheer volume of long video and audio data further exacerbates these challenges, (e.g. long clinical trial interviews and educational seminars). Here, we introduce a domain-adapted Video-to-Video Clip Generation framework that integrates Audio-Language Models (ALMs) and Vision Language Models (VLMs) to produce highlight clips. Our contributions are threefold: (i) a reproducible Cut Merge algorithm with fade-in/out and timestamp normalization, ensuring smooth transitions and audio/visual alignment; (ii) a personalization mechanism based on role definition and prompt injection for tailored outputs (marketing, training, regulatory); (iii) a cost-efficient e2e pipeline strategy balancing ALM/VLM-enhanced processing. Evaluations on Video-MME benchmark (900) and our proprietary dataset of 16,159 pharmacy videos across 14 disease areas demonstrate 3–4× speedup, 4× cost reduction, and competitive clip quality. Beyond efficiency gains, we also report our methods improved clip coherence scores (0.348) and informativeness scores (0.721) over state-of-the-art VLM baselines (e.g., Gemini 2.5 Pro), highlighting the potential of transparent, custom extractive, and compliance-supporting video summarization for life sciences. Demo: https://video-clips-highlight-generator-338849523617.us-west1.run.app/.
Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification
Amrita Singh | H. Suhan Karaca | Aditya Joshi | Hye-young Paik | Jiaojiao Jiang
Amrita Singh | H. Suhan Karaca | Aditya Joshi | Hye-young Paik | Jiaojiao Jiang
Despite advances in legal NLP, no comprehensive evaluation of Transformer-based models customized for legal tasks (referred to as ’legal-specific’ models in this paper) exists for contract classification tasks. To address this gap, we present an evaluation of 13 legal-specific transformer-based models on 3 English-language contract classification tasks and compare them with 9 generalist models. The results show that legal-specific models consistently outperform generalist models, especially on tasks requiring nuanced legal understanding. They also help reduce misclassification of rare classes in imbalanced datasets. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing generalist models. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract classification. Our results highlight the shortcomings of generalist models, emphasizing the need for domain-specific customization, particularly in the context of legal applications.
Personalizing News Headlines with Retrieval-Augmented Generation
Jiajing Wan | Samia Touileb | Lubos Steskal | Lilja Øvrelid
Jiajing Wan | Samia Touileb | Lubos Steskal | Lilja Øvrelid
We focus on personalized news headline generation, where we aim to improve headline generation by extending the generation context to incorporate the news reading history of users. In particular, we study a RAG-LLM-based system that customizes news headlines with user histories to improve news headline personalization. Our experiments show that our approach not only produces better headlines for specific users, but also makes the generated headlines closer to the original headlines. We experiment with different retrievers and analyze the generated outputs through systematic comparisons with both original and rewritten headlines. These analyses provide insights into the role of retrieval and personalization in headline generation, highlighting how the user history contributes to meaningful improvement while remaining aligned with original headlines.
Building Multi-turn Intent Classification with LLM-based Labeling
Biancen Xie | Kaiqi Bian | Jai Ranjan Singh Gusain | Manikandarajan Ramanathan | Raj Maragoud
Biancen Xie | Kaiqi Bian | Jai Ranjan Singh Gusain | Manikandarajan Ramanathan | Raj Maragoud
Intent classification is essential for customer service routing, connecting customers to the appropriate agents and reducing handling time and operational cost. Developing a real-world multi-turn intent classification system is challenging due to complex intent taxonomies, dynamic intent switching within conversations, and limited labeled training data. To address these challenges, we propose a scalable multi-turn intent classification framework for ecommerce customer service that models intent along multiple dimensions. We introduce LLMbased labeling strategies to annotate real customer transcripts at scale and augment training with LLM-simulated multi-turn dialogues that expand coverage of topic and intent switches, which are rare in existing transcripts. Through extensive experiments, we find that explanationguided labeling with a self-critique step produces the most accurate training labels. Finetuned models built on a RoBERTa backbone outperform zero-shot LLM prompting while achieving substantially lower inference latency. Finally, we show that a hybrid approach that combines the fine-tuned classifier with LLM prompting further improves accuracy over either component alone. Overall, our results provide practical guidance for building and deploying high-accuracy, low-latency, large-scale multi-turn intent classification systems.
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Avyav Kumar Singh | Yen-Chen Wu | Alexandru Cioba | Alberto Bernacchia | Davide Buffelli
Avyav Kumar Singh | Yen-Chen Wu | Alexandru Cioba | Alberto Bernacchia | Davide Buffelli
Cross-tokenizer distillation (CTD), the transfer of knowledge from a teacher to a student language model when the two use different tokenizers, remains a largely unsolved problem. Existing approaches rely on heuristic strategies to align mismatched vocabularies, introducing considerable complexity. In this paper, we propose a simple but effective baseline called Byte-Level Distillation (BLD) which enables CTD by operating at a common interface across tokenizers: the byte level. In more detail, we convert the teacher’s output distribution to byte-level probabilities, attach a lightweight byte-level decoder head to the student, and distill through this shared byte-level interface. Despite its simplicity, BLD performs competitively with–and on several benchmarks surpasses–significantly more sophisticated CTD methods, across a range of distillation tasks with models from 1B to 8B parameters. Our results suggest that the byte level is a natural common ground for cross-tokenizer knowledge transfer, while also highlighting that consistent improvements across all tasks and benchmarks remain elusive, underscoring that CTD is still an open problem.
Fine-grained Readability Controlled Summarization of Scientific Documents via Control Vectors
Isabel Cachola | Kuleen Sasse | Mark Dredze
Isabel Cachola | Kuleen Sasse | Mark Dredze
Plain Language Summarization (PLS) generates summaries of technical documents accessible to non-expert audiences. Readability – commonly used to evaluate PLS – has often been treated coarsely (expert vs. lay) although it exists on a spectrum with different levels for different readers. We propose a light weight control vector method for fine-grained readability control in scientific summarization along with a requirements-based framework for data selection. Our framework enforces: (1) readability levels differ substantially, and (2) paired examples share comparable content. Under this, control vectors enable more precise readability control than other popular methods.
Building a Custom Taxonomy of AI Skills and Tasks from the Ground Up with Job Postings
Stephen Meisenbacher | Peter Norlander
Stephen Meisenbacher | Peter Norlander
Utilizing LLMs for automated taxonomy construction presents a clear opportunity for the comprehensive, yet efficient mapping of potentially complex domains. When contending with high volumes of rapidly growing corpora, however, it becomes unclear how to best leverage such data for optimal taxonomy construction. Taking the case of systematizing *AI skills in the workplace*, we use two large-scale job postings corpora to investigate key design decisions for the inclusion (or exclusion) of data points for taxonomy construction. We propose **TaxonomyBuilder** as a blueprint for our systematic study, with which we evaluate various configurations of custom, data-informed, and hierarchical taxonomies. We demonstrate that *less* data can provide more clarity: filtering inputs to **TaxonomyBuilder** provides better domain-specific coverage than offering unfiltered inputs to clustering and LLM-enhanced hierarchical taxonomy labeling tools.
Using Topological Data Analysis to Characterize the Layers of Language Models Before and After Word Substitution Attacks
Adam Tang | Catherine Liu | Kimberly Lopez | Shreya Subramanian | Leif Zinn-Brooks | Alexia E. Schulz | Adaku Uchendu
Adam Tang | Catherine Liu | Kimberly Lopez | Shreya Subramanian | Leif Zinn-Brooks | Alexia E. Schulz | Adaku Uchendu
Large language models are known to be vulnerable to adversarial perturbations such as synonym-based word substitutions. However, previous analyses of adversarial influence focus only on output behavior and provide limited insight into the propagation of substitution-based input perturbations through internal representations. In this work, we introduce a topological data analysis (TDA) framework to study the structural effects of adversarial attacks on attention maps across model layers. We evaluate small encoder-based architectures (BERT, RoBERTa, DistilBERT) fine-tuned to solve binary classification on the IMDb review dataset, which were attacked using TextFooler. We convert attention maps into distance matrices and apply TDA to extract topological features, which we then compare using Wasserstein distances between original and perturbed features. In parallel, we compute a non-TDA baseline on attention maps using per-head L1 distances between original and perturbed attentions. In addition, we analyze these models on a layer-by-layer basis. We find that adversarial perturbations induce systematic and statistically significant topological changes across layers, with the largest deviations occurring in late layers and smaller but notable effects in early layers. These patterns are consistent across models and are validated using both non-parametric (Kruskal–Wallis, Dunn) and parametric (one-way ANOVA, Tukey) tests on log-transformed Wasserstein distances. Compared to our non-TDA baseline, our results show more distinct layer-wise separation and provides a robust and interpretable framework for evaluating how adversarial perturbations alter internal model structure. Our code is publicly available at: https://github.com/angelinatsai04/mitll_clinic/tree/adam_spring.
Customizing ASR for Language Documentation and Resource Prioritization
Alexandra Fort | Shobhana Lakshmi Chelliah
Alexandra Fort | Shobhana Lakshmi Chelliah
Research in language documentation has the potential to benefit from integration of ASR models, especially through the assisted transcription of recordings with audio. Recent advancements in ASR for low-resource languages demonstrate the ability to adapt general, multilingual models for unseen languages with limited fine-tuning data, supporting the creation of custom ASR models. However, resources are still required to collect and prepare the fine-tuning data, necessitating exploration of optimization of resource allocation within the process of data collection and preparation. This paper outlines important considerations for the collection and preparation of data for customizing an ASR model for use in language documentation projects. With the development of a Lamkang ASR model as an example, prioritization of tasks within a language documentation project is outlined by analyzing the relative impact of time spent on transcription correction versus time spent on manual alignment on ASR model performance. Results from this research suggest prioritization of transcription correction over manual-alignment of data and suggest fine-tuning multilingual ASR systems produces superior results to zero-shot ASR models, despite recent advancements in the technology.
Improving Medical Hallucination Detection with System Combination and Rule-based Customization
Jonathan Lasko | Damianos Karakos | Francis Keith
Jonathan Lasko | Damianos Karakos | Francis Keith
The presence of factuality errors (hallucinations) in the outputs of patient-facing medical chatbots is a serious problem: they can lead to patient harm and erode people’s trust in the medical profession. For this reason, it is crucial to detect hallucinations in chatbot outputs and forward them to clinicians for review. In this paper, we present the system we built for detecting such errors: it consists of multiple LLM-powered detectors which are combined together with a novel alignment procedure. We ran our system on the MedExpert-Benchmark dataset (Yarmohammadi et al., 2025) and our results on two use cases, Mental Health and Prenatal Care, show that the combined system gives nice gains over the individual systems. Additionally, we show that further customization of the system to each one of the use cases leads to further gains, but at the cost of reduced generalizability. Our code and dataset are available here: https://github.com/BBN-E/medic-customnlp4u.
Asking the Right Questions: Can expert-prompted LLMs reformulate legal queries from non-experts?
Katherine Atwell | Morgan A. Gray | Jaromir Savelka | Len Rial | Sera Linardi | Malihe Alikhani
Katherine Atwell | Morgan A. Gray | Jaromir Savelka | Len Rial | Sera Linardi | Malihe Alikhani
Large language models are widely used by everyday users, and can be asked to perform tasks that require specialized expertise, such as interpreting contractual terms and conditions, filing personal taxes, or diagnosing medical symptoms. Although these tools should not be used in place of professional advice, they can be useful starting points for users seeking professional help, improving users’ access and interactions with professionals. In this vein, this paper introduces a legal question reformulation task to assist non-experts in their interactions with lawyers. This has the potential to streamline discussions between lawyers and clients, who may not know the correct legal language to communicate their needs. Using a novel evaluation framework informed by legal expertise, we investigate the quality of model-generated legal question reformulations on in-the-wild data from non-experts seeking legal advice. Our findings indicate that LLMs have significant potential in legal reasoning, but some unexpected safety concerns may emerge. Further, adding linguisticallyaligned in-domain text samples can improve performance for smaller models, even when the samples are not aligned factually with the given question.
When Valid Signals Fail: Regime Boundaries Between LLM Features and RL Trading Policies
Zhengzhe Yang
Zhengzhe Yang
Can large language models (LLMs) generate continuous numerical features that improve reinforcement learning (RL) trading agents? We build a modular pipeline where a frozen LLM serves as a stateless feature extractor, transforming unstructured daily news and filings into a fixed-dimensional vector consumed by a downstream PPO agent. We introduce an automated prompt-optimization loop that treats the extraction prompt as a discrete hyperparameter and tunes it directly against the Information Coefficient—the Spearman rank correlation between predicted and realized returns—rather than NLP losses. The optimized prompt discovers genuinely predictive features (IC above ∼0.15 on held-out data). However, these valid intermediate representations do not automatically translate into downstream task performance: during a distribution shift caused by a macroeconomic shock, LLM-derived features add noise, and the augmented agent under-performs a price-only baseline. In a calmer test regime the agent recovers, yet macroeconomic state variables remain the most robust driver of policy improvement. Our findings highlight a gap between feature-level validity and policy-level robustness that parallels known challenges in transfer learning under distribution shift.
Modern conversational AI systems frequently rely on user metadata to localize responses, yet the unintended regional biases introduced by this hidden context remain poorly understood. In this work, we evaluate _location leakage_: the phenomenon where a model generates geographic references despite receiving a geographically neutral user prompt. Across both creative writing and open-ended Q&A prompts, even state-of-the-art LLMs systematically favor region-specific outputs when exposed to location metadata, with leakage spiking by up to 793 times above baseline (e.g., from 0.04% to 31.7% for Llama 3.1-8B, and 21.3% and 8.8% for Qwen3-8B and Claude Sonnet 4.6, respectively). Our analysis further shows a novel structural conditioning effect: replacing the injected location with the placeholder "Unknown" still elevates leakage by up to 72 times above baseline, demonstrating that the user profile frame itself, independent of any geographic content, acts as a generative conditioning signal.
Efficiency vs. Verifiability in Evidence-Aware RAG: Does Prompt Compression Preserve Citation Grounding?
Aiyu Li | Qian Peng | Bin Chen
Aiyu Li | Qian Peng | Bin Chen
Retrieval-augmented generation (RAG) is widely used in domain-specific and knowledge-intensive applications, where long prompts increase inference cost and may exceed context limits. Prompt compression is therefore appealing, but existing evaluations focus primarily on answer quality, overlooking whether compressed systems remain faithful to the retrieved evidence. In this paper, we ask: does compression that preserves answers also preserve grounding? Using Self-RAG and LLMLingua-2 in a controlled setting, we evaluate compressed RAG on ASQA in terms of both answer correctness and citation grounding. Under increasing compression, answer correctness drops by only 2-4%, whereas grounding drops by 40-50%. This stark divergence shows that answer-only evaluation can substantially overestimate the reliability of compressed RAG in evidence-aware scenarios. We further propose a lightweight hierarchical compression strategy that prioritizes evidence-bearing spans. It recovers nearly all grounding loss while maintaining comparable answer quality. Our results reveal a clear trade-off between efficiency and verifiability, and suggest that compression in RAG should be customized to downstream verification needs rather than treated as a one-size-fits-all efficiency intervention.
When Gradients Collide: Failure Modes of Multi-Objective Prompt Optimization for LLM Judges
Parth Darshan | Abhishek Divekar
Parth Darshan | Abhishek Divekar
Customizing an LLM judge to a specific task or domain often involves optimizing its prompt across multiple evaluation criteria simultaneously. Textual gradient methods automate this for a single judge criterion, however they produce natural-language critiques, not numerical vectors. Thus, the conflict-resolution toolkit of multi-task learning (PCGrad, MGDA) doesn’t apply to the multi-objective textual gradient setting. We test five decomposition modes of textual gradient optimizers by varying how much cross-task information the loss, gradient and optimizer LLMs share. In 6 of 10 configurations on SummEval, we observe that optimization never improves over the initial prompt. Gradient specificity drops by 59% (from 9.0 to 3.7) when the gradient LLM processes multiple criteria jointly. Separately, we observe that naively combining per-task instructions into a single prompt degrades Spearman’s ρ by -5.3%. These results identify two separable failure modes: optimization-time gradient dilution and inference-time instruction interference, which together constrain the design space for multi-objective judge customization using textual feedback.
up
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Proceedings of the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar Madasamy | Sajeetha Thavareesan | Saranya Rajiakodi | Subalalitha Navaneethakrishnan | Dhivya Chinnappa | Balasubramanian Palani | Malliga Subramanian | Kogilavani Shanmugavadivel | Ratnavel Rajalakshmi
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar Madasamy | Sajeetha Thavareesan | Saranya Rajiakodi | Subalalitha Navaneethakrishnan | Dhivya Chinnappa | Balasubramanian Palani | Malliga Subramanian | Kogilavani Shanmugavadivel | Ratnavel Rajalakshmi
Abusive Content Detection in Telugu-English Code-Mixed Social Media Using Hybrid Transformer Architectures
Bojja Revanth Reddy | Sivaiah Bellamkonda
Bojja Revanth Reddy | Sivaiah Bellamkonda
The rapid growth of social media platforms has led to a substantial increase in user-generated content, including abusive and offensive language. Detecting abusive content becomes particularly challenging in low-resource and code-mixed language settings such as Telugu-English social media text. Code-mixed content involves transliteration, inconsistent spelling variations, informal expressions, and frequent language switching within a single sentence. This paper focuses on detecting abusive content in Telugu-English code-mixed comments using both traditional machine learning and transformer-based deep learning models. The proposed approach incorporates preprocessing strategies to normalize transliterations and spelling variations, hybrid feature extraction techniques combining TF-IDF and FastText embeddings, and fine-tuning of multilingual transformer models. The study addresses challenges such as morphological complexity, contextual ambiguity, and limited annotated data in low-resource NLP environments.
Beyond Benchmark Accuracy: Robustness Evaluation of Hinglish Sentiment Models
Chennuru Rahul | Kolawole Adebayo
Chennuru Rahul | Kolawole Adebayo
Multilingual transformers have achieved re-markable performance on code-mixed senti-ment benchmarks, but their robustness underlinguistic stress and domain shift remains un-derexplored. We fine-tune XLM-RoBERTaand mBERT on a carefully cleaned 25,543-tweet Hinglish sentiment dataset, where XLM-R achieves near-perfect in-distribution accu-racy (99.7%). The integrity of this result isconfirmed by rigorous hash-based and 3-gramJaccard deduplication, ruling out data leakage.However, when evaluated on a 400-examplehuman-validated adversarial benchmark span-ning negation, sarcasm, contrast, subtle senti-ment, and true neutral, XLM-R performancecollapses to 42.5% – a drop of over 57 per-centage points. Zero-shot transfer to EnglishTweetEval yields only 50.8% accuracy (40.8%macro F1), above . Our results highlight a crit-ical gap between benchmark scores and real-world reliability, underscoring the need for ad-versarial evaluation and cross-domain stress-testing before deploying sentiment models inpractical, safety-sensitive applications.
Cascaded Modular or End-to-End? : An Investigation on Speech-to-Speech Translation Task for Dravidian Languages
Bhavana Nali | Abhik Jana
Bhavana Nali | Abhik Jana
This paper presents a study of speech-to-speech translation for low-resource Dravidian languages, focusing on Tamil, Telugu, and Kannada. We investigate the efficacy of the Cascaded Modular system with the End-to-end system in both zero-shot and fine-tuned settings. The Cascaded Modular approach combines an ASR Module (Whisper-based ASR for English speech; IndicConformer for Dravidian speech), a Text-to-Text translation module (IndicTrans2), and a Speech synthesis module (Indic Parler-TTS), whereas SeamlessM4T is used as the End-to-end system. For parameter-efficient Low-Rank Adaptation (LoRA) fine-tuning to adapt the translation component to a domain-specific dataset, we use FLEURS and Mann-ki-Baat (a subset of BhasaAnuvaad dataset). Cascaded Modular systems achieve BLEU scores ranging from 3.17 to 19.18 in the zero-shot setting and 5.08 to 19.18 after fine-tuning, whereas the End-to-end model ranges from 3.02 to 15.72 in zero-shot settings across languages and 4.11 to 16.84 after fine-tuning. The results show that Cascaded Modular systems consistently outperform the End-to-end model across both setups. Note that parameter-efficient fine-tuning yields significant improvements in translation quality and speech generation performance for low-resource Dravidian speech translation.
FLAICOL: Flip-Point-Led Augmentation for Imbalanced Code-Mixed Offensive Language Detection
Danish Mohammed | Vidhya Kamakshi
Danish Mohammed | Vidhya Kamakshi
Hate speech detection in low-resource, code-mixed languages is a challenging task as people often switch between scripts and languages in a single post. Code-Mixed scripts can take the form of explicit slurs, subtle insults, or fragmented abuse, and is often hidden by spelling variants and Romanized script. These datasets are also subjected to class imbalance with hate speech being a minority class of interest. To mitigate the imbalance, targeted data augmentation of minority class samples can help learn better representations to aid hate speech detection despite the naturally expected imbalance. We propose FLAICOL, a flip-point method which identifies the minimal embedding perturbation that moves an input across the decision boundary, map it back to discrete text, and retrain on those focused examples. Empirical results show that these interpretable augmentations strengthen Transformer classifiers on low-resource, code-mixed low resource hate datasets (Experiments were conducted on the Tamil-English, Malayalam-English, and Kannada-English splits in the Dravidian CodeMix Benchmark).
LIMP: Linguistically-Informed Multi-Strategy Prompting for Telugu Multi-Turn Dialogue Generation
Arjungopal Anilkumar | Suryansh Ram Menon | Divagar S | Premjith B
Arjungopal Anilkumar | Suryansh Ram Menon | Divagar S | Premjith B
Generating contextually coherent multi-turn dialogue in Telugu requires resolving three deeply interacting constraints absent from generic LLM prompting: morphologically encoded social hierarchy (honorific verb conjugations), strict SOV agglutinative syntax, and culturally governed emotional logic formalised in Natyashastra rasa theory (Bharata Muni, 1951). We introduce LIMP (Linguistically-Informed Multi-Strategy Prompting), an inference-time, training-free framework that injects expert linguistic and cultural knowledge into prompt structure, requiring no fine-tuning or labelled data. We empirically evaluate two strategies on 10,000 stratified evaluation instances from the IndicDialogue Telugu corpus (Arnob et al., 2024): LIMP-RAW, a dense constraint prompt, and LIMP-COT, a six-stage analytical scaffold grounded in rasa theory and Telugu morphological grammar. Our primary finding is that LIMP-COT achieves approximately 2× higher morphosyntactic surface fidelity than LIMP-RAW on GEMMA-3-1B-IT (Gemma Team, Google DeepMind, 2025) (1B parameters): Jaccard = 0.0436 vs. 0.0211, Dice = 0.0792 vs. 0.0411 (p < 0.001, Cohen’s d = 0.57), demonstrating that sequential analytical commitment to linguistic constraints produces more form-faithful Telugu than holistic constraint injection. Concurrently, LIMP-RAW achieves near-ceiling semantic fidelity (BERTSCORE F1 = 0.9709), exceeding both LIMP-COT (0.9637) and SARVAM-1 (Sarvam AI, 2024) (2B, Indic-pretrained; 0.9680) on this dimension. This semantic–lexical dissociation—no single configuration dominates across both metric classes—is itself a substantive finding: in agglutinative Telugu, semantic paraphrase fidelity and morphosyntactic surface fidelity are orthogonal evaluation dimensions. On lexical metrics specifically, LIMP-COT with a 1B general-purpose model surpasses SARVAM-1 under matched prompting (Jaccard = 0.0436 vs. 0.0052), suggesting that structured linguistic scaffolding is a stronger lever than parametric scale for form-faithful generation.
TamilMayangoliSpell: An Open-Source Neural Framework for Context-Sensitive Mayangoli Error Correction in Tamil
Yazhmozhi V M | Annalu Waller | Jacky Visser
Yazhmozhi V M | Annalu Waller | Jacky Visser
Mayangoli errors are context-sensitive errors in Tamil that arise from confusion among phonetically similar graphemes (e.g., ல/ள/ழ, ர/ற, ந/ன/ண). These errors are challenging for conventional spell checkers because both incorrect and correct forms are valid dictionary words, making dictionary lookup insufficient and requiring contextual modelling. We present TamilMayangoliSpell, a reproducible framework for Mayangoli error correction that combines (i) Tamil-specific preprocessing for sentence segmentation and normalisation, (ii) linguistically grounded error induction for generating training data constrained by dictionary validity, and (iii) fine-tuning of multilingual sequence-to-sequence models. Using 30,000 sentence pairs derived from TamilCorp, a massive multi-genre Tamil corpus and split 80/10/10 into train/validation/test, we fine-tune mBART, mT5, and NLLB under a small hyperparameter grid using greedy decoding with a maximum sequence length of 128. mT5 achieves the best performance (BLEU 99.28; Exact Match Accuracy 93.50%) and remains strong in a cross-genre evaluation on short stories. The preprocessing scripts, generated parallel datasets, and trained models are publicly available in a GitHub repository.
TamilTok: Morphologically-Informed Tokenization for Tamil
Surendhar Muthukumar | Aaricia Herygers | Lisa Beinborn
Surendhar Muthukumar | Aaricia Herygers | Lisa Beinborn
Tokenization is fundamental to neural language modeling, yet for Tamil it remains largely adapted from general-purpose multilingual models without systematic consideration of the rich agglutinative morphology. We introduce TamilMorph, a large-scale dataset of more than 480,000 morphologically segmented Tamil word forms. Building on this new resource, we develop TamilTok, a morphology-aware tokenization framework that incorporates explicit morpheme structure into tokenizer training. We benchmark Tamil tokenization quality across multiple tokenization algorithms and vocabulary configurations and find that our approach improves both morphological alignment and downstream performance compared to previous approaches. Our morphological resource for Tamil and our systematic empirical analyses can guide future developments of tokenization for morphologically rich languages.
Thiruppugazh-KG Dataset: A Manually Annotated Resource for Computational Analysis of Tamil Devotional Literature
Garthigan Kumarasamy | Jubeerathan Thevakumar | Sathurgini Uthayakumar | Disne Kajanath | Narthana Sivalingam | Uthayasanker Thayasivam
Garthigan Kumarasamy | Jubeerathan Thevakumar | Sathurgini Uthayakumar | Disne Kajanath | Narthana Sivalingam | Uthayasanker Thayasivam
This paper introduces Thiruppugazh-KG, a semantically annotated dataset and knowledge graph derived from the Thiruppugazh corpus, a 14th-century collection of 1,335 Tamil devotional hymns composed by Arunagirinathar. The dataset includes annotations for entities, devotional themes, mythological events, philosophical concepts, imagery, and sacred locations mentioned in each hymn. Using these annotations, we construct a Neo4j-based knowledge graph that models relationships between hymns and their associated cultural and narrative elements. Graph analytics, including PageRank, are applied to identify prominent entities and sacred locations within the corpus. The resulting resource provides a structured representation of Tamil devotional literature and supports computational analysis of cultural texts in low-resource languages.
Findings in Tamil Dialect Speech Recognition and Classification
Bharathi B | Bharathi Raja Chakravarthi | Shunmuga Priya Muthusamy Chinnan | Saranya S | Suhasini S
Bharathi B | Bharathi Raja Chakravarthi | Shunmuga Priya Muthusamy Chinnan | Saranya S | Suhasini S
As part of DravidianLangTech-2026, we provide a overview of Shared Task on Dialect-based Speech Recognition and Classification in Tamil. Creating reliable system for Tamil dialect identification from audio signals and dialect-aware Automatic Speech Recognition (ASR) is the main goal of the joint work. Dialect-based Tamil Speech Recognition and Tamil Dialect Classification from Speech are the two subtasks that make up the task. 5,134 audio recordings in four Tamil dialects: Southern, Northern, Western, and Central-spanning 9 hours and 22 minutes make up the training dataset. There are 579 audio samples in the test set, totaling almost two hours in length. The shared task involved 17 teams in total. For speech recognition and dialect classification, the top-performing system obtained a Word Error Rate (WER) of 0.51 and a macro F1-score of 0.79, respectively. The findings emphasize the difficulties in understanding Tamil speech due to dialectal diversity and set solid foundations for further study on low-resource dialect-aware ASR systems.
Findings of the Shared Task on Hope Speech Detection in Tulu
Thenmozhi Durairaj | Anusha M D Gowda | Raksha Adyanthaya | Rathnakara Shetty P | Parameshwar R Hegde | Mohammed Fadhel Aljunid | Prasanna Kumar Kumaresan | Bharathi Raja Chakravarthi
Thenmozhi Durairaj | Anusha M D Gowda | Raksha Adyanthaya | Rathnakara Shetty P | Parameshwar R Hegde | Mohammed Fadhel Aljunid | Prasanna Kumar Kumaresan | Bharathi Raja Chakravarthi
Hope Speech Identification is the process of detecting positive, supportive, and encouraging language in text. It focuses on identifying content that promotes unity, inclusiveness, and resilience. Identification of hope speech helps supports mental well being, create healthier online environments, counter hate speech, and promote positive digital communication. This shared task hope speech detection in code-mixed Tulu language as part of DravidianLangTech @ ACL 2026, focuses on both the coarse-grained hope tone classification and the fine-grained hope type classification tasks. There are 11 teams participated in the tasks and submitted several runs for both the tasks. The teams are ranked based on the macro-F1 score.
From Comments to Harm: A Findings Report on Abusive Tamil Text Targeting Women on Social Media Shared Task
Bhuvaneswari Sivagnanam | Kathiravan Pannerselvam | Jananayagan | Charmathi Rajkumar | Ramesh Kannan R | Ratnavel Rajalakshmi | Shunmuga Priya Muthusamy Chinnan | Saranya Rajiakodi | Bharathi Raja Chakravarthi
Bhuvaneswari Sivagnanam | Kathiravan Pannerselvam | Jananayagan | Charmathi Rajkumar | Ramesh Kannan R | Ratnavel Rajalakshmi | Shunmuga Priya Muthusamy Chinnan | Saranya Rajiakodi | Bharathi Raja Chakravarthi
This paper presents an overview of the second shared task on Abusive Tamil Text Targeting Women on Social Media as a binary classification problem (abusive vs. non-abusive). We release a dataset of Tamil YouTube comments and evaluate submissions using macro-F1 to encourage balanced performance in a noisy, low-resource setting. There are 89 teams registered for this task and 24 teams submitted the results. The approaches used by the teams includes transformer fine-tuning, heterogeneous ensembles, classical baselines, and large language models using prompting and LoRA. Results show that the best-performing system scored 0.8297 macro-F1 and many submissions are around 0.79-0.81. Across submissions, transformer fine-tuning with domain-aligned encoders is consistently strong, while additional gains are frequently associated with Tamil-aware normalization and macro-F1-oriented calibration such as class-weighted learning and validation-based threshold tuning. Overall, the findings highlights the importance of language-aware preprocessing and careful decision calibration for reliable moderation of women-targeted abusive Tamil social media text.Disclaimer: This paper (including figures and examples) may contain offensive or harmful language, including abusive content targeting women. All such text is presented solely for research and educational purposes and it does not reflect the author’s views. Reader discretion is advised.
Overview of the Shared Task on Multilevel Political Meme Classification in Tamil and Malayalam
Saranya Rajiakodi | Shunmuga Priya Muthusamy Chinnan | Premjith B | Subalalitha CN | Rahul Ponnusamy | Anshid K A | Bhuvaneswari Sivagnanam | Bharathi Raja Chakravarthi
Saranya Rajiakodi | Shunmuga Priya Muthusamy Chinnan | Premjith B | Subalalitha CN | Rahul Ponnusamy | Anshid K A | Bhuvaneswari Sivagnanam | Bharathi Raja Chakravarthi
This paper presents an overview of the Multi-Level Political Meme Classification shared task conducted at DravidianLangTech–ACL 2026. The task introduces a hierarchical two-level classification framework for Tamil and Malayalam political memes: Level 1 focuses on stance detection (Support/Praise vs. Troll/Oppose), while Level 2 identifies the political target (individual or party), conditioned on the predicted stance. The dataset was curated from social media platforms and manually annotated with strong inter-annotator agreement. A total of 64 teams registered and 19 teams submitted their results using diverse multimodal approaches combining transformer-based text encoders, vision models, OCR pipelines, and hierarchical architectures. Results show that stance detection achieves high macro-F1 scores across both languages, whereas target identification remains more challenging, particularly in Malayalam. The findings highlight the importance of multimodal fusion, hierarchical reasoning, and robustness to OCR noise and class imbalance in political meme analysis.
Shared Task on Depression Detection from Malayalam and Tamil Speech Data
Jyothish Lal G | Premjith B | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Prasanna Kumar Kumaresan
Jyothish Lal G | Premjith B | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Prasanna Kumar Kumaresan
Depression is one of the most common mental health problems in the world. It affects a person’s emotions, thinking, energy levels, and daily life. Early detection of depression is very important to provide timely support and treatment. While many studies focus on identifying depression from text, speech also carries important emotional and psychological signals that are often not fully explored. This paper presents an overview of the shared task on Depression Detection in Dravidian Languages (DD- DL). The task focuses on identifying signs of depression from speech data in two low-resource Dravidian languages: Tamil and Malayalam. Participants were provided with curated training datasets and were asked to build systems to classify speech samples as Depressed or Non-Depressed. The shared task includes two subtasks: (1) Depression detection in Tamil and (2) Depression detection in Malayalam. Participants applied various machine learning and deep learning approaches to model the acoustic and linguistic characteristics of speech. All submissions were evaluated using the macro-F1 score, which ensures fair performance measurement across classes.
Shared Task on Prompt Style Recovery for Large Language Models in Telugu
Premjith B | Jyothish Lal G | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Ratnavel Rajalakshmi | Rahul Ponnusamy | Chinthala Bhuvanesh
Premjith B | Jyothish Lal G | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Ratnavel Rajalakshmi | Rahul Ponnusamy | Chinthala Bhuvanesh
This paper presents an overview of the Shared Task on Prompt Recovery for Large Language Models (LLMs) in Telugu, organized as part of DravidianLangTech @ ACL 2026. The task focuses on identifying the underlying communicative style of Telugu text excerpts, framed as a nine-class single-label classification problem covering Formal, Informal, Optimistic, Pessimistic, Humorous, Serious, Inspiring, Authoritative, and Persuasive tones. The dataset was constructed by collecting Telugu YouTube comments and generating style-modified variants using an LLM, resulting in 3,000 training instances, 300 validation samples, and 301 test samples. A total of 52 teams registered for the shared task, with 13 teams submitting valid system predictions. Systems explored diverse approaches, including transformer-based fine-tuning (IndicBERT, MuRIL, XLM-R), ensemble and stacking methods, pairwise modeling strategies, curriculum learning, and few-shot large language model prompting. Evaluation was conducted using Macro F1-score as the primary metric. The top-performing system achieved a Macro F1-score of 0.2987. Overall results indicate that Telugu prompt-style recovery remains a challenging problem, particularly due to stylistic overlap and high lexical similarity across classes.
TamilPoliSent 2026: A Shared Task report on Multiclass Political Sentiment Analysis in Tamil
Mani Vegupatti | Kishore Kumar Ponnusamy | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Prasanna Kumar Kumaresan | Sathiyaraj Thangasamy
Mani Vegupatti | Kishore Kumar Ponnusamy | Bharathi Raja Chakravarthi | Saranya Rajiakodi | Thenmozhi Durairaj | Prasanna Kumar Kumaresan | Sathiyaraj Thangasamy
Political sentiment analysis aims to automatically identify opinions and attitudes expressed in political discourse on social media platforms. This paper presents an overview of the TamilPoliSent 2026 shared task on multiclass political sentiment analysis in Tamil, organized as part of DravidianLangTech@ACL 2026. The task focuses on categorizing Tamil comments from X (formerly Twitter) into seven sentiment classes: Substantiated, Sarcastic, Opinionated, Positive, Negative, Neutral, and None of the above. The dataset consists of 5,440 annotated Tamil tweets collected from political discussions on social media. Participants were provided with labeled training and development datasets, while the test set was used for final evaluation.A total of 22 teams participated in the shared task and explored a wide range of modeling approaches including classical machine learning methods, transformer-based architectures, hybrid lexical–contextual models, and ensemble frameworks. System performance was evaluated using Macro F1-score to ensure balanced evaluation across all sentiment categories. The best-performing system achieved a Macro F1-score of 0.3935.The results highlight several challenges in Tamil political sentiment analysis, including class imbalance, sarcasm, informal writing styles, and semantic overlap between sentiment categories. The shared task demonstrates that transformer-based models combined with class-balanced learning and hybrid representations are effective for handling fine-grained political sentiment classification in low-resource languages. These findings contribute to advancing research in political discourse analysis and natural language processing for Tamil and other under-resourced languages.
AbuseDetect_Alchemists@DravidianLangTech 2026: A Weighted Transformer Ensemble for Detecting Abusive Tamil Text Targeting Women
Meclin A Francis | Jyoti Kumari | Vinay Babu Ulli | Malavika Sreekumar | Joel Johnson
Meclin A Francis | Jyoti Kumari | Vinay Babu Ulli | Malavika Sreekumar | Joel Johnson
This paper describes our system submitted to the shared task on Abusive Tamil Text Targeting Women on Social Media at DravidianLangTech@ACL 2026. We formulate the problem as a supervised binary classification task, assigning each Tamil social media comment to an Abusive or Non-Abusive category. Our pipeline begins with a tailored preprocessing stage that handles emoji translation, URL removal, and entity normalization. We then independently fine-tune two pre-trained transformer models MuRIL and XLM-RoBERTa on the task data. At inference time, we combine these models through a weighted softmax ensemble, assigning a weight of 0.6 to MuRIL and 0.4 to XLM-RoBERTa. The resulting system achieves a Macro-F1 score of 0.8115 on the test set, outperforming both individual models. The code is publicly available at: https://github.com/meclin2345/AbuseDetect_Alchemists
AITamilDialect@DravidianLangTech 2026: Zero-Shot Whisper and Wav2Vec2 Embedding-Based Tamil Speech Recognition and Dialect Classification.
Varalakshmi K | Bharathi B
Varalakshmi K | Bharathi B
Low-resource languages pose significant challenges for speech technology due to linguistic variation and limited annotated resources. One such language is Tamil, which is a morphologically rich language with significant dialectal variations, which makes Automatic Speech Recognition (ASR) and dialect classification a challenging task. In this article, we introduce a shared-task system for handling Speech Processing in Tamil Language covering both ASR and Dialect classification. We use the Whisper Large-v3 multilingual model in a zero-shot setting without task-specific fine-tuning. For dialect classification, we employ a pre-trained Wav2Vec2 model to extract acoustic features and mean and standard deviation pooling to create utterance-level representations, with an XGBoost model trained for four-way prediction of dialects. Experiments on 579 Tamil speech samples resulted in a word error rate (WER) of 0.61, highlighting the difficulty of the dialectal ASR problem in low- resource setting. The dialect classification system obtained an accuracy of 0.49 and a macro F1 score of 0.41, and there was a certain amount of confusion between the dialect classes. The proposed system is purely based on the standard pretrained models without adaptation, but has produced a benchmark that can be replicated in the multilingual speech representation evaluation of Tamil low-resource scenarios. The results also indicate the need for additional strategies to improve the robustness of the model and stronger baseline models and improved methods for embedding-based dialect classification for future research.
Azrael@DravidianLangTech 2026:Dialect-Sensitive Automatic Speech Recognition and Classification for Tamil
Janish Andrin J | Mohammed Sahil | Saranya S
Janish Andrin J | Mohammed Sahil | Saranya S
Tamil is a pre-historic language of millions of individuals who live in India, Sri Lanka, and other parts of the world. Consider the variations in accents, vocabulary and even speech rhythm even among the central region, the northern region, the southern region and the western region of Tamil Nadu. Such idiosyncrasies make it difficult to use features such as voice assistants or translation applications to keep up. A feasible system has been developed in this project to manage that challenge. It picks up raw audio files in Tamil, identifies which of the four predominant dialects the speech belongs to and translates that speech into text. Good quality datasets on Tamil dialects are rather rare, due to the lack of resources and interest in languages. There were pre-trained models, namely, XLSR to spot the dialects and Wav2Vec 2.0 to convert speech into text. All in all, this configuration had an accuracy rate of 46 percentage. It was very good at distinguishing between northern and southern, but was somewhat confused between central and west-central-western. In the case of the transcription component, a cursory inspection reveals that it is a reliable process, able to nail down clear speech despite those accent twists. With that said, it is possible to improve it with such details as a more detailed fine-tuning or equalizing the classes of data.
ByteBreaker@DravidianLangTech 2026: XLM-RoBERTa Large with Sliding-Window Chunking and Top-K Mean Pooling for Writing Style Classification
Chava Srinivasa Sai | R Vinay Kumar | Jigeesha Sai Surapaneni | Chava Shanmukha Sai
Chava Srinivasa Sai | R Vinay Kumar | Jigeesha Sai Surapaneni | Chava Shanmukha Sai
Identifying different writing styles in large chunks of text is difficult because writing styles vary in different sections of a document. Additionally, the writing styles associated with a text can be differentiated in only tiny and nuanced ways. In this paper, we describe ByteBreaker, the system we built for the Prompt Recovery for LLM Shared Task at DravidianLangTech@ACL-2026. The goal is to analyze the writing style in a specific document that a large language model (LLM) has written. The styles to choose from are categorized as: Authoritative,Formal, Humorous,Informal,Inspiring,Optimistic,Persuasive, Pessimistic, and Serious. Given that a number of documents exceed the 512 token limit of transformer models, we adopt a sliding-window method that breaks each document down into overlapping 512 token chunks, with a stride of 256 tokens. We fine-tune XLM-RoBERTa Large with just the rewritten “CHANGE STYLE” text, as that one has more distinct stylistic indicators. For prediction, we Top-K mean pool the chunk-level predictions, which puts more emphasis on the confident chunks as opposed to treating all chunks the same. To enhance consistency, we trained the model with five distinct random seeds and made three submission:a weighted ensemble(Run 1), a mean-guided single model (Run 2), and a Top-K-guided single model (Run 3). Among the three, Run 3 reached the highest macro F1 score of 0.3306, while Run 1 achieves the best accuraccy(0.3256) with a macro F1 of 0.3290.
ByteBuilders@DravidianLangTech 2026: Transformer-Based Weighted Ensemble for Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments
Mitharshana T V | Shanthi S | Lavana V | Kaviya Varma R
Mitharshana T V | Shanthi S | Lavana V | Kaviya Varma R
Our proposal for the Dravidian LangTech 2026 Tamil Political Sentiment Analysis job is outlined in this document. Seven categories—substantiated, sarcastic, opinionated, positive, negative, neutral, and none of the above—should be used to classify Tamil political remarks according to their attitudes. Classifying the sentiments of Tamil political utterances is quite difficult. Furthermore, the emotions associated with various identities are not distributed uniformly. We built an ensemble of two transformer-based techniques, XLM-RoBERTa and IndicBERT, and used 10-fold cross-validation to improve the model’s dependability and prevent overfitting in order to address some of these issues while finishing this research. In order to help the model concentrate more on the challenging examples, used oversampling to address class imbalance and Focal Loss to train the model. In order to improve the representation of sentences, finally averaged the token embeddings.
cantnlp@DravidianLangTech 2026: organic domain adaptation improves multi-class hope speech detection in Tulu
Andrew Li | Sidney Wong
Andrew Li | Sidney Wong
This paper presents our systems and results for the Hope Speech Detection in Code-Mixed Tulu Language shared task at the Sixth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages (DravidianLangTech-2026). We trained an XLM-RoBERTa-based text classification system for detecting hope speech in code-mixed Tulu social media comments. We compared this organically adapted hope speech detection model with our baseline model. On the development set, the organically adapted model outperformed the baseline system. While our submitted systems performed more modestly on the official test set, these results suggest that further adapting XLM-RoBERTa on organically collected Tulu social media text containing code-mixed and mixed-script variation can improve hope speech detection in code-mixed Tulu.
CHMOD_777@DravidianLangTech 2026: Context-Aware Fine-tuned MuRIL for Abusive Tamil Text Detection on Social Media
Arunaggiri Pandian Karunanidhi | Prabalakshmi Arumugam
Arunaggiri Pandian Karunanidhi | Prabalakshmi Arumugam
This paper describes Team CHMOD_777’s system for the DravidianLangTech@ACL 2026 shared task on detecting abusive Tamil text targeting women on social media. We fine-tune three transformer backbones (MuRIL, XLM-RoBERTa, IndicBERT-v3) with Focal Loss and weighted sampling, systematically evaluating the effects of context length, hyperparameter tuning, and language-specific pre-training. Our best system, MuRIL with 256-token context, achieves 82.76% Macro F1 on the development set and 80.61% on the official test set, ranking 6th out of 24 teams. We find that (1) extending context from 128 to 256 tokens improves F1 while converging 2.4x faster, (2) language-specific pre-training (MuRIL, 236M) outperforms larger models (IndicBERT, 270M), and (3) default hyperparameters are optimal, with every tuning attempt degrading performance.
CHMOD_777@DravidianLangTech 2026: LLM Augmented Transformer Fine-tuning for Tamil Political Sentiment Analysis
Arunaggiri Pandian Karunanidhi | Prabalakshmi Arumugam
Arunaggiri Pandian Karunanidhi | Prabalakshmi Arumugam
This paper describes Team CHMOD_777’s system for the DravidianLangTech@ACL 2026 shared task on political multiclass sentiment analysis of Tamil Twitter comments. The task requires classifying Tamil political tweets into seven sentiment categories under severe class imbalance (8:1 ratio). We address this challenge through LLM-based data augmentation using Gemini 2.5 Flash, expanding training data from 4,352 to 15,316 samples (3.5x the original). Our best system, MuRIL fine-tuned on augmented data with Focal Loss (gamma=3.0) and weighted sampling, achieves 35.79% Macro F1 on the development set, a 67% relative improvement over the non-augmented baseline. On the official test set, our system achieves 34.25% Macro F1, ranking 12th out of 22 participating teams. We find that (1) language-specific pre-training (MuRIL, 236M) outperforms larger general models (IndicBERT-v3, 1B), (2) smaller models benefit disproportionately from augmentation, and (3) Substantiated is the hardest category (F1=10.7%) due to its requirement for factual reasoning.
CHMOD_777@DravidianLangTech 2026: Tamil-Adapted Whisper and MMS for Dialect Speech Recognition and Classification
Arunaggiri Pandian Karunanidhi | Prabalakshmi Arumugam
Arunaggiri Pandian Karunanidhi | Prabalakshmi Arumugam
This paper describes Team CHMOD_777’s system for the DravidianLangTech@ACL 2026 shared task on Tamil dialect speech recognition and classification. The task comprises two subtasks: classifying Tamil speech into four regional dialects (Northern, Southern, Western, Central) and transcribing dialectal Tamil speech to text. For dialect classification, we fine-tune MMS-1b-all with Focal Loss and weighted sampling, achieving 83.04 Macro F1 on the development set (5th out of 11 teams on the test set). For speech recognition, we fine-tune a Tamil-specific Whisper model (763M parameters), achieving 53.72 WER on the development set and 49.75 on the official test set, ranking 1st out of 13 teams. Our key finding is that domain-specific pre-training significantly outperforms larger general-purpose models: Tamil Whisper (763M) beats Whisper-large-v3 (1.5B) by 8 WER points despite having half the parameters.
Cuet Yet Another Baseline@DravidianLangTech 2026: Shared Task on Prompt Recovery for LLM in Telugu
Rotna Dipika Debnath | Shahrin Afroz Hoque Ruhi | Ayesha Labiba | Arpita Mallik | Hasan Murad
Rotna Dipika Debnath | Shahrin Afroz Hoque Ruhi | Ayesha Labiba | Arpita Mallik | Hasan Murad
Prompt recovery in large language models (LLMs) is the task of inferring the communicative intent and stylistic framing of the original instruction from model-generated output. This task is especially challenging for low-resource Dravidian languages such as Telugu, where agglutinative morphology, register variation, and scarce annotated data complicate stylistic modelling. In this paper, we present our system for the Shared Task on Prompt Recovery for LLM in Telugu at DravidianLangTech @ ACL 2026, which aims to classify Telugu transcript excerpts into nine communicative style categories: Formal, Informal, Optimistic, Pessimistic, Humorous, Serious, Inspiring, Authoritative, and Persuasive.We have implemented a transformer-based approach using ai4bharat/IndicBERTv2-MLM-only, MuRIL-base and Telugu-BERT for Telugu communicative style classification. Our system fine-tunes the pretrained Indic language training samples to capture stylistic patterns in Telugu transcripts. Our approach achieved a macro F1 score of 0.2993 on the evaluation set, demonstrating the potential of Indic-focused pretrained models for stylistic analysis in low-resource language settings.Controlled ablations reveal that label smoothing benefits stronger Indic backbones but degrades weaker ones, and that surface linguistic feature augmentation does not complement rich contextual representations on small datasets.
CUET-2567@DravidianLangTech-ACL 2026: Multimodal Stance and Target Identification in Dravidian Political Memes
Arka Dutta | Anindya Majumder | Adnan Faisal | Hasan Murad
Arka Dutta | Anindya Majumder | Adnan Faisal | Hasan Murad
In Dravidian languages, political memes progressively shape public opinion and political discourse, influencing digital conversations andpublic narratives. Our paper proposes a multilevel multimodal framework for political meme classification in Tamil and Malayalam as part of the Multi Level Political Meme ClassificationDravidianLangTech@ACL 2026 shared task. The task has involved two levels: Level 1 has identified whether a meme expresses Troll/Oppose or Support/Praise, while Level 2 has determined the specific target category (Individual, Party, or Intersection). We have evaluated unimodal and multimodal architectures to analyze the impact of textual and visual representation. Experimental results have highlighted the importance of a multimodal approach over unimodal approaches. This workconfirms the effectiveness of combining image and text features in meme understanding. Among the evaluated models, the mBERT+ViTarchitecture has achieved the best overall performance across both languages and classification levels. According to the evaluation of shared task we achieved average F1 score of 0.72 securing the 2nd rank in Malayalam task and F1 score of 0.76 in Tamil task securing the 6th rank. However after our experimental evaluation we got best average F1 score of 0.62 for Tamil and 0.49 for Malayalam. Despite moderate results, challenges have remained mainly due to the dataset size, class imbalance, and noisy text extraction from images.
CUET_InferX@DravidianLangTech 2026: Shared Task on Dialect Based Speech Recognition and Classification in Tamil
Md. Ashraful Islam Semon | Jihadul Islam | Ratnajit Dhar | Hasan Murad
Md. Ashraful Islam Semon | Jihadul Islam | Ratnajit Dhar | Hasan Murad
Tamil has a lot of internal variability, including the way it is used in casual conversations, code mixing, and phonetic differences in the way it is spoken in different regions, making it quite difficult to transcribe the spoken word and classify the dialects. In order to address these challenges, our paper presents the system developed by the CUET_InferX team for the Shared Task on Dialect Based Speech Recognition and Classification in Tamil, which was part of DravidianLangTech@ACL 2026. For Subtask 2 (ASR), our proposed system is based on a dual-architecture design that incorporates a fine-tuned Whisper-large-v3 model with Low-Rank Adaptation (LoRA) and a Wav2Vec2 XLSR-53 model, topped with a KenLM statistical language model for n-gram phonetic correction. Our ASR system resulted in a Word Error Rate (WER) of 0.54, which earned us 2nd position for Subtask 2. For Subtask 1 (Speech-Based Dialect Classification), our proposed system is based on a text-based weighted ensemble of IndicBERT, MuRIL, XLM-RoBERTa, and TamilBERT models, which is completely dependent on our ASR system’s transcription outputs. Our proposed system achieved a Macro F1 score of 0.22, which earned us 9th position for Subtask 1.
Cuet_Neural_Navigators@DravidianLangTech 2026: Depression Detection from Malayalam and Tamil Speech using Self-Supervised Acoustic Models
Shuva Dey | Abir Dey | Sha Newaz Mahmud | Hasan Murad
Shuva Dey | Abir Dey | Sha Newaz Mahmud | Hasan Murad
Depression detection from speech aims to findsigns of depression using behavioral signals.This approach enables early mental healthscreening and makes it scalable. However, thetask is tough because of subtle acoustic cues,differences among speakers, and language-specific patterns. In this work, we introduceour system for the Shared Task on DepressionDetection in Dravidian Languages (DD-DL)at DravidianLangTech@ACL 2026. We fo-cus on speech in Tamil and Malayalam. Weexplore pretrained self-supervised speech en-coders, including HuBERT, XLS-R, and Whis-per, to identify acoustic patterns related to de-pression directly from raw audio. Our methodcombines these models through ensembling tocapture different acoustic features. The ex-periments use stratified evaluation and cross-lingual analysis to check how well the mod-els work across languages. Results show thatpretrained acoustic representations effectivelycapture vocal features of depression, achiev-ing Macro-F1 scores of 0.9058 for Tamil and0.9396 for Malayalam. However, cross-lingualtransfer faces challenges because of phoneticand prosodic differences.
CUET_SYNTHETICA@DravidianLangTech 2026: Multi Architecture Transformer Ensemble for Detecting Abusive Tamil Text Targeting Women
Miftahul Jannat Rishta | Sumaiya Zaman | Shiti Chowdhury | Hasan Murad
Miftahul Jannat Rishta | Sumaiya Zaman | Shiti Chowdhury | Hasan Murad
Abusive language targeting women has been a serious problem on Tamil social media and building systems to detect it automatically is harder than it looks. Tamil is morphologically complex, people have written it mixed with English in ways no dictionary has accounted for and a lot of the hostility has been indirect enough that has slipped past models trained on surface patterns. In the Shared Task on Abusive Tamil Text Targeting Women on Social Media DravidianLangTech@ACL 2026, we have worked on classifying Tamil YouTube comments as Abusive or Non-Abusive. We have trained three transformer models four times each with different learning rates, giving us 12 models total. Their predicted probabilities have been averaged to make the final decision. The 12-model ensemble has achieved a macro F1 of 0.8086, outperforming all individual models and securing 4th place in the shared task. Combining Tamil-specialized and multilingual transformer models has outperformed any single-architecture approach.
CUET_SYNTHETICA@DravidianLangTech 2026: Multilingual Transformer Based Hope Speech Detection for Coarse and Fine-Grained Classification in Tulu
Sumaiya Zaman | Miftahul Jannat Rishta | Shiti Chowdhury | Hasan Murad
Sumaiya Zaman | Miftahul Jannat Rishta | Shiti Chowdhury | Hasan Murad
Hope speech has played a vital role in online communities, yet most NLP work has focused on English and a few high-resource languages, leaving code-mixed varieties like Tulu largely unexplored. In the Shared Task on Hope Speech Detection in Code-Mixed Tulu at DravidianLangTech@ACL 2026, we have tackled two subtasks: (i) coarse-grained classification into Encouraging, Discouraging, Uninvolved and Blended categories (Task 1) and (ii) fine-grained classification into Optimistic, Realistic, Inspiring, Fading and Hopelessness (Task 2).We have fine-tuned three multilingual transformer encoders XLM-RoBERTa-base, MuRIL and mBERT on the official training splits. In Task 1, a three-way soft-voting ensemble of all three models has yielded the best performance with a macro F1 of 0.58, securing 1st place. In Task 2, XLM-RoBERTa-base alone has outperformed both MuRIL and mBERT, achieving a macro F1 of 0.42 and also securing 1st place.
CYBERPUNK@DravidianLangTech 2026: Multimodal Political Meme Classification using CLIP and Logo Similarity
Shahad Abir
Shahad Abir
We present our system for the DravidianLangTech 2026 shared task on multi-level political meme classification in Tamil and Malayalam. The task involves two hierarchical levels: (1) stance detection (Support vs. Troll) and (2) target identification (Person, Party, or Intersection). Our approach combines CLIP vision-language embeddings (ViT-L-14) with face detection features and political logo similarity matching, resulting in a 773-dimensional feature representation. We train separate LinearSVC classifiers for each language and task level. Our system achieved Rank 1 in Malayalam with an average F1-score of 0.7930 and Rank 6 in Tamil with 0.7666. Our codes are available at https://github.com/A-k-a-sh/Shared-task-multimodal-political-meme.
Dialectmind@DravidianLang Tech 2026: Zero-Shot Dialectal Tamil Automatic Speech Recognition Using a Large Pretrained Conformer Model
Gayathri.k | Bharathi B
Gayathri.k | Bharathi B
The low-resource dialectal Automatic Speech Recognition (ASR) in languages like Tamil is a critical issue because of phonological differences, lack of labeled data and because of the differences in the acoustic of speech patterns among regions. This paper will introduce a dialect-conscious Tamil ASR model that is trained on the Conformer-CTC-BPE-Large framework via the NVIDIA NeMo framework. This model is an integration of convolutional subsampling, multi-head self-attention, and Connectionist Temporal Classification (CTC) decoding along with a BPE tokenizer to make possible both efficient end-to-end speech recognition. The system is tested on the audio recordings of dialectal Tamil, in which mono-channel audio normalization and batch transcription are used. Our findings indicate that even using large pretrained Conformer models, dialectal ASR tasks are successfully implemented even in zero-shot. Transcriptions generated are examined and the challenges associated with the dialectal differences and acoustic models, and we comment on the possible future directions of enhancing data-efficient adaptation in low-resource speech recognition.
DLRG@DravidianLangTech 2026: Dual-Purpose Whisper Adaptation for Tamil Dialect Identification and Dialectal Speech Recognition
Gulisetty Abhinav | Tanisha Nanda | Ramesh Kannan R | Ratnavel Rajalakshmi
Gulisetty Abhinav | Tanisha Nanda | Ramesh Kannan R | Ratnavel Rajalakshmi
This paper describes our system developed for the shared task on Dialect Based Speech Recognition and Classification in Tamil at DravidianLangTech@ACL 2026. We participated in both Subtask 1 (Dialect Identification) and Subtask 2 (Dialectal ASR). Our approach leverages a single Tamil-adapted Whisper Medium model as a unified foundation for both tasks. For dialect classification, we have used the Whisper encoder as a feature extractor by discarding the decoder, applying mean pooling over the temporal dimension, and fine-tuning the full encoder with a lightweight classification head, achieving 73.4% accuracy on the test set. For dialectal ASR, we apply Low-Rank Adaptation (LoRA) to the full encoder-decoder architecture with SpecAugment-based data augmentation, achieving a Word Error Rate (WER) of 0.55 on the test set. Our experiments reveal that unfreezing the pre-trained encoder is critical for dialect discrimination, boosting accuracy from 52.78% (frozen) to 73.4% (unfrozen). The code is publicly available at https://github.com/DLRG-VIT/DravidianLangTech2026
DLRG@DravidianLangTech 2026: Explainable Transformer-Based Detection of Abusive Tamil Text Targeting Women on Social Media
Mirudhula Sankar | Ratnavel Rajalakshmi
Mirudhula Sankar | Ratnavel Rajalakshmi
Many social media platforms have users who have normalized the abuse of women online, creating a need for systems that automatically detect such activity. For low-resource, regional languages like Tamil, which has informal writing styles, spelling variations, dialectal differences, and culturally specific expressions, it becomes a challenge to correctly detect abusive comments. In this work, a transformer-based approach for binary classification of Tamil comments into abusive and non-abusive categories is done using the DravidianLangTech dataset. The proposed system fine-tunes MuRIL(a multilingual transformer pretrained for Indian languages), enabling effective contextual representation with minimal preprocessing. To improve the transparency of the system, a post-hoc Explainable AI component is incorporated. A perturbation-based method using log-odds differences identifies words that significantly influence the predictions. Experimental findings indicate that the model reaches a validation accuracy exceeding 81% while also exhibiting a strong macro-F1 score. This research shows that utilizing contextual multilingual representations alongside simple interpretability methods offers a viable and effective approach for detecting abusive text in Tamil. The implementation of our system is publicly available at https://github.com/mirud5173/Abusive-Tamil-Comment-Detection-using-Transformer-Models
DPR@DravidianLangTech 2026: Transformer-Based Abusive Content Detection for Tamil Text Targeting Women on Social Media
Diya Prakash | Praveen Kumar S | R Ranjith Kumar | Balasubramanian Palani | Jobin Jose | Siranjeevi Rajamanickam
Diya Prakash | Praveen Kumar S | R Ranjith Kumar | Balasubramanian Palani | Jobin Jose | Siranjeevi Rajamanickam
The fast-growing number of content in Tamil in social media has led to increasing abusive and gender-directed hate speech in online platforms. Detecting abusive content written in Tamil is relatively difficult owing to the complex morphological structure of Tamil language, its dialects, transliteration, and contextualized usage. In this study, the use of transformer-based pretrained language models in detecting abusive content in Tamil was explored. Five transformer-based models—mBERT, MuRIL, XLM-RoBERTa, IndicBERT, and Tamil-BERT—were fine-tuned and tested using DravidianLangTech 2026 shared task dataset. The experimental results show that the best-performing model was Tamil-BERT with an accuracy rate of 80.72% owing to Tamil-specific pretraining and better morphological analysis capabilities. Our system ranks 5th at the leaderboard of the DravidianLangTech 2026 shared task challenge. The source code and fine-tuned models are opensource and publicly accessible.
Dravid-Tech-Builders@DravidianLangTech 2026: A Comparative Study of Classical and Deep Learning Approaches for Tamil Dialect Classification and Speech Recognition
Naveen A | Karthiyayini P | Kalaivani K S
Naveen A | Karthiyayini P | Kalaivani K S
The rapid expansion of digital connectivity across India has dramatically increased participation in speech-enabled services and multilingual communication platforms. Tamil, with its rich dialectal diversity across geographical regions, presents unique challenges for automatic speech recognition and dialect identification systems. We participated in the DravidianLangTech 2026 shared task to classify Tamil speech into four regional dialects (Central, Northern, Southern, Western) and perform automatic speech recognition. We trained four machine learning models (SVM, Random Forest, CNN, CNN+BiLSTM) alongside two transfer learning models (Wav2Vec2-Base, Wav2Vec2-XLSR-53) for ASR. Among classification models, SVM with MFCC features achieved the best performance with 94.17% macro F1-score and validation accuracy of 94.35%. For ASR, Wav2Vec2-XLSR-53 obtained 15.3% WER, demonstrating effective cross-lingual knowledge transfer. Our analysis reveals that traditional machine learning approaches with engineered features outperform deep learning methods in low-resource scenarios with limited training data. Code is available at: https://github.com/Naveen-Arul/dravid-tech
ERROR_500@DravidianLangTech2026: Automatic Prompt Style Classification in Telugu Using Transformer-Based Language Models
Mahashweta Manjari Barua | Tasnia Khanam | Nuzha Saifa Rahmat | Shiti Chowdhury | Hasan Murad
Mahashweta Manjari Barua | Tasnia Khanam | Nuzha Saifa Rahmat | Shiti Chowdhury | Hasan Murad
Recovering writing style prompts in low resource languages has been daunting due to diverse morphology, culturally cognizant language patterns and deficient annotated resources. As previous works have predominantly focused on binary sentiment or single attribute transfer, extensive multi-class style classification in under-resourced languages like Telegu has been vastly underexplored. In this study, we have addressed this chasm through the Telugu Prompt-Style Recovery Shared Task at DravidianLangTech@ACL 2026 (Premjith et al., 2026), framing prompt reconstruction as a nine-class classification problem with Formal, Informal, Optimistic, Pessimistic, Humorous, Serious, Inspiring, Authoritative and Persuasive as prompt styles. We have evaluated three input configurations—Change Style, Original Transcripts and Merged input style—while training three transformer based models-MuRIL, XLM-RoBERTa and IndicBERT v2 under identical conditions. Our most promising model, IndicBERT v2 with partial layer freezing and weighted cross-entropy loss, has obtained a macro-F1 of 0.2987 and accuracy of 0.299. The Change Style configuration has significantly outperformed Original and Merged inputs, indicating that explicit style changes have made tonal and meaning cues more distinctive. These results have showcased the importance of language-specific pretraining and careful input design for style-sensitive NLP in low-resource settings, ultimately securing 1st rank on the shared task.
HNK@DravidianLangTech 2026: Investigating Grapheme-Level Normalization for Abusive Tamil Text Classification
Hanish Vigneshwar R | Nahul Alaguraj | Karthikeyan Manimaran | Ratnavel Rajalakshmi
Hanish Vigneshwar R | Nahul Alaguraj | Karthikeyan Manimaran | Ratnavel Rajalakshmi
The increasing prevalence of social media has also correlated with an increase in abusive content targeting women, particularly for regional languages such as Tamil. The automatic identification of abusive content is critical for the creation of safer online spaces. In this paper, we focus on the abusive text detection of women in the context of binary text classification. We evaluated the performance of the proposed system on the abusive text detection of women using the IndicBERT, MuRIL, and Tamil-BERT models. Additionally, we propose the use of grapheme-aware normalization for the proposed system. Grapheme-aware normalization aims to maintain the structural integrity of Tamil characters at the Unicode level. The experimental results reveal that the proposed system using the Tamil-BERT model with grapheme-aware normalization achieves the best performance among the evaluated models. The proposed system achieved the third position in the shared task.
Hope_Speech_Alchemists@DravidianLangTech 2026: TF-IDF SVM and XLM-RoBERTa with Focal Loss for Hope Speech Detection in Tulu
Joel Johnson | Meclin A Francis | Jyoti Kumari | Malavika Sreekumar | Vinay Babu Ulli
Joel Johnson | Meclin A Francis | Jyoti Kumari | Malavika Sreekumar | Vinay Babu Ulli
This paper describes our system submitted to the shared task on Hope Speech Detection in Tulu at DravidianLangTech@ACL 2026 hope-speech-dravidianlangtech-acl-2026. The task comprises two sub-tasks: coarse-grained classification into four categories Task 1 and fine-grained classification into five categories Task 2. We compare a traditional TF-IDF + LinearSVC baseline against XLM-RoBERTa fine-tuned with minority-class oversampling and Focal Loss. Our experiments reveal an interesting trade-off: while the transformer approach achieves the best validation Macro-F1 of 0.57 on the coarse-grained task, the TF-IDF baseline outperforms it on the smaller fine-grained task, highlighting the data scarcity threshold below which large pre-trained models struggle to generalise. On the official test set, our system achieves a Macro-F1 of 0.55 on Task 1 and 0.40 on Task 2. The code is publicly available at: https://github.com/meclin2345/Hope_Speech_Alchemists
IIITK_SpeechScape@DravidianLangTech 2026: Dialect based speech recognition and classification using Speech Foundation Models and Deep Learning Techniques
G Srishtik Sekar | Harissh Ragav Dhamodaran | Kishore Shankar S | Balasubramanian Palani | R Tharaniya Sairaj
G Srishtik Sekar | Harissh Ragav Dhamodaran | Kishore Shankar S | Balasubramanian Palani | R Tharaniya Sairaj
Dialectal variation poses a significant challenge to Automatic Speech Recognition (ASR), particularly for low resource morphologically rich languages such as Tamil. Although widely spoken in India, Sri Lanka, and the global diaspora, Tamil exhibits substantial phonetic, lexical, and prosodic variation across dialects, complicating both dialect classification and speech recognition. In this work, we address these tasks within a unified framework.We evaluate state-of-the-art models for dialect classification, including Whisper, CLDNN, wav2vec, and wavLM, and for ASR, Whisper and a zero-shot Conformer. Among them, Whisper achieves the best performance, obtaining a macro F1-score of 0.46 for dialect classification and a word error rate of 0.57 for ASR.These results highlight the strong generalization capability of transformer-based foundation models across dialects and languages. The code is publicly available in github for research purpose.
IndiLangTech@DravidianLangTech 2026: Hierarchical Modeling for Multi-Level Political Meme Classification
Saurabh Kumar | Vivekananda G | Ranbir Singh Sanasam | Sukumar Nandi
Saurabh Kumar | Vivekananda G | Ranbir Singh Sanasam | Sukumar Nandi
Political memes are a widely used form of digital political expression in linguistically diverse regions such as South India, where visual cues, textual overlays, and cultural symbolism convey complex political narratives. The Shared Task on Multi-Level Political Meme Classification at DravidianLangTech 2026 introduces a hierarchical setting requiring stance identification (Support vs. Troll) and target-type prediction (Individual vs. Party) for Tamil and Malayalam memes. We propose a two-stage hierarchical framework based on the Gemma 3 4B Instruction model. Instead of jointly predicting both levels, two specialized models are fine-tuned: the first predicts meme stance, and its output conditions the second model for target identification, explicitly modeling the dependency between the meme content, the predicted stance, and the target type. Using LoRA-based parameter-efficient instruction tuning, our approach achieves an average F1-scores of 0.8029 for Tamil and 0.6950 for Malayalam across both levels, ranking 1st in Tamil and 4th in Malayalam.
JerinWarriors@DravidianLangTech 2026: A Two-Stream Cross-Attention Approach for Prompt Recovery in Telugu
Savith A | Wordson Robert | Jerin Mahibha C | Shrey Patnaik
Savith A | Wordson Robert | Jerin Mahibha C | Shrey Patnaik
Identifying the structure of detailed sentences which show glimpses of various annotation cues, in a low resource language that is morphological rich like Telugu is a challenge. Standard baseline architectures like Multi Layer Perceptrons (MLP) struggle with low resource languages. This paper details our proposed solution for the Telugu Prompt-Style Recovery Shared Task at DravidianLangTech @ ACL 2026. We propose a Two-Stream Cross-Attention architecture that uses a shared MuRIL encoder to calculate the relationship between an original transcript and its style-shifted counterpart, helping the MLP to distinguish the styles better and catch the differences better. Through experimentation we have found out that this proposed model handles the signal dilution of the individual labels better than the rest. Our best-performing system achieved a Macro F1-score of 0.2588 on the test set, securing 2nd place out of 13 teams. We have concluded that the local transformation is the main driver for the style recovery in this task. For reproducibility, we release our implementation and experimental setup on GitHub.
KEC’S CODE CRAFTERS@DravidianLangTech 2026: Abusive Tamil Text Detection Targeting Women on Social Media
Nivetha | Nethrasri S | Malliga Subramanian
Nivetha | Nethrasri S | Malliga Subramanian
As social media platforms continue to grow insize, unfortunately, they have also become ahub for digital toxicity, where women in linguistically diverse regions are particularly vulnerable to online harassment. Hence, the requirement for an automated moderation toolthat can effectively handle regional languagesis critical. Our paper is a step in this direction as we propose a classification modelfor the “Abusive Tamil Text Detection Targeting Women on Social Media” shared taskfor DravidianLangTech-2026. Our model istrained on a dataset of 25,948 comments fortraining and 915 for testing. Our primary objective was to classify content as either ”Abusive”or ”Non-Abusive” for YouTube videos. TheTamil language is particularly difficult to workwith owing to its highly agglutinative structure and the tendency for code-mixing betweenTamil and English or even using a mix of bothin a single sentence. To overcome these difficulties in preprocessing, we designed a specificpipeline for denoising these informal scripts.We then implemented four traditional machinelearning models: SVM, Logistic Regression,Random Forest, and Multinomial Naive Bayesusing TF-IDF for feature extraction. Our modelwas optimized for hyperparameters and decision thresholds to achieve an accuracy and F1score of 0.86 using Logistic Regression
Lannisters@DravidianLangTech 2026: A Comparative and Ablation Study of Multilingual Transformers for Gender-Targeted Abuse Detection in Tamil Social Media Platforms
Kalaivani K S | Jaisanth K | Nandhini B
Kalaivani K S | Jaisanth K | Nandhini B
The prevalence of the use of the Tamil lan- guage on social media has heightened the need to address the issue of online harassment of women. As a result, there is a heightened need to develop a system to automatically iden- tify abusive content in the Tamil language to promote a safe online communication plat- form. This paper presents a model to iden- tify abusive content using a binary classifi- cation model to identify Abusive and Non- Abusive content. In this work, we experi- mented with several multilingual transformer models including DistilBERT, mBERT, and XLM-RoBERTa. From the experiments, it was observed that the XLM-RoBERTa model performed better than the others, achieving an accuracy of 91.17% and a macro F1 score of 0.8865. In this paper, ablation experiments are conducted to show that structured preprocess- ing, balancing the minority class, and tuning the hyperparameters contribute to the model’s performance
Mano_sub@DravidianLangTech 2026: Article-Aware Batching and Discriminative Fine-Tuning of MuRIL for Telugu Prompt-Style Classification
Manohar Sita Rama Madhurapantula | Seshu Babu Pulagara
Manohar Sita Rama Madhurapantula | Seshu Babu Pulagara
This paper presents Team Mano_sub’s sub mission to the Telugu Prompt-Style Recovery task at DravidianLangTech 2026, classifying Telugu text into nine stylistic categories: Formal, Informal, Optimistic, Pessimistic, Humorous, Serious, Inspiring, Authoritative, and Persuasive. We identify a critical structural property of the dataset: each of 384 unique source articles appears ap proximately 7.8 times with different style la bels. Standard random batching leads to poor within-batch diversity when same-article samples co-occur, causing majority-class collapse and keeping macro F1 stuck at 0.022 regard less of learning rate. We propose an article aware batch sampler that enforces within-batch article diversity, combined with discriminative learning rates for full MuRIL fine-tuning. Complete five-fold cross-validation yields a mean macro F1 of 0.3834 (std=0.0189) on the development set, with fold best scores ranging from 0.3488 to 0.4040. The fold 1 best model achieves macro F1=0.2765 on the official test set —a5.6×improvement over our officially submitted result of F1=0.0491, which would have ranked 2nd among all 13 participating teams. All nine style classes are correctly predicted by epoch 5. Our system is officially ranked 12th in the Prompt Recovery for LLM in Telugu shared task at DravidianLangTech@ACL 2026. Code: https:// github.com/msrmanohar/ACL-PRLLM
MedHastra@DravidianLangTech 2026: Piecewise Style Classification for Telugu Prompt Recovery Using XLM-RoBERTa
Shruti Chandrasekar | Vedajanaani R S | Vijayalakshmi P
Shruti Chandrasekar | Vedajanaani R S | Vijayalakshmi P
We present a system for the DravidianLangTech @ ACL 2026 shared task on TeluguPrompt-Style Recovery(B et al., 2026). The task requires classifying Telugu text into one of nine communicative styles: Formal, Informal, Optimistic, Pessimistic, Humorous, Serious, Inspiring, Authoritative and Persuasive. Our approach fine-tunes the multilingual XLMRoBERTa base model with a piecewise segment comparison strategy that evaluates distinct stylistic markers across sentence segments,enabling richer contextual discrimination between visually similar styles. Evaluated on the official test set, our system achieves a Macro F1score of 0.1205, Accuracy of 0.1196, Precision of 0.1205 and Recall of 0.1231. We analyze the challenges of stylistic ambiguity in low resource Telugu NLP and discuss directions for future improvement.
MUCS@Dravidianlangtech@ACL2026: Hope Speech Detection in Code-Mixed Tulu Language Using Multiple Features
Hosahalli Lakshmaiah Shashirekha | Rachana A
Hosahalli Lakshmaiah Shashirekha | Rachana A
Hope speech refers to online expressions that promote positivity, encouragement, and social harmony. It fosters inclusivity and resilience, making it particularly valuable in culturally diverse and code-mixed communities. Detecting hope speech is an emerging area in computational linguistics, aimed at supporting healthier digital interactions and improving accessibility for vulnerable groups.While most of the hope speech detection work has been focused on high-resource languages, low- resource languages such as Tulu remains unexplored. In this paper, we - Team MUCS, describe our proposed system submitted to the first shared task on Hope Speech Detection in Code-Mixed Tulu, organized by DravidianLangTech@ACL 2026. As there are no pretrained language models for Tulu, we explored multiple hand crafted features - word n-grams (n = 1, 3), character n-grams (n = 1, 3), syllable n-grams (n = 1, 3) and sub-words, to train ensemble of classical Machine Learning (ML) models: i) Multinomial Naive Bayes (MNB) and Logistic Regression (LR) classifiers and ii) k Nearest Neighbor (kNN) and Decision Tree (DT) classifiers, both with soft-voting. Experimental results demonstrate that feature integration effectively captures lexical, sub-lexical, and phonological cues in noisy code-mixed text. The system achieves competitive performance on both development and test datasets, highlighting the effectiveness of feature-based approaches for hope speech detection in code-mixed Tulu.An ablation study is also conducted to evaluate the contribution of multiple feature sets for hope speech detection.
NITC-HSR@DravidianLangTech 2026: Ensembling Multilingual Transformer Models for Detecting Abusive Tamil Text Targeting Women on Social Media
Rameez Mohammed A | S D Madhu Kumar
Rameez Mohammed A | S D Madhu Kumar
The proliferation of misogynistic content on social media platforms is a serious problem that requires the development of automated detection systems, which is a challenging task for low-resource languages like Tamil. This study investigates the effectiveness of multilingual transformer models for identifying abusive Tamil text targeting women in social media. Results indicate that such models achieve strong baseline performance on this task. Furthermore, an ensemble of two best performing models was found to improve the classification performance further. The results also highlighted the significance of domain-specific pre-training for improving classifier performance. The best performing ensemble model achieved a weighted F1 score of 0.83 on the test set, placing our approach in first position in the shared task.
PhucNguyen@DravidianLangTech 2026: Political Multiclass Sentiment Analysis with XLM-RoBERTa and Low-Rank Adaptation
Dinh Khac Phuc Nguyen | Thìn Đặng Văn
Dinh Khac Phuc Nguyen | Thìn Đặng Văn
Analyzing political sentiment in code-mixed Tamil-English presents significant challenges due to informal jargon, severe class imbalance, and distribution shifts. This paper describes our system for the Political Multiclass Sentiment Analysis shared task at DravidianLangTech@ACL 2026, which categorizes tweets into seven sentiment classes. Our approach leverages XLM-RoBERTa integrated with Low-Rank Adaptation (LoRA). To mitigate majority-class dominance, we combine random oversampling with automated hyperparameter optimization to improve macro-level balance within this Parameter-Efficient Fine-Tuning (PEFT) framework. Enhanced by targeted preprocessing—specifically emoji demojization and noise removal—our system helps preserve nuanced symbolic cues, achieving a macro-average F1-score of 0.3763 and securing Rank 2 on the shared task leaderboard.
PolyTicsTamil_Alchemists@DravidianLangTech@ACL 2026: An Augmentation-Driven Focal Ensemble Model for Political Sentiment Analysis in Tamil
Jyoti Kumari | Meclin A Francis | Vinay Babu Ulli | Malavika Sreekumar | Joel Johnson
Jyoti Kumari | Meclin A Francis | Vinay Babu Ulli | Malavika Sreekumar | Joel Johnson
This paper describes our system submitted to the DravidianLangTech@ACL 2026 shared task on Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments. The task requires classifying Tamil political tweets into seven sentiment categories. We address two key challenges, severe class imbalance and semantic overlap between categories, through a three-stage pipeline. First, we balance the training set by augmenting minority classes via back-translation and transformer-based paraphrasing. Second, we fine-tune XLM-RoBERTa-base using a class-weighted Focal Loss (𝛾=2), which directs learning towards hard, ambiguous samples. Third, we train five models under Stratified 5-Fold Cross-Validation and average their softmax outputs at inference time. On the official test set, the system achieves a Macro-F1 of 0.3539. The code is publicly available at: https://github.com/meclin2345/PolyTicsTamil_Alchemists
PrimeLine@DravidianLangTech 2026: Abusive Tamil Comment Detection Using MuRIL
Rithikaa V | S.Sumathi | Nithya Varshini C N R | Sanjay Krishnan K
Rithikaa V | S.Sumathi | Nithya Varshini C N R | Sanjay Krishnan K
Detecting abusive language in Tamil social media is a genuinely difficult problem. The language is morphologically rich, speakers routinely mix Tamil with English, and informal romanised Tamil is common enough to confuse models trained primarily on formal text. This work presents a system for binary classification of Tamil comments into Abusive and Non-Abusive categories, submitted to the DravidianLangTech@ACL 2026 shared task. MuRIL, a BERT-based encoder pre-trained on 17 Indian languages and their transliterated equivalents, is fine-tuned, and it is shown that this Indian-language-specific pre-training provides a meaningful advantage over generic multilingual baselines. The system achieves a macro-averaged F1 of 0.83 on the validation set, compared to 0.79 for XLM-RoBERTa and 0.77 for mBERT under identical training conditions, establishing a strong transformer-based baseline for abusive language detection in code-mixed Tamil.
PrimeLine@DravidianLangTech 2026: Hope Speech Detection in Tulu Using XLM-RoBERTa for Coarse and Fine-Grained Classification
Rithikaa V | S.Sumathi | Sanjay Krishnan K | Nithya Varshini C N R
Rithikaa V | S.Sumathi | Sanjay Krishnan K | Nithya Varshini C N R
Hope speech detection in low-resource, code-mixed languages presents a genuine challenge for natural language processing. Tulu, a Dravidian language spoken along the coastal regions of Karnataka and Kerala, is one such language where social media content is deeply code-mixed, blending Tulu, Kannada script, and English within a single comment. Two classification tasks are addressed: a four-class coarse-grained setting (Track 1) and a five-class fine-grained setting (Track 2). XLM-RoBERTa, a cross-lingual transformer pre-trained on more than 100 languages, is fine-tuned on the task-provided datasets using Google Colab with an NVIDIA T4 GPU. The system achieves a Macro F1-score of 0.34 on Track 1 and 0.19 on Track 2 on the official Codabench evaluation, establishing the first transformer-based baseline for hope speech classification in Tulu.
RMS@DravidianLangTech 2026: Multimodal Gated Fusion for Hierarchical Tamil Political Meme Classification
Md. Ajwad Hossain
Md. Ajwad Hossain
Internet memes have become a dominant and highly accessible medium for political discourse on social media. However, their multimodal nature—combining culturally specific visual symbols with code-mixed text—presents a significant challenge for automated content analysis, particularly in low-resource languages. In this study, we describe the system submitted by team RMS for the Multi-Level Political Meme Classification shared task at DravidianLangTech @ ACL 2026, focusing exclusively on the Tamil language track. We propose a robust late-fusion multimodal architecture that leverages a pre-trained ResNet-50 network for visual feature extraction and a Transformer-based model (MuRIL) for processing code-mixed Tamil text. The modalities are aligned using bidirectional cross-modal attention and combined using a Gated Multimodal Unit, allowing the model to dynamically weight the importance of visual versus textual cues. Our system ranked 11th on the official leaderboard with a macro-averaged F1-score of 0.7382. Through detailed error analysis, we demonstrate that while our gated fusion approach excels at identifying explicit trolling stances, it struggles with complex target resolution when visual and textual cues contradict.
Semantica@DravidianLangTech 2026: Vision-Language Models for Hierarchical Political Meme Classification in Tamil and Malayalam
Junain Uddin | Rahul Datta | Taha Ibne Abdullah | Hasan Murad
Junain Uddin | Rahul Datta | Taha Ibne Abdullah | Hasan Murad
Political memes are widely used to express opinions, sarcasm, and ideological narratives on social media platforms. However, detecting political trolling in low-resource languages such as Tamil and Malayalam remains challenging due to limited datasets and tools. To address this problem, DravidianLangTech@ACL 2026 organized a shared task on hierarchical political meme classification.This work explores text-only models, classical multimodal fusion, and Vision-Language Models (VLMs) for Tamil and Malayalam political meme classification. Our experiments include IndicBERTv2, XLM-RoBERTa, EfficientNet-based multimodal fusion, and Qwen-VL models. Among the submitted systems, Qwen2.5-VL-7B-Instruct with 4-bit QLoRA fine-tuning achieved competitive performance, ranking 3rd in the Malayalam track and 4th in the Tamil track based on weighted-F1 score. Additional post-evaluation experiments with Qwen3-VL-8B further improved macro-F1 performance, highlighting the effectiveness of VLMs for low-resource multilingual political meme classification.
SERENE@DravidianLangTech 2026: Multimodal Approaches for Depression Detection in Dravidian Speech: Acoustic, Spectrogram, and Transformer-Based Models
TT Pranesh | K.K.Thamizhmathi | S Vigneshwaran | Bharathi B
TT Pranesh | K.K.Thamizhmathi | S Vigneshwaran | Bharathi B
This paper presents our submission to the De-pression Detection in Dravidian Languagesshared task at DravidianLangTech 2026. Weinvestigate three complementary approachesfor speech-based depression detection in Tamiland Malayalam: (i) acoustic feature engineer-ing using MFCC and prosodic features with aSupport Vector Machine (SVM) classifier, (ii)a convolutional neural network (CNN) trainedon Mel-spectrogram representations, and (iii)a transformer-based model using Whisper-generated transcripts fine-tuned with XLM-RoBERTa. Experimental results show thatacoustic feature-based SVM and spectrogram-based CNN models achieve the strongestperformance on both Tamil and Malayalamdatasets, while the transformer-based approachalso produces competitive results. We furtherdiscuss limitations and future research direc-tions.
SJM_MINDS@DravidianLangTech@ACL2026: Machine Learning Approaches for Hope Speech Detection in Code-Mixed Tulu
Hosahalli Lakshmaiah Shashirekha | Manjula | Jayashree Krishna
Hosahalli Lakshmaiah Shashirekha | Manjula | Jayashree Krishna
Hope speech detection is an important task in understanding emotionally constructive communication in online platforms, especially in low-resource and code-mixed languages. This paper describes our system submitted to the first shared task on Hope Speech Detection in Code-Mixed Tulu, organized by DravidianLangTech@ACL 2026. The shared task consists of two tasks: Task 1 - Coarse-Grained Hope Tone Classification and Task 2 - Fine-Grained Hope Type Classification, with the objective of detecting and classifying the tone and type of hope expressed in code-mixed Tulu texts. We experimented with Logistic Regression (LR) and Linear Support Vector Classifier (LinearSVC) - classical Machine Learning (ML) approaches, trained with Term Frequency and Inverse Document Frequency (TF-IDF) of word ngrams (n = 1, 2). For Task 1, we employed both models, whereas for Task 2, we employed only the LR model. Linear SVC obtained a macro F1-score of 0.51 in Task 1 and secured 4th rank, while the LR model obtained a macro F1-score of 0.37 in Task 2 and secured 5th rank. The results demonstrate that traditional ML approaches remain effective for low-resource code-mixed language scenarios.
SSN_HopeNetters@DravidianLangTech 2026: Multi-Level Hope Speech Detection using XLM-RoBERTa
Moogambigai A | Bharathi B | Nikhil Karthik S | Pandiarajan D | Nandhika Saravanan
Moogambigai A | Bharathi B | Nikhil Karthik S | Pandiarajan D | Nandhika Saravanan
This paper presents our system submission to the Shared Task on Hope Speech Detection in Code-Mixed Tulu Language at DravidianLangTech @ ACL 2026. We introduce a transformer-based approach built on XLM RoBERTa-base for multilingual hope speechclassification. Our system addresses two sub tasks: coarse-grained classification of hope versus non-hope speech and fine-grained categorization of different hope expressions. Since hope is often expressed in subtle ways, especially in mixed-language text, our model looks at the full context of a sentence to understand its real meaning rather than just focusing on specific words. Experimental results demonstrate that multilingual transformer models effectively model supportive and encouraging expressions, underscoring their suitability for promoting constructive discourse in low-resourceand code-mixed language settings.
Still Loading@DravidianLangTech 2026: Telugu Prompt-Style Recovery using Multilingual Transformers
Samonwita Sarker | Isnat Mehrin Sami | Priyontee Mojumder | Arpita Mallik | Hasan Murad
Samonwita Sarker | Isnat Mehrin Sami | Priyontee Mojumder | Arpita Mallik | Hasan Murad
This paper describes the system that our Still-Loading team designed to run the Telugu Prompt-Style Recovery shared task at DravidianLangTech@ACL 2026. The purpose of the given task is categorizing Telugu transcript passages as belonging to one of 9 communicative styles: Formal, Informal, Optimistic, Pessimistic, Humorous, Serious, Inspiring, Authoritative, and Persuasive. We compared several multilingual Transformer-based models, i.e. MuRIL, XLM-RoBERTa-Large, mBERT, and IndicBERTv2. We chose a "Turbo Sandwich" preprocessing strategy which helps to give more emphasis to lexical deltas, in addition to Focal Loss. Our system based on the MuRIL was rated at the 7th place in the official leaderboard with a Macro-F1 rating of 0.1703. The source code to reproduce our experiments is publicly available on Still-Loading-Prompt-Recovery-for-LLM-in-Telugu (https://github.com/Priyontee1713/Still-Loading-Prompt-Recovery-for-LLM-in-Telugu).
SUPERNOVA@DravidianLangTech 2026: Transformer and Ensemble Approaches for Abusive Tamil Text Detection Targeting Women
Kiruthika K | Roahiyaa T | Premjith B
Kiruthika K | Roahiyaa T | Premjith B
Abusive language targeting women on Tamil social media is a growing concern that necessitates automated detection systems capable of handling low-resource, code-mixed, and morphologically rich text. This paper presents the SUPERNOVA system submitted to the shared task on Abusive Tamil Text Targeting Women on Social Media at DravidianLangTech@ACL 2026. We investigate three complementary approaches: (1) fine-tuning MuRIL with class balancing and label smoothing, (2) MuRIL contextual embeddings combined with XG-Boost and decision threshold tuning, and (3) a lightweight ensemble of character-level TF-IDF and SentenceBERT features with Random Forest and Extra Trees. Our best system achieves an accuracy of 0.8007 and a macro F1-score of 0.7994, ranking 11th among all participating teams. These results highlight the effectiveness of multilingual transformer representations combined with ensemble techniques for the detection of abusive text on Tamil social networks. The code is publicly available at https://github.com/Kiruthi001/SuperNova-DravidianLangTech-ACL2026.
SYNAPSE@DravidianLangTech 2026: Multi-Level Political Meme Classification for Tamil and Malayalam
Suriya KP | Durai Singh K | Gnanasabesan G | Ganesh Sundhar S | Hari Krishnan N | Jyothish Lal G
Suriya KP | Durai Singh K | Gnanasabesan G | Ganesh Sundhar S | Hari Krishnan N | Jyothish Lal G
Political memes in Tamil and Malayalampresent unique multimodal challenges for automated under-standing, combining visual context with code-mixed, cultur-ally grounded text. We present SYNAPSE, our system forthe DravidianLangTech@ACL 2026 shared task on multi-levelpolitical meme classification. The task requires hierarchicalclassification of memes along two levels: Level 1 identifies thepolitical stance (Support/Praise vs. Troll/Oppose), and Level 2identifies the target (individual person vs. party). Our approachfine-tunes the Qwen3-VL-2B-Instruct vision-language modelusing parameter-efficient LoRA adapters on task-specific mul-timodal data, with structured output prompting for hierarchi-cal label prediction. We report results for both Tamil andMalayalam subtracks. For Malayalam, our system achievesa Level 1 F1 of 0.9200 and Level 2 F1 of 0.4256 (Avg-F1:0.6728, Rank 5). For Tamil, our system achieves a Level 1 F1of 0.7840 and Level 2 F1 of 0.4885 (Avg-F1: 0.6362, Rank 14).
TamilEcho_Political@DravidianLangTech 2026: Hybrid XLM-RoBERTa with Sarcasm-Aware Feature Fusion for Political Multiclass Sentiment Analysis in Tamil X
Kanimozhi Selvi C S | Inigashree N S | Kavinraj J | Moneissh A G
Kanimozhi Selvi C S | Inigashree N S | Kavinraj J | Moneissh A G
Political sentiment analysis in Tamil social media is challenging due to informal language, sarcasm, emoji-driven sentiment inversion, and severe class imbalance. This paper presents TamilEcho, our system submitted to the Shared Task on Political Multiclass Sentiment Analysis of Tamil X (Twitter) Comments at DravidianLangTech@ACL 2026. We propose a hybrid architecture that integrates contextual representations from XLM-RoBERTa with lexical TF-IDF features and explicit sarcasm-aware emoji features. Domain-specific hashtag expansion is incorporated to enrich political context. To address class imbalance, we apply inverse-frequency class weighting and label smoothing during training. Experimental results demonstrate that hybrid feature fusion significantly improves performance over transformer-only baselines. Our final system achieves a Macro-F1 score of 0.3559 on the official test set, securing Rank 10 among participating teams. The results highlight the effectiveness of combining semantic, lexical, and pragmatic cues for fine-grained political sentiment classification in Tamil.
TAMILGOODBADTXT@DravidianLangTech 2026:A Multilingual Transformer-Based Approach for Abusive Language Identification in Tamil Social Media
Varalakshmi K | Bharathi B
Varalakshmi K | Bharathi B
It is difficult to detect abusive language, particularly in social networks for low-resource languages like Tamil. Spelling errors, informal expressions and code-mixing make it even more challenging to read text from social media. The current work proposes a multilingual transformer-based approach to detect abusive content in Tamil text. A pretrained XLM-RoBERTa model is used to learn contextual and semantic representations from the input text. This is a general pipeline comprising preprocessing, tokenization, and binary classification (abusive / non-abusive). Experiments are performed on Tamil social media datasets that have abusive and non-abusive data. The results reveal that multilingual transformer models achieve good performance in low-resource scenarios. The proposed model attains an F1-score of 78.64%, which shows the potential of using cross-lingual pretrained models for the detection of abusive Tamil language.
TamilVoiceLab@DravidianLangTech 2026: Investigating Whisper Tamil Large-v2 for Dialectal Tamil Speech Recognition
S.b.priya | Bharathi B
S.b.priya | Bharathi B
Automatic Speech Recognition (ASR) for languages rich in dialects and those with limited resources presents significant challenges due to the variations in pronunciation and vocabulary across different regions. This study offers a baseline evaluation of the Whisper Tamil Large-v2 model without fine-tuning for the Tamil Dialect Speech Recognition shared task. The focus is on the ASR subtask, utilizing dialectal Tamil speech recordings gathered from various regional dialects within Tamil Nadu. The pretrained Whisper Tamil Large-v2 model was assessed directly, without any supplementary fine-tuning or domain adaptation. A total of 579 dialect speech samples were used for experimentation, with performance evaluated based on Word Error Rate (WER). The model recorded a WER of 0.71, indicating that even robust multilingual pretrained models encounter challenges in dialect-rich and low-resource environments. These findings underscore the necessity for dialect-aware adaptation and the importance of balanced dialect training data to develop effective Tamil ASR systems.
Team Oryu@DravidianLangTech 2026: A Multilingual Transformer Approach for Hope Speech Detection in Code-Mixed Tulu
Joyeta Barua Moni | Noore Tamanna Orny | Md. Abtahee Kabir | Hasan Murad
Joyeta Barua Moni | Noore Tamanna Orny | Md. Abtahee Kabir | Hasan Murad
Hope speech detection appears to have an essential role to play in fostering positive and inclusive communication on social media, especially in low-resource multilingual settings. This paper describes the system submitted by Team Oryu for Task 1: Coarse-Grained Hope Tone Classification in Code-Mixed Tulu. The task involves classifying comments in social media texts into one of the four classes: Encouraging, Discouraging, Uninvolved, and Blended Tone. The texts in this task show heavy code-mixing between Tulu, English, and Kannada. In order to overcome this challenge, we employed a fine-tuned multilingual transformer model, code-mixed text processing, data augmentation, and class-weighted loss to handle class imbalance. Our proposed system achieved a Macro F1-score of 63%, securing 3rd position on the shared task. The results demonstrate the efficacy of multilingual transformer models in emotionally nuanced classification in code-mixed environments while underscoring the difficulties in capturing blended emotional tones.
Team_One@DravidianLangTech 2026: A Gated Multimodal Architecture for Multi-Level Stance and Target Detection in Malayalam Political Memes
Nimisha M Iyer | Ashmi S N | Balasubramanian Palani | Jobin Jose | Siranjeevi Rajamanickam
Nimisha M Iyer | Ashmi S N | Balasubramanian Palani | Jobin Jose | Siranjeevi Rajamanickam
Stance and target detection in multimodal political memes presents notable challenges in low-resource and highly imbalanced settings.This task is based on the Malayalam dataset from the DravidianLangTech 2026 Shared Task(500 samples with a 95.4:4.6 stance imbalance).The primary challenges stem from linguistic variability and visually complex meme formats,which hinder accurate text extraction and effective multimodal alignment. A lightweight yet high-performing multimodal framework is proposed that integrates bilingual OCR, a Vision Transformer (ViT), and IndicBERT to learn complementary visual and textual representations. A gated fusion mechanism effectivelycombines multimodal features, while asymmetric loss weighting and post-training threshold optimization address extreme class imbalance. The methodology achieves a Weighted F1-score of 0.9535 for stance detection and 0.5283 for target identification, demonstrating strong robustness and generalization under realistic multimodal constraints.
Trailblazer@DravidianLangTech 2026: A Comparative Study of TF-IDF SVM and XLM-RoBERTa for Political Multiclass Text Classification.
Anuradha C | Anbuaruvi R | Shanthi Murugan
Anuradha C | Anbuaruvi R | Shanthi Murugan
The rapid growth of social media networks faces challenges in the classification of multilingual and code-mixed data. A task is shared by Political Multiclass Sentiment Analysis of Tamil X (Twitter) -DravidianLangTech@ACL 2026 to classify the political text.For the above task, we proposed solutions to compare a traditional machine learning and the transformer based model. First we developed a Baseline traditional Support vector Machine model using the TF-IDF features. To provide a stronger Indic-language baseline we consider the IndicBERT, a transformer model specifically designed for Indian Languages. IndicBERT improves contextual understanding of Tamil-English code-mixed political text . To capture the deeper information from the text we developed an XLM-RoBERTa model where we used minimal pre-processing technique. The Result shows us that the transformer-based performs well compared to the traditional baseline model with the macro F1 score of 0.3738. The Study highlights the importance of robust multi-class social media political text classification.
TriVector@DravidianLangTech 2026: Abusive Tamil Text Detection on Social Media Using Lexicon-Augmented Transformers
Oarisa Rebayet | Tahmima Hoque Eid | Fawzia Tabassum | Hasan Murad
Oarisa Rebayet | Tahmima Hoque Eid | Fawzia Tabassum | Hasan Murad
Abusive comment detection in low-resource languages poses significant challenges, particularly when targeting gender-based abuse on social media platforms. This work presents our system for ’Abusive Tamil text targeting women on social media’ at DravidianLangTech@ACL 2026. We introduce nine handcrafted lexicon features integrated with pretrained multilingual transformer embeddings and evaluate their effectiveness in classifying Tamil online comments as abusive or non-abusive. To better understand their impact, we compare model performance with and without these lexical attributes across multiple transformer architectures. Our best-performing model, XLM-RoBERTa-Large, achieved a macro F1-score of 81.71%, securing 15th rank in the competition. The findings indicate that larger multilingual models generalize more effectively to unseen data compared to smaller domain-specific models, while the addition of lexical features yields only mild gains.
TriVector@DravidianLangTech 2026: Depression Detection from Tamil and Malayalam Speech with Speaker-Independent Evaluation using MFCC and Wav2Vec2
Tahmima Hoque Eid | Fawzia Tabassum | Oarisa Rebayet | Hasan Murad
Tahmima Hoque Eid | Fawzia Tabassum | Oarisa Rebayet | Hasan Murad
Depression is a major mental health concern that can be reflected through subtle changes in speech patterns, prosody, and vocal characteristics. In low-resource and multilingual settings, depression detection from speech may become particularly more challenging. In this work, we present our system for the Shared Task on Depression Detection from Malayalam and Tamil. We explored both handcrafted acoustic features (MFCC) and pretrained speech representations (Wav2Vec2) for depression detection, along with a simple fusion strategy to examine their complementary strengths. Our observations showed that Wav2Vec2 generalized better for Malayalam, whereas for Tamil, a validation-tuned probability fusion performed best. The final system achieved macro-F1 scores of 99.5% for Malayalam and 88.6% for Tamil, securing 3rd place in both tasks.
VITECH@DravidianLangTech2026: Prompting and LoRA Adaptation for Tamil Abusive Language Detection - A Comparative Study of Open LLMs
Triambiga Krubhakaran | Senthil Kumar B | Kaviya Nagarajan | Balaji N
Triambiga Krubhakaran | Senthil Kumar B | Kaviya Nagarajan | Balaji N
The detection of abusive Tamil text using large language models (LLMs) has received relatively little attention compared to BERT variations. We empirically evaluated four families of open-weight LLMs —Gemma, LLaMA, Qwen, and DeepSeek-Distilled— on the Tamil dataset provided by the shared task. The models are assessed under two in-context learning settings (zero-shot and few-shot) and a parameter-efficient fine-tuning approach using LoRA, with model sizes of approximately 2B and 8B parameters. Experimental results show that 8B models consistently outperform their 2B counterparts, indicating the benefit of increased model capacity. Among the adaptation techniques, LoRA fine-tuning significantly outperforms both zero-shot and few-shot prompting. Across all evaluated settings, Google’s Gemma-2-9B model with LoRA fine-tuning achieved the best performance compared to the other model families and our test result was ranked 12th among all 22 submissions with the 0.7959 f1-score.
Wave2Word@DravidianLangTech 2026: WhisTam: A unified framework for dialect based Tamil speech recognition and classification
Ruwad Naswan | Shadab Tanjeed Ahmad
Ruwad Naswan | Shadab Tanjeed Ahmad
While Automatic Speech Recognition (ASR) systems have shown impressive performance in languages having sufficient annotated speech data like English, their performance is still limited for low-resource, dialect rich languages like Tamil. Tamil poses further challenges because of its extremely high regional variation in dialects that manifest in varying vocabulary, pronunciations, and even syntactic structures. To address these challenges, we present a unified framework WhisTam based on the Whisper medium model, which performs speech transcription and dialect classification jointly within a single system. Our method is evaluated against speech samples from four regional dialects and achieves a macro F1-score of 0.53 and a Word Error Rate (WER) of 0.55 for dialect classification and transcription respectively, ranking 2nd in the dialect classification task and 3rd in the transcription task in the DravidianLangTech@ACL 2026 shared task on Dialect-based Speech Recognition and Classification in Tamil. These findings emphasize the challenges in dialectal Tamil ASR as well as the promise of multi-task learning for low-resource languages. Our implementation is publicly available at: https://github.com/rwd51/DravidianLangTech-Wave2Word.
Wise@DravidianLangTech 2026: Dialect-Aware Tamil Speech Classification and Recognition via Cross-Pipeline Embedding Transfer
Ganesh Sundhar S | Hari Krishnan N | Gnanasabesan G | Suriya KP | Jyothish Lal G
Ganesh Sundhar S | Hari Krishnan N | Gnanasabesan G | Suriya KP | Jyothish Lal G
This paper presents the **Wise** system for the shared task on dialect-based speech processing in Tamil, addressing two subtasks: **(1) four-way dialect region classification** (Northern, Southern, Western, Central), and **(2) dialectal Tamil ASR**. All audio is preprocessed using loudness normalization followed by neural denoising to ensure consistent audio quality for downstream models. For classification, we experiment with different model variants combining multilingual and Tamil-pretrained **Wav2Vec2** backbones with five temporal pooling strategies under frozen and partial fine-tuning settings. Our best configuration, i.e., learned attentive pooling with partial fine-tuning and a differentially trained MLP head, achieves a macro F1 of **0.79**, securing **1st place** with a margin of **0.26** points. For ASR, we propose two novel **dialect-conditioned Whisper** architectures—residual injection and cross-attention—that inject dialect embeddings from the trained classifier into the ASR pipeline. In addition, we evaluate a vanilla Whisper-Tamil fine-tuned baseline. The best model achieved a **WER of 0.90**, securing **8th place** in the shared task.
up
Proceedings of the 9th Workshop on Event Extraction and Understanding: Challenges and Applications (EEUCA 2026)
Proceedings of the 9th Workshop on Event Extraction and Understanding: Challenges and Applications (EEUCA 2026)
Ali Hürriyetoğlu | Surendrabikram Thapa | Hristo Tanev
Ali Hürriyetoğlu | Surendrabikram Thapa | Hristo Tanev
Overview of the Workshop on Event Extraction and Understanding: Challenges and Applications
Ali Hürriyetoğlu | Surendrabikram Thapa | Hristo Tanev | Laxmi Thapa | Surabhi Adhikari
Ali Hürriyetoğlu | Surendrabikram Thapa | Hristo Tanev | Laxmi Thapa | Surabhi Adhikari
This paper presents an overview of the 9th Workshop on Event Extraction and Understanding: Challenges and Applications (EEUCA 2026), held in conjunction with ACL 2026. Formerly known as CASE, the workshop continues its mission of bringing together researchers from natural language processing, machine learning, computational social science, and related disciplines to advance research on event extraction and understanding. This year’s edition particularly emphasized the growing influence of large language models (LLMs), multimodal learning, and weakly supervised methodologies in event extraction research. The workshop featured six regular research papers covering topics such as low-resource event extraction, reflective multi-agent architectures, symbolic auditing of procedural events, geopolitical event extraction, and generative event extraction strategies. In addition, EEUCA 2026 hosted two shared tasks focusing on toxicity detection in gaming communities and multimodal vaccine-critical meme analysis, attracting broad international participation and encouraging research on socially impactful applications of AI. The workshop highlights current advances, emerging challenges, and future directions in multilingual, multimodal, and socially aware event extraction systems.
Understanding Toxic Behavior in Gaming Communities Using AI to Promote Healthier Digital Spaces
Surendrabikram Thapa | Shuvam Shiwakoti | Siddhant Bikram Shah | Kritesh Rauniyar | Laxmi Thapa | Surabhi Adhikari | Kristina T. Johnson | Ali Hürriyetoğlu | Hristo Tanev | Usman Naseem
Surendrabikram Thapa | Shuvam Shiwakoti | Siddhant Bikram Shah | Kritesh Rauniyar | Laxmi Thapa | Surabhi Adhikari | Kristina T. Johnson | Ali Hürriyetoğlu | Hristo Tanev | Usman Naseem
Online gaming communities are increasingly affected by toxic communication, including harassment, threats, hate speech, and extremist content. Detecting such behavior is challenging due to the short, noisy, multilingual, and highly imbalanced nature of gaming chat data. To advance research in this area, we organized the Shared Task on Fine-Grained Toxicity Detection in Online Gaming at EEUCA 2026, co-located with ACL 2026. The task is based on the GameTox dataset, containing approximately 53,000 annotated chat utterances from World of Tanks across six toxicity categories. A total of 102 participants took part, and 35 teams submitted systems exploring approaches such as domain-adaptive pretraining, multilingual transfer learning, contrastive learning, LLM-based augmentation, and ensemble methods. Systems were evaluated using macro-averaged F1-score, with the top system achieving 0.7041 Macro F1. This paper presents an overview of the shared task, dataset, evaluation framework, participant methods, and key findings.
Multimodal Identification of Vaccine Content Stance on Social Media
Surendrabikram Thapa | Shuvam Shiwakoti | Siddhant Bikram Shah | Kritesh Rauniyar | Laxmi Thapa | Surabhi Adhikari | Kristina T. Johnson | Ali Hürriyetoğlu | Hristo Tanev | Usman Naseem
Surendrabikram Thapa | Shuvam Shiwakoti | Siddhant Bikram Shah | Kritesh Rauniyar | Laxmi Thapa | Surabhi Adhikari | Kristina T. Johnson | Ali Hürriyetoğlu | Hristo Tanev | Usman Naseem
Vaccination-related memes on social media play an increasingly influential role in shaping public perception of immunization, often spreading both supportive messaging and vaccine-critical narratives through multimodal communication. Detecting such content is challenging due to the combined use of images, embedded text, sarcasm, humor, and cultural references. This paper presents an overview of the Shared Task on Multimodal Identification of Vaccine Critical Content on Social Media, organized as part of the 9th Workshop on Event Extraction and Understanding: Challenges and Applications (EEUCA 2026) at ACL 2026. The task is based on the VaxMeme dataset, a large-scale collection of vaccination-related memes annotated into three classes: Vaccine-critical, Neutral, and Pro-vaccine. A total of 77 participants registered for the competition, with 25 teams submitting systems for evaluation. Participating approaches included transformer-based multimodal architectures, vision-language models, ensemble methods, and instruction-tuned large language models. The best-performing system achieved a macro F1-score of 0.8494. This shared task provides insights into the strengths and limitations of current multimodal approaches for vaccine stance detection and highlights future directions for robust public health misinformation analysis.
Constructing a Silver Corpus for Weakly Supervised Vietnamese Event Extraction using Cross-Document N-ary Relation Filtering
Phạm Xuân Hiệu | Tuan Vu Minh | Mai-Vu Tran | Hoang-Quynh Le
Phạm Xuân Hiệu | Tuan Vu Minh | Mai-Vu Tran | Hoang-Quynh Le
Event extraction for low-resource languages such as Vietnamese is limited by the lack of large-scale annotated data. To address this, we propose a weakly supervised framework that constructs a silver corpus via pseudo-labeling. We introduce a cross-document n-ary relation filtering strategy to reduce noise by leveraging consistency across multiple articles describing the same event, and further enhance data diversity with schema-based augmentation. Experiments on the BKEE benchmark show consistent improvements, demonstrating the effectiveness of our approach. Data is available at: https://github.com/Larken1612/VietEE2.
When Tasks Share Structure: A Comparative Study of Training Strategies for Generative Event Extraction
Rishi Ravikumar | Riza Batista-Navarro
Rishi Ravikumar | Riza Batista-Navarro
Event extraction requires performing two interdependent subtasks: event detection and event argument extraction. While prior work has explored pipelined and joint training approaches, the question of how best to coordinate training across these subtasks in generative LLM-based systems remains open. We present a systematic study comparing three training paradigms: disjoint, fully shared and hybrid weight allocation, instantiated as eight concrete strategies and evaluated on ACE2005 and RichERE across multiple instruction-tuned LLMs. Our findings show that training strategy has a consistent and meaningful effect on extraction accuracy, and that a clear best-performing strategy emerges across models and benchmarks. We believe that these findings could extend beyond event extraction to other information extraction tasks that decompose into interdependent subtasks.
A Qualia-Based Audit of Procedural Event Annotations
Kyeongmin Rim | Marc Verhagen | James Pustejovsky
Kyeongmin Rim | Marc Verhagen | James Pustejovsky
Procedural event annotations record *what changed* but not the semantic relevance or grounding of the change: whether the annotated entity is the kind of thing whose state matters for the domain.We present Entity Qualia Structure (EQS), a per-entity sortal-type categorization (coarsened from Generative Lexicon’s type system to three categories: natural, artifactual, instrument) extracted from existing lexical resources.Applied to the OpenPI food domain, EQS reaches 84.7% coverage of the 518-item entity vocabulary; across 9367 transformation annotations, only 51.1% concern food entities themselves, while 30.2% record state changes of instruments, entities whose sortal type places them outside the food-state task.In a three-way comparison against existing cleanup efforts, EQS uniquely flags 15.6% of annotations that neither human re-annotation (OpenPI-C) nor LLM salience scoring (OpenPI 2.0) catches.Analysis of the *agentive* quale reveals that 93% of agentive-positive annotations involve instruments rather than food: entity creation can only be detected when the agentive feature is paired with the associated verb’s event semantics.
Benchmarking Models for Low-Resource Nepali Event Extraction with Trigger Phrase Identification and Event Classification
Sujal Maharjan | Astha Shrestha | Lakshmojee Koduru | Sweta Poudel | Shuvam Shiwakoti | Rabin Thapa | Kritesh Rauniyar | Surendrabikram Thapa
Sujal Maharjan | Astha Shrestha | Lakshmojee Koduru | Sweta Poudel | Shuvam Shiwakoti | Rabin Thapa | Kritesh Rauniyar | Surendrabikram Thapa
Research on Event Extraction (EE) in South Asian languages is crucial for understanding information dissemination and enabling automated news analysis in morphologically complex, low-resource environments. To address the scarcity of high-quality, publicly available datasets, we present Nepali Event Extraction (NepEE), a manually annotated corpus comprising 10,226 Devanagari sentences. The dataset includes annotations for trigger spans and event types, achieving high inter-annotator agreement with Fleiss’ kappa = 0.812 for trigger identification and kappa = 0.855 for event classification. Our dataset was developed through a rigorous iterative three-phase protocol involving five expert native speakers to ensure linguistic precision. We conduct benchmarking across a broad spectrum of approaches, including classical feature-based models, five fine-tuned Transformer encoders, and contemporary instruction-tuned Large Language Models (LLMs) using zero-shot and fixed few-shot prompting. Our analysis shows that Indic-specialized Transformers achieve superior classification performance, while traditional methods and few-shot prompting struggle with the challenges of exact span extraction in morphologically complex contexts. Furthermore, we quantify performance differences between sentence-level and span-level tasks, providing strong baselines for future research. The findings and the released NepEE dataset provide a valuable resource for advancing event understanding in low-resource languages (LRLs). All code and resources are available at https://github.com/SUJAL390/EEUCA-ACL-2026-Trigger-Phrase-Identification-and-Event-Classification-in-Low-Resource-Languages.
A Self-Reflective LLM-based Architecture for Semi-Open Event Extraction
Hristo Tanev | Michel de Bollivier | Bertrand De Longueville
Hristo Tanev | Michel de Bollivier | Bertrand De Longueville
We present a multi-agent reflective architecture for event extraction based on generativelarge language models (LLMs). Our architecture is the first of its kind to perform Semi-Open Event Extraction (SOEE), a hybrid framework that combines a fixed set of event template fields with dynamically generated attributes. Another novel feature of this system is the self-reflection. This type of LLM-based reasoning is the other novel feature of our system. It is defined as the generation of questions about missing or implicit event information and finding their answers within the system itself. We model event extraction as an iterative dialogue between a reflective LLM based agent, which generates questions to uncover missing event information and a set ofexpert agents, which provide domain-aware answers to these questions. The expert agents alsogenerate the initial event template using a generative LLM. Evaluated in the health domain, our event extraction system shows very promising results, demonstrating that LLM-based reflective multi-agent reasoning can accurately perform event extraction and expand the eventtemplate in a creative and comprehensive manner
GENOME: A New Geopolitical Event Methodology and Dataset using Large Language Models
Alessandro Dell’Orto | Jesse Kommandeur
Alessandro Dell’Orto | Jesse Kommandeur
Quantitative research in International Relations relies heavily on structured event data, yet existing automated datasets lack up-to-date coverage of both conflictual and cooperative interactions. We introduce GENOME (Geopolitical Event News Observatory, Mapping, and Extraction), an automatically extracted dataset that implements PLOVER’s 16 event types and extends its Actor–Recipient schema with a Third Party role to capture multilateral relations from newswire data. GENOME’s pipeline comprises event extraction, ontology-based classification, entity normalization, and deduplication, leveraging GPT models with one-shot prompting and enforced structured outputs. We compare GENOME against POLECAT dataset over a five-month overlap period across event volume, temporal dynamics, and geographical coverage. Results show that while the two datasets align closely on conflict event types, GENOME captures a more balanced distribution of cooperative events, particularly verbal interactions nearly absent in POLECAT. GENOME also demonstrates improved temporal precision by attributing events to their inferred date of occurrence rather than publication date, and effective deduplication of highly covered events.
FNLP412@EEUCA 2026: Understanding Toxic Behavioral Intent in Gaming Chat Logs using Transfer Learning and Synthetic Data Augmentation
Mihai Radu Radulescu
Mihai Radu Radulescu
Our paper explores several machine learning methods for detecting toxic language in gaming-related chat utterances. We start with the GameTox dataset, perform some data preprocessing and augment the minority classes with LLM-generated synthetic data. We then set a baseline using a classic Logistic Regression model and continue to explore severalapproaches to surpassing it, by leveraging the leading multilingual transformer models (XLM-RoBERTa and DeBERTa-V3) to classify our test data. We achieve a top result of 0.6725 Macro-F1 (2nd place on shared task leaderboard) using a MDeBERTa-V3 model which we pretrained on the Jigsaw dataset for 1 epoch and then fine-tuned on our GameTox data for 5 epochs.
wangkongqiang@EEUCA 2026: Understanding Toxic Behavioral Intent in Gaming Chat Logs
Kongqiang Wang | Peng Zhang | Quingli Tan
Kongqiang Wang | Peng Zhang | Quingli Tan
Our team was interested in content classification and labeling from toxicity detection of gaming chat logs in online gaming communities. We joined the shared task on Understanding Toxic Behavioral Intent in Gaming Chat Logs@EEUCA with ACL 2026. In this task, our goal is to assign a content classification label to player’s utterance (e.g., Hate and Harassment, Threats, Non-toxic). The objective is to develop systems that can classify the intent of a player’s utterance. The dataset for this task will have five labels: Non-toxic (0), Insults and Flaming (1), Other Offensive Texts (2), Hate and Harassment (3), Threats (4) and Extremism (5). The performance will be ranked by F1-score (Macro). The task utilizes 53,000 game chat utterances from World of Tanks. Our group used a supervised learning method on multiple pre-trained models and finetuning Qwen2 LLMs. The best result on the test set for shared task were Macro F1 score of 0.5776, Accuracy 0.9075, Precision (Macro) 0.6847, and Recall (Macro) 0.5343 from fine-tuning qwen2_7B LLM method, ranking 8th among all teams. The complete code of this entire project can be found at our GitHub address.
wangkongqiang@EEUCA 2026: Multimodal Identification of Vaccine Critical Content on Social Media
Kongqiang Wang | Peng Zhang | Quingli Tan
Kongqiang Wang | Peng Zhang | Quingli Tan
Our team was interested in content classification and labeling from multimodal meme detection of vaccine critical content on social media.We joined the shared task on Multimodal Identification of Vaccine Critical Content on Social Media@EEUCA with ACL 2026. In this task,our goal is to assign a content classification label to vaccine-related discourse (e.g., Vaccine critical, Neutral, Pro-vaccine). The objectiveis to develop systems that can classify the intent of a vaccine-related meme. The dataset for this task will have three labels: Vaccine critical (0), Neutral (1), and Pro-vaccine (2). The performance will be ranked by F1-score (Macro). This shared task is based on the VaxMeme dataset, a collection of over 10,000 manually annotated vaccination-related memes, designed to support multimodal vaccine-critical meme detection. Our group used a supervised learning method on finetuning pre-trained models and Large Language Model (LLM), including Qwen2 LLMs and Llama series LLMs based on Llama-Factory. The best result on the test set for shared task were Macro F1 score of 0.8153, Accuracy 0.8185, Precision (Macro) 0.8151, and Recall (Macro) 0.8159 from fine-tuning qwen2_1.5B LLM method, ranking 12th among all teams. The complete code of this entire project can be found at our GitHub address.
Quasar@EEUCA 2026: Multimodal Deep Learning for Vaccine Stance Detection in Memes
Adiba Fairooz Chowdhury | MD Sagor Chowdhury
Adiba Fairooz Chowdhury | MD Sagor Chowdhury
Vaccine stance detection in multimodal memes has emerged as an important yet challenging task, requiring models to interpret both textual and visual cues that jointly convey opinions. The difficulty lies in capturing subtle semantic interactions and handling class imbalance across stance categories. In this paper, we present our system developed for the VaxMeme 2026 Shared Task at EEUCA 2026. Our approach leverages a soft-voting ensemble of complementary models, combining DeBERTa-v3-large and RoBERTa-large for rich textual representation with CLIP (ViT-B/32) for joint vision-language understanding. We incorporate domain-specific preprocessing, techniques such as random token deletion, image enhancement, and balanced class oversampling to address dataset limitations. Through extensive ablation studies, we identify balanced class oversampling as the most effective component, significantly improving performance across models. Our final system achieves a macro F1-score of 0.8306, securing 8th place among 25 teams, demonstrating the effectiveness of ensemble-based multimodal learning for stance detection.
CUET_SYNTHETICA@EEUCA 2026: Gated Cross-Modal Attention with Domain-Adapted Text Encoding for Vaccine-Critical Meme Detection
Sumaiya Zaman | Miftahul Jannat Rishta | Shiti Chowdhury
Sumaiya Zaman | Miftahul Jannat Rishta | Shiti Chowdhury
Vaccine-critical memes have emerged as a growing challenge for public health communication, combining images and text to spread misinformation in ways that are difficult to detect automatically. In this paper, we have described our system for the EEUCA 2026 Shared Task on Multimodal Vaccine-Critical Meme Detection, classifying memes from the VaxMeme dataset into Vaccine-Critical, Neutral and Pro-Vaccine categories. We have experimented with multiple text encoders and visual backbones, finding that Twitter-RoBERTa fused with CLIP ViT-L/14 through gated cross-modal attention has achieved a test macro F1 of 0.8357. We have further shown that domain-specific pretraining has outperformed larger general-purpose models, highlighting the importance of domain adaptation over raw model scale. Finally, our system has secured the 3rd position on the shared task leaderboard.
wenbin@EEUCA 2026: MoEs-VaxAgent, A Two-Stage Framework for Multimodal Vaccine Critical Meme Detection
Wenbin Shen
Wenbin Shen
Memes on social media have emerged as a crucial medium for disseminating vaccine-related viewpoints, yet their inherent irony, metaphor, and text-image misalignment pose significant challenges to automatic detection. In this paper, we propose MoEs-VaxAgent, a two-stage multimodal framework for vaccine critical meme detection. First, we design a dynamic routing Mixture-of-Experts module capable of adaptively capturing multi-granular semantic cues within memes. Second, to address hard samples located at the decision boundaries, we introduce an uncertainty-aware multi-agent rectification mechanism to perform a secondary detection on samples identified with low confidence in the first stage. In the EEUCA 2026 Shared Task on Multimodal Vaccine Critical Meme Detection, our system achieved a Macro F1-score of 0.8205, ranking 9th on the official leaderboard. Furthermore, we discuss various exploratory strategies evaluated during the competition and provide a detailed analysis of the model’s performance.
thaulab@EEUCA 2026: Who Said What to Whom? A Targeting-Aware Neural-Symbolic Pipeline for Gaming Toxicity Detection
Anmol Guragain | Marcos Estecha-Garitagoitia | Luis Fernando D’Haro | Ricardo de Córdoba
Anmol Guragain | Marcos Estecha-Garitagoitia | Luis Fernando D’Haro | Ricardo de Córdoba
This paper describes our system for the EEUCA 2026 Shared Task on toxicity classification in gaming chat. We implement a three-stage pipeline combining an ensemble of two compact transformers (DeBERTa-v3-base, 184M; XLM-RoBERTa-base, 278M) with a Linguistically-Informed Mediator (LIM) that resolves inter-model disagreements through corpus-backed lexical normalization, class-conditional unigram scoring, multilingual profanity detection, and agentive targeting analysis grounded in speech act theory. The LIM specifically targets the minority classes (Hate Harassment, Threats, and Extremism), which are the most safety-critical categories in real-world gaming moderation. To address the extreme class imbalance (1,450:1 Non-toxic to Extremism ratio), we introduce a two-stage data augmentation strategy using only the provided training data. Our system achieves a Macro F1 of 0.6441 and accuracy of 0.9062 on the official test set, ranking 3rd in Macro F1 and 1st in accuracy among all teams. The proposed pipeline is domain-portable: adapting to other gaming platforms requires substituting only the game-specific entity lexicon. Code is publicly available at https://github.com/Anmol2059/thaulab_EEUCA.
syuhhh@EEUCA 2026: A Three-Stage Progressive Training Framework for Fine-Grained Toxicity Detection in Online Gaming Communities
Yuhao Shi | Yu Wang | Shengjie Zhao
Yuhao Shi | Yu Wang | Shengjie Zhao
This paper presents our 1st-place system for the Shared Task on Fine-Grained Toxicity Detection in Online Gaming (GameTox) at the 9th EEUCA Workshop, co-located with ACL 2026. The task targets 6-class fine-grained toxic intent classification on the official GameTox dataset, comprising 53,000 real-world World of Tanks chat utterances. We propose a three-stage progressive training framework built on XLM-RoBERTa-large: (1) gaming domain adaptive MLM pre-training, (2) multilingual toxicity transfer fine-tuning, and (3) supervised contrastive learning (SCL)-enhanced target task tuning. We further incorporate LLM-driven data augmentation and long-tailed class synthesis. Our system achieves a Macro F1 of 0.7041, ranking 1st among 35 teams. Ablation studies validate each module’s contribution, and we release our code to facilitate follow-up research.
CSECU-Learners@EEUCA 2026: Vaccine Critical Memes Identification using Two-Stage Early Fusion of Transformers
Monir Ahmad | Md. Saif Uddin
Monir Ahmad | Md. Saif Uddin
Memes have emerged as a fast and influential way to share information online, particularly during major public health events like COVID-19 vaccination. While they can support awareness and encourage positive behavior, they are also widely used to spread misinformation and vaccine-critical views. These messages are often expressed through sarcasm and implicit meaning, which makes automatic detection difficult. To tackle this problem, EEUCA 2026 introduces a shared task based on the VaxMeme dataset for multimodal vaccine critical meme detection. The task encourages us to design models that can jointly understand both image and text, capturing the underlying context more effectively. In this work, we present our approach to this task by proposing a two-stage early fusion framework that integrates multiple transformer-based encoders. We train our model using focal loss to give more attention to difficult samples. Our experimental results show that our method performs competitively in the shared task, demonstrating its effectiveness for this problem.
ShriNep@EEUCA 2026: RAKSHAK – Multi-Task DeBERTa with Rationale Distillation and Jigsaw-Augmented Training for Toxic Intent Classification
Binayak Karki | Aryan Kafle | Pingala Ghimire
Binayak Karki | Aryan Kafle | Pingala Ghimire
This paper presents two systems for the GameTox Shared Task at the Workshop on EEUCA at ACL 2026, which requires classifying World of Tanks chat utterances into six fine-grained toxic intent categories (Labels 0–5). Severe class imbalance, domain-specific multilingual slang, and extremely scarce data for rare categories such as Threats (Label 4, 60 samples) and Extremism (Label 5, 24 samples) make this a challenging classification problem. Our primary submission, RAKSHAK (rakṣaka, Sanskrit for "Protector"), is a multi-task DeBERTa-v3-base framework combining rationale distillation from Qwen2.5-14B, Supervised Contrastive Loss, and dedicated rare-class binary heads. RAKSHAK’s training data is augmented with cross-domain transfer from the Jigsaw Toxic Comment dataset (16,225 samples mapped to Labels 1–4) and 100 LLM-generated extremism samples for Label 5. Our secondary system (M1) fine-tunes DeBERTa-v3-base with Focal Loss on the original GameTox data plus the same 100 extremism samples, without Jigsaw transfer. RAKSHAK achieves a Macro F1 of 0.5883 on the official test set, ranking 7th out of 35 participating teams, while M1 achieves 0.5252 Macro F1. An ablation comparing M1 with and without Jigsaw data shows that cross-domain transfer accounts for +2.6 F1 points, while RAKSHAK’s multi-task architecture contributes a further +3.7 points.
_alexcristea@EEUCA 2026: A Robust Early-Fusion ERNIE Pipeline for Multimodal COVID-19 Vaccine Meme Classification
Cristea Alexandru-Marian | Costin Ionescu
Cristea Alexandru-Marian | Costin Ionescu
This paper presents our system for the EEUCA0022026 shared task on Multimodal Vaccine Critical Meme Detection. The task focuses on categorizing social media memes from the VaxMeme dataset into three stances: Vaccine Critical, Neutral, and Pro-Vaccine. To tackle the inherent challenges of internet sarcasm, implicit context, and high label noise, we propose a robust, heavily regularized text-fusion pipeline. Rather than relying on computationally heavy visual encoders, we extract text directly from the images via OCR and concatenate it with the user’s social media post, processing the unified context through an ERNIE 2.0-Large encoder. To combat the severe overfitting typical in subjective meme datasets, we replace the standard classification head with a Multi-Sample Dropout architecture, averaging predictions across five parallel dropout masks (p = 0.3). Our optimized, lightweight text-only pipeline achieves a peak Macro F1 score of 0.834. Furthermore, an ablation study utilizing Focal Loss reveals that our primary solution using standard Cross-Entropy provides superior robustness against the inherent label noise found in internet meme datasets.
PSK@EEUCA 2026: Fine-tuning Large Language Models with Synthetic Data Augmentation for Multi-class Toxicity Detection in Gaming Chat
Srikar Kashyap Pulipaka
Srikar Kashyap Pulipaka
This paper describes our system for the EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming Communities. The task involves classifying World of Tanks chat messages into six toxicity categories: Non-toxic, Insults/Flaming, Other Offensive, Hate/Harassment, Threats, and Extremism. We explore multiple approaches including encoder-based models, instruction-tuned LLMs with LoRA fine-tuning, hierarchical classification, one-vs-rest strategies, and various ensemble methods. Our best system combines Llama 3.1 8B with carefully calibrated 5% synthetic data augmentation, achieving an F1-macro score of 0.6234 on the test set, placing 4th out of 35 participating teams. We provide extensive analysis of the dataset’s annotation patterns and their impact on model generalization, revealing a critical “validation trap” phenomenon where high validation performance correlates with poor test transfer.
TAGA@EEUCA 2026: Token-Attribution Guided Attention for Fine-Grained Toxic Behaviour Classification in Online Gaming Communities
Akshyat Shah | Shashi Sah | Aryan Gupta | Kavinder Singh
Akshyat Shah | Shashi Sah | Aryan Gupta | Kavinder Singh
Online gaming involves large amount of people forming a large community of players who interact in real time. Toxic behavior in online chat is common and can harm players by deterring them. Thus, automated moderation is a necessity but difficult because game chat mixes domain-specific slang, deliberate obfuscation, informal "gamer" language , and tiny support for categories such as threats and extremism. This paper describes the TAGA (Token-Attribution Guided Attention) system submitted to the EEUCA 2026 Shared Task on Understanding Toxic Behavior in Gaming Communities. We propose TAGA, an architecture that employs a leave-one-out attribution method using the Detoxify toxicity scorer to compute per-token attribution scores across multiple toxicity dimensions, which are then projected into the learned attention biases that steer the model toward toxicity-indicative tokens. By preparing a five phase ablation study, we demonstrate that each component: domain-specific preprocessing, focal loss with label smoothing, attribution-guided attention pooling, and dual-model Detoxify features with strategic oversampling contributes to a cumulative gain in macro-F1 score points over the DeBERTa-v3-base baseline reported. The final system achieves a test macro-F1 score of 0.618 and, importantly, produces non-zero predictions for extreme data imbalance present in the dataset used in the shared task.
LilyMeme@EEUCA 2026: Multimodal Vaccine Meme Stance Detection with Task-Adapted MemeCLIP and Complementary Ensembling
Yixuan Li | Xiaolong Yin | Yang Yang
Yixuan Li | Xiaolong Yin | Yang Yang
Memes have emerged as a prominent medium for conveying public sentiment on sensitive health topics such as vaccination. Unlike conventional multimodal tasks, memes feature implicit stances, sarcastic nuances, and complex cross-modal interactions, posing significant challenges for accurate stance detection. This paper presents our approach for the VaxMeme Shared Task @EEUCA 2026, which aims to classify vaccine-related memes into three distinct classes: Vaccine-critical, Neutral, and Pro-vaccine. Building upon MemeCLIP, we systematically enhance our framework via task-specific adaptation, lightweight cross-modal fusion, noise-aware training, LLM-assisted semantic augmentation, and inference-stage optimization, ultimately ensembling multiple complementary variants for final predictions. Our ensemble method achieves a Macro-F1 score of 0.8494 on the official test set, securing first place and demonstrating the critical efficacy of noise-aware training and late-stage ensembling for robust stance identification.
LINUS@EEUCA 2026: Fine-grained Toxicity Detection in Gaming Chat using Multilingual Transformers
Prajwal Ghimire | Aashish Mahato | Sunil Regmi
Prajwal Ghimire | Aashish Mahato | Sunil Regmi
The detection of toxic behavior in online gaming communities is crucial for maintaining safe digital spaces, yet remains challenging due to subtle context-dependent and intent-driven language. The GameTox dataset consists of around 53K World of Tanks chat utterances annotated across six categories: Non-toxic, Insults and Flaming, Other Offensive Texts, Hate and Harassment, Threats, and Extremism (CITATION). Our best performing approach, across multiple transformer-based architecture experimentations, is based on the multilingual BERT variant mmBERT-base fine-tuned with class-weighted cross-entropy loss. The best mmBERT-base model achieved a Macro F1 of 0.5882 during validation and an official test Macro F1 of 0.5104 on the shared task leaderboard. An internal held-out evaluation on a development split yielded 0.4282, which we analyze to understand distributional sensitivity to gaming slang and class imbalance. The code is available at: https://github.com/sunilRegmi-ai/eeuca-toxicity-detection.
Linus@EEUCA 2026: Multimodal and Text-Only Approaches to Vaccine-Critical Meme Detection.
Darwin Acharya | Shiv Ram Saud | Sunil Regmi
Darwin Acharya | Shiv Ram Saud | Sunil Regmi
In this paper, we describe our participation in the Shared Task on Multimodal Identification of Vaccine Critical Content on Social Media (VaxMeme) of EEUCA 2026, a satellite of ACL 2026. We tackle the classification of Twitter-based vaccine memes into anti-vaccine, neutral, and pro-vaccine categories using the VaxMeme dataset with 8,195 train, 1,024 val, and 1,025 test samples. We experiment with two different architecture families: (i) Multimodal hybrids: CLIP ViT-B/32 for images + BERT-based models for texts (BERT-base-uncased, ModernBERT) with late fusion strategy based on concatenation of L2-normalized feature vectors and (ii) Text-only: pre-trained models for texts (BERT-base-uncased, RoBERTa-base, ModernBERT-base, DistilBERT-base, Deberta-v3-base) for post_text. In both cases, we use a three-layer feed-forward network with GELU activation for classification. We use class-weighted cross-entropy loss, differential learning rates, AdamW optimizer, gradient accumulation, OneCycleLR scheduler, and early stopping on the val set for optimization. Data augmentation is applied for the multimodal CLIP-based approach only. The winning approach among those tested is the text-only BERT-base-uncased with a macro-F1 of 0.8102 which is ahead of the performance of the CLIP + BERT-base hybrid model, which achieves a test macro-F1 of 0.7603.
up
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Proceedings of the Workshop on Evaluating Evaluations (EvalEval)
Mubashara Akhtar | Jan Batzner | Leshem Choshen | Avijit Ghosh | Usman Gohar | Jennifer Mickel | Ichhya Pant | Zeerak Talat | Michelle Lin
Mubashara Akhtar | Jan Batzner | Leshem Choshen | Avijit Ghosh | Usman Gohar | Jennifer Mickel | Ichhya Pant | Zeerak Talat | Michelle Lin
Rigorous Interpretation Is a Form of Evaluation
Isabelle Lee | Emmy Liu | Cathy Jiao | Brihi Joshi | Dani Yogatama | Fazl Barez | Michael Saxon
Isabelle Lee | Emmy Liu | Cathy Jiao | Brihi Joshi | Dani Yogatama | Fazl Barez | Michael Saxon
Current machine learning models are evaluated through behavioral snapshots, with benchmark accuracies, win rates and outcome-based metrics. Model explanations and evaluations, however, are fundamentally intertwined: understanding why a model produces a behavior can be as important as measuring what it produces. If we trusted interpretability, we argue that it can serve not merely as diagnostics but as a richer and more principled form of model evaluation beyond surface-level performance metrics. We explore three ways interpretability can function evaluatively: (1) fixing problems by identifying the root causes of unwanted behavior, (2) detecting subtly faulty mechanisms that invalidate model outputs, and (3) predicting potential issues before they arise by fully understanding the model’s weaknesses. To fulfill its evaluative potential, we argue that interpretability methods must generate claims that are falsifiable, reproducible, and predictive—that is, interpretability must meet scientific standards.
Large language models (LLMs) are increasingly used as collaborative assistants, yet dominant NLP evaluation practices remain centered on aggregate metrics such as accuracy and fluency. These approaches often overlook behaviors that are critical in human-facing settings (e.g., consistency across multiple turns and iterative refinement). In this paper, we examine limitations of current NLP evaluation practices and introduce TCR, a structured framework for evaluating human–AI interaction using educational LLM assistants as an illustrative example. TCR emphasizes dimensions such as transparency, consistency, and refinement. We further present structured evaluation prompts and illustrative interaction examples demonstrating how structured evaluation can complement aggregate metrics and LLM-as-a-judge approaches. Our work highlights the need for more human-centered evaluation practices for interactive LLM systems.
Guidelines for Whom? Rethinking AI Ethics in Resource-Constrained Migration Services
Nari Yoo | Ashley Khor | Namrata Mukhija | Aminat Adebiyi | Miri Zilka
Nari Yoo | Ashley Khor | Namrata Mukhija | Aminat Adebiyi | Miri Zilka
AI ethics guidelines for humanitarian settings have grown in number and scope. Whether they produce their intended outcomes depends on which deployers are expected to follow them. These guidelines respond to documented risks: surveillance, data misuse, and discriminatory outcomes affecting refugee populations. For high-risk applications such as biometric identification and asylum adjudication, the concerns they address are genuine. Many differentiate risk tiers in principle, yet the compliance expectations they establish (staff capacity, technical infrastructure, formal evaluation) reflect the organizational contexts in which they were developed. Many nonprofits providing frontline services to refugees operate with limited administrative capacity. When compliance requirements exceed what these organizations can meet, formal AI adoption stalls, while informal adoption proceeds without oversight or recourse. Current guidelines also tend to treat non-adoption as a neutral default, without accounting for the service gaps that follow when AI-assisted language access is unavailable. Drawing on collaboration with refugee-serving practitioners, we show that this gap between governance design and organizational reality has consequences for the people these guidelines are meant to protect. Evaluating AI guidelines, we argue, requires the same realist logic that evaluation research has long applied to social programs: not "does this guideline exist?" but "for which deployers, under what conditions, and does it produce its intended protective outcomes?"
Evaluating Large Language Model News Sentiment in Finance under Liquidity and Market Frictions
Kemal Kirtac
Kemal Kirtac
This paper studies whether large language models can extract useful sentiment signals from firm-specific financial news when evaluation accounts for realistic market frictions. Many financial NLP studies report strong offline prediction results, but these do not always show whether model outputs remain useful once trading constraints are imposed. I address this gap by evaluating sentiment models through classification performance, return predictability, and implementable portfolio performance. The analysis links Refinitiv News Analytics to CRSP and begins with 3,129,924 U.S. news items published between January 1, 2010 and January 30, 2026. Filtering retains single-firm stories, removes redundant coverage using a five-day cosine-similarity novelty screen, and restricts the sample to tradable stocks with positive bid and ask quotes, minimum share and dollar volume thresholds, quoted spreads below 20%, and available Amihud illiquidity ratios and Kyle’s lambda estimates. The final sample contains 973,481 tradable news items linked to 3,452 firms. I compare six sentiment approaches: LLaMA–3, OPT, RoBERTa, BERT, FinBERT, and the Loughran–McDonald dictionary. LLaMA–3 achieves the strongest classification performance with 78.2% accuracy and produces the largest predictive coefficients in panel regressions. Daily rebalanced long–short portfolios with a 5 bps trading cost show that the LLaMA–3 strategy earns a cumulative return of approximately 180% from June 2024 to January 2026, followed by OPT with 155% and RoBERTa with 120%, while the dictionarybased strategy loses 9%. The results show that evaluation becomes more informative when financial NLP models are assessed beyond offline accuracy and under realistic deployment constraints. High-capacity language models retain economically meaningful predictive content under market frictions, whereas simpler lexicon-based methods do not.
Standard benchmarks for large language models (LLMs) assume that task feedback is truthful, but real-world reasoning often requires processing unreliable or adversarial information. We introduce WordleArenas, a benchmark platform that evaluates LLM reasoning robustness across a deception gradient. Building on Wordle and its deceptive variant Fibble (Chusap et al., 2025), we generalize to Fibblek (k = 0, . . . , 5 lies per row), creating a controlled evaluation of LLM robustness to misinformation. Across six arenas — standard Wordle (0 lies per row) through Fibble5 (5 lies per row) — we evaluate 41 models from 10 providers across 3,749 games. We find that (1) even one lie per row causes catastrophic performance drops (average win rate falls from 41.1% to 18.7%), (2) a sharp deception cliff emerges at 2–3 lies where nearly all models collapse to ≤3% win rate, and (3) model robustness to deception is poorly predicted by standard benchmark rankings. A surprising Fibble5 recovery emerges: some models recover partial performance when all feedback lies (average 9.5%), outperforming Fibble3 (0.3%) and Fibble4 (0.4%), because knowing that every tile lies restores deterministic — though partial — information. Our results demonstrate that truthful-feedback evaluations systematically overestimate LLM reasoning capabilities and that deception-aware benchmarks are essential for assessing real-world robustness. All code and data are publicly available.
Mind the Gap: How Elicitation Protocols Shape the Stated-Revealed Preference Gap in Language Models
Pranav Mahajan | Ihor Kendiukhov | Syed Hussain | Lydia Nottingham
Pranav Mahajan | Ihor Kendiukhov | Syed Hussain | Lydia Nottingham
Recent work identifies a stated–revealed (SvR) preference gap in language models (LMs): a mismatch between the values models endorse and the choices they make in context. Existing evaluations rely heavily on binary forcedchoice prompting, which entangles genuine preferences with artifacts of the elicitation protocol. We systematically study how elicitation protocols affect SvR correlation across 24 LMs. Allowing neutrality and abstention during stated preference elicitation allows us to exclude weak signals, substantially improving Spearman’s rank correlation (ρ) between volunteered stated preferences and forced-choice revealed preferences. However, further allowing abstention in revealed preferences drives ρ to near-zero or negative values due to high neutrality rates. Finally, we find that system prompt steering using stated preferences during revealed preference elicitation does not reliably improve SvR correlation on AIRiskDilemmas. Together, our results show that SvR correlation is highly protocol-dependent and that preference elicitation requires methods that account for indeterminate preferences.
When Scanners Lie: Evaluator Instability in LLM Red-Teaming
Lidor Erez | Omer Hofman | Tamir Nizri | Roman Vainshtein
Lidor Erez | Omer Hofman | Tamir Nizri | Roman Vainshtein
Automated LLM vulnerability scanners are increasingly used to assess security risks by measuring different attack type success rates (ASR). Yet the validity of these measurements hinges on an often-overlooked component: the evaluator who determines whether an attack has succeeded. In this study, we demonstrate that commonly used open-source scanners exhibit measurement instability that depends on the evaluator component. Consequently, changing the evaluator while keeping the attacks and model outputs constant can significantly alter the reported ASR. To tackle this problem, we present a two-phase, reliability-aware evaluation framework. In the first phase, we quantify evaluator disagreement to identify attack categories where ASR reliability cannot be assumed. In the second phase, we propose a verification-based evaluation method where evaluators are validated by an independent verifier, enabling reliability assessment without relying on extensive human annotation. Applied to the widely used Garak scanner, we observe that 22 of 25 attack categories exhibit evaluator instability, reflected in high disagreement among evaluators. Our approach raises evaluator accuracy from 72% to 89% while enabling selective deployment to control cost and computational overhead. We further quantify evaluator uncertainty in ASR estimates, showing that reported vulnerability scores can vary by up to ±33% depending on the evaluator. Our results indicate that the outputs of vulnerability scanners are highly sensitive to the choice of evaluators. Our framework offers a practical approach to quantify unreliable evaluations and enhance the reliability of measurements in automated LLM security assessments.
Reasoning Model Is Superior LLM-Judge, Yet Suffers from Biases
Hui Huang | Xuanxin Wu | Muyun Yang | Yuki Arase
Hui Huang | Xuanxin Wu | Muyun Yang | Yuki Arase
This paper presents the first systematic comparison investigating whether Large Reasoning Models (LRMs) are superior judges to non-reasoning LLMs. Our empirical analysis yields four key findings: 1) LRMs outperform non-reasoning LLMs in terms of judgment accuracy, particularly on reasoning-intensive tasks; 2) LRMs demonstrate superior evaluation instruction-following capabilities; 3) LRMs exhibit enhanced robustness against adversarial attacks targeting judgment tasks; 4) However, LRMs still exhibit strong evaluation biases. To mitigate this bias vulnerability, we propose PlanJudge, a lightweight evaluation strategy that prompts the model to generate an explicit evaluation plan before executing the judgment. Despite its simplicity, our experiments demonstrate that PlanJudge significantly mitigates biases in LLM-as-a-Judge while preserving overall judgment accuracy1.
From Rubrics to Recipe: Principle-Centric Benchmark for Evaluating Large Language Models
Shirley Anugrah Hayati | Ruizi Wang | Dongyeop Kang
Shirley Anugrah Hayati | Ruizi Wang | Dongyeop Kang
Large language models (LLMs) are often evaluated on benchmarks that rely on surfacelevel instructions, obscuring what defines highquality performance. We argue that tasks can be more precisely characterized through principles: human-readable rules that specify what matters for a good response to the task. Our study proposes a framework to automatically extract and generate task-level principles for data generation and evaluation. Using this approach, we build a benchmark of over 20K principle-aligned instances, enabling controllable data creation and fine-grained, interpretable assessment of LLMs. Experiments show that principles both improve output quality and scale evaluation beyond manual curation, offering a new recipe for principled assessment of LLM capabilities.1
Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. Across five evaluated models, we find that both prompt length and solution length are positively associated with model failure. These associations are statistically significant but modest, and we interpret them as descriptive rather than causal. We also include a secondary, exploratory analysis of cross-model disagreement. Because disagreement measures based on variance are mechanically constrained by mean failure, we treat this part of the analysis cautiously. Overall, our main finding is that structural length is linked to empirical difficulty in this benchmark, suggesting that length should be considered as a potential confounder when interpreting mathematical model evaluations.
Benchmarks for assessing large language model (LLM) capabilities have been criticized for a lack of construct validity. Here, we focus on an often overlooked dimension of a benchmark’s validity: namely, the functional mapping between a benchmark’s numerical score and the underlying quantity the benchmark purports to measure. What licenses the assumption that equivalent intervals on a scale correspond to equivalent differences in the underlying capability? We argue that this question is not merely theoretical: the form of this mapping (e.g., linear vs. logarithmic vs. exponential) could and should influence decisions about deployment and regulatory policy. Drawing on work from the history and philosophy of science, we discuss an analogous problem in the early history of thermometry termed the problem of nomic measurement, as well as the epistemic practices that enabled scientists to overcome these challenges. We then ask whether a similar process of epistemic iteration can overcome this problem in benchmarking. Despite clear differences between temperature and “capabilities” as constructs, we argue that some modest success could be achievable in the domain of benchmarking—but that this depends crucially on the clear articulation of a researcher’s goals and theoretical commitments.
Caged Birds and Cute Bookworms: Feminine Tropes and Implicit Gender Bias in Large Language Models
Sachita Nishal | Jack Bandy
Sachita Nishal | Jack Bandy
This paper introduces a curated dataset for diagnosing implicit gender bias through feminine tropes in narratives generated by large language models. Drawing from a crowd-sourced database of tropes from television media, we create prompts that elicit narratives from LLMs based on historically gendered tropes. We find that LLMs tend to revert to feminine characters in these narratives, even when prompted without explicit gender references, and also when prompted with non-binary (“they/them”) gender references for the main character. In some cases, even when prompted with masculine pronouns (“he/him”), LLMs still use feminine pronouns to describe the main character. The paper describes our dataset creation process and the evaluation of four open-weight models. We discuss implications for future research in mitigating implicit gender bias and its associated representational harms in LLMs, as well as the complex relationship between language models and societal values.
Effective AI risk assessment relies on the quality of evaluations. Currently, there are large quality differences, such as in construct validity and annotation, between existing benchmarks. In this work, we propose a quality scorecard for benchmarks designed to make this diversity easier to navigate. The scorecard employs two main components: dimensions, which provide granular scores of an evaluation under that dimension, and classifications, which correspond to concrete use-cases ranging from research to post-deployment. By establishing a common language and objective methods, this framework aims to aid in transparency and raise the baseline quality of benchmarks used across the ecosystem.
Defining Cultural Capabilities for AI Evaluation: A Taxonomy Grounded in Intercultural Communication Theory
Isar Nejadgholi | Masoud Kianpour | Krishnapriya Vishnubhotla | Maryam Molamohammadi
Isar Nejadgholi | Masoud Kianpour | Krishnapriya Vishnubhotla | Maryam Molamohammadi
Tremendous efforts have been put into evaluating the inclusivity and effectiveness of AI systems across cultures. However, the cultural capabilities considered in much of the literature remain vaguely defined, are referred to using interchangeable terminology, and are typically limited to recalling accurate information about various demographics, regions, and nationalities. To address this construct ambiguity, we draw from Intercultural Communication scholarship and propose a three-level taxonomy of AI-relevant cultural capabilities: Cultural Awareness answers “Does the model know?”, Cultural Sensitivity answers “How does it frame its knowledge?”, and Cultural Competence answers “Can it adapt as the interaction evolves?”. Beyond conceptual clarification, we position this taxonomy as a practical tool for improving the validity and interpretability of AI evaluation in real-world, multicultural settings. Without such construct clarity, evaluation results risk overstating model capabilities and may lead to inappropriate deployment decisions in culturally sensitive contexts.
BenchNavigator: A Discovery Interface for Comparing LLM Benchmarks
Anna Sokol | Inge Vejsbjerg | Elizabeth M. Daly | David Piorkowski | Michael Hind | Nuno Moniz | Nitesh V. Chawla
Anna Sokol | Inge Vejsbjerg | Elizabeth M. Daly | David Piorkowski | Michael Hind | Nuno Moniz | Nitesh V. Chawla
Evaluating large language models (LLMs) requires selecting benchmarks that fit the intended use case. However, the rapid growth of benchmarks has made discovery and comparison difficult, because practitioners must assemble information across papers, repositories, and dataset cards with heterogeneous metadata, inconsistent terminology, and uneven documentation. Prior work improves individual benchmark documentation and quality assessment, but does not provide a uniform way to compare benchmarks during discovery. We survey practitioners, analyze multi-source benchmark metadata, and identify the fields needed for effective benchmark discovery. We introduce BenchNavigator, a prototype that organizes heterogeneous metadata into a coherent, provenance-preserving interface aligned with practitioner priorities. Our results show that benchmark metadata can be presented in a comparable form without imposing new reporting burdens on benchmark producers. We frame this contribution as discovery infrastructure, not as a method for scoring benchmark quality or replacing contextual evaluation.
Beyond Static Benchmarks: A Validity, Reliability, and Sociotechnical Framework for Evaluating LLMs in Deployment Contexts
Ben Jenkins
Ben Jenkins
Static leaderboards summarize large language model (LLM) performance but offer weak evidence under shifting usage, noisy inputs, and plural stakeholder values. We present VRS-Eval, operationalizing deployment validity (benchmark vs. deployment score alignment), operational reliability (stability under a declared perturbation family), and sociotechnical alignment (metric vs. elicited rubric weights as a thin audit summary). With a reproducible simulator under explicit PB vs. PD shift and multi-turn interaction, we stress-test evaluation protocols in a controlled environment: under our main setting, benchmark-side scores (on PB) exceed estimated deploymentside utility scores (evaluated on trajectories from PD) by roughly 21–26% in relative terms across three metrics, with tight 95% percentile intervals (K=200). Failure mixtures emphasize overfitting, shift fragility, and rubric misalignment, consistent with firstvs. third-party reporting asymmetries (Reuel et al., 2025). A staged pipeline narrows the validity gap and raises reliability for the same generative story. Sensitivity sweeps over |Ω| and rubric-label rate preserve the rank ordering of harnesses, suggesting the qualitative conclusions are robust to plausible design-choice variation within the simulator. We discuss harness and accountability implications.
From Guidelines to Guarantees: A Graph-Based Evaluation Harness for Domain-Specific Evaluation of LLMs
Jessica M. Lundin | Usman Nasir Nakakana | Guillaume Chabot-Couture
Jessica M. Lundin | Usman Nasir Nakakana | Guillaume Chabot-Couture
Rigorous evaluation of domain-specific language models requires benchmarks that are comprehensive, contamination-resistant, and maintainable. Static, manually curated datasets do not satisfy these properties. We present a graph-based evaluation harness that transforms structured clinical guidelines into a queryable knowledge graph and dynamically instantiates evaluation queries via graph traversal. The framework provides three guarantees: (1) complete coverage of guideline relationships; (2) surface-form contamination resistance through combinatorial variation; and (3) validity inherited from expert-authored graph structure. Applied to the WHO IMCI guidelines, the harness generates clinically grounded multiple-choice questions spanning symptom recognition, treatment, severity classification, and follow-up care. Evaluation across five language models reveals systematic capability gaps. Models perform well on symptom recognition but show lower accuracy on treatment protocols and clinical management decisions. The framework supports continuous regeneration of evaluation data as guidelines evolve and generalizes to domains with structured decision logic. This provides a scalable foundation for evaluation infrastructure. Data and Code Availability The WHO IMCI handbook is publicly available (WHO, 2014). Our graph construction, question generation code, and generated question dataset are available at https://github.com/jessicalundin/ graph_testing_harness.
Document Overlap Is Not Evidence Continuity: Measuring Retrieval Jitter in Citation-Based RAG Evaluation
Punitha Ponnuraj
Punitha Ponnuraj
RAG evaluations often rely on citations or retrieved evidence traces for correctness checks, provenance claims, and audits, implicitly assuming that evidence remains reproducible under routine retrieval settings. We test this assumption in a controlled diagnostic study where queries, embeddings, and decoding are fixed while retrieval depth, chunk size, and overlap vary. We call the resulting change in attributed evidence retrieval jitter and measure evidence identity at two levels: document (doc_id) and exact cited span (doc_id, span_hash). Across BEIR ArguAna and SciFact, we observe a consistent Stability Gap: document overlap remains moderate while span overlap often collapses, including many cases of total span turnover despite non-empty retrieval. We interpret span-level instability as a diagnostic of exact evidence-trace reproducibility, not semantic equivalence. These findings motivate reporting stability diagnostics alongside citation-based evaluation metrics for more reproducible evaluation practice.
Measuring AI-Induced Disempowerment: A Framework and Proposed Metrics
Je Qin Chooi | Jaeho Lee | Jasmine Xinze Li
Je Qin Chooi | Jaeho Lee | Jasmine Xinze Li
AI systems are embedded in economic production, public discourse, governance, and personal decision-making, yet there is little empirical infrastructure for tracking whether this integration erodes humans’ ability to meaningfully shape outcomes that affect their lives. We argue that measuring AI-induced disempowerment is both urgent and tractable, and lay out a research agenda for doing so. We first operationalize disempowerment through Sen’s model of agency and a three-layer model of exposure, erosion, and lock-in, applied across economic, political, and cultural domains at individual, institutional, and civilizational scales. We survey existing measurement efforts and show that current work clusters almost entirely at exposure, leaving erosion and lock-in largely unaddressed. We then propose six concrete metrics (centaur evaluations, disempowerment perception surveys, AI content saturation and cultural convergence monitoring, monitoring capital flow to and from human labor, human task frontier tracking, and institutional ethnography) and identify which actors are best positioned to implement each. We close by discussing limitations and open challenges, including construct validity across levels of analysis, causal attribution, the distinction between disempowerment and adaptation, and the political economy of measurement.
Position: Evaluations of AI Moral Reasoning Still Miss Half of the Picture
Aidan Kierans | Ritam Dutt | Kaley Rittichier | Shiri Dori-Hacohen | Avijit Ghosh
Aidan Kierans | Ritam Dutt | Kaley Rittichier | Shiri Dori-Hacohen | Avijit Ghosh
Recent work on evaluating the moral competence of large language models (LLMs) has focused primarily on what we call the moral value problem, i.e., whether model outputs align with human moral values. In contrast, the moral norm problem, i.e., whether models can identify and correctly apply context-sensitive moral norms, remains underexplored. We posit that this imbalance stems from the field’s reliance on descriptive ethics frameworks, such as Moral Foundations Theory and Kohlberg’s stages of moral development, which emphasize value representation over normative application. We review existing benchmarks and evaluation methods, and show that they cluster heavily around the value problem, while discussion regarding normative ethics remains underrepresented. We identify three crucial gaps: (i) the absence of high-quality groundtruth data for moral norms and their applications, (ii) insufficient evaluation of intermediate reasoning processes, and (iii) limited attention to the identification of morally relevant features in context. Subsequently, we propose a research agenda that includes the development of standardized formal representations for normative theories, the construction of expert-annotated datasets capturing norm application, and evaluation protocols that explicitly distinguish between values-level and normslevel competence. Our goal is to encourage a more systematic study of normative reasoning in LLMs.
The evaluation of explainable AI (XAI) methods is affected by a lack of standardization. Metrics are inconsistently defined, incompletely reported, and rarely validated against common baselines. In this paper, we identify transparency of evaluation reporting as a central, under-addressed problem. We propose the XAI Evaluation Card, a documentation template analogous to model cards, designed to accompany any study that introduces an XAI evaluation metric. The card covers explicit declaration of target properties, grounding levels, metric assumptions, validation evidence, gaming risks, and known failure cases. We argue that adopting this template as a community norm would reduce evaluation fragmentation, support meta-analysis, and improve accountability in XAI research.
up
Proceedings of the Ninth Fact Extraction and VERification Workshop (FEVER)
Proceedings of the Ninth Fact Extraction and VERification Workshop (FEVER)
Mubashara Akhtar | Rami Aly | Rui Cao | Christos Christodoulopoulos | Oana Cocarascu | Zhijiang Guo | Arpit Mittal | Michael Schlichtkrull | James Thorne | Andreas Vlachos
Mubashara Akhtar | Rami Aly | Rui Cao | Christos Christodoulopoulos | Oana Cocarascu | Zhijiang Guo | Arpit Mittal | Michael Schlichtkrull | James Thorne | Andreas Vlachos
Weakly-supervised Argument Mining with Boundary Refinement and Relation Denoising
Wei Sun | Mingxiao Li | Jesse Davis | Elena Cabrio | Serena Villata | Marie-Francine Moens
Wei Sun | Mingxiao Li | Jesse Davis | Elena Cabrio | Serena Villata | Marie-Francine Moens
Argument mining (AM) involves extracting argument components and predicting relations between them to create argumentative graphs, which are essential for applications requiring argumentative comprehension. To automatically provide high-quality graphs, previous works require a large amount of human-annotated training samples to train AM models. Instead, we leverage a large language model (LLM) to assign pseudo-labels to training samples for reducing reliance on human-annotated training data. However, the training data weakly-labeled by the LLM are too noisy to develop an AM model with reliable performance. In this paper, to improve the model performance, we propose a center-based component detector that refines the boundaries of the detected components and a relation denoiser to deal with noise present in the pseudo-labels when classifying relations between detected components. Experimentally, our AM model improves the boundary detection obtained from the LLM by up to 16% in terms of IoU75 and of the relation classification obtained from the LLM by up to 12% in terms of macro-F1 score. Our AM model achieves new state-of-the-art performance in weakly-supervised AM, showing up to a 6% improvement over the state-of-the-art component detector and up to a 7% improvement over the state-of-the-art relation classifier. Additionally, our model uses less than 20% of human-annotated data to match the performance of state-of-the-art fully-supervised AM models.
POaaS: Minimal-Edit Prompt Optimization as a Service to Lift Accuracy and Cut Hallucinations on On-Device sLLMs
Jungwoo Shim | Dae Won Kim | Sunwook Kim | Sooyoung Kim | Myungcheol Lee | Jaegeun Cha | Hyunhwa Choi
Jungwoo Shim | Dae Won Kim | Sunwook Kim | Sooyoung Kim | Myungcheol Lee | Jaegeun Cha | Hyunhwa Choi
Small language models (sLLMs) are increasingly deployed on-device, where imperfect user prompts–typos, unclear intent, or missing context–can trigger factual errors and hallucinations. Existing automatic prompt optimization (APO) methods were designed for large cloud LLMs and rely on search that often produces long, structured instructions; when executed under an on-device constraint where the same small model must act as optimizer and solver, these pipelines can waste context and even hurt accuracy. We propose POaaS, a minimal-edit prompt optimization layer that routes each query to lightweight specialists (Cleaner, Paraphraser, Fact-Adder) and merges their outputs under strict drift and length constraints, with a conservative skip policy for well-formed prompts. Under a strict fixed-model setting with Llama-3.2-3B and Llama-3.1-8B, POaaS improves both task accuracy and factuality while representative APO baselines degrade them, and POaaS recovers up to +7.4% under token deletion and mixup. Overall, per-query conservative optimization is a practical alternative to search-heavy APO for on-device sLLMs.
Evidence Grounding vs. Memorization: Why Neural Semantics Matter for Knowledge Graph Fact Verification
Ankit Kumar Upadhyay | John S. Erickson | Deborah L. McGuinness
Ankit Kumar Upadhyay | John S. Erickson | Deborah L. McGuinness
Knowledge graphs like DBpedia enable structured fact verification, but the relative contributions of symbolic structure, neural semantics, and evidence grounding remain unclear. We present a systematic study on FACTKG (108,675 claims) comparing symbolic, neural, and LLM-based approaches. Our symbolic baseline using 29 hand-crafted features covering graph structure, entity coverage, and semantic relation type achieves 66.54% accuracy, while BERT over linearized subgraphs reaches 92.68% and graph neural networks plateau at 70%, demonstrating that token-level semantics outperform both symbolic features and message passing. Using GPT-4.1-mini to filter training data, budget-matched controls show that token-budget control recovers most of the gap over truncation-dominated inputs, while LLM semantic selection adds +1.31 points beyond lexical heuristics (78.85% filtered vs. 77.54% heuristic vs. 52.70% unfiltered), showing that semantic relevance, not just evidence quantity, governs learnability. Finally, comparing 300 test claims under memorization (claim-only) versus KG-grounded reasoning with chain-of-thought, we find KG grounding improves GPT-4o-mini and GPT-4.1-mini accuracy by 12.67 and 9.33 points respectively, with models citing specific triples for interpretability. These results demonstrate that neural semantic representations and explicit KG evidence grounding are highly effective for robust, interpretable fact verification.
The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods
Arpit Singh Gautam | Kailash Talreja | Saurabh Jha
Arpit Singh Gautam | Kailash Talreja | Saurabh Jha
Large Language Models (LLMs) frequently "hallucinate" plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are "confidently wrong." We propose DiffuTruth, an unsupervised framework that re-conceptualizes fact verification via non-equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the "Generative Stress Test": claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector-space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state-of-the-art unsupervised AUROC of 0.725, outperforming baselines by +1.5% through the correction of overconfident predictions. Furthermore, we show superior zero-shot generalization on the multi-hop HOVER dataset, outperforming baselines by over 4%, confirming the robustness of thermodynamic truth properties to distribution shifts.
BiCon-Gate: Consistency-Gated De-colloquialisation for Dialogue Fact-Checking
Hyunkyung Park | Arkaitz Zubiaga
Hyunkyung Park | Arkaitz Zubiaga
Automated fact-checking in dialogue involves multi-turn conversations where colloquial language is frequent yet understudied. To address this gap, we propose a conservative rewrite candidate for each response claim via staged de-colloquialisation, combining lightweight surface normalisation with scoped in-claim coreference resolution. We then introduce BiCon-Gate, a semantics-aware consistency gate that selects the rewrite candidate only when it is semantically supported by the dialogue context, otherwise falling back to the original claim. On the DialFact benchmark, this gated selection stabilises downstream fact-checking and yields gains in both evidence retrieval and fact verification particularly strong gains on SUPPORTS and outperforms competitive baselines, including a decoder-based one-shot LLM rewrite that attempts to perform all de-colloquialisation steps in a single pass.
The Automatic Verification of Image-Text Claims (AVerImaTeC) Shared Task
Rui Cao | Yulong Chen | Zhenyun Deng | Michael Schlichtkrull | Andreas Vlachos
Rui Cao | Yulong Chen | Zhenyun Deng | Michael Schlichtkrull | Andreas Vlachos
The Automatic Verification of Image-Text Claims (AVerImaTeC) shared task aims to advance system development for retrieving evidence and verifying real-world image-text claims. Participants were allowed to either employ external knowledge sources, such as web search engines, or leverage the curated knowledge store provided by the organizers. System performance was evaluated using the AVerImaTeC score, defined as a conditional verdict accuracy in which a verdict is considered correct only when the associated evidence score exceeds a predefined threshold. The shared task attracted 14 submissions during the development phase and 6 submissions during the testing phase. All participating systems in the testing phase outperformed the baseline provided. The winning team, HUMAN, achieved an AVerImaTeC score of 0.5455. This paper provides a detailed description of the shared task, presents the complete evaluation results, and discusses key insights and lessons learned.
Take It All: Ensemble Retrieval for Multimodal Evidence Aggregation
Max Upravitelev | Veronika Solopova | Premtim Sahitaj | Ariana Sahitaj | Charlott Jakob | Sebastian Möller | Vera Schmitt
Max Upravitelev | Veronika Solopova | Premtim Sahitaj | Ariana Sahitaj | Charlott Jakob | Sebastian Möller | Vera Schmitt
Multimodal fact checking has become increasingly important due to the predominance of visual content on social media platforms, where images are frequently used to enhance the credibility and spread of misleading claims, while generated images become more prevalent and realistic as generative models advance. Incorporating visual information, however, substantially increases computational costs, raising critical efficiency concerns for practical deployment. In this study, we propose and evaluate the ADA-AGGR (ensemble retrievAl for multimoDAl evidence AGGRegation) pipeline, which achieved the second place on both the dev and test leaderboards of the FEVER 9/AVerImaTeC shared task. However, long runtimes per claim highlight challenges regarding efficiency concerns when designing multimodal claim verification pipelines. We therefore run extensive ablation studies and configuration analyses to identify possible performance–runtime improvements. Our experiments show that substantial efficiency gains are possible without significant loss in verification quality. For instance, we reduced the average runtime by up to 6.28× while maintaining comparable performance across evaluation metrics by aggressively downsampling input images processed by visual language models. Overall, our results highlight that careful design choices are crucial for building scalable and resource-efficient multimodal fact-checking systems suitable for real-world deployment.
REVEAL: Retrieval-Enhanced Verification for Multimodal Fact-Checking
Amina Tariq | Yova Kementchedjhieva
Amina Tariq | Yova Kementchedjhieva
Multimodal misinformation combines images and text to amplify false narratives, yet most fact-checking research addresses only textualclaims. The AVerImaTeC shared task introduces real-world image-text claims requiring sophisticated evidence retrieval. We present REVEAL (Retrieval-Enhanced Verification with Evidence Accumulation Loop), a system designed to overcome the “semantic gap,” defined as the disconnect between the neutral phrasing of claims and the adversarial vocabulary of debunking evidence. Unlike static baselines, REVEAL breaks down the verification task into an iterative context loop, integrating sparse and dense retrieval signals to aggressively target refuting evidence. We achieve a Verdict Accuracy of 23.6% and an Evidence Recall of 27.7% on the test set. Our results outperform the official baseline across all metrics, validating our hybrid retrieval strategy for complex multimodal verification.
VILLAIN at AVerImaTeC: Verifying Image–Text Claims via Multi-Agent Collaboration
Jaeyoon Jung | Yejun Yoon | Seunghyun Yoon | Kunwoo Park
Jaeyoon Jung | Yejun Yoon | Seunghyun Yoon | Kunwoo Park
This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at https://github.com/ssu-humane/VILLAIN.
Selective Multimodal Retrieval for Automated Verification of Image–Text Claims
Yoana Tsoneva | Paul-Conrad Feig | Jiaao Li | Veronika Solopova | Neda Foroutan | Arthur Hilbert | Vera Schmitt
Yoana Tsoneva | Paul-Conrad Feig | Jiaao Li | Veronika Solopova | Neda Foroutan | Arthur Hilbert | Vera Schmitt
This paper presents an efficiency-aware pipeline for automated fact-checking of real-world image–text claims that treats multimodality as a controllable design variable rather than a property that must be uniformly propagated through every stage of the system. The approach decomposes claims into verification questions, assigns each to text- or image-related types, and applies modality-aware retrieval strategies, while ultimately relying on text-only evidence for verdict prediction and justification generation. Evaluated on the AVerImaTeC dataset within the FEVER-9 shared task, the system achieves competitive question, evidence, verdict, and justification scores and ranks fourth overall, outperforming the official baseline on evidence recall, verdict accuracy, and justification quality despite not using visual evidence during retrieval. These results demonstrate that strong performance on multimodal fact-checking can be achieved by selectively controlling where visual information influences retrieval and reasoning, rather than performing full multimodal fusion at every stage of the pipeline.
In this paper, we present our 3rd place system in the AVerImaTeC shared task, which combines our last year’s retrieval-augmented generation (RAG) pipeline with a reverse image search (RIS) module.Despite its simplicity, our system delivers competitive performance with a single multimodal LLM call per fact-check at just 0.013 on average using GPT5.1 via OpenAI Batch API.Our system is also easy to reproduce and tweak, consisting of only three decoupled modules — a textual retrieval module based on similarity search, an image retrieval module based on API-accessed RIS, and a generation module using GPT5.1 — which is why we suggest it as an accesible starting point for further experimentation.We publish its code and prompts, as well as our vector stores and insights into the scheme's running costs and directions for further improvement.
up
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Proceedings of the Fifth Workshop on NLP Applications to Field Linguistics
Éric Le Ferrand | Elena Klyachko | Shu Okabe | Ekaterina Voloshina | Oleg Serikov | Tatiana Shavrina | Ekaterina Vylomova
Éric Le Ferrand | Elena Klyachko | Shu Okabe | Ekaterina Voloshina | Oleg Serikov | Tatiana Shavrina | Ekaterina Vylomova
Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist
Kellen Parker van Dam | Abishek Stephen
Kellen Parker van Dam | Abishek Stephen
Lexical data collection in language documentation often contains transcription errors and borrowings that can mislead linguistic analysis. We present unsupervised methods to identify phonotactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using phoneme-level and syllable-level n-gram language models, our approach identifies potential transcription errors and borrowings. We evaluate our methods using hand annotated gold standard and rank the phonotactic outliers using precision and recall at K metric. The ranking approach provides field linguists with a method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.
Field linguistics increasingly relies on computational tools to organize, analyze, and preserve linguistic data, yet the classificatory assumptions embedded in these tools are rarely examined. A pervasive assumption is that languages can be treated as discrete, genealogically defined units, with relatedness modeled as tree-structured descent. We argue that this assumption misrepresents linguistic evidence in contact-heavy regions and risks distorting the computational mediation of field linguistic data. Focusing on South Asia, we show that widely assumed boundaries—such as the Indo-Aryan–Dravidian divide—collapse in long-standing contact zones characterized by convergence, dialect continua, and institutional multilingualism. Through historically grounded case studies including Kannada–Telugu and Tamil–Malayalam, we demonstrate how convergence, script-mediated distance, and post-hoc standardization reshape how field data is segmented, compared, and interpreted when organized through genealogical labels. We argue that contact-aware, relational models of linguistic relatedness are necessary if NLP tools are to support, rather than distort, the documentation and analysis of linguistic diversity.
Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan
Siyu Liang | Talant Mawkanuli | Gina-Anne Levow
Siyu Liang | Talant Mawkanuli | Gina-Anne Levow
Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
Linguistically Informed Tokenization Improves ASR for Underresourced Languages
Massimo Marie Daul | Alessio Tosolini | Claire Bowern
Massimo Marie Daul | Alessio Tosolini | Claire Bowern
Automatic speech recognition (ASR) is a crucial tool for linguists aiming to perform a variety of language documentation tasks. However, modern ASR systems rely on data-hungry transformer architectures, rendering them generally unusable for underresourced languages. We fine-tune a wav2vec 2.0 ASR model on Yanyhangu, an Indigenous Australian language, comparing the effects of phonemic and orthographic tokenization strategies on performance. In parallel, we explore ASR’s viability as a tool in a language documentation pipeline. We find that a linguistically informed phonemic tokenization system substantially improves word error rate (WER) and character error rate (CER) compared to a baseline orthographic tokenization scheme. Finally, we show that hand-correcting the output of an ASR model is much faster than hand-transcribing audio from scratch, demonstrating that ASR can provide significant assistance for underresourced language documentation.
Short-form verbal arts as a speech data resource in the field
Matthew Faytak | Tianle Yang | Pius Wuchu Akumbu | Ivo Forghema Njuasi | Éric Le Ferrand
Matthew Faytak | Tianle Yang | Pius Wuchu Akumbu | Ivo Forghema Njuasi | Éric Le Ferrand
We propose a method for efficient field data collection of speech resource data which leverages short-form verbal arts, namely riddles and proverbs, which permit a predictable transcript to be assigned to naturalistic but conventionalized utterances. As a proof of concept, we describe a 5.25 hour corpus of proverbs and riddles collected for Kom, a low-resource language of Cameroon, and conduct ASR modeling experiments on the corpus. Results suggest that the method yields high quality speech data, albeit with relatively low lexical diversity. We highlight the alignment of the collected data with community priorities for cultural education and preservation in the Cameroonian context.
Quantitative Lect Description: A Case Study of Lemko from the Field Data of 1920s-1930s
Ilia Afanasev
Ilia Afanasev
While qualitative descriptions (in the form of reference grammars) and benchmarks for low-resource languages are becoming increasingly widespread, computational linguists do not often use quantitative methods to describe a new lect rather than a new model. This paper intends to close this lacuna. The case study is a Lemko text transcribed at the beginning of the twentieth century. Using morphosyntactic tagging and topic modelling, the study demonstrates areal influences and archaic features of the lect. Fine-grained evaluation significantly assists in identifying subtle patterns that are not readily apparent through traditional metrics such as accuracy score. The results highlight the necessity of a more detailed analysis of model performance, which may yield more linguistically significant results than a purely manual check. This information is present in the resulting dataset, which can be used for further investigation into the structural features of the Lemko lect.
We conduct a preliminary study of the order of subject (S), object (O), and verb (V) in Tatyshly Udmurt (Finno-Ugric) on the basis of approximately 900 clauses from oral folklore and non-folklore narratives (including contemporary texts and texts recorded earlier) using a gradient approach. We show that the most frequent word orders are SOV, SV, and OV. In full clauses (with both S and O), in folklore texts SOV order (≈ 70%) is followed by OSV order (≈ 15%). In contemporary non-folklore texts, however, SOV order competes with SVO order (50% vs 30%), which may be explained by the influence of Russian. We note that full clauses may differ from clauses with only S or with only O: in contemporary folklore texts VS order is much more frequent in S-only clauses (≈ 23%) than in full ones (≈ 4%), and in contemporary non-folklore texts VO order is more frequent in full clauses (≈ 35%) than in O-only ones (≈ 12%). Moreover, we show that word order can depend on the type of clause. For example, in existential clauses the order is almost always SV, while clauses with verbs of speech often have VS order.
up
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Proceedings of the Fifth Workshop on Generation, Evaluation and Metrics (GEM)
Simon Mille | Sebastian Gehrmann | Patrícia Schmidtová | Ondřej Dušek | Marzieh Fadaee | Kyle Lo | Enrico Santus | Gabriel Stanovsky
Simon Mille | Sebastian Gehrmann | Patrícia Schmidtová | Ondřej Dušek | Marzieh Fadaee | Kyle Lo | Enrico Santus | Gabriel Stanovsky
CoSy: Conversational Synthesis for Grounded Question Answering
Patrick Huber | Arash Einolghozati | Rylan Conway | Kanika Narang | Matt Smith | Waqar Nayyar | Adithya Sagar | Ahmed A Aly | Akshat Shrivastava
Patrick Huber | Arash Einolghozati | Rylan Conway | Kanika Narang | Matt Smith | Waqar Nayyar | Adithya Sagar | Ahmed A Aly | Akshat Shrivastava
High-quality, large-scale conversational datasets are scarce, making it difficult to train on-device language models (~1B parameters) as effective assistants. We introduce CoSy (Conversational Synthesis), a novel framework for generating diverse, steerable, multi-turn conversations at scale. CoSY combines three key mechanisms: (1) conversational graphs that ensure natural dialogue flow, (2) turn-based prompt augmentations for diversity, and (3) explicit linguistic phenomena for coherence. We evaluate CoSy on conversational grounded reasoning tasks (i.e. answering questions based on contextual information), a core on-device use case.Our on-device sized models trained on CoSy-synthesized data achieve competitive performance with human-annotated baselines and outperform instruction-tuned models of up to 70B parameters in zero-shot settings.
VAIDYA: Validated Agents for Intelligent Diagnosis and Yielded Analysis
Kalash Shah | Gautam Bhutani | Rohitaswa Sarbhangia | J Snehan
Kalash Shah | Gautam Bhutani | Rohitaswa Sarbhangia | J Snehan
Recent advances in large language models (LLMs) have demonstrated impressive medical reasoning capabilities. However, current evaluation methods are mostly limited to static case vignettes and multiple-choice questions which fail to reflect the complexity, uncertainty, and iterative nature of real-world clinical decision-making. To bridge this gap, we propose **DiagBench**, a novel benchmark where models interact dynamically with a LLM based Patient Simulator, querying relevant clinical details to formulate accurate diagnoses. To complement this, we introduce **MedConvBench**, a diagnostic conversation benchmark designed to assess the relevance and quality of model-generated clinical reasoning. To further address the interpretability and alignment challenges of AI-assisted diagnosis, we develop a modular and medically grounded framework called **VAIDYA** that mirrors a physician’s stepwise diagnostic reasoning. This structured approach improves transparency and yields substantial performance gains over base LLMs. Our work takes a critical step toward aligning AI systems with real-world clinical practices by combining dynamic interaction, interpretability, and clinical validation.
Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence
Harshavardhan
Harshavardhan
Self-Anchoring Calibration Drift (SACD), a tendency for large language models (LLMs) to show systematic changes in expressed confidence when building iteratively on their own prior outputs across multi-turn conversations. Through a controlled three-condition study comparing Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 across factual, technical, and open-ended domains, we find that SACD is real but multiform: models exhibit distinct self-anchoring signatures ranging from active confidence suppression to calibration improvement suppression, with effects concentrated in open-ended domains. These findings challenge the adequacy of single-turn calibration evaluation for characterizing LLM reliability in realistic multi-turn deployment contexts. Code and data are available at https://github.com/hvardhan878/calibration-drift
Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models
Zefang Liu | Nam H Nguyen | Yinzhu Quan | Shi-Xiong Zhang
Zefang Liu | Nam H Nguyen | Yinzhu Quan | Shi-Xiong Zhang
Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents a systematic empirical study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data’s statistical properties, highlighting temporal tokenization as a critical yet often overlooked design dimension in LLM-based event modeling.
“Be My Cheese?”: Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs
Madison Van Doren | Casey Ford | Jennifer Barajas | Riley VanMeter | Cory Holland
Madison Van Doren | Casey Ford | Jennifer Barajas | Riley VanMeter | Cory Holland
We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but often overlook pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Raters scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0–3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 3.7 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate substantially better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence.
Component Transfer Can Exceed Full Model Performance: Investigating Post-Trained Mixture-of-Experts
Rabin Tiwari
Rabin Tiwari
Post-training methods such as supervised fine-tuning and preference optimization are widely used to align large language models, yet how their benefits distributeacross architectural components and transfer across tasks and prompts remains unclear. In this work, we analyze component-level transfer in aMixture-of-Experts language model by selectively replacing routers, attention modules, and expert networks between two post-trained Mixture of Experts models trained with different post-training recipes and dataset mixtures. Starting from a SFT+DPO checkpoint, we systematically replace its components (routers, attention, experts) with those from a Tulu3 checkpoint and evaluate the impact of each replacement and their combinations on mathematical and scientific reasoningand a general-purpose classification task under zero-shot, few-shot and Chain of Thought prompting. We find strong component-specific specialization: expert networksaccount for most gains on mathematical and scientific reasoning, while attention mechanisms consistently outweigh expert transfer on general tasksand router transfer alone provides minimal benefit or harms performance. Prompting strategy further modulates these effects, with expert transfer degrading zero-shot scienceperformance but improving few-shot reasoning. Strategically combining components from different model versions can in some cases match or exceed the performance of the best available model, motivating principled techniques for composing post-trained models into task- and prompt-specific systems without additional training.
Reassessing Extractive QA Datasets at Scale: LLM-as-a-Judge and In-Depth Analyses
Xanh Ho | Jiahao Huang | Florian Boudin | Akiko Aizawa
Xanh Ho | Jiahao Huang | Florian Boudin | Akiko Aizawa
Extractive QA tasks are commonly evaluated using Exact Match (EM) and F1-score, but these metrics often fail to reflect true model performance. Recent studies have proposed using large language models (LLMs) as judges (LLM-as-a-judge), yet they often lack comprehensive evaluation across datasets and overlook key factors such as sensitivity to answer types, prompt variations, and self-preference bias.In this work, we conduct a systematic study of LLM-as-a-judge across four extractive QA datasets and various prompt variations, assessing multiple LLM families in both answering and judging roles. Our results show that LLM-as-a-judge judgments correlate much more strongly with human evaluations than EM (0.22) and F1 (0.40), achieving correlations up to 0.85 with open-source models.Further analysis reveals that LLM-as-a-judge performs particularly well on number-related answers but faces challenges with more complex types, such as job titles. Contrary to findings in other NLP tasks, we observe no self-preference bias, even when the same model serves as both QA model and judge. Finally, we find that prompt phrasing has minimal impact, and zero-shot, context-free judging often yields the best evaluation performance.
IndicMMLU-Pro: Benchmarking Indic Large Language Models on Multi-Task Language Understanding
Sankalp Jajee | Ashutosh Kumar | Nikunj Kotecha | Vinija Jain | Aman Chadha | Sreyoshi Bhaduri
Sankalp Jajee | Ashutosh Kumar | Nikunj Kotecha | Vinija Jain | Aman Chadha | Sreyoshi Bhaduri
Indic languages, spoken by over 1.5 billion people, pose unique challenges for NLP due to their cultural richness, linguistic diversity, and structural complexity. We present IndicMMLU-Pro, a comprehensive benchmark extending the MMLU-Pro framework to nine major Indic languages: Hindi, Bengali, Gujarati, Marathi, Kannada, Punjabi, Tamil, Telugu, and Urdu. Covering a wide range of tasks in comprehension, reasoning, and generation, IndicMMLU-Pro offers a standardized evaluation framework to advance AI model development in Indic contexts. This paper details the benchmark’s design, taxonomy, and data curation, and establishes baseline results using state-of-the-art multilingual models. As an open resource IndicMMLU-Pro aims to accelerate progress in Indic language technologies and support inclusive research in multilingual NLP.
Identifying Where Large Language Models Struggle in Answering Complex Questions
Xanh Ho | Florian Boudin | Saku Sugawara | Khoa Duong | Akiko Aizawa
Xanh Ho | Florian Boudin | Saku Sugawara | Khoa Duong | Akiko Aizawa
We design experiments to identify where Large Language Models (LLMs) struggle when answering complex questions.Our focus is on two key stages, mirroring the human QA process: 1) question decomposition, where the model breaks down a complex question into sub-questions and 2) subproblem solving, where it addresses each sub-question to obtain the final response.We preprocess and expand three multi-hop datasets to create experimental datasets featuring explicit and implicit multi-hop questions, crowdsourced and templated questions, and varying numbers of hops.Our results show that larger models (Llama 3.1 70B and o1) excel at decomposing explicit multi-hop questions but struggle with implicit ones, while smaller models (e.g., Llama 3.1 8B) have difficulty with both.In the sub-problem solving stage, all models perform well on simple questions with context.Furthermore, we found no correlation between accuracy in the question decomposition stage and final QA performance (direct response), highlighting a key difference between human and LLM reasoning.
More Yap Less Meaning: Uncovering Self-Improvement Behavior in SLMs
Marina Igitkhanian | Erik Arakelyan
Marina Igitkhanian | Erik Arakelyan
Recently, language models have made rapid progress across various domains and applications. However, their capability for self-improvement, i.e., whether they are adept at recognising and correcting flaws in their own reasoning, remains dubious. In this study, we address this question by constructing a sufficiency test to rigorously examine the self-correction capabilities of small language models (SLMs). We propose a minimal three-step self-correction pipeline that collects initial SLM answers, prompts the same model to generate hints for its incorrect responses given the ground truth, and feeds the model the same question with its own feedback to refine the initial answer. We evaluate a variety of instruction-tuned and reasoning SLMs in this experimental setup on arithmetic and logical reasoning benchmarks. Our findings show that SLMs with injected hint sentences yield only a 4.4$ % gain over initial question-answering accuracy. Even though the correct answer was provided alongside the model’s incorrect reasoning, the evaluated SLMs fail to understand what was missing in their reasoning and show minimal semantic difference between hints that lead to corrections and ones that do not. Furthermore, our experiments show that longer hints are positively correlated with incorrect final answers, suggesting that longer deliberation on problems can hinder the reasoning process, meaning that SLMs do not necessarily scale in performance with a larger compute budget.
Reinforced Agent: Inference-Time Feedback for Tool-Calling Agents
Anh Ta | Junjie Zhu | Shahin Shayandeh
Anh Ta | Junjie Zhu | Shahin Shayandeh
Tool-calling agents are evaluated on tool selection, parameter accuracy, and scope recognition, yet LLM trajectory assessments remain inherently *post-hoc*. Disconnected from the active execution loop, such assessments identify errors that are usually addressed through prompt-tuning or retraining, and fundamentally cannot course-correct the agent in real time. To close this gap, we move evaluation into the execution loop at *inference time*: a specialized reviewer agent evaluates provisional tool calls *prior to* execution, shifting the paradigm from post-hoc recovery to proactive evaluation and error mitigation.In practice, this architecture establishes a clear separation of concerns between the primary execution agent and a secondary review agent. As with any multi-agent system, the reviewer can introduce new errors while correcting others, yet no prior work to our knowledge has systematically measured this tradeoff. To quantify this tradeoff, we introduce *Helpfulness-Harmfulness metrics*: helpfulness measures the percentage of base agent errors that feedback corrects; harmfulness measures the percentage of correct responses that feedback degrades. These metrics directly inform reviewer design by revealing whether a given model or prompt provides net positive value.We evaluate our approach on BFCL (single-turn) and 𝜏2-Bench (multi-turn stateful scenarios), achieving +5.5% on irrelevance detection and +7.1% on multi-turn tasks. Our metrics reveal that reviewer model choice is critical: the reasoning model o3-mini achieves a 3:1 benefit-to-risk ratio versus 2.1:1 for GPT-4o. Automated prompt optimization via GEPA provides an additional +1.5–2.8%. Together, these results demonstrate a core advantage of separating execution and review: the reviewer can be systematically improved through model selection and prompt optimization, without retraining the base agent.
RE-AD: Real-Time Requirement Adherence for Data Labeling
Siddarth Malreddy | Ishan Nigam | Akshay Arora | Nikhil Mittal | Subrat Sahu
Siddarth Malreddy | Ishan Nigam | Akshay Arora | Nikhil Mittal | Subrat Sahu
Human-annotated data remains fundamental to training frontier Large Language Models (LLMs). However, crowd-sourced annotations often suffer from quality issues stemming from annotator misunderstanding or lack of engagement. To address this, we introduce a real-time requirement adherence (RE-AD) framework that leverages LLMs to proactively validate labeling quality. Our methodology involves decomposing Standard Operating Procedures (SOPs) into atomic rules via self-reflection, categorizing them by complexity, and applying tiered validation strategies. Evaluated on a synthetic benchmark, the system achieved an F1 score of 0.749. Furthermore, production deployment resulted in annotators accepting and fixing 82% of the errors flagged by the framework. We include ablation studies to demonstrate the impact of our core design decisions.
General-purpose language models are trained to produce varied natural language outputs, but for some tasks, like annotation or classification, we need more specific output formats. LLM systems increasingly support structured output, which enforces formats by sampling tokens according to a grammar — but also unpredictably reduces downstream performance. Are there systematic differences between grammars that appear semantically (and often visually) similar to humans? To answer this, we test four popular model families with five varying output formats on four common NLP benchmarks. We find all models perform most accurately when guided to use formats respecting convention, such as letters for multiple choice and real numbers for numerical prediction. Performance also improves by 5%-10% when guiding models to return tokens incorporating leading whitespace, with smaller models benefiting the most. We find leading whitespace helps models avoid structural deficiencies in subword token representations. We finally present best practices for researchers using language models as zero-shot classifiers with structured output.
An Empirical Study of LLM-as-a-Judge: How Design Choices Impact Evaluation Reliability
Yusuke Yamauchi | Taro Yano | Masafumi Oyamada
Yusuke Yamauchi | Taro Yano | Masafumi Oyamada
As large language models (LLMs) continue to advance, reliable evaluation methods are essential—particularly for open-ended, instruction-following tasks. LLM-as-a-Judge enables automatic evaluation using LLMs as evaluators, but its reliability remains uncertain. In this work, we analyze key factors affecting its trustworthiness, focusing on alignment with human judgments and evaluation consistency. Using BIGGENBench and EvalBiasBench, we study the effects of evaluation design, decoding strategies, and Chain-of-Thought (CoT) reasoning in evaluation. Our results show that evaluation criteria are critical for reliability, non-deterministic sampling improves alignment with human preferences over deterministic evaluation, and CoT reasoning offers minimal gains when clear evaluation criteria are present.
In many human-annotated NLP tasks involving ambiguity or subjective judgment, annotator disagreement reflects epistemic uncertainty rather than noise. Soft labeling (SL), which represents annotations as probability distributions rather than majority-vote (MV) labels, preserves this uncertainty and can improve downstream performance. We extend this perspective to LLM-based annotation by formalizing LLM soft labeling as introducing controlled variation in model-generated annotations to approximate the latent variability underlying human annotations. We distinguish two sources of variation: model-induced (e.g., stochastic decoding and model ensembles) and human-approximated (e.g., persona prompting and human-calibrated in-context annotation). Using the Gab Hate and GoEmotions datasets, we show that SL training consistently outperforms MV training under stronger LLM-based annotation strategies. Model ensembles produce the most informative soft-label distributions, achieving the best human–LLM agreement and downstream classification performance. These findings suggest that scalable LLM-based annotation pipelines can model epistemic uncertainty through diverse model-level variation without explicitly simulating human attributes.
Mind the Gap... or Not? How Translation Errors and Evaluation Details Skew Multilingual Results
Jan-Thorsten Peter | David Vilar | Tobias Domhan | Dan Malkin | Markus Freitag
Jan-Thorsten Peter | David Vilar | Tobias Domhan | Dan Malkin | Markus Freitag
Most current large language models (LLMs) support a wide variety of languages in addition to English, including high-resource languages (e.g. German, Chinese, French), as well as low-resource ones (e.g. Swahili, Telugu). In addition they have shown impressive capabilities in different domains, like coding, science and math. In this paper, taking math as an example domain, we study the performance of different LLMs across languages. Experimental results show that there exists a non-negligible and consistent gap in the performance of the models across languages. Interestingly, and somewhat against expectations, the gap exists for both high- and low-resource languages. These results should impact further research into cross-lingual capability generalization for next generation LLMs. Or they would, if it weren’t for the fact that they are false. By analyzing one of the standard multilingual math benchmarks (MGSM), we determine that several translation errors are present in the data. Furthermore, the lack of standardized answer extraction from LLM outputs further influences the final results. We propose a method for semi-automatic quality assurance to address the first issue at scale, and give recommendations to address the second one. Combining these two approaches we show that the aforementioned language gap mostly disappears, leading to completely different conclusions from our research. We additionally release the corrected dataset to the community.
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
Jaeyun Lee | Junyoung Koh | Zeynel Tok | Hunar Batra | Ronald Clark
Jaeyun Lee | Junyoung Koh | Zeynel Tok | Hunar Batra | Ronald Clark
Multi-constraint instruction following requires verifying whether a response satisfies multiple individual requirements, yet LLM judges are often assessed only through overall-response judgments. We introduce MCJudgeBench, a benchmark for constraint-level judge evaluation in multi-constraint instruction following. Each instance includes an instruction, a candidate response, an explicit constraint list, per-constraint gold labels in yes, partial, no, and controlled response-side perturbations. The evaluation protocol further includes evaluation prompt variants to test judge stability. We evaluate proprietary and open-source LLM judges using both correctness and inconsistency metrics, distinguishing intrinsic inconsistency under stochastic decoding from procedural inconsistency under prompt and response perturbations. Our results show that judge reliability has multiple dimensions: strong overall performance does not guarantee equally reliable detection across label categories, especially for rarer partial and no cases. Judges with higher correctness do not always have lower inconsistency. Evaluation with reasoning improves correctness but does not uniformly improve stability. These findings motivate evaluating LLM judges at the constraint level to study these failure modes.
MedAct: Removing the Human Bottleneck in Benchmarking Clinical LLM Safety
Arjun Krishna | Brian Pridgen | Max Silverstein
Arjun Krishna | Brian Pridgen | Max Silverstein
Most medical benchmarks for large language models test factual recall through multiple-choice questions, but on-the-ground physicians do not have the luxury of four options to choose from. NOHARM (Wu et al., 2025) demonstrated this limitation using 100 real eConsult cases annotated by 29 board-certified physicians, showing that action-level evaluation reveals omission and commission failure modes invisible to multiple-choice tests. However, NOHARM’s cases are closed and their creation required substantial expert physician time, creating a human bottleneck that limits the scalability and openness of this evaluation approach. We present MedAct, an open replication of NOHARM’s evaluation methodology using synthetically generated cases. Our contribution is a multi-stage generation pipeline that uses language models grounded in clinical practice guidelines to produce 100 cases across ten specialties, each containing roughly 50 plausible next-step actions labeled as Appropriate or Inappropriate using NOHARM’sscoring framework. The pipeline includes structural quality controls: 83 of 100 cases pass all five automated checks, and answer-leaking language appears in only 0.06% of actions. In a pilot evaluation of nine contemporary LLMs using this synthetic benchmark, we observe patterns consistent with NOHARM’s findings on human-curated cases, including that omissions dominate error volume while commissions dominate severe errors. We release all cases, rubrics, generation tooling, and scoring code openly, removing the human-bottleneck barrier to action-level clinical LLM evaluation.
Response Content Units: Evaluating Completeness and Proactiveness in Medical Open-Response Question Answering
Yongsin Park | Wen-wai Yim | Emma McKibbin | Asma Ben Abacha | Fei Xia
Yongsin Park | Wen-wai Yim | Emma McKibbin | Asma Ben Abacha | Fei Xia
Remote clinical care has significantly increased the workload for healthcare professionals managing digital inquiries. While automated systems aim to alleviate this burden, consumer health questions present unique challenges due to their linguistic complexity and the need for proactive clinical guidance, which traditional question-answering models often overlook. We introduce the medical Response Content Units (RCU) schema, a framework that facilitates automatic analysis to identify question-answer completeness and critical answer subparts, which can then be used as tools for supporting clinician response or for automatic metric evaluation. Our analysis using this schema reveals a 16.4% gap in response completeness in professional replies and demonstrates that essential medical directives are provided 2.4 to 12.1 times as frequently as direct answers. We provide baseline results and publicly release our annotations and source code to offer an evaluation framework that is more closely aligned with real-world clinical requirements.
NanoFlux: Adversarial Dual-LLM Evaluation and Distillation for Multi-Domain Reasoning
Raviteja Anantha | Soheil Hor | Teodor Nicola Antoniu | Layne C Price
Raviteja Anantha | Soheil Hor | Teodor Nicola Antoniu | Layne C Price
We present NanoFlux, a novel adversarial framework for generating targeted training data to improve LLM reasoning, where adversarially-generated datasets of ≤ 200 examples outperform conventional fine-tuning approaches. The framework employs a competitive dynamic between models alternating as Attacker and Defender, supervised by a tool-augmented Judge, synthesizing multi-step questions with explanatory annotations. Fine-tuning a 4B-parameter model on NanoFlux-generated data yields performance gains across diverse domains compared to full-benchmark fine-tuning: +5.9% on mathematical reasoning, +3.6% on scientific reasoning, and +16.6% on medical reasoning, while reducing computational requirements by 3-14×. Ablation studies reveal a non-monotonic relationship between dataset characteristics and model performance, uncovering domain-specific optimal points for question complexity and reasoning quality. NanoFlux automates training data generation through embedding-based novelty filtering, tool-augmented evaluation, and multi-hop reasoning, pointing to the value of small, targeted training datasets.
Evaluating the Reliability of LLMs in Faithfully Updating Text: An Empirical Study
Ayan Datta | Paheli Bhattacharya | Rishabh Gupta
Ayan Datta | Paheli Bhattacharya | Rishabh Gupta
We provide a comprehensive review of the FRUIT (Faithfully Reflecting Updated Information in Text) task, which formalizes the challenge of accurately updating textual information with large language models (LLMs). Our work begins with an in-depth analysis of the FRUIT dataset, revealing key structural insights. We also investigate the unsupervised capabilities of LLMs—such as zero-shot learning, chain-of-thought reasoning, self-reflection, and evidence ordering. Experimental results demonstrate that unsupervised approaches perform competitively with supervised methods in faithful text updating. Qualitative analysis shows that updates utilizing table-structured evidence outperform those based on unstructured text. We also discuss important limitations, including the need for new datasets and the risks of information leakage in this domain. These findings have significant implications for applications requiring precise document updates, such as software engineering, technical documentation, and legal document maintenance.
Not All Tokens Are Equal: Per-Dimension Top-K Pooling for Adversarially Robust BERT Classification
Manoranjan Dash | Shivam Anand Aralikatti | Shanay Sheth | Pranav Shinde
Manoranjan Dash | Shivam Anand Aralikatti | Shanay Sheth | Pranav Shinde
Contextual text classification with BERT typically relies on the [CLS] token representation for downstream prediction. While effective under standard conditions, [CLS]-based pooling is brittle under adversarial perturbation, as its single-vector representation is indiscriminately influenced by injected adversarial tokens. We propose Per-Dimension Top-K Average Pooling, a pooling strategy that, for each hidden dimension, selectively aggregates only the top-K token activations rather than the full sequence — effectively controlling which tokens contribute to the final representation. This token-level selectivity acts as a natural filter against adversarial injection: tokens that do not rank among the top-K for a given dimension are suppressed from aggregation. We evaluate our approach against CLS, Global Average Pooling (GAP), Global Max Pooling (GMP), and Hybrid variants across three text classification domains: spam detection (Enron and LingSpam), automated essay scoring (ASAP), and hate speech classification. On the Enron spam dataset under adversarial attack, our best Hybrid (K=3) variant reduces the Attack Success Rate from 70.65% to 37.07% while maintaining clean accuracy above 99%, compared to CLS which degrades to 63.64% adversarial accuracy. Representation-level analyses further corroborate these findings: Top-K pooling variants exhibit substantially lower cosine similarity shift under attack, and adversarially injected tokens enter the top-K selection in far fewer dimensions compared to CLS. These results suggest that per-dimension token selectivity offers a principled and lightweight mechanism for adversarial robustness in BERT-based classifiers without any modification to the underlying model architecture.
Near-Miss: Latent Policy Failure Detection in Agentic Workflows
Ella Rabinovich | David Boaz | Naama Zwerdling | Ateret Anaby Tavor
Ella Rabinovich | David Boaz | Naama Zwerdling | Ateret Anaby Tavor
Agentic systems for business process automation often require compliance with policies governing conditional updates to the system state. Evaluation of policy adherence in LLM-based agentic workflows is typically performed by comparing the final system state against a predefined ground truth. While this approach detects explicit policy violations, it may overlook a more subtle class of issues in which agents bypass required policy checks, yet reach a correct outcome due to favorable circumstances. We refer to such cases as near-misses or latent failures. In this work, we introduce a novel metric for detecting latent policy failures in agent conversations traces. Building on the ToolGuard framework, which converts natural-language policies into executable guard code, our method analyzes agent trajectories to determine whether agent’s tool-calling decisions where sufficiently informed.We evaluate our approach on the 𝜏2-verified Airlines benchmark across several contemporary open and proprietary LLMs acting as agents. Our results show that latent failures occur in 8–17% of trajectories involving mutating tool calls, even when the final outcome matches the expected ground-truth state. These findings reveal a blind spot in current evaluation methodologies and highlight the need for metrics that assess not only final outcomes but also the decision process leading to them.
Evaluating Counterfactual Strategic Reasoning in Large Language Models
Dimitrios Georgousis | Maria Lymperaiou | Angeliki Dimitriou | Giorgos Filandrianos | Giorgos Stamou
Dimitrios Georgousis | Maria Lymperaiou | Angeliki Dimitriou | Giorgos Filandrianos | Giorgos Stamou
We evaluate whether LLMs adapt their strategic behavior when familiar games are counterfactually modified. We introduce a repeated-game evaluation framework covering Prisoner’s Dilemma and Rock–Paper–Scissors under default, label-perturbed, payoff-perturbed, and joint counterfactual variants. This design separates surface robustness to renamed actions from deeper sensitivity to changed incentives. Across multiple frontier LLMs, we find that label perturbations usually cause moderate degradation, whereas payoff perturbations expose stronger failures: LLMs often preserve canonical strategies even when the equilibrium structure changes. In RPS, several LLMs remain close to uniform play despite a payoff-counterfactual equilibrium requiring a biased mixed strategy. Behavioral and efficiency metrics further show that stronger or reasoning-enabled LLMs are not uniformly more strategic: some deliberate more without adapting faster. Overall, counterfactual repeated games provide a compact diagnostic for distinguishing robust incentive-sensitive behavior from brittle template-based strategic execution.
Speculative Refinement: A Hybrid Autoregressive Diffusion Decoding Strategy and Its Behavior Across Benchmarks
Aditi Gupta | Neel Mishra | Kushagra Trivedi | Pawan Kumar
Aditi Gupta | Neel Mishra | Kushagra Trivedi | Pawan Kumar
How should we evaluate generation systems that combine autoregressive (AR) and diffusion decoding?We study this question through *Speculative Refinement* (SpecRef), a training-free hybrid method that warm-starts a masked diffusion language model from an AR draft using entropy-guided selective masking.Evaluating SpecRef across six benchmarks (HumanEval, MBPP, GSM8K, BBH, ARC-Challenge, HellaSwag) with three distinct evaluation protocols (execution-based pass@1, exact-match, log-likelihood scoring), we surface several findings relevant beyond our specific system:(1) code benchmarks conflate structural discovery with logical correctness: providing a syntactic scaffold lifts accuracy from near zero to over 20% without changing the model, indicating that much of the baseline failure is structural;(2) a *refinement tension* phenomenon where multi-stage correction degrades already-correct tokens, exposing benchmark saturation ceilings invisible to single-model evaluation;(3) log-likelihood and generative evaluation produce different model rankings for the same model pair, suggesting they measure different capabilities;(4) standard Python post-processing silently breaks code evaluation for non-AR generators.These observations apply to any multi-stage or non-autoregressive generation pipeline and point toward more diagnostic evaluation practices.
SAUCE: Summary Analysis Using Conversation Entailment
Man-Ling Sung | Hemanth Kandula | Jeff Ma | William Hartmann | Matthew Snover
Man-Ling Sung | Hemanth Kandula | Jeff Ma | William Hartmann | Matthew Snover
With the growing need for evaluating Large Language Models (LLMs) and their applications to speech, challenges persist in summarizing and evaluating conversations that lack a clear end goal. We introduce SAUCE – a reference-free, fact-based evaluation pipeline for cross-lingual conversational speech summarization. It measures the accuracy and the fact coverage of a summary through the entailment between conversation and text. We compare SAUCE against several popular summarization metrics and demonstrate the effectiveness of capturing information loss due to transcription and translation error and identifying broken summaries. Crucially, unlike black-box LLM evaluators or dense embedding metrics, SAUCE is inherently explainable: it maps summary scores to discrete, verifiable facts, allowing users to pinpoint exact hallucinations or omissions. We illustrate how this interpretability helps developers systematically profile LLM behaviors and gives end-users an actionable tool to verify summary accuracy in noisy, real-world conditions. Preliminary investigations show SAUCE strongly align with human judgment.
Evaluating ASR Quality at Scale on TV Entertainment Platforms
Adeep Hande | Kishorekumar Sundararajan | Yidnekachew Endale | Akshatha Bapu KrishnaSwamy | Sachin Dabral | Dawn Reed | Michael Pereira
Adeep Hande | Kishorekumar Sundararajan | Yidnekachew Endale | Akshatha Bapu KrishnaSwamy | Sachin Dabral | Dawn Reed | Michael Pereira
Fine-Tuning vs. RAG for Multi-Hop Question Answering with Novel Knowledge
Zhuoyi Yang | Yurun Song | Kyler G. Harris | Iftekhar Ahmed | Ian Harris
Zhuoyi Yang | Yurun Song | Kyler G. Harris | Iftekhar Ahmed | Ian Harris
Multi-hop question answering is widely used to evaluate the reasoning capabilities of large language models (LLMs), as it requires integrating multiple pieces of supporting knowledge to arrive at a correct answer. While prior work has compared fine-tuning and retrieval-augmented generation (RAG) for factual recall and single-hop question answering, it remains unclear how these approaches perform in multi-hop settings that require compositional reasoning over temporally novel knowledge. In particular, prior comparisons often do not control for model scale, evaluation format, or knowledge freshness, making it difficult to isolate the effect of knowledge injection mechanisms.In this paper, we systematically compare parametric and non-parametric knowledge injection methods for open-domain multi-hop question answering. We evaluate unsupervised fine-tuning (continual pretraining), supervised fine-tuning, and retrieval-augmented generation across three 7B-parameter open-source LLMs. Experiments are conducted on two benchmarks: Question Answering Science Challenge (QASC), a standard multi-hop science question answering dataset, and a newly constructed dataset of over 10,000 multi-hop questions derived from Wikipedia events in 2024, which is designed to test knowledge beyond the models’ pretraining cutoff.Our results show that unsupervised fine-tuning provides only limited gains over base models, suggesting that continual pretraining alone is insufficient for improving multi-hop reasoning accuracy. In contrast, RAG yields substantial and consistent improvements, particularly when answering questions that rely on temporally novel information. Supervised fine-tuning achieves the highest overall accuracy across models and datasets. These findings highlight fundamental differences in how knowledge injection mechanisms support multi-hop question answering and underscore the importance of retrieval-based methods when external or compositional knowledge is required.
MHGraphBench: Knowledge Graph-Grounded Benchmarking of Mental Health Knowledge in Large Language Models
Weixin Liu | Congning Ni | Shelagh A. Mulvaney | Susannah L. Rose | Murat Kantarcioglu | Bradley A. Malin | Zhijun Yin
Weixin Liu | Congning Ni | Shelagh A. Mulvaney | Susannah L. Rose | Murat Kantarcioglu | Bradley A. Malin | Zhijun Yin
Large language models (LLMs) are increasingly used in the mental health domain, yet it remains unclear how well they capture related biomedical knowledge and how reliably they apply it to clinically salient structured judgments. Here, we present a knowledge-graph (KG)-grounded benchmark for assessing LLMs on mental-health entity recognition, relation judgment, and two-hop reasoning. The benchmark is derived from PrimeKG and comprises nine task families with KG-supported answers and controlled negative options. Experiments across 15 closed- and open-source LLMs reveal a persistent recognition-to-judgment gap: leading models achieve near-ceiling performance on entity typing and on the small relation-typing subset, yet they still struggle with relation prediction and two-hop reasoning. Additionally, short KG-derived snippets benefit some models but degrade performance for others. Moreover, output-format reliability can substantially influence measured performance under constrained multiple-choice settings, highlighting the critical role of response validity in benchmark-based evaluation. MHGraphBench should therefore be interpreted as evaluating agreement with a curated mental-health slice of PrimeKG under a constrained multiple-choice interface, rather than as a direct assessment of real-world clinical safety.
A Progressive Evaluation Framework for Multicultural Analysis of Story Visualization
Janak Kapuriya | Ali Hatami | Paul Buitelaar
Janak Kapuriya | Ali Hatami | Paul Buitelaar
Recent advancements in text-to-image generative models have improved narrative consistency in story visualization. However, current story visualization models often overlook cultural dimensions, resulting in visuals that lack cultural fidelity. In this study, we present a progressive evaluation framework for story visualization. We validate this framework on current text-to-image models across three languages (English, Hindi, and Chinese) on two datasets (VIST and FlintstonesSV). The proposed framework introduces three levels of cultural analysis as evaluation rubrics: 1) Basic Cultural Criteria, 2) Cultural Dimension Guidance, and 3) Cultural Examples Grounding. We evaluate story visualization by use of a novel MLLM-as-Jury approach across all three rubrics and a small-scale human evaluation only on the third rubric. We implement an MLLM-as-jury approach by aggregating scores from three different families of MLLM-as-Judge models. In our experiments, real-world stories generally receive higher cultural appropriateness scores than animated ones, with English tending to score higher than Hindi and Chinese across the evaluated models. Some examples also exhibited culturally inconsistent or stereotypical elements noted by annotators. The proposed progressive evaluation framework has therefore been shown to provide early insights into cultural misalignments in story visualization. Code for this work is made available on https://github.com/janak11111/Cultural_Eval_For_StoryViz
Is GraphRAG Needed? From Basic RAG to Graph-/Agentic Solutions with Context Optimization
Long Chen | Ryan Razkenari | Yuxuan Zhou | Yuan Tian | Rahul Ghosh | Venkatesh Pappakrishnan | Disha Ahuja | Vidya Sagar Ravipati
Long Chen | Ryan Razkenari | Yuxuan Zhou | Yuan Tian | Rahul Ghosh | Venkatesh Pappakrishnan | Disha Ahuja | Vidya Sagar Ravipati
As advanced RAG variants like GraphRAG and Agentic RAG emerge, one leading question is when and how to use them. Here, we introduce a framework for different RAG scenarios evaluation and comparison on semi-structured knowledge bases, including regular RAG, GraphRAG, Modular RAG and Agentic RAG. We provide implementation for 9 standardized RAG scenarios, and conduct experiments for a comprehensive comparison. These scenarios are designed for real use cases regarding data and domain restrictions, spanning from simple document-based retrieval to advanced features such as hybrid text-graph retrieval, integration with computed or pre-defined domain knowledge graphs, agentic multi-step planning, and agent-graph integration. Besides, we present a novel context engineering method for GraphRAG and Agentic RAG, addressing the context/memory overflow issues, efficiently managing text and graph retrievals with new representations and agentic loop design, leading to 19%-53% reduction on token usage. Moreover, further analysis identifies a retrieval-generation gap where expanded retrieval does not proportionally improve generation quality, suggesting retrieval-oriented metrics overstate advanced retrieval benefits. This work provides data-driven insights on when and how to use them for building production-ready intelligent RAG systems.
Cross-Domain Semantic Fidelity Evaluation for Meaning-to-Text Generation
Davan Harrison | Marilyn Walker
Davan Harrison | Marilyn Walker
Slot Error Rate (SER) is the standard metric for evaluating semantic accuracy in meaning-to-text generation, but computing it has historically required domain-specific scripts that do not generalize across datasets. We present a cross-domain SER evaluation framework that replaces hand-crafted rules with a learned slot extraction model. We adapt Llama-3.2-3B-Instruct with LoRA, updating only 0.34% of its parameters, and show that this small adapted model outperforms prompted frontier LLMs by a wide margin on structured extraction across 23 dialogue domains. We further apply overgenerate-and-rank to the extraction task itself, generating multiple candidate meaning representations and selecting the best one with a trained ranker, which improves SER-Accuracy from 75% to 88%. We combine the extraction model with a Natural Language Inference (NLI) verification baseline through learned per-example routing, achieving 90.0% accuracy on held-out evaluation pairs without any domain-specific rule engineering. We compare our framework against published rule-based SER tools and show that our learned approach matches or outperforms hand-crafted scripts on all six comparable domains.
E-star 12B: Reliable Rubric-Following and Domain-Adaptive SLM Evaluator for Korean Industrial Settings
Yonghoon Kwon | Heondeuk Lee | Barom Kang
Yonghoon Kwon | Heondeuk Lee | Barom Kang
Automatic evaluation in industrial settings requires models to interpret and apply natural language rubrics reliably under language and domain shift. This challenge is compounded when reference answers are unavailable and proprietary models cannot be deployed due to data-governance constraints. We present E-Star-12B, a 12B-parameter evaluator for Korean industrial environments that jointly addresses rubric following and domain adaptation. Our approach combines a structured evaluation format—feedback, highlight, and decision—with a 6K high-confidence training set via multi-stage consensus-based filtering. We introduce two benchmarks: Ko Feedback Bench for rubric-following evaluation under Korean language transfer, and RAG Quality Bench for domain-specific evaluation in financial and legal settings. E-Star-12B achieves the strongest rubric alignment among small language models on Ko Feedback Bench, improving Pearson correlation by +0.173 over its base model. On RAG Quality Bench, the domain-adapted variant approaches frontier-model performance with more stable adaptation than general instruct models. Strong rubric-following capability serves as a reliable scaffold for subsequent domain adaptation.
Pressure-Testing Deception Probes in LLMs: Scaling, Robustness, and the Geometry of Deceptive Representations
Sachin Kumar
Sachin Kumar
Linear probes trained on internal activations of Large Language Models (LLMs) are increasingly proposed as evaluation metrics for deceptive generation, automated monitors that score whether a model’s output was produced deceptively, without requiring ground-truth labels or human annotation. Yet these metrics report AUROC scores exceeding 0.96 on clean benchmarks while demonstrating profound fragility under distributional shift. This paper presents a systematic pressure-test of such probe-based evaluation metrics across the Gemma 3 model family (1B–27B parameters), diagnosing why they fail rather than merely documenting that they fail. We investigate four competing hypotheses about how deception is encoded: as (1) a single linear direction, (2) a multi-dimensional subspace, (3) a convex conic hull, or (4) a proxy for computational entropy. Our experimental design includes cross-domain transfer matrices, multi-dimensional probe analysis with permutation null baselines, entropy-residualization tests, and systematic distractor evaluations across 8 stylistic shifts. Across all four model scales, we find that: (a) probe-based metrics achieve near-perfect AUROC (≥0.998) on clean data but collapse under stylistic shifts when trained without stylistic augmentation, style-augmented probes recover near-perfect detection (mean AUROC 0.979–0.983) even on unseen styles; (b) the single-direction hypothesis is decisively rejected (k=1 captures only 0.61–0.80 AUROC of the signal, with cross-domain transfer failure confirmed as geometric rather than layer-mismatch-driven; (c) the entropy-proxy hypothesis is rejected (maximum |𝜌|=0.454, maximum 𝛥AUROC after residualization=0.004); and (d) deception does not form a statistically significant linear subspace even within individual domains (per-domain k*=0), yet multi-dimensional probes (k≥5) consistently recover the signal through distributed sub-threshold features. These findings demonstrate that probe fragility under standard training reflects distributional narrowness rather than a fundamental architectural limitation: style-augmented probes recover near-perfect detection (mean AUROC 0.979–0.983 on unseen styles) at both the 4B and 27B scales, establishing that the inverse scaling pattern observed under standard training is a training-distribution artifact rather than a genuine scale-dependent phenomenon.
Sycophancy Negatively Affects LLM-as-a-Judge in Conflict Evaluation
Naghmeh Farzi | Laura Dietz | Samuel Carton
Naghmeh Farzi | Laura Dietz | Samuel Carton
LLM-as-Judge systems are increasingly used to generate labels and evaluate conversational data, yet their susceptibility to narrative framing remains underexplored. We study whether replacing one speaker’s username with the first-person identifier ’Me’ systematically biases model judgments independent of the underlying evidence. Using the Conversations Gone Awry corpus, we evaluate four LLMs across three judgment tasks (attack detection, attacker identification, and blame attribution), three perspective conditions, and two evidence visibility settings. Our results show that narrative perspective induces strong, task-dependent distortions, particularly in more subjective judgment tasks. We find that models systematically favor the narrator when a speaker is presented as ’Me’, reducing blame and responsibility attribution toward that speaker even when the underlying evidence is unchanged. These findings raise concerns about using LLMs to judge or moderate first-person conversational data.
Concord: An Agreement-Aware Multi-Adjudication Pipeline for LLM Evaluation
Tyler Bliss | Mahit Verma | Aila Iyer-Singh | Subrata Biswas | Sheikh Asif Imran | Bashima Islam
Tyler Bliss | Mahit Verma | Aila Iyer-Singh | Subrata Biswas | Sheikh Asif Imran | Bashima Islam
Evaluating multimodal generations is challenging: human evaluation is costly, and single-model LLM-as-a-judge pipelines can be brittle and provide limited uncertainty signals. We introduce Concord, an ensemble-based evaluation pipeline that aggregates discrete judgments from multiple LLM judges and uses inter-judge agreement as a practical uncertainty signal for disagreement-driven triage. We evaluate Concord on AVSSD and SCORE-AVS, a ground-truth-supervised audio-visual benchmark with discrete labels (True/False or 0–5). Concord improves agreement with human judgments over single-judge and naive aggregation baselines, and prioritizing low-agreement instances focuses human review on the most ambiguous cases. We use locally hosted open-source judges and include the binary results for online larger scale models GPT4.o mini turbo and Gemini 3.1 Flash Lite.
The Silent Vote: Improving Zero-Shot LLM Reliability by Aggregating Semantic Neighborhoods
Sanket Badhe | Priyanka Tiwari | Deep Shah
Sanket Badhe | Priyanka Tiwari | Deep Shah
Large Language Models are increasingly used as zero-shot classifiers in complex reasoning tasks. However, standard constrained decoding suffers from a phenomenon we define as Renormalization Bias. When a model is restricted to a small set of target labels, the standard softmax operation discards the probability mass assigned to semantic synonyms in the original distribution. This loss of information, which we call the Silent Vote, results in artificial overconfidence and poor calibration. We propose Semantic Softmax, an inference-time layer that recovers this lost information by aggregating the scores of the semantic neighborhood surrounding each target label. We evaluate this approach on Qwen-3 and Phi-4-mini models using GoEmotions and Civil Comments datasets. Our results demonstrate consistent improvements across all evaluation metrics: Semantic Softmax substantially reduces Expected Calibration Error (ECE) and Brier Score, while simultaneously enhancing discriminative performance in terms of AUROC and Macro-F1. By accounting for linguistic nuances, our method provides a more calibrated and accurate alternative for zero-shot classification.
Are LLM Benchmarks Already Contaminated? A Systematic Review of Contamination Detection Methods
Erfan Nourbakhsh | Mohammad Sadegh Sirjani | Amir Mousavi | Khoa Nguyen | John Quarles | Mimi Xie | Rocky Slavin
Erfan Nourbakhsh | Mohammad Sadegh Sirjani | Amir Mousavi | Khoa Nguyen | John Quarles | Mimi Xie | Rocky Slavin
Large Language Models (LLMs) are trained on web-scale corpora, increasing the risk that benchmark test data appears in training sets and inflates reported performance. We present a systematic literature review of 55 studies on LLM benchmark contamination through late 2025. Our contributions are: (1) a four-tier contamination taxonomy (Exact, Syntactic, Semantic, Task-Level; T1–T4); (2) a comparative analysis of five detection families (string-matching, likelihood-based, membership inference, LLM-prompted detection, and benchmark auditing), including access assumptions and failure modes; (3) a synthesis of contamination evidence on MMLU, GSM8K, HumanEval, and HellaSwag by measurement construct; (4) a comparative evaluation of mitigation strategies across lifecycle points, access assumptions, and evidence maturity; and (5) a Contamination Transparency Card (CTC) framework for future releases. Across studies, no detection method is consistently reliable across contamination tiers, model-access settings, and training stages. We identify instruction tuning as a persistent blind spot, note that RL/post-training contamination auditing is only beginning to mature, and report inflation estimates spanning roughly 6%–40% under benchmark- and setting-dependent assumptions.
Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy, RBCorr, and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that RBCorr effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, RBCorr is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.
Recent studies have highlighted that Large Language Models (LLMs) often exhibit limited coherence, that is the ability to produce consistent responses to semantically equivalent questions. While most prior research has focused exclusively on English, limited investigation has been conducted on other languages. In this work, we study the coherence of LLMs on Question Answering tasks across six languages: English, Italian, German, Chinese, Japanese, and Vietnamese. We evaluate models of varying sizes, ranging from 3.8B to 235B parameters, to examine how coherence scales with model capacity and how it relates to languages. Our findings reveal that (i) coherence is not uniquely related to model size and accuracy and (ii) for some models, coherence varies significantly between languages.
Token Cost Inequality: Measuring Tokenization Disparities Across Scripts in Roman Urdu and Urdu
Waleed Jamil | Saima Rafi | Yanchao Yu
Waleed Jamil | Saima Rafi | Yanchao Yu
Tokenization is central to modern language models, yet its effects on cross-script efficiency, input cost, and truncation behavior remain underexplored. We study this issue through aligned comparisons of Urdu and Roman Urdu, asking whether semantically equivalent content incurs systematically different tokenization costs across scripts. We introduce Token Cost Inequality (TCI), a metric for quantifying relative tokenization efficiency under semantic alignment, and propose a multi-axis framework spanning token cost, fragmentation, and fixed-budget retention. Across three tokenizer families (cl100k, mT5, and ByT5), we find that tokenization disparities are strongly tokenizer-dependent, with substantial differences in token cost and segmentation behavior across scripts. We further identify an efficiency-retention paradox: token cost alone does not fully explain truncation behavior. Under fixed token budgets, Roman Urdu preserves more character-level content than native Urdu, reflecting differences in character-per-token density and fragmentation. Lightweight normalization yields minimal gains, suggesting that the observed disparities arise primarily from tokenizer design rather than superficial orthographic variation. These findings provide controlled evidence that fixed token budgets can produce unequal surface-coverage conditions across scripts, with implications for input-side cost estimation, benchmark design, and multilingual evaluation under constrained token budgets.
Semantic vs. Structural Signals: Log-Probability and LLM-as-a-Judge for Reference-Free Code Evaluation
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Dmitriy Fedrushkov | Yulong He | Ivan Smirnov | Artem Aliev | Sergey Kovalchuk
Reference-free evaluation of LLM-generated code is essential when execution-based testing is unavailable or costly. We compare two paradigms: explicit LLM-as-a-Judge scoring, which assigns a quality score to a solution, and log-probability scoring, which uses log P𝜃(code ∣ task) as an instruction-free signal.Across HumanEval-X, we find that the two approaches capture qualitatively different aspects of code correctness. Explicit judges — particularly larger models — perform strongly on generated code, reflecting their ability to reason about task-solution alignment, but fail to distinguish correct solutions from minimally mutated ones. Log-probability exhibits the opposite pattern: weaker performance on generated code, but consistent pairwise separation of canonical from mutated solutions.These results reveal a discrimination-ranking dissociation and show that the two paradigms provide complementary, non-interchangeable signals: explicit judges capture semantic correctness, while log-probability captures local structural consistency.
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges
Srimonti Dutta | Akshata Kishore Moharir
Srimonti Dutta | Akshata Kishore Moharir
LLM-as-judge evaluation is widely used in benchmarking pipelines, where model outputs are compared and ranked using automated evaluators. These pipelines typically assume that judgments are stable properties of fixed inputs. We show that this assumption does not hold under interaction.We study post-decision manipulability: the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Across controlled experiments on MT-Bench and AlpacaEval, we find that LLM judges are highly stable under repeated and neutral reevaluation, yet become substantially reversible under targeted post-decision challenge. An anti-baseline challenge protocol shows that stable judgments can be overturned through motivated interaction, while a counterbalanced target-validation protocol separates this reversibility from net target-directed steering.These reversals have practical consequences: they can degrade agreement with human preferences, shift benchmark rankings, and produce harmful evaluation changes despite high self-reported confidence. Authority framing is especially destabilizing, and revised judgments are often accompanied by low-overlap justifications, suggesting post hoc rationalization rather than reliable error correction. We introduce the Evaluation Robustness Score (ERS) to quantify interactional robustness by combining reversal susceptibility with counterbalanced directional effects. Our findings identify post-decision interaction as a distinct failure mode for LLM-as-judge evaluation and motivate evaluation protocols that measure not only static agreement, but robustness under challenge.
Permutation-Consensus Listwise Judging for Robust Factuality Evaluation
Tianyi Huang | Nathan Huang | Justin Tang | Wenqian Chen | Elsa Fan
Tianyi Huang | Nathan Huang | Justin Tang | Wenqian Chen | Elsa Fan
Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and that marginalizing over this nuisance variation can improve the reliability of LLM evaluation.
MedFact: Benchmarking the Fact-Checking Capabilities of Large Language Models on Chinese Medical Texts
Jiayi He | Yangmin Huang | Qianyun Du | Xiangying Zhou | Zhiyang He | Jiaxue Hu | Xiaodong Tao | Lixian Lai
Jiayi He | Yangmin Huang | Qianyun Du | Xiangying Zhou | Zhiyang He | Jiaxue Hu | Xiaodong Tao | Lixian Lai
Deploying Large Language Models (LLMs) in medical applications requires rigorous fact-checking to ensure patient safety and regulatory compliance. We introduce **MedFact**, a challenging Chinese medical fact-checking benchmark with 2,116 expert-annotated instances from diverse real-world texts, spanning 13 specialties, 8 error types, 4 writing styles, and 5 difficulty levels. Construction uses a hybrid AI-human framework where iterative expert feedback refines AI-driven, multi-criteria filtering to ensure high quality and difficulty. We evaluate 20 leading LLMs on veracity classification and error localization, and results show that models can often determine whether text contains errors but struggle to localize them precisely, with top performers falling short of human performance. Our analysis reveals an "over-criticism" phenomenon, where models misidentify correct information as erroneous, a tendency that is aggravated by advanced reasoning techniques such as multi-agent collaboration and inference-time scaling. MedFact highlights the challenges of deploying medical LLMs and provides resources to develop factually reliable medical AI systems.
Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate
Ali Keramati | Justin Cheok | Jacob Horne | Mark Warschauer
Ali Keramati | Justin Cheok | Jacob Horne | Mark Warschauer
Evaluating reasoning quality in multi-agent LLM systems is challenging, especially for open-ended tasks without reference answers. We investigate whether intrinsic confidence signals, token-level log-probabilities from decoding, can predict reasoning quality as assessed by LLM-as-judge evaluation. Using a debate-based essay scoring framework, we compare confidence proxies against rubric-based judge scores across two ASAP essay sets. We find that early-token confidence, particularly within the first few generated tokens, is consistently the strongest predictor of reasoning quality, outperforming full-sequence statistics. Analysis of log-probability trajectories shows that the opening phase of generation is the most heterogeneous and therefore most informative. We also observe a systematic asymmetry between agent roles, with stronger alignment between confidence and quality for supportive reasoning than for adversarial critique. These results suggest that early decoding dynamics provide a lightweight and effective signal for estimating reasoning reliability in multi-agent LLM systems.
Complex-IF and Beyond: Expert Rubrics for RLVR
Sushant Mehta | Liudas Panavas | Eleanor Fleming | Paul Mains | Edwin Chen
Sushant Mehta | Liudas Panavas | Eleanor Fleming | Paul Mains | Edwin Chen
As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks rely onprogrammatic verification of narrow, surface-level constraints, yet real-world instruction following and agentic tasks demand assessmentof nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce COMPLEX-IF, a new expert-curated instruction-following dataset in which each prompt is paired with 10–40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 COMPLEX-IF examples yields +15.5 pp improvement for a 4B-parameter model and +12.2 pp for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5 pp BFCL, +7.4 pp τ 2-Bench, +6.8 pp Toolathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.
C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning
Avni Mittal | Rauno Arike
Avni Mittal | Rauno Arike
Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, yet it remains unclear whether they can reliably assess process faithfulness rather than merely answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that explicitly decomposes faithfulness into two complementary dimensions: causality (whether each step logically follows from prior context) and coverage (whether essential intermediate inferences are present). Using controlled perturbations, we construct examples with known causal error positions by replacing a single step with a logically inconsistent variant, and with controlled coverage deletions at varying rates, enabling direct measurement against reference labels. We evaluate three frontier LLM judges across three tasks: binary causal detection, causal step localization, and coverage scoring. Our results reveal that judge reliability is highly task-dependent, with no single model dominating across settings. While models often detect that an error exists, they struggle to accurately localize it, indicating a substantial gap between detection and attribution. Moreover, all judges systematically overestimate reasoning completeness, assigning high coverage scores even when substantial portions of intermediate reasoning are missing. These findings expose fundamental limitations of LLM judges in process-level evaluation and highlight the need for more reliable and calibrated methods when using LLMs to assess reasoning quality.
Evaluating Multilingual Sentiment Classifiers Using an LLM-Annotated Wikipedia Benchmark
Milena Stróżyna | Włodzimierz Lewoniewski | Izabela Czumałowska
Milena Stróżyna | Włodzimierz Lewoniewski | Izabela Czumałowska
We present a multilingual study of sentiment evaluation on Wikipedia articles from various topics in five languages (German, English,Spanish, Polish, and Russian). In this paper, we compare three large language models (Gemini Pro 3.1, Claude Opus 4.6, and GPT 5.2),each queried three times per sentence, with two popular multilingual sentiment classifiers. This setup allows us to analyze not only inter-model differences but also intra-model stability as a proxy for confidence.To support systematic evaluation, we construct a benchmark dataset based on strict consensus across evaluators and analyze sentiment distributions across topics and languages. We show substantial variation in sentiment distributions, agreement, and consistency across models and languages. Our results suggest that sentiment evaluation on encyclopedic text remains an underexplored challenge for multilingual NLP.
Process Standardisation for Human Evaluation of NLP System Outputs
Craig Thomson | Javier González Corbelle | Anya Belz
Craig Thomson | Javier González Corbelle | Anya Belz
Human evaluation of NLP systems has high knowledge and effort thresholds. Researchers are often expected to design and run evaluations without formal training, while also creating the required resources from scratch. Recent work has started to address the knowledge threshold, but reusable tools that reduce effort remain limited. In this paper, we take a first step toward automated human-evaluation experiment creation by (i) surveying the processes and data resources used in a representative sample of current human evaluations in NLP, and (ii) deriving a canonical process model from these survey results, which (iii) provides a basis for standardised experiment design and automated toolkit development. The survey shows that recent human-evaluation practices are highly aligned in process structure, making reusable automation feasible.
Language Modeling for the Future of Finance: A Survey into Metrics, Tasks, and Data Opportunities
Nikita Tatarinov | Siddhant Sukhani | Agam Shah | Sudheer Chava
Nikita Tatarinov | Siddhant Sukhani | Agam Shah | Sudheer Chava
Recent advances in language modeling have led to a growing number of papers related to finance in top-tier Natural Language Processing (NLP) venues. To systematically examine this trend, we review 374 NLP research papers published between 2017 and 2024 across 38 conferences and workshops, with a focused analysis of 221 papers that directly address finance-related tasks. We evaluate these papers across 11 quantitative and qualitative dimensions, with particular attention to evaluation practices, metric choices, dataset coverage, and reproducibility in a high-stakes applied LM domain. Our study identifies the following opportunities for NLP researchers: (i) expanding the scope of forecasting tasks; (ii) enriching evaluation with finance-specific metrics; (iii) leveraging multilingual and crisis-period datasets for robustness-oriented evaluation; and (iv) balancing PLMs with efficient or interpretable alternatives. We identify actionable directions supported by dataset and tool recommendations, with implications for both academic evaluation practices and industry deployment.
Recent LLMs have shown remarkable success in following user instructions, yet handling instructions with multiple constraints remains a significant challenge. In this work, we introduce WildIFEval - a large-scale dataset of 7K real user instructions for single-turn constrained text generation, exhibiting diverse, multi-constraint conditions. Unlike prior datasets, our collection spans a broad lexical and topical spectrum of constraints, extracted from natural user instructions. We categorize these constraints into eight high-level classes to capture their distribution and co-occurrence dynamics in real-world scenarios. Leveraging , we conduct extensive experiments to benchmark the instruction-following capabilities of leading LLMs. WildIFEval clearly differentiates between small and large models, and demonstrates that all models have room for improvement on such tasks. Our analysis reveals that as constraint count grows, models’ overall success drops sharply while per-constraint success remains stable, indicating a capacity bottleneck in juggling multiple constraints, and that models struggle more with rigid form-based constraints than with softer content-based ones. We release our dataset to promote further research on instruction-following under complex, realistic conditions.
EconWebArena: Benchmarking Autonomous Agents on Economic Tasks in Realistic Web Environments
Zefang Liu | Yinzhu Quan
Zefang Liu | Yinzhu Quan
We introduce EconWebArena, a benchmark for evaluating autonomous agents on complex, multimodal economic tasks in realistic web environments. The benchmark comprises 360 curated tasks from 82 authoritative websites spanning domains such as macroeconomics, labor, finance, trade, and public policy. Each task challenges agents to navigate live websites, interpret structured and visual content, interact with real interfaces, and extract precise, time-sensitive data through multi-step workflows. We construct the benchmark by prompting multiple large language models (LLMs) to generate candidate tasks, followed by rigorous human curation to ensure clarity, feasibility, and source reliability. Unlike prior work, EconWebArena emphasizes fidelity to authoritative data sources and the need for grounded web-based economic reasoning. We evaluate a diverse set of state-of-the-art multimodal LLMs as web agents, analyze failure cases, and conduct ablation studies to assess the impact of visual grounding, plan-based reasoning, and interaction design. Our results reveal substantial performance gaps and highlight persistent challenges in grounding, navigation, and multimodal understanding, positioning EconWebArena as a rigorous testbed for economic web intelligence.
ISO-Bench: Benchmarking Multimodal Causal Reasoning in Visual–Language Models through Procedural Plans
Ananya Sadana | Yash Kumar Lal | Jiawei Zhou
Ananya Sadana | Yash Kumar Lal | Jiawei Zhou
Understanding causal relationships across modalities is a core challenge for multimodal models operating in real-world environments. We introduce ISO-Bench, a benchmark for evaluating whether models can infer causal dependencies between visual observations and procedural text. Each example presents an image of a task step and a text snippet from a plan, with the goal of deciding whether the visual step occurs before or after the referenced text step. Evaluation results on ten frontier vision-language models show underwhelming performance: the best zero-shot F1 is only 0.57, and chain-of-thought reasoning yields only modest gains (up to 0.62 F1), largely behind humans (0.98 F1). Our analysis further highlights concrete directions for improving causal understanding in multimodal models.
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media
Yuefeng Shi | Nedjma Ousidhoum | Jose Camacho-Collados
Yuefeng Shi | Nedjma Ousidhoum | Jose Camacho-Collados
LLMs have demonstrated exceptional proficiency in a wide range of NLP tasks. However, a notable gap remains in practical data analysis scenarios, particularly when LLMs are required to process long sequences of unstructured documents, such as news feeds or, as specifically addressed in this paper, social media posts. To empirically assess the effectiveness of LLMs in this setting, we introduce a question-based evaluation framework comprising 470 manually curated questions designed to evaluate LLMs’ semantic understanding and reasoning abilities over aggregated text data. We apply our benchmark on diverse Twitter datasets covering various NLP tasks, including sentiment analysis, hate speech detection, and emotion recognition. Our results reveal that the performance depends heavily on input scale and the complexity of the data sources, declining noticeably in multi-label or target-dependent scenarios. In addition, as task complexity increases, performance drops progressively from basic semantic existence identification to more demanding operations such as comparison, counting, and calculation. Furthermore, as the input size grows beyond 500 instances, we identify a common limitation across LLMs, particularly Open-weights models: performance degrades substantially, especially on numerical tasks. These findings highlight critical architectural bottlenecks in current LLMs for performing rigorous quantitative analysis over large text collections.
Teaching Values to Machines: Simulating Human-Like Behavior in LLMs
Asaf Yehudai | Naama Rozen | Ariel Gera
Asaf Yehudai | Naama Rozen | Ariel Gera
Large Language Models (LLMs) demonstrate a remarkable capacity to adopt different personas and roles; however, it remains unclear whether they can manifest behavior that adheres to a coherent, human-like value structure. In this work, we draw on established psychological value theory to induce human-like values in LLMs and assess their alignment with patterns observed in human studies.Using validated psychological questionnaires, we conduct large-scale experiments – over 5 million questions – to evaluate value structures and value–behavior relationships in leading LLMs and compare them to humans. Our findings reveal strong agreement between value-prompted LLMs and humans across both dimensions. Moreover, incorporating human value distributions enhances population-level simulations with value-induced LLMs. These findings highlight the potential of value-induced LLMs as effective, psychologically grounded tools for simulating human behavior.
MetaGraph: A Large-Scale Meta-Analysis of GenAI in Financial NLP (2022–2025)
Paolo Pedinotti | Peter Baumann | Nathan Jessurun | Leslie Barrett | Enrico Santus
Paolo Pedinotti | Peter Baumann | Nathan Jessurun | Leslie Barrett | Enrico Santus
Financial NLP has evolved rapidly since late 2022, outpacing narrative surveys. We introduce MetaGraph, a methodology for extracting typed knowledge graphs from scientific corpora using ontology-guided LLM extraction to enable structured, large-scale trend analysis. Applied to 681 papers on GenAI in Finance (2022–2025), MetaGraph reveals three phases: early LLM-driven expansion of tasks and datasets, growing emphasis on limitations and risk, and a shift toward modular, system-oriented methods (e.g., retrieval-augmented designs). We release the resulting resource and artifacts to support reproducible meta-analysis and future monitoring of the field.
When Users Are Happy but Agents Are Wrong: Multi-Dimensional Evaluation of Tool-Augmented Dialogue
Tanya Shourya | Yingfan Wang | Zhaoyi Joey Hou | Shamik Roy | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Tanya Shourya | Yingfan Wang | Zhaoyi Joey Hou | Shamik Roy | Vinayshekhar Bannihatti Kumar | Rashmi Gangadharaiah
Evaluating conversational AI systems that use external tools is challenging, as errors can arise from complex interactions among user, agent, and tools. While existing evaluation methods assess either user satisfaction or agents’ tool-calling capabilities, they fail to capture critical errors in multi-turn tool-augmented dialogues—such as when agents misinterpret tool results yet appear satisfactory to users. We introduce TRACE, a benchmark of systematically synthesized tool-augmented conversations covering diverse error cases. Evaluation with state-of-the-art conversation evaluation frameworks reveals that all approaches remain far from ideal performance, demonstrating the fundamental difficulty of this benchmark.
Tool-Aware Planning for Contact-Center Analytics: Evaluating LLMs through Lineage-Guided Query Decomposition
Varun Nathan | Shreyas Guha | Ayush Kumar
Varun Nathan | Shreyas Guha | Ayush Kumar
We present a domain-grounded benchmark and evaluation framework for tool-aware plan generation in contact-center analytics, where answering a business-insights query requires decomposing it into executable steps over structured tools (Text2SQL over Snowflake), unstructured tools (RAG over transcripts), and LLM-based synthesis, with explicit depends_on relations for safe parallel execution. Our contributions are threefold: (i) a reference-based plan evaluation framework with two complementary views—a metric-wise evaluator spanning seven dimensions (e.g., tool–prompt alignment, query adherence) and a one-shot evaluator that compares a candidate plan against a reference plan; (ii) a lineage-driven data curation methodology that uses an iterative evaluator→optimizer loop to refine initial plans into high-quality plan lineages while reducing manual effort; and (iii) a large-scale study of 14 LLMs across model families and sizes on their ability to generate step-by-step, executable, tool-assigned plans, evaluated with and without lineage in the prompt. Empirically, LLMs continue to struggle on compound queries and on plans longer than four steps; the highest aggregate metric-wise score is 84.8 (Claude-3-7-Sonnet), while the strongest one-shot A+ rate (Extremely Good or Very Good) is only 49.75% (o3-mini). Lineage yields mixed overall gains but improves several strong models and often helps step executability. Overall, our results expose persistent weaknesses in tool understanding—especially tool–prompt alignment and tool-usage completeness—and show that shorter, simpler plans remain markedly easier. The benchmark, evaluation framework, and findings provide a practical path for assessing and improving agentic planning with tools in enterprise question-answering settings. An anonymized dataset with human-annotated reference plans, plan lineages, and per-planner outputs for all 14 planners is available at the anonymous repository linked in the paper.
TSAQA: Time Series Analysis Question And Answering Benchmark
Baoyu Jing | Sanhorn Chen | Lecheng Zheng | Boyu Liu | Zihao Li | Jiaru Zou | Tianxin Wei | Zhining Liu | Zhichen Zeng | Ruizhong Qiu | Xiao Lin | Yuchen Yan | Dongqi Fu | Jingchao Ni | Jingrui He | Hanghang Tong
Baoyu Jing | Sanhorn Chen | Lecheng Zheng | Boyu Liu | Zihao Li | Jiaru Zou | Tianxin Wei | Zhining Liu | Zhichen Zeng | Ruizhong Qiu | Xiao Lin | Yuchen Yan | Dongqi Fu | Jingchao Ni | Jingrui He | Hanghang Tong
Time series data are integral to applications across domains such as finance, healthcare, transportation, and environmental science.While recent work has begun to explore time series question answering (QA), existing benchmarks still provide limited coverage of analytical capabilities under a standardized evaluation framework. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates 6 diverse tasks under a single framework ranging fromconventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, datatransformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shotevaluation shows that TSAQA remains challenging for current Large Language Models (LLMs): best-performing commercial model,Gemini-2.5-Flash, achieves 65.08 average accuracy. Although instruction tuning improves open-source models’ performance: the best-performing model, LLaMA-3.1-8B, shows significant room for improvement. We further evaluate language-capable time series foundation models (TSFMs), showing that TSAQA extends beyond general-purpose LLMs. The data are available in https://huggingface.co/datasets/TSAQA/TSAQA-Benchmark.
Who Endorsed It? Measuring Authority Bias Across Expertise Levels in Language Models
Priyanka Mary Mammen | Emil Joswin | Shankar Venkitachalam
Priyanka Mary Mammen | Emil Joswin | Shankar Venkitachalam
Prior research demonstrates that the performance of language models on reasoning tasks can be influenced by suggestions, hints, and endorsements. However, the influence of endorsement source credibility remains underexplored. We investigate whether language models exhibit systematic bias based on the perceived expertise of the provider of the endorsement. Across 4 datasets spanning mathematical, legal, and medical reasoning, we evaluate 11 models using personas representing four expertise levels per domain. Our results reveal that models are increasingly susceptible to incorrect or misleading endorsements as source expertise increases, with higher-authority sources inducing not only accuracy degradation but also increased confidence in wrong answers. We also show that this authority bias is mechanistically encoded within the model and a model can be steered away from the bias, thereby improving its performance even when an expert gives a misleading endorsement.
Reference Games as a Testbed for the Alignment of Model Uncertainty and Clarification Requests
Manar Ali | Judith Sieker | Sina Zarrieß | Hendrik Buschmeier
Manar Ali | Judith Sieker | Sina Zarrieß | Hendrik Buschmeier
In human conversation, both interlocutors play an active role in maintaining mutual understanding. When listeners are uncertain about what speakers mean, for example, they can request clarification. It is an open question for language models whether they can assume a similar listener role, recognizing and expressing their own uncertainty through clarification. We argue that reference games are a suitable testbed to approach this question as they are controlled, self-contained, and make clarification needs explicit and measurable. To test this, we evaluate three vision-language models comparing a baseline reference resolution task to an experiment where the models are instructed to request clarification when uncertain. The results suggest that even in such simple tasks, models often struggle to recognize internal uncertainty and translate it into adequate clarification behavior. This demonstrates the value of reference games as testbeds for interaction qualities of (vision and) language models.
Mapping Out the NLP Evaluation Landscape with a Standard Taxonomy of Quality Criteria
Anya Belz | Simon Mille | Craig Thomson
Anya Belz | Simon Mille | Craig Thomson
Prior research shows that when papers reportresults from system evaluations in terms ofa quality criterion such as Fluency, answersto two questions are normally less clear thanthey should be: (i) was it really Fluency thatwas evaluated; and (ii) was the same aspect ofquality evaluated as in other evaluations alsoclaiming to evaluate Fluency. Answers to thesequestions are crucial if meaningful conclusionsabout the Fluency of systems, independentlyand as compared to others, are to be drawn.We map a combined total of 1,002 individualevaluations identified in three surveys of 310NLP papers to the standardised QCET inven-tory of quality criterion names and definitions.Standardisation results in up to 76% reductionin evaluation criteria names, revealing a lot ofspurious difference in evaluation naming. Weargue that conclusions drawn from NLP sys-tem evaluations are only fully interpretable andcomparable if grounding in a standard inven-tory of quality criterion names and definitionsforms part of experiment design and reporting,and we propose a way of achieving this.
The critique of scalar benchmark rankings as proxies for model quality is now well-established (Raji et al., 2021; Wallach et al.,2025; Bean et al., 2025; Gehrmann et al., 2021). What the field still lacks is a shared structural vocabulary for comparing, combining, and contextualizing metric design choices. This paper provides that vocabulary: a four-primitive typology—representation (𝜙), comparison (D), aggregation (A), and context (C)—under which existing metrics (BLEU, BERTScore, nDCG, LLM-as-judge, calibration scores, agentic outcome measures) are explicit parameterizations of a common form. This typology is paired with a measurement–decision split: metrics are noisy estimators of latent constructs, and model selection is context-dependent Pareto optimization over construct estimates, not over raw scores. The typology makes implicit metric assumptions comparable and debatable rather than hidden inside a single number.
Position: What Are We Measuring? Rethinking Evaluation in Natural Language Generation
Wajdi Zaghouani
Wajdi Zaghouani
The field of natural language generation has accumulated a rich ecosystem of automatic evaluation metrics, yet it lacks a coherent theory of what those metrics are actually measuring. Drawing on measurement theory from the quantitative social sciences, this paper argues that current NLG evaluation practices suffer from a fundamental construct validity problem: metrics are treated as proxies for output quality without explicit specification of the underlying constructs they are meant to operationalize. We examine four dominant evaluation paradigms (reference-based metrics, embedding-based metrics, LLM-as-judge, and human evaluation) and demonstrate that each conflates construct definition with operationalization. Building on a long psychometric tradition reaching back to Cronbach and Meehl (1955) and on recent NLP work that has begun to apply this tradition to bias measurement, dialogue evaluation, and benchmark design, we propose that the field adopt a measurement modeling perspective for NLG evaluation. We borrow the concepts of construct validity, reliability, and consequential validity as a foundation for more principled evaluation, and we outline a preliminary taxonomy of NLG quality constructs as a starting point for this work.
Evaluation methodologies for language models increasingly combine multiple signals—automated metrics, LLM-as-judge ratings, human assessments, and benchmark suite results. When these signals are aggregated via averaging, the resulting evaluation confidence can substantially exceed the reliability of the weakest signal: a phenomenon we call trust inflation in evaluation. We argue that evaluation scores should be treated as epistemic claims with three properties: formality (human evaluation provides stronger evidence than an automated metric), scope (a benchmark result applies to the tested distribution, not universally), and validity windows (benchmark results expire as contamination accumulates and distributions shift). Drawing on several converging research traditions—chain-of-thought analysis, possibilistic logic, and algebraic theory—that establish weakest-link aggregation as the conservative endpoint of a parameterized operator family controlled by a single pessimism parameter, and on concrete lessons from building an evaluation harness for agentic AI, we propose that evaluation results carry explicit metadata—formality tier, scope declaration, and expiration date—to make their epistemic status transparent. We illustrate the cost of mean aggregation on the public HELM leaderboard: across 54 frontier models on ten scenarios, the top-five models ranked by mean score and by weakest-link are completely disjoint.
Position: A Semiotic-Hermeneutic Approach to Evaluating Meaning in LLM Summaries via the Inductive Conceptual Rating Metric
Natalie Perez | Sreyoshi Bhaduri | Aman Chadha
Natalie Perez | Sreyoshi Bhaduri | Aman Chadha
Meaning in human language is relational and context-dependent, and it emerges, according to Saussure (1916), through a dynamic system of signs rather than fixed relationships between words and concepts. Insights from the study of semiotics and hermeneutics emphasize that meaning arises through interpretive processes shaped by context, which has historically posed challenges for computational processing and evaluation. Building on these perspectives, this article advances an interdisciplinary framework for evaluating meaning in machine-generated language and introduces the Inductive Conceptual Rating (ICR) metric, a qualitative approach grounded in inductive content analysis and reflective thematic analysis that assesses semantic accuracy and meaning alignment in generative artificial intelligence (GenAI) outputs beyond surface-level lexical and similarity metrics. The ICR metric is applied in an empirical study that compares thematic summaries generated by the large language model (LLM) with the human-generated output in five datasets (N = 50-800). Results show that although models achieve high linguistic similarity scores, they consistently unperformed relative to human outputs in capturing recurring, contextually grounded meanings. This work concludes by discussing implications for meaning evaluation and future research.
Recent years have seen rapid growth in evaluation and benchmarking in NLP, driven by advances in large language models (LLMs). This growth has shifted evaluation from measuring generalization to tracking capability, often without reference to training assumptions. We argue that this creates a conceptual gap: results are frequently interpreted without considering what models could plausibly have learned, rendering many conclusions scientifically underdetermined. We propose an expectation-aware view, where the informativeness of evaluation depends on its relationship to training data, model design, and tasks. We further distinguish between evaluation for scientific understanding and capability tracking, and provide recommendations for aligning evaluation with its intended purpose in the LLM era.
The Shared Task on Reproducibility of Evaluations in NLP (ReproNLP) 2026: Overview and Results
Anya Belz | Craig Thomson | Javier González Corbelle
Anya Belz | Craig Thomson | Javier González Corbelle
We present the 2026 Shared Task on Reproducibility of Evaluations in NLP (ReproNLP’26) which followed on from five predecessor shared tasks on reproducibility of evaluations, ReproNLP’25, ReproNLP’24, ReproNLP’23, ReproGen’22 and ReproGen’21.This shared task series forms part of an ongoing research programme designed to develop theory and practice of reproducibility assessment in NLP and machine learning, against a backdrop of increasing recognition of the importance of the topic across the two fields. We describe the ReproNLP’26 shared task, summarise results from the reproduction studies submitted, and provide additional comparative analysis of their results.
Do Nugget-Based Evaluation Patterns Generalize to List-QA?
MohammadJavad Ardestani | Ehsan Kamalloo | Davood Rafiei
MohammadJavad Ardestani | Ehsan Kamalloo | Davood Rafiei
Evaluating long-form answers from retrieval-augmented generation (RAG) systems remains challenging: human evaluation is expensive, while automatic metrics must reliably capture answer completeness. The AutoNuggetizer framework addresses this by decomposing evaluation into atomic facts (nuggets) and using LLMs for both nugget creation and assignment. The original study validated this approach on open-ended TREC RAG queries; however, it remains unclear whether the same cost-quality tradeoffs hold for structurally different tasks. We reproduce AutoNuggetizer on seven RAG systems over the QAMPARI list-QA benchmark, where answers consist of discrete entities and omissions are more directly measurable. Our results directionally reproduce the original findings: fully automatic evaluation preserves run-level rankings, assignment-only automation yields stronger agreement than end-to-end automation, and LLM-based assignment is highly concordant with human labels while being modestly stricter. These findings support the use of AutoNuggetizer for comparative evaluation beyond open-ended RAG, while also identifying systematic biases in automatic nugget creation and assignment.
ReproNLP 2026: A Third Replication of the Human Evaluation of a QAG System for Children’s Storybooks
Marcel Mroczek | Chiara Albarello | Paul-Emmanuel Floch | Maciej Gawinecki
Marcel Mroczek | Chiara Albarello | Paul-Emmanuel Floch | Maciej Gawinecki
Abstract: Reproducibility of human evaluations in Natural Language Processing remains a critical open challenge. This paper presents a third independent replication of the human evaluation from Yao et al. (2022), which assessed an automated Question-Answer Generation (QAG) system for children’s storybooks against a baseline system and human-authored ground truth, across three criteria — Readability, Question Relevance, and Answer Relevance — using five NLP-literate annotators. Our replication confirms the main findings of the original study: the QAG system outperforms the baseline on Readability and Question Relevance, and Ground Truth ranks highest across all criteria. System rankings are preserved across all three criteria, with the exception of a statistically non-significant difference in Answer Relevance. This holds true despite a severe drop in inter-annotator agreement for Readability. We further document several methodological concerns, some unreported in prior replications, including data quality issues and evaluation design limitations identified during our pilot study.
In the context of the ReproNLP’26 shared task, I report on a single-criterion reproduction study of a human evaluation experiment for neuralreferring expression generation models (Castro Ferreira et al., 2018a), which has already been reproduced once by Mahamood (2024)for the ReproHum 2024 shared task. The experiments reported on in this paper therefore seek to second the findings from both previousexperiments.
ReproHum #0866-04: Variability in Human Judgments of Sociopolitical Acceptability Across Studies
Rui Fan | Guanyi Chen
Rui Fan | Guanyi Chen
Human evaluations are essential for assessing NLP systems, but their reproducibility can be limited when judgments involve socially sensitive constructs. This paper reproduces the perceived sociopolitical acceptability evaluation in (CITATION), where annotators judged whether model-generated writer-intent implications reflected mainstream or fringe viewpoints. Using the same 600 headline–belief pairs, we collected new annotations on Prolific and compared our results with both the original study and a prior reproduction. Our scores are lower than the original results. Under a 70% threshold, these findings do not support the original conclusion that most generations were socially acceptable. Overall, our results align more closely with the prior reproduction, while also showing substantial variability, especially for GPT2-large. We argue that this variability may arise from a combination of platform differences, task framing, topic effects, and changes in social context over time. These findings highlight the importance of reporting not only annotation results, but also the evaluation setting in which subjective social judgments are collected.
ReproHum #0031–01: Reproducing a Human Readability Evaluation for Question–Answer Generation Systems
Manuela Hürlimann | Mark Cieliebak
Manuela Hürlimann | Mark Cieliebak
Human evaluations play a central role in assessing natural language processing systems, yet their robustness and reproducibility remain incompletely understood. This paper reports on a reproduction of the human readability evaluation from Yao et al. (2022) for question–answer generation (QAG) systems, conducted within the ReproHum project and the ReproNLP 2026 shared task (Belz et al., 2026). The original evaluation compared three QAG systems with respect to three criteria. We reproduced the evaluation of one of these criteria, readability, using a new group of five evaluators. We report descriptive results, inter-annotator agreement, system-level comparisons, and cross-study robustness metrics compared to the original study and two previous reproductions. Our results support all conclusions of the original evaluation and are largely consistent with two previous reproductions.
ReproHum #0033-05: Human Evaluation Report on "Generating Scientific Definitions with Controllable Complexity"
Ines Arous | Jackie Chi Kit Cheung
Ines Arous | Jackie Chi Kit Cheung
Human evaluation remains a central component of assessing NLG systems, especially for open-ended or creative generation tasks. Yet, the field still lacks standardized practices for designing and reporting such evaluations. In this paper, we present a reproduction study of the human evaluation conducted by August et al. for their method of generating scientific definitions with controllable complexity. By closely replicating their experimental setup, we find that our results partially align with the original findings, suggesting a moderate level of reproducibility.
We describe our attempt to reproduce a single human evaluation quality criterion that was conducted in the paper “Reproducing a Recipe for Arbitrary Text Style Transfer with LLMs”. This paper describes the approach and challenges involved in reproducing the human evaluation as done by the original authors. In particular, we describe negative results obtained during the reproduction, and we compare our results with an earlier reproduction for the same experiment. Finally, we describe the insights we gained from attempting this particular reproduction and the barriers that remain in attempting successful reproductions. The results and insights presented will hopefully enable the broader NLP research community to improve both how human evaluations are conducted and enable better reproducibility of NLP experiments in the future.
up
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)
Proceedings of the 1st Workshop on Linguistic Analysis for Health (HeaLing 2026)
Vera Danilova | Murathan Kurfalı | Ylva Söderfeldt | Julia Reed | Andrew Burchell
Vera Danilova | Murathan Kurfalı | Ylva Söderfeldt | Julia Reed | Andrew Burchell
Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?
Grace Chang Yuan | Xiaoman Zhang | Sung Eun Kim | Pranav Rajpurkar
Grace Chang Yuan | Xiaoman Zhang | Sung Eun Kim | Pranav Rajpurkar
Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.
The Doctor Will Agree With You Now: Sycophancy of Large Language Models in Multi-Turn Medical Conversations
Taeil Matthew Kim | Luyang Luo | Sung Eun Kim | Arjun Kumar Manrai | Eric Topol | Pranav Rajpurkar
Taeil Matthew Kim | Luyang Luo | Sung Eun Kim | Arjun Kumar Manrai | Eric Topol | Pranav Rajpurkar
Large language models (LLMs) increasingly exhibit sycophancy—the tendency to conform to user beliefs rather than provide factually accurate information—posing significant risks in healthcare applications where reliability is paramount. We evaluate sycophantic behavior in ten LLMs from OpenAI, Google, and Anthropic across multi-turn medical conversations using an escalatory pushback framework. To enable fine-grained analysis, we introduce Resistance, a metric that measures nonconformity to user stances at each conversational turn, providing insights beyond existing flip-based metrics. Evaluating on MedCaseReasoning (open-ended diagnostic questions) and PubMedQA (clear-answer biomedical questions), we find that Gemini models exhibit the highest Resistance, followed by OpenAI and Claude models. We further observe that response patterns ("Yes, but..." vs. "Yes, and...") may be more predictive of sycophancy than specific phrases. Notably, all models are more easily persuaded to change their answers on clear multiple-choice questions than on ambiguous diagnostic cases. Our findings highlight critical vulnerabilities in deploying LLMs for clinical decision support and suggest that training toward contradiction-maintaining response patterns may serve as a potential mitigation strategy.
Discourses of Prevention: A Multimodal Study of HPV Vaccination Campaigns in Italy
Claudia Roberta Combei | Antonio Bianco | Elena Giribaldi | Adalberto Lovotti | Valentina Ghirotto | Marianna France Pasquali | Sara Gemelli | Chiara Cassani | Chiara Zanchi
Claudia Roberta Combei | Antonio Bianco | Elena Giribaldi | Adalberto Lovotti | Valentina Ghirotto | Marianna France Pasquali | Sara Gemelli | Chiara Cassani | Chiara Zanchi
This study assesses the communicative effectiveness of Italian HPV vaccination campaign materials using a mixed-methods design that combines expert annotation and a public perception experiment. A corpus of 49 official documents was annotated by six experts (three Linguistics Ph.D. students and three Gynecology residents) across 56 variables capturing the appropriateness and efficiency of verbal and visual elements. The perception experiment, administered to a convenience sample of Italian general public, examined attitudes toward HPV vaccination and evaluations of communication effectiveness. Overall, both expert and public assessments converged in judging the HPV vaccination campaign materials as relatively weak, citing reduced informativeness in overly concise texts, inappropriate choice of colors, and recurring issues regarding gender representation, inclusivity, and diversity.
Extracting medical decisions from clinical notes is a key step for clinical decision support and patient-facing care summaries. We study how the linguistic characteristics of clinical decisions vary across decision categories and whether these differences explain extraction failures. Using MedDec discharge summaries annotated with decision categories from the Decision Identification and Classification Taxonomy for Use in Medicine (DICTUM), we compute seven linguistic indices for each decision span and analyze span-level extraction recall of a standard transformer model. We find clear category-specific signatures: drug-related and problem-defining decisions are entity-dense and telegraphic, whereas advice and precaution decisions contain more narrative, with higher stopword and pronoun proportions and more frequent hedging and negation cues. On the validation split, exact-match recall is 48%, with large gaps across linguistic strata: recall drops from 58% to 24% from the lowest to highest stopword-proportion bins, and spans containing hedging or negation cues are less likely to be recovered. Under a relaxed overlap-based match criterion, recall increases to 71%, indicating that many errors are span boundary disagreements rather than complete misses. Overall, narrative-style spans–common in advice and precaution decisions–are a consistent blind spot under exact matching, suggesting that downstream systems should incorporate boundary-tolerant evaluation and extraction strategies for clinical decisions.
Semantic Echo Pathways (SEP): Tracing How Medical Language Propagates and Transforms
Charu Karakkaparambil James | Marcio Monteiro | Sophie Fellenz
Charu Karakkaparambil James | Marcio Monteiro | Sophie Fellenz
We introduce Semantic Echo Pathways (SEP), a new approach for modeling the cross-domain evolution of medical language. Using continual neural topic models (CoNTM) trained separately on scientific literature, clinical notes, and public health-related data, we track linguistic drift and identify points where concepts change meaning. We propose three novel metrics: Cross-Domain Drift Score, Temporal Echo Lag, and Semantic Mutation Patterns to quantify how medical language travels between the scientific, clinical, and public domain. Applications to evolving concepts such as "long COVID", diagnostic category changes reveal previously undocumented patterns of medical-semantic evolution. Our results bridge computational modeling with the human-centered perspectives of medical humanities, offering clear, domain-aware maps of how medical language shifts across time and domains, and combining quantitative analysis with linguistic and clinical insight.
A Graph-Augmented Liquid Neural Network for Extracting Food Hazards and Disease Outbreaks
Tirthankar Dasgupta | Manjira Sinha | Sudeshna Jana | Diya Saha | Ishan Verma | Vaishali Aggarwal
Tirthankar Dasgupta | Manjira Sinha | Sudeshna Jana | Diya Saha | Ishan Verma | Vaishali Aggarwal
The increasing frequency of foodborne illnesses, safety hazards, and disease outbreaks in the food supply chain demands urgent attention to protect public health. These incidents, ranging from contamination to intentional adulteration of food and feed, pose serious risks to consumers, leading to poisoning, and disease outbreaks that lead to product recalls. Identifying and tracking the sources and pathways of contamination is essential for timely intervention and prevention. This paper explores the use of social media and regulatory news reports to detect food safety issues and disease outbreaks. We present an automated approach leveraging a multi-task sequence labeling and sequence classification model that uses a liquid time-constant neural network augmented with a graph convolution network to extract and analyze relevant information from social media posts and official reports. Our methodology includes the creation of annotated datasets of social media content and regulatory documents, enabling the model to identify foodborne infections and safety hazards in real-time. Preliminary results demonstrate that our model outperforms baseline models, including advanced large language models like LLAMA-3 and Mistral-7B, in terms of accuracy and efficiency. The integration of liquid neural networks significantly reduces computational and memory requirements, achieving superior performance with just 1.2 × e6 bytes of memory, compared to the 20.3 GB of GPU memory needed by traditional transformer-based models. This approach offers a promising solution for leveraging social media data in monitoring and mitigating food safety risks and public health threats.
Multimodal Artificial Intelligence (AI) promises to transform biomedicine by integrating imaging, genomics, and clinical data for superior decision-making. Yet, we contend that the current pursuit of large-scale generalist models is fundamentally misaligned with the high-risk nature of biomedical applications. This position paper argues that biomedical NLP demands specialization, not generalization, challenging the assumption that greater model scale and generality inherently ensure robustness in healthcare. We propose a theoretical framework built on three biomedical axioms: error cost asymmetry, multimodal data fragility, and interpretability–utility coupling, alongside a formal proof of criticality in biomedical NLP, showing that generalist models are intrinsically unsuited for medical tasks. As a secondary contribution, we advance a task-first design paradigm centered on modular, specialized, and ethically grounded AI architectures for biomedical use. Through analysis and illustrative cases, we contrast this approach with scale-centric strategies, exposing risks such as bias amplification, reduced interpretability, and exclusion of rare or underrepresented populations. We call for a realignment of research, funding, and regulation toward specialization as the sustainable path for meaningful and equitable biomedical AI, aiming to spark critical discourse on what constitutes genuine progress in machine learning for health.
An Enhanced Training-Free Pipeline for Entity Recognition and Linking: A Low-Resource Case Study – 20-th Century Historical Medical Texts
Phu-Vinh Nguyen | Vera Danilova
Phu-Vinh Nguyen | Vera Danilova
Entity linking in biomedicine typically relies on large annotated corpora and supervised methods, which often fail in out-of-distribution settings. Historical medical texts are rich in biomedical terms but pose unique challenges: terminology has changed, some concepts are obsolete, and stylistic differences from modern journals prevent off-the-shelf models fine-tuned on contemporary datasets from aligning historical terms with current ontologies. Training-free methods based on LLMs offer a solution by linking historical terms to modern concepts and inferring their meaning from context. In this paper, we evaluate a state-of-the-art training-free entity linking method on historical medical texts and propose an improved pipeline—end-to-end entity extraction and linking with confidence estimation. We also assess performance on modern benchmarks to check whether the gains generalize to other domains and show their superior performance in most cases. We report an analysis of the findings. The code and curated dataset for historical medical entity linking are available on GitHub.
Graph-Enhanced LLM Analysis of Multimodal Health Communities: A Computational Framework for Patient Discourse Understanding on TikTok
Tawakalit Agboola | Oluwaseun Ajao
Tawakalit Agboola | Oluwaseun Ajao
Social media platforms have become critical sources of patient-generated health data, yet existing computational approaches fail to capture the interconnected nature of online health discourse. We present a novel framework that integrates graph-based community detection with large language model analysis to understand patient narratives in multimodal social media content. Applied to 10,253 TikTok posts about JAK inhibitors (2020-2024), our approach constructs heterogeneous graphs representing user-content-medical entity relationships and applies community detection algorithms enhanced with context-aware LLM interpretation. Our comprehensive analysis of 10,253 posts (January 2020–September 2024) reveals five distinct patient communities characterized by different discourse patterns: treatment success narratives (873 nodes), medication guidance (642 nodes), side effect discussions (589 nodes), comparative treatment analysis (412 nodes), and dosage optimization (347 nodes). The Louvain algorithm significantly outperformed Girvan-Newman in modularity (0.9931 vs. 0.9928), conductance (0.0002 vs. 0.0006), and computational efficiency (0.14s vs. 54.24s). Temporal analysis demonstrates increasing community cohesion and evolving discourse patterns from cautious inquiry (2020-2021) to experience sharing and specialized sub-communities (2023-2024). This work contributes: (1) a scalable computational framework for multimodal health content analysis, (2) methodological innovations in graph-LLM integration, and (3) insights into platform-specific health communication patterns. The framework has applications in pharmacovigilance, computational social science, and AI-assisted health monitoring systems.
Almost Clinical: Linguistic properties of synthetic electronic health records
Serge Sharoff | John Baker | Dr David Francis Hunt | Alan Simpson
Serge Sharoff | John Baker | Dr David Francis Hunt | Alan Simpson
This study evaluates the linguistic and clinical suitability of synthetic electronic health records in mental health. First, we describe the rationale and the methodology for creating the synthetic corpus. Second, we examine expressions of agency, modality, and information flow across four clinical genres (Assessments, Correspondence, Referrals and Care plans) with the aim to understand how LLMs grammatically construct medical authority and patient agency through linguistic choices. While LLMs produce coherent, terminology-appropriate texts that approximate clinical practice, systematic divergences remain, including registerial shifts, insufficient clinical specificity, and inaccuracies in medication use and diagnostic procedures. The results show both the potential and limitations of synthetic corpora for enabling large-scale linguistic research otherwise impossible with genuine patient records.
Mind Your Steps in Biomedical Named Entity Recognition: First Extract, Tag Afterwards
Darya Shlyk | Stefano Montanelli | Marco Mesiti | Lawrence Hunter
Darya Shlyk | Stefano Montanelli | Marco Mesiti | Lawrence Hunter
Few-shot prompting with Large Language Models (LLMs) has emerged as a promising paradigm for advancing information extraction, particularly in data-scarce domains like biomedicine, where high annotation costs constrain the availability of training data.However, challenges persist in biomedical Named Entity Recognition (NER), where LLMs fail to achieve necessary accuracy and lag behind supervised fine-tuned models. In this study, we introduce FETA (First Extract, Tag Afterwards), a two-stage approach for entity recognition that combines instruction-guided prompting and a novel self-verification strategy to improve accuracy and reliability of LLM predictions in domain-specific NER tasks. FETA achieves state-of-the-art results on multiple established biomedical datasets.Our experiments demonstrate that carefully designed prompts, using self-verification and instruction guidance, can steer general-purpose LLMs to outperform fine-tuned models in knowledge-intensive NER tasks, unlocking their potential for more reliable and accurate information extraction in resource-constrained settings.
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
Ikram Belmadani | Oumaima El Khettari | Pacôme Constant dit Beaufils | Richard Dufour | Benoit Favre
Ikram Belmadani | Oumaima El Khettari | Pacôme Constant dit Beaufils | Richard Dufour | Benoit Favre
Automatic evaluation of open-ended question answering in specialized domains remains challenging mainly because it relies on manual annotations from domain experts. In this work, we assess the ability of several large language models (LLMs), including closed-access (GPT-5.1, Gemini-2.5-Pro), open-source general-purpose (Qwen-80B), and biomedical domain-adapted models (MedGemma-27B, Phi-3.5-mini variants), to act as automatic evaluators of semantic equivalence in French medical open-ended QA. Our analysis reveals that LLM-based judgments are sensitive to the source of answer generation: judgement correlation varies substantially across different generator models. Among the judges, MedGemma-27B and Qwen-80B achieve the highest agreement with expert annotations in terms of F1 score and Pearson correlation. We further explore lightweight adaptation strategies on Phi-3.5-mini using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO). Even with 184 training instances, these adaptations significantly improve Phi-3.5’s results and reduce variability across answer generators, achieving performance comparable to larger domain-adapted models. Our results highlight the importance of generator-aware evaluation, the limitations of general-purpose LLMs in domain-specific settings, and the effectiveness of lightweight adaptation for compact models in low-resource scenarios.
Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks
Chaimae Abouzahir | Congbo Ma | Nizar Habash | Farah E. Shamout
Chaimae Abouzahir | Congbo Ma | Nizar Habash | Farah E. Shamout
In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic & English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis shows that model-reported confidence and explanations are poor indicators of correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
Modulating Multi-Label Tendency in Zero-Shot LLM Coding: The Effect of Output Structure on CDSS Feedback Analysis
Hyunwoo Choo | Sungsoo Hong
Hyunwoo Choo | Sungsoo Hong
Large language models (LLMs) often default to single-label classification in zero-shot multi-label tasks—a tendency we term "conservative default". While few-shot prompting mitigates this, it introduces "example bias". We evaluate zero-shot strategies to modulate this tendency using 1,441 healthcare feedback records and two LLMs. We compare instruction-based methods with structural constraints that modify the token generation sequence, specifically an Enum-First format requiring domain enumeration before selection. Results show that structural constraints substantially reduce single-label rates (Magistral: 96% → 19%; Qwen3: 54% → 0.0%), though the latter suggests potential over-correction compared to human baselines (16.7–41.3%). These findings indicate that while output structure is a potent modulator of classification behavior by shifting the decision point upstream, its effect magnitude is model-dependent, necessitating empirical calibration to prevent spurious associations.
Normalizing Health Concepts with Biomedical Embedding and LLMs
Iram Azam | Keyuan Jiang | Gordon Bernard
Iram Azam | Keyuan Jiang | Gordon Bernard
Accurate normalization of health-related expressions to standardized biomedical concepts is crucial for both healthcare and biomedical research. However, traditional string-based matching methods are limited by lexical variations. In this study, we propose a neural embedding-based normalization framework that utilizes an embedding model trained on biomedical terminology, generating over 3.59 million embeddings corresponding to UMLS terms and Concept Unique Identifiers (CUIs). For clinical data, CUIs were retrieved via semantic matching, while Twitter phrases were first processed using a large language model (LLM) to generate preferred terms prior to embedding-based CUI retrieval. Our approach substantially outperforms exact string matching and MetaMap Lite. For clinical data (3,144 phrases), normalization accuracy improved from 0.679 (string match) and 0.574 (MetaMap Lite) to 0.858. For Twitter data (102 phrases), accuracy increased from 0.235 (string match) and 0.118 (MetaMap Lite) to a range of 0.882 (Gemini 2.5 Flash) to 0.980 (GPT-4o mini). These findings highlight both the effectiveness of embedding-based semantic retrieval and the ability of LLMs to generate preferred terms, enhancing robustness in health concept normalization across diverse text sources.
From Pain to Praise: Aspect-Based Sentiment Analysis for Norwegian Patient Feedback
Lilja Charlotte Storset | Elma Jelin | Rebecka Maria Norman | Oyvind Bjertnaes | Lilja Øvrelid | Erik Velldal
Lilja Charlotte Storset | Elma Jelin | Rebecka Maria Norman | Oyvind Bjertnaes | Lilja Øvrelid | Erik Velldal
This paper describes a new dataset for aspect-based sentiment analysis (ABSA) for analyzing patient feedback about healthcare services. In an interdisciplinary collaboration spanning the fields of natural language processing and healthcare research, we manually annotate a dataset of 2382 free-text comments collected from national patient experience surveys in Norway, covering two sub-fields of services – special mental healthcare and general practitioners. Annotations are provided on both the sentence- and comment-level, covering a fine-grained set of 25 unique healthcare-related aspects and their polarities. We also report results for fine-tuning both encoder- and decoder models on the resulting dataset, comparing different modeling strategies, like joint and sequential prediction of aspects and polarity. The resources developed in this work can assist healthcare researchers in the analysis of patient feedback, bringing a much more efficient approach compared to today’s manual analysis, potentially leading to improved patient satisfaction and clinical outcomes.
LLM Plug-ins Are Not a Free Lunch for Clinical Time-Series Prediction
Juhwan Choi | Kwanhyung Lee | Sangchul Hahn | Eunho Yang
Juhwan Choi | Kwanhyung Lee | Sangchul Hahn | Eunho Yang
Inspired by recent plug-in frameworks that repurpose frozen layers from large language models (LLMs) as inductive priors, we explore whether such mechanisms can be extended to clinical time-series prediction without textual inputs or LLM fine-tuning. We introduce a lightweight plug-in architecture that inserts a single frozen LLM Transformer layer between an aggregated time-series representation and the prediction head. Unlike prior work focused on vision or language tasks, our study targets clinical time-series data, where LLMs typically underperform when applied directly.Experiments on two ICU prediction tasks from MIMIC-III show that the proposed plug-in exhibits heterogeneous effects across different backbones and tasks, with occasional performance improvements and minimal computational overhead. We further compare general-purpose and medical-domain LLM layers under an identical plug-in setting, analyzing how domain specialization interacts with clinical time-series models. Overall, our results highlight important limitations of frozen LLM plug-ins and motivate future work on understanding the conditions under which such layers may be beneficial.
Tracking Autism Stigma in Italian Newspapers: A Longitudinal Analysis of Media Discourse (2016–2025)
Ginevra Martinelli | Chiara Barattieri di San Pietro | Daniela Ovadia | Marta Bosia | Valentina Bambini
Ginevra Martinelli | Chiara Barattieri di San Pietro | Daniela Ovadia | Marta Bosia | Valentina Bambini
Public awareness of Autism Spectrum Disorder (ASD) has grown in recent years, yet stigma surrounding this condition persists. Building on prior research showing increasingly positive portrayals of ASD, this study examines recent longitudinal trends in stigma and ASD, with a focus on Italian newspapers, and how these were affected by a key event such as the COVID-19 pandemic. We analyzed nearly 3,000 articles published between 2016 and 2025 using an innovative multi-layered Natural Language Processing (NLP) framework to capture multiple dimensions of stigma, including discriminatory language, emotional framings indicative of prejudices, stereotypes, and the thematic contexts in which ASD-related stigma appears. Overall, results indicate low levels of overt stigma and a gradual shift toward more positive portrayals, with only temporary disruptions during the pandemic. Some stereotypes remain, highlighting the need for ongoing attention to ASD representation in the media.
Why Are We Lonely? Leveraging LLMs to Measure and Understand Loneliness in Caregivers and Non-caregivers
Michelle Damin Kim | Ellie S. Paek | Yufen Lin | Emily Mroz | Jane Chung | Jinho D. Choi
Michelle Damin Kim | Ellie S. Paek | Yufen Lin | Emily Mroz | Jane Chung | Jinho D. Choi
This paper presents an LLM-driven approach for constructing diverse social media datasets to measure and compare loneliness in the caregiver and non-caregiver populations. We introduce an expert-developed loneliness evaluation framework and an expert-informed typology for categorizing causes of loneliness for analyzing social media text. Using a human-validated data processing pipeline, we apply GPT-4o, GPT-5-nano, and GPT-5 to build a high-quality Reddit corpus and analyze loneliness across both populations. The loneliness evaluation framework achieved average accuracies of 76.09% and 79.78% for caregivers and non-caregivers, respectively. The cause categorization framework achieved micro-aggregate F1 scores of 0.825 and 0.80 for caregivers and non-caregivers, respectively. Across populations, we observe substantial differences in the distribution of types of causes of loneliness. Caregivers’ loneliness were predominantly linked to caregiving roles, identity recognition, and feelings of abandonment, indicating distinct loneliness experiences between the two groups. Demographic extraction further demonstrates the viability of Reddit for building a diverse caregiver loneliness dataset. Overall, this work establishes an LLM-based pipeline for creating high quality social media datasets for studying loneliness and demonstrates its effectiveness in analyzing population-level differences in the manifestation of loneliness.
Importance of Prompt Optimisation for Error Detection in Medical Notes Using Language Models
Craig Myles | Patrick Schrempf | David Harris-Birtill
Craig Myles | Patrick Schrempf | David Harris-Birtill
Errors in medical text can cause delays or even result in incorrect treatment for patients. Recently, language models have shown promise in their ability to automatically detect errors in medical text, an ability that has the opportunity to significantly benefit healthcare systems. In this paper, we explore the importance of prompt optimisation for small and large language models when applied to the task of error detection. We perform rigorous experiments and analysis across frontier language models and open-source language models. We show that automatic prompt optimisation with Genetic-Pareto (GEPA) improves error detection over the baseline accuracy performance from 0.669 to 0.785 with GPT-5 and 0.578 to 0.690 with Qwen3-32B, approaching the performance of medical doctors and achieving state-of-the-art performance on the MEDEC benchmark dataset. Code available on GitHub: https://github.com/CraigMyles/clinical-note-error-detection
Linguistic Features Competitive with Bert! Leveraging Speech for Detection of Mental Health in Paediatric Lupus
Jida Jaffan | Barend Beekhuizen | Andrea Knight
Jida Jaffan | Barend Beekhuizen | Andrea Knight
Neuropsychiatric lupus (NPSLE) is characterized by inflammation in the brain with common symptoms of depression and anxiety. Early detection is crucial as it may change the treatment regimen; however, current approaches are costly and resource intensive. Therefore, we propose that leveraging current work using linguistics in NLP detection of mental health symptoms can be advantageous in early detection of NPSLE. This study is a proof-of-concept using 20 interviews from N=20 adolescents (10-17 years) diagnosed with Lupus. Our results suggest that linguistic feature-based models supported by Word2Vec embeddings offer an interpretable output compared with BERT models, while maintaining competitiveness in depression, and improvement over BERT in anxiety detection. This work may transform early screening methods in paediatric contexts and can be adapted to other clinical populations.
A Multimodal Framework for Aphasia Severity Classification in Russian
Kolmogorova Anastasia | Ekaterina Yavshitz | Anastasia Margolina | Anna Sugian
Kolmogorova Anastasia | Ekaterina Yavshitz | Anastasia Margolina | Anna Sugian
Automatic classification of aphasia severity presents persistent challenges, particularly for languages with limited clinical speech resources such as Russian. This paper explores a multimodal approach to severity estimation that combines acoustic and semantic representations of pathological speech. Acoustic features are extracted using pretrained Wav2Vec 2.0 models, while semantic information is obtained from the encoder of the Whisper model. The two representations are integrated via early feature fusion and evaluated using gradient boosting classifiers in a speaker-independent cross-validation setting. Experiments are conducted on a newly collected dataset of Russian speech recordings from patients with aphasia and neurotypical speakers (RuAphasiaBank). The results suggest that the combined use of acoustic and semantic embeddings can provide more stable severity estimates than unimodal baselines. This study contributes empirical evidence on the applicability of multimodal representation learning for aphasia severity classification under data-scarce conditions.
Data Augmentation Based on Selective Masking of Language Models for One Health Context
Youssef Mahdoubi | Najlae Idrissi | Mathieu Roche | Sarah Valentin
Youssef Mahdoubi | Najlae Idrissi | Mathieu Roche | Sarah Valentin
This study focuses on improving the performance of language models for two critical applications within the One Health context, specifically in epidemiological monitoring using textual data: (i) thematic classification across syndromic surveillance, biomedical and plant health domains, and (ii) detection of epidemic misinformation. A key challenge in these tasks is the limited availability of labeled textual data, which constrains the effectiveness of supervised learning methods. To overcome this limitation, we introduce two families of selective masking–based data augmentation strategies: lexical and non-lexical. Each family is implemented in a standard variant (Aug-SM-Lex and Aug-SM-NonLex), and a TF-IDF-weighted variant (Aug-SM-Lex-TFIDF and Aug-SM-NonLex-TFIDF). We perform two complementary experiments: the first determines the optimal masking rate, while the second evaluates the proposed strategies against LLM-based text reformulation. Experimental results indicate that selective masking-based augmentation outperformed both LLM-based reformulation (Mistral-7B and GPT-Neo-1.3B) and baseline models trained on original data alone across three of the five evaluated datasets, with the best performance achieved at a masking rate of 20%. This suggests that selective masking is a promising approach, potentially more effective than computationally expensive LLM-based reformulation.
Towards Inclusive Communication in Cancer Prevention and Treatment: A Case Study on Italian Informational Materials
Chiara Cassani | Luca Brigada Villa | Marco Forlano | Serena Coschignano | Amelia Barcellini | Silvia Luraghi | Alberto Giovanni Leone | Chiara Zanchi | Adalberto Lovotti
Chiara Cassani | Luca Brigada Villa | Marco Forlano | Serena Coschignano | Amelia Barcellini | Silvia Luraghi | Alberto Giovanni Leone | Chiara Zanchi | Adalberto Lovotti
This paper presents an annotation scheme developed to analyze linguisticaccessibility and inclusivity in Italian cancer-related informational materials.The scheme combines metadata annotation, qualitative analysis of textual andvisual features, and automatically extracted measures of linguistic complexitycapturing structural, lexical, and probabilistic properties of the texts. Abrief case study demonstrates how the proposed framework can be applied tocompare documents and identify different sources of linguistic difficulty. Theapproach provides a replicable methodological basis for large-scale analyses ofhealth communication materials.
Empathy as interactional accomplishment in clinical interactions with a conversational agent
Spencer Hazel | Adam Brandt | Yajie Vera He | Ernest Lim | Jared Joselowitz | Zachary Ellis
Spencer Hazel | Adam Brandt | Yajie Vera He | Ernest Lim | Jared Joselowitz | Zachary Ellis
As healthcare services deploy AI to automate patient-facing communication, concerns persist about the interactional work through which empathy is made relevant. We examine empathy not as an internal state but as an interactional accomplishment, asking how patients display orientations to an LLM-powered voice assistant’s turns as (non-)empathic in real clinical telephone calls. Using Conversation Analysis (CA) to analyse post–cataract surgery follow-up calls conducted by AI-powered voice assistant Dora (Ufonia), we compare patient responses across earlier and later system versions.Earlier calls show minimal, delayed, prosodically closed responses to wellbeing enquiries, consistent with treating Dora as a transactional information-gathering device. Later calls more often feature socially rich formats, for example colloquial upgrades, gratitude tokens, occasional return enquiries, and increased turn-final rising intonation, suggesting patients hear Dora’s talk as socially implicative and thus opening space for affiliative/empathetic uptake. We discuss implications for CA-informed conversation design and for evaluating “empathy” via participant orientations in situ rather than post-hoc self-report.
Delayed Wh-Question Development in Children with Hearing Loss: Evidence for Morphosyntactic Vulnerability from Corpus-Based NLP and LLM Analyses
Tong Wu
Tong Wu
This study provides corpus-based evidence that English-speaking children with hearing loss (CHL) show both quantitative and qualitative delays in wh-question development compared to typically developing (TD) peers. Using Natural Language Processing (NLP)/Large Language Model (LLM) based methods and two clinical subcorpora from CHILDES, we analyzed child utterances across several syntactic dimensions: frequency, lexical diversity, structural completeness, clausal embedding, wh-fronting, and utterance length. CHL produced significantly fewer wh-questions, used a narrower range of wh-types, showed lower rates of embedding, and more structural incompleteness. These differences were most evident in syntactically complex forms, such as embedded and canonical fronted wh-questions. The results support input-sensitive and usage-based accounts of syntactic development and highlight the need for enriched linguistic input in supporting CHL’s grammatical growth. Importantly, these group differences persisted when controlling for overalllanguage development as indexed by mean length of utterance (MLU) in words, indicatingthat CHL’s difficulties with wh-questions are not reducible to generalgrammatical delay.Methodologically, the study combines dependency-parsing-based analyses with exploratory LLM evaluation to assess the feasibility and limits of automated approaches to spontaneous child language. NLP-based analyses were more stable for formally defined syntactic features, while GPT-based analysis showed mixed performance, performing better on global structural judgments than on fine-grained syntactic diagnostics.
StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection
Amal Abdullah Alqahtani | Efsun Kayi | Mona T. Diab
Amal Abdullah Alqahtani | Efsun Kayi | Mona T. Diab
The prevalence of chronic stress represents a major public health concern, yet automated detection of vulnerable individuals remains limited. Social media platforms like X (formerly Twitter) serve as important venues for people to share their experiences openly. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for the automatic detection of self-reported chronic stress in English tweets. We investigate whether continual pretraining on clinically related conditions, such as depression, anxiety, and PTSD, which have a high comorbidity with chronic stress, improves stress detection compared to general language models. We continually pretrained RoBERTa on the Stress-SMHD corpus, a subset of Self-reported Mental Health Diagnoses focused on stress-related conditions, consisting of 108 million words from users with self-reported diagnoses of depression, anxiety, and PTSD. Then, we fine-tuned on the SMM4H 2022 Shared Task 8. StressRoBERTa achieves 82% F1, which outperforms the best shared task system (79% F1) by 3 percentage points. Our results demonstrate that focused cross-condition transfer learning from stress-related disorders provides stronger representations than general mental health training. To validate cross-condition generalization, we also fine-tuned the model on the Dreaddit. Our result of 81% F1 further demonstrates the transfer from clinical mental health contexts to situational stress discussions.
DementiaBank-Emotion: A Multi-Rater Emotion Annotation Corpus for Alzheimer’s Disease Speech (Version 1.0)
Cheonkam Jeong | Jessica Liao | Audrey Lu | Yutong Song | Christopher Rashidian | Donna Krogh | Erik Krogh | Mahkameh Rasouli | Jung-Ah Lee | Nikil Dutt | Lisa M Gibbs | David Sultzer | Julie Rousseau | Jocelyn Ludlow | Margaret Galvez | Alexander Nuth | Chet Khay | Sabine Brunswicker | Adeline Nyamathi
Cheonkam Jeong | Jessica Liao | Audrey Lu | Yutong Song | Christopher Rashidian | Donna Krogh | Erik Krogh | Mahkameh Rasouli | Jung-Ah Lee | Nikil Dutt | Lisa M Gibbs | David Sultzer | Julie Rousseau | Jocelyn Ludlow | Margaret Galvez | Alexander Nuth | Chet Khay | Sabine Brunswicker | Adeline Nyamathi
We present DementiaBank-Emotion, the first multi-rater emotion annotation corpus for Alzheimer’s disease (AD) speech. Annotating 1,492 utterances from 108 speakers for Ekman’s six basic emotions and neutral, we find that AD patients express significantly more non-neutral emotions (16.9%) than healthy controls (5.7%; p < .001). Exploratory acoustic analysis suggests a possible dissociation: control speakers showed substantial F0 modulation for sadness (Delta = -3.45 semitones from baseline), whereas AD speakers showed minimal change (Delta = +0.11 semitones; interaction p = .023), though this finding is based on limited samples (sadness: n=5 control, n=15 AD) and requires replication. Within AD speech, loudness differentiates emotion categories, indicating partially preserved emotion-prosody mappings. We release the corpus, annotation guidelines, and calibration workshop materials to support research on emotion recognition in clinical populations.
up
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Canyu Chen | Yuji Zhang | Zoey Sha Li | Zihan Wang | Qineng Wang | Jinyan Su | Priyanka Kargupta | Sara Vera Marjanović | Jeff Z. Pan | Mohit Bansal | Isabelle Augenstein | Jiawei Han | Heng Ji | Manling Li
Canyu Chen | Yuji Zhang | Zoey Sha Li | Zihan Wang | Qineng Wang | Jinyan Su | Priyanka Kargupta | Sara Vera Marjanović | Jeff Z. Pan | Mohit Bansal | Isabelle Augenstein | Jiawei Han | Heng Ji | Manling Li
Annotation Frameworks Shape Model Knowledge: Safety Alignment in Large Language Models
Wajdi Zaghouani
Wajdi Zaghouani
Large language models (LLMs) are commonly described as acquiringknowledge through large scale pretraining on textual corpora.This view underestimates the epistemic consequences of post trainingsafety mechanisms. Modern LLMs undergo extensive safety alignmentvia curated datasets, human annotations, and reinforcement learningfrom human feedback (RLHF), processes that do not merely constrainoutputs but actively reshape how propositional and proceduralknowledge is accessed and expressed. We propose a conceptualframework in which safety alignment functions as a systematic formof knowledge editing at scale. Annotation frameworks used toconstruct safety datasets act as normative ontologies that partitionlanguage into categories of acceptable and unacceptable content, andalignment training propagates these distinctions into model behaviour.We introduce the Safety Knowledge Pipeline (SKP), a four stageframework describing how pretraining knowledge is progressivelyfiltered, reframed, and constrained through annotation and alignmentmechanisms. We identify three mechanisms of knowledge modification,suppression, reframing, and substitution, each with distinctdiagnostic signals, and we operationalise them in a cross lingualevaluation protocol. Throughout, we distinguish carefully betweenbehavioural claims that follow from prior empirical literature andrepresentational claims that remain open hypotheses. Case studiesspanning harmful instruction queries, hate speech annotation inArabic dialects, and culturally variable discourse illustrate theframework. We further discuss how treating annotator disagreementas a training signal rather than noise can mitigate the culturallyhegemonic effects of current alignment pipelines.
Can factual errors in language models be repaired by editing a single hidden activation at inference time?We compare blind edits, which are not told the correct answer, with oracle edits that receive answer-specific information.On Pythia-6.9B, with corruption replicated on Pythia-1B and GPT-2 XL, we find a strong break/fix asymmetry: single-layer perturbations easily corrupt correct factual recall, flipping 74-100% of initially correct answers, but blind repair is much harder.On EntityConfusion, twelve blind non-gradient interventions from four families fail to repair stable hallucinations in the strict single-layer setting; relaxed multi-layer or multi-head variants improve net accuracy by only +3 percentage points.Blind gradient optimization repairs more errors, but often breaks already-correct answers.In contrast, oracle edits given the correct answer repair many more hallucinations, fixing 68% at the default layer and up to 82% at a better layer.These results suggest that the main barrier is not whether factual recall can be steered, but whether a blind method can identify the right target-specific direction.TriviaQA is a boundary case: blind confidence maximization outperforms the single-token oracle, but the comparison is complicated because evaluation accepts multiple aliases.
What Does Alignment Cost? The Structural Brittleness of Chain-of-Thought Reasoning
Joanna Hao | Shanduojiao Jiang | Sai Asish Nakka
Joanna Hao | Shanduojiao Jiang | Sai Asish Nakka
While Chain-of-Thought (CoT) prompting enables Large Language Models to explicitly justify their predictions, the extent to which these textual rationales faithfully reflect internal computation remains unclear. We investigate the circuit-level impact of alignment by performing a strict within-family comparison of the 1B-parameter Llama 3 architecture (Base vs. Instruct). Executing dynamic circuit discovery and dual-direction resample ablation on unconstrained CoT traces across synthetic mathematical primitives and a GSM8K proxy, we find that foundation models possess highly redundant, self-repairing computational networks; completely corrupting their primary reasoning circuits yields a minimal performance drop (2.92%) due to the dynamic compensation of backup heads (the Hydra Effect). In contrast, the instruction-tuned model exhibits reduced structural redundancy, suffering more than double the degradation (6.79%) under identical perturbation. We formalize our observation as an "Alignment Tax on Redundancy": optimizing for human-preference compliance repurposes dormant backup circuits, centralizing mathematical routing and rendering the aligned model’s reasoning pathways significantly more vulnerable to internal perturbation.
bLLeQA: Benchmarking LLMs for Grounded Legal Question-Answering in French and Dutch
Nikolay Banar | Ehsan Lotfi | Jens Van Nooten | Marija Kliocaite | Walter Daelemans
Nikolay Banar | Ehsan Lotfi | Jens Van Nooten | Marija Kliocaite | Walter Daelemans
Retrieval-augmented generation (RAG) systems can play an important role in making law more accessible. However, large and reliable resources for training and benchmarking such systems remain scarce, especially for under-resourced languages like Dutch. To address this gap, and building on previous work (Louis et al., 2024), we introduce bLLeQA, a bilingual parallel question-answering dataset grounded in Belgian legal resources, both in French and Dutch. The dataset contains aligned questions, answers, and supporting articles in both languages, enabling evaluation of both retrieval and end-to-end RAG pipelines. Using bLLeQA, we benchmark the full RAG pipeline in a zero-shot setting, covering retrieval, citation extraction, refusal behavior, and generation quality. Our experiments show that open-weight models are competitive with proprietary models in retrieval and citation extraction, but lag behind in generation quality in the RAG pipeline. Across all models, refusal capability remains weak, meaning that models do not reliably detect when the provided supporting sources are incomplete. In addition, the end-to-end RAG setup still yields a substantial share of flawed responses, reaching 20% even in the best-case scenario.
VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models
Ravi Ranjan | Agoritsa Polyzou
Ravi Ranjan | Agoritsa Polyzou
Vision-language-action (VLA) models are emerging as embodied foundation models for robotic manipulation, but their deployment introduces a new unlearning challenge: removing unsafe, spurious, or privacy-sensitive behaviors without degrading perception, language grounding, and action control. In OpenVLA-style policies, behavior is produced through a fused visual encoder, a cross-modal projector, and a language backbone that predicts tokenized robot actions, so undesirable knowledge can be distributed across perception, alignment, and reasoning/action layers rather than confined to a single module. Consequently, partial unlearning applied only to the vision stack or only to the language backbone is often insufficient, while conventional unlearning baselines designed for standalone vision or language models may leave residual forgetting or incur unnecessary utility loss in embodied settings. We propose VLA-Forget, a hybrid unlearning framework that combines ratio-aware selective editing for perception and cross-modal specificity with layer-selective reasoning/action unlearning for utility-preserving forgetting. VLA-Forget jointly optimizes three objectives: targeted forgetting, perceptual preservation, and reasoning retention, through staged updates over the visual encoder, projector, and upper action-generating transformer blocks. Across forget-set behavior probes and retain-task evaluations, VLA-Forget improves forgetting efficacy by 10%, preserves perceptual specificity by 22%, retains reasoning and task success by 9%, and reduces post-quantization recovery by 55% relative to strong unlearning baselines.
Overcoming the Impedance Mismatch: A Theoretical Roadmap for Fusing Foundation Models and Knowledge Graphs
Sahil Rajesh Dhayalkar
Sahil Rajesh Dhayalkar
Modern artificial intelligence remains fundamentally divided between the continuous, probabilistic spaces of Foundation Models and the discrete, deterministic structures of Knowledge Graphs. While Retrieval-Augmented Generation (RAG) attempts to connect them by serializing graph data into text, we argue this lexical bridging is merely a superficial patch. In this paper, we formalize the underlying structural and geometric friction as the Impedance Mismatch. By categorizing current neuro-symbolic integration strategies into a three-tiered hierarchy, we demonstrate that neither surface-level prompt injection nor continuous representation alignment can preserve the strict logical motifs required for reliable multi-hop reasoning. We define the specific mathematical limits, such as the Lexical Bottleneck and Topological Collapse, that show current architectures will eventually hallucinate or conflate semantic nodes. To achieve true semantic fusion, we propose a rigorous theoretical roadmap. We advocate for natively internalizing discrete symbolic structures through Structured Residual Streams, utilizing Vector Symbolic Architectures for latent sub-graph injection, and performing model updates via Orthogonal Subspace Editing. This actionable framework paves the way for models that seamlessly fuse the precision of symbolic logic with the expressivity of parametric memory.
LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering
Yuanjie Zhu | Liangwei Yang | Ke Xu | Weizhi Zhang | Zihe Song | Jindong Wang | Philip S. Yu
Yuanjie Zhu | Liangwei Yang | Ke Xu | Weizhi Zhang | Zihe Song | Jindong Wang | Philip S. Yu
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.
Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face a fundamental dilemma: cross-encoders deliver strong reranking quality but require costly supervised training and incur high latency, while unsupervised BM25 reranking consistently degrades dense retrieval performance on most of BEIR benchmarks. We propose DART (Dense Adaptive Reranking at Test-time), which resolves this dilemma by adapting the scoring function at inference time. For each query, the top-ranked documents serve as pseudo-positive examples and the bottom-ranked as pseudo-negative examples, providing noisy but readily available supervision to adapt a bilinear scoring matrix W via a small number of gradient updates. We further introduce a confidence-weighted margin loss and a cross-query momentum buffer that warm-starts adaptation across queries. On six BEIR benchmarks, DART achieves a mean per-dataset relative NDCG@10 gain of +2.1% over the dense retrieval baseline with under 10ms additional latency per query, demonstrating a powerful capability for zero-shot performance enhancement and cross-domain generalization.
Multimodal Generative Engine Optimization: Rank Manipulation for Vision–Language Model Rankers
Yixuan Du | Chenxiao Yu | Haoyan Xu | Ziyi Wang | Yue Zhao | Xiyang Hu
Yixuan Du | Chenxiao Yu | Haoyan Xu | Ziyi Wang | Yue Zhao | Xiyang Hu
Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking: through Multimodal Generative Engine Optimization (MGEO), we show that an adversary can manipulate a VLM’s ranking decisions by jointly crafting imperceptible image perturbations and fluent textual suffixes that exploit the model’s internal cross-modal knowledge coupling. Using an alternating optimization strategy, MGEO targets the deep interactions between visual and linguistic representations within the VLM, achieving rank manipulations that substantially exceed those of unimodal attacks and heuristic baselines powered by strong commercial models. Our findings reveal that surface-level content quality is insufficient for rank promotion; instead, direct alignment with the model’s internal knowledge utilization mechanism is required. These results raise important questions on the faithfulness and robustness of knowledge grounding in multimodal foundation models, and motivate future work on defense mechanisms for multimodal retrieval systems.
Beyond Retrieval: Bi-Temporal State Arbitration for Longitudinal Healthcare Agents
Jianing Zhao | Xiaoquan Zhi | Xinqiang Yu
Jianing Zhao | Xiaoquan Zhi | Xinqiang Yu
Longitudinal healthcare agents require persistent state tracking under temporal uncertainty. In domains like chronic disease management, patient states—medications, symptoms, and vital signs—evolve continuously over months. Existing memory architectures for Large Language Models (LLMs) are inherently retrieval-centric: they treat memory as a static repository of past interactions, failing to resolve conflicting or superseded information when queried for the current patient state. We propose a shift to state-centric memory. Our framework introduces (1) a bi-temporal state representation that decouples event time from ingestion time and tracks temporal validity windows, (2) an incremental state arbitration mechanism using four operators—SUPPORT, REFINE, SUPERSEDE, and BRANCH-CONFLICT—to handle evolving medical facts without destructive overwriting, and (3) a confidence-thresholded evidence escalation layer for robust, efficient memory access. Evaluated on a longitudinal diabetes management suite as a representative biomedical state tracking task, our method achieves a Unique-F1 of 0.85 and Conflict-F1 of 0.98, substantially improves upon long-context LLMs (0.38 / 0.89) and standard vector memory (0.30 / 0.60), demonstrating that agentic AI in longitudinal biomedical settings requires continuous, evidence-grounded arbitration rather than simple retrieval.
RSCE: Training-Free Residual Stream Encoding for Persistent Context Amortization
Adam Kamel | Eric Xu
Adam Kamel | Eric Xu
A central question in the knowledge lifecycle of language models ishow externally injected signals interact with parametric memoryaccumulated during pretraining. We address this through ResidualStream Context Encoding (RSCE), a training-free method that encodesa context document ctx into a single vector C ∈ ℝdMvia mean-pooling residual stream activations at a calibratedintermediate layer, then injects C as an additive shift at querytime. This replaces O(|T(ctx)|) attention prefill with an O(1)operation and reveals a previously undescribed dual-pathwayinterference effect: vector injection alone suppresses parametricrecall below the question-only baseline across four of fivetested architectures. This finding—absent in behavioral activationsteering—provides mechanistic evidence that LLMs maintain separatecontextual-retrieval and parametric-recall pathways that compete whenexternally injected signals are semantically rich but token-precisiondeficient. A dual-channel design pairing C with a compact explicitfact block F resolves this tension. We evaluate five decoder-onlyarchitectures (7B–70B) on multi-document QA (LongBench, n=108)and six on cross-file code completion (RepoBench-C), comparingagainst LongLLMLingua and EHPC. At extreme compression (∼99%token reduction), RSCE Vec+F is competitive with EHPC on smallerarchitectures (LLaMA-8B F1 0.333 vs. EHPC 0.334; DeepSeek-14Bboth 0.214) while both substantially outperform LongLLMLingua.RSCE is the only method achieving 81% compression at 100%operational reliability on code.
Tricking Open-World Object Recognition Models: Uncertainty in Out-of-Distribution Detection
Wout Teillers | Matias Valdenegro-Toro
Wout Teillers | Matias Valdenegro-Toro
Object recognition models are well studied on benchmark datasets, typically focusing on performance in retrieving objects that exist in images. However, in real-life scenarios there is no prior knowledge of an object’s existence, and current research fails to assess model performance in these situations. This research aims to shed light on this problem by testing three Open-World models, YOLO-World, Grounding Dino and GPT-4o, on the LVIS, Open Images, and JUS datasets. We design an experiment where models are confronted with impossible prompts by instructing them to retrieve non-existing objects. This allows us to observe the models’ uncertainty performance. Overall, GPT-4o performed poorest with regard to object recognition and uncertainty estimation. GPT-4o showed to be highly overconfident. In contrast, YOLO-World and Grounding Dino are slightly underconfident, but they are superior in their uncertainty calibration in comparison to GPT-4o. However, all three models occasionally assign high confident predictions to non-existing objects. Showing that improvement can still be made to the uncertainty estimation of these models when confronted with impossible prompts.
Knowledge Localization and Editability in Small Language Models: A Multi-Stage Experimental Study
Pranamya Nilesh Deshpande | Aiswarya Konavoor | Sreedath Panat
Pranamya Nilesh Deshpande | Aiswarya Konavoor | Sreedath Panat
The internal mechanisms by which transformer-based language models encode and retrieve factual knowledge remain poorly understood, particularly for small language models (SLMs) operating in the 2–3 billion parameter range. This paper presents a systematic, multi-stage empirical investigation into knowledge localization, compression effects, and knowledge editability across four SLMs—Gemma-2B, Llama-3.2-3B-Instruct, Qwen-2.5-3B-Instruct, and Phi-2—with Meta-Llama-3-8B serving as a large-model baseline. Stage 1 employs causal tracing with activation patching on the CounterFact dataset (~450–500 validated facts per model) to identify the layer or layers most causally responsible for factual recall. Stage 2 compares knowledge density, layer concentration, and redundancy between the 2–3B models and the 8B baseline to quantify the structural effects of model compression on knowledge storage. Stage 3 applies the Rank-One Model Editing (ROME) algorithm at the causally identified layers to assess whether localized knowledge can be reliably overwritten. Our results demonstrate that (i) factual knowledge in SLMs concentrates in upper-to-final transformer layers, with Llama-3B exhibiting extreme concentration in layer 28; (ii) compressed models store knowledge more densely per parameter but with substantially lower redundancy (Llama-3B: 0.047 vs. Llama-8B: 0.468); and (iii) editing success correlates strongly with architectural concentration rather than model size, with Llama-3B achieving 85.7% editing success versus 33% for Gemma-2B. These findings carry direct implications for interpretability, model editing, and the design of future small language model architectures.
One Retrieval to Cover Them All: Co-occurrence-Aware Knowledge Base Reorganization for Session-Level RAG
Shivam Ratnakar | Yixuan Zhu | Cecilia Cheng | Chaya Vijayakumar
Shivam Ratnakar | Yixuan Zhu | Cecilia Cheng | Chaya Vijayakumar
RAG systems retrieve documents optimized for answering *one query at a time*. Yet enterprise users arrive with *sessions*, that is, coherent episodes of related questions that span semantically distant parts of the knowledge base. We show that a single retrieval call over a standard knowledge base covers only 41% of a user’s session-level information need. To close this gap, we reorganize the KB offline using co-occurrence-aware clustering and expand retrieval candidates through cluster neighborhoods at query time. On WixQA (6,221 enterprise support articles), our method raises single-query session coverage to 58% (+17% absolute; 95% CI: [14.1, 20.4]), reduces retrieval calls to 70% coverage by 34%, and compresses the KB to 20% of its original size, all consistently across four embedding models and six functional domains. We argue that session-level coverage, not single-query recall, should be the primary metric for enterprise RAG evaluation.
up
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Proceedings of the 10th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature 2026
Diego Alves | Yuri Bizzoni | Stefania Degaetano-Ortlieb | Anna Kazantseva | Janis Pagel | Stan Szpakowicz
Diego Alves | Yuri Bizzoni | Stefania Degaetano-Ortlieb | Anna Kazantseva | Janis Pagel | Stan Szpakowicz
From Corpus to Concept Scheme: Developing a SKOS Vocabulary for Armenian Epigraphic Heritage
Hamest Tamrazyan | Kamal Nour | Emanuela Boros
Hamest Tamrazyan | Kamal Nour | Emanuela Boros
Armenian epigraphy, one of the world’s oldest and most diverse inscriptional traditions, remains largely absent from digital research infrastructures due to a lack of basic linguistic and conceptual resources. No machine-readable corpus, standardized terminology, or controlled vocabulary exists for describing Armenian inscription types, preventing indexing and interoperability. This paper addresses this gap by constructing the first dataset of Armenian inscription-type terminology and by developing a computational pipeline for analyzing it at scale. We digitize and preprocess a broad corpus of authoritative printed publications; curate a culturally grounded terminology list; and train transformer-based NER models to identify both attested inscription types and potential terminological variants across unseen texts. The resulting resources form the first empirical foundation for modelling Armenian epigraphic concepts needed for further developing a SKOS vocabulary aligned with, yet culturally distinct from, existing international epigraphic ontologies.
Armenian AutoEpiDoc: Automated Extraction and Encoding of Armenian Inscriptions into EpiDoc TEI/XML
Hamest Tamrazyan | Emile Cornamusaz | Emanuela Boros
Hamest Tamrazyan | Emile Cornamusaz | Emanuela Boros
Armenian epigraphy is extensively documented in printed scholarly corpora, yet lacks machine-readable editions that support interoperability or computational analysis. In this paper, we present Armenian AutoEpiDoc, a system that automatically converts expert-verified Armenian inscription records into EpiDoc-compliant TEI/XML files. Operating on curated and domain-validated data, AutoEpiDoc maps Armenian-specific metadata to EpiDoc structures through rule-based templates and schema-aware validation. The workflow significantly reduces manual encoding effort and provides a scalable path toward producing digital editions and integrating Armenian inscriptions into international epigraphic infrastructures.
Studying Expert-ese: Profiling and Classification of Domain-Specific Language Variation in Architecture with Traditional Machine Learning and LLMs
Carmen Schacht | Renate Delucchi Danhier
Carmen Schacht | Renate Delucchi Danhier
This study investigates how domain expertise shapes spontaneous oral language production, with a focus on architecture. Building on the ExpLay Corpus, which contains image descriptions by speakers with and without architectural training, we analyze linguistic variation by combining Profiling-UD and the DECAF framework. We extract a broad range of syntactic and morpho-syntactic features to build linguistic profiles for both groups and train classifiers to distinguish expert from non-expert productions. Two traditional machine learning models (logistic regression and SVM) are compared with a lightweight BiLSTM and two large language models (GliClass and LLaMA 2). While the expert and non-expert corpora diverge only subtly (pairwise Jensen–Shannon divergence (JSD)= 0.25), the BiLSTM using fastText embeddings achieves the highest F1-score (0.88), outperforming both traditional models and LLMs. This indicates that semantic representations are more predictive of domain variation than purely structural features and that smaller neural architectures generalize better on limited data. Overall, the findings provide empirical evidence that architectural expertise leaves measurable linguistic traces in spontaneous speech, supporting the Grammar of Space hypothesis.
We introduce CroCoSyn, a controlled, cross-lingual and cross-model corpus of 25,920 LLM-generated film synopses in English and French. Each synopsis is generated under systematically varied conditions, including model type, temperature, genre, protagonist gender, and narrative constraints, and enriched with structured metadata capturing characters and their relationships. Comparing Mistral and Llama across different model temperature degrees, CroCoSyn enables fine-grained analysis of narrative content, style, and character representation across models and languages. The corpus supports research on gender and cultural biases and story generation evaluation, and provides a foundation for comparative studies between LLM-generated and human-written narratives.
Identity Without Action: Rethinking Collective Action Models in Disinformation Research
Lorella Viola
Lorella Viola
Despite the rapid growth of disinformation research, the fundamental reasons behind user engagement with such content remain poorly understood. Recently, several scholars have suggested that researchers should study engagement with disinformation as a form of collective action (CA). Drawing on Social IdentityTheory (SIT) and the Social Identity Model of Collective Action (SIMCA), this study empirically verifies this assumption by testing it across two distinct linguistic communities, English and Spanish. Specifically, it investigates whether mobilizing CA language functions as a uniform predictor of engagement, or if engagement is primarily driven by community specific identity dynamics. The experiment analysed a bilingual corpus of 4,035 X (formerly Twitter) posts associated with conspiracy theory and disinformation-related hashtags (e.g., #Agenda2030, #TheGreatReset). Using a mixed-methods approach combining BERTopic for narrative discovery, non-parametric statistical testing and Random Forest Regressor, we disentangled the effects of language presence from community behaviour. The results revealthat the Spanish community exhibits a higher baseline engagement compared to the English community indicating that engagement is primarily driven by macro-level community norms (i.e., identity) rather than micro-level linguistic triggers. We argue that rather than treating mobilizing language as a uniform predictor of engagement, future application of SIMCA in disinformation research should account for these identity-based baseline differences.
Weakly Supervised Named Entity Recognition for Historical Texts
Marco Sorbi | Laurent Moccozet | Stephane Marchand-Maillet
Marco Sorbi | Laurent Moccozet | Stephane Marchand-Maillet
Named Entity Recognition has emerged as a critical task in natural language processing, particularly for extracting meaningful information from unstructured text. Although traditional approaches rely heavily on large annotated datasets, recent advances have explored weak supervision techniques to address the limitations of resource-intensive annotation processes. Historical texts provide unique challenges to this task because of their linguistic peculiarities, and several approaches exist to address texts of this domain in a supervised way, but they involve lengthy manual annotations of the documents of interest by domain experts. To address this issue, this paper explores how recent weakly supervised NER techniques can be adapted to historical texts, analyzing their suitability for this domain. The experiments show that domain-specific architectures can be effectively trained on low-resource corpora with weak supervision over a small set of entity labels. Using only 10% of the annotations, the performance of these architectures remains above 80% of the supervised quality in terms of F1-Score.
Invisible Speakers? Gender Disparity in German AI Discourse and Its Reflection in Language Models
Milena Belosevic
Milena Belosevic
This paper investigates how language models (LMs) reproduce the existing gender disparity found in German media discourse about artificial intelligence (AI). Building on a human-annotated corpus of quotations from German media discourse on AI, we first quantify the frequency with which male and female speakers are directly cited across domains and speaker roles. We then train LLäMmlein (Pfister et al., 2025), a state-of-the-art German-only language model, GBERT, and a logistic regression model using only the quoted text as input and without providing any gender cues to classify the quotation as originating from a male or female speaker. By comparing model predictions with corpus-based gold labels, we find that male voices dominate both the corpus and the model predictions. Balancing the data mitigates but does not fully eliminate this disparity, indicating that the strong male-default tendency of transformer models cannot be explained by corpus skew alone, but also by their priors from pretraining. The study contributes to the interpretability of language models’ output for DH-related tasks, adaptation of NLP tools to domain-specific humanities corpora, and knowledge modelling in the humanities.
GlobLingDiv: A global dataset linking linguistic diversity and digital support to reveal landscapes with under-resourced languages for NLP
Katharina Zeh | Hannes Essfors | Juliane Benson | Lale Tüver | Andreas Baumann | Hannes A. Fellner
Katharina Zeh | Hannes Essfors | Juliane Benson | Lale Tüver | Andreas Baumann | Hannes A. Fellner
Linguistic diversity is increasingly under pressure globally and is becoming ever more relevant in digital contexts, where many languages remain structurally under-resourced, limiting access to language technologies and inhibiting equitable NLP development. To support linguistic diversity, publicly available data are needed that capture both the number of languages spoken and the distribution of speakers across them. We introduce GlobLingDiv, a database that uses country-level speaker distributions to derive language richness and entropy-based diversity measures, alongside a population-weighted digital language support measure. Applying these metrics globally, we examine the association between linguistic diversity and digital support conditions. The results reveal a substantial imbalance: highly diverse linguistic landscapes show comparatively low digital support, underscoring the need for more inclusive NLP environments.
LLMs Got Rhyme? Hybrid Phonological Filtering for Greek Poetry Rhyme Detection and Generation
Stergios Chatzikyriakidis | Anastasia Natsina
Stergios Chatzikyriakidis | Anastasia Natsina
Large Language Models (LLMs), even though exhibiting multiple capabilities on many NLP tasks, struggle with phonologically-grounded phenomena like rhyme detection and generation. When one moves to lower-resource languages such as Modern Greek, this is even more evident. In this paper, we present a hybrid neural-symbolic system that combines LLMs with deterministic phonological algorithms to achieve accurate rhyme identification and generation. We implement a comprehensive taxonomy of Greek rhyme types and employ an agentic generation pipeline with phonological verification. We use multiple prompting strategies (zero-shot, few-shot, Chain-of-Thought, and RAG-augmented) across several LLMs including Claude 3.7 and 4.5, GPT-4o, Gemini 2.0 and open-weight models like Llama 3.1 8B and 70B and Mistral Large. Results reveal a significant reasoning gap: while native-like models (Claude 3.7) perform intuitively (40\% accuracy in identification), reasoning-heavy models (Claude 4.5) achieve state-of-the-art performance (54\%) only when prompted with Chain-of-Thought. Most critically, pure LLM generation fails significantly (under 4\% valid poems), while our hybrid verification loop restores performance to 73.1\%. Along with the system presented, we further release a corpus of 40,000+ rhymes, derived from the \textit{Anemoskala} and \textit{Interwar Poetry} corpora, to support future research.
Style as Signature: Profile-Based Authorship Verification of Mihai Eminescu’s Journalistic Corpus
Ioana-Roxana Boriceanu | Liviu Dinu
Ioana-Roxana Boriceanu | Liviu Dinu
Authorship verification aims to assess whether a questioned text is stylistically compatible with an author’s known writings, a task that is particularly challenging in historical corpora with partial ground truth. We address this problem in the context of Mihai Eminescu’s journalistic corpus, a historically grounded collection comprising published articles, manuscripts, and texts of uncertain authorship. Using a profile-based framework with character n-grams and function words, we examine how stylistic compatibility behaves across different profile construction settings and temporal splits. The results show that character trigram profiles consistently accept verified texts while producing a small and stable set of rejections among disputed items, whereas function word profiles show near complete acceptance across the corpus. A qualitative analysis shows that rejected texts exhibit meaningful differences in discourse structure and communicative purpose. These findings illustrate how authorship verification can support literary scholarship through stable signals for close reading.
Measuring Social Integration Through Participation: Categorizing Organizations and Leisure Activities in the Displaced Karelians Interview Archive using LLMs
Joonatan Laato | Veera Schroderus | Jenna Kanerva | Jenni Kauppi | Virpi Lummaa | Filip Ginter
Joonatan Laato | Veera Schroderus | Jenna Kanerva | Jenni Kauppi | Virpi Lummaa | Filip Ginter
We study how to better use digitized historical archives to answer sociological and historical questions that require more context than raw text mentions provide. Using Finnish World War II Karelian evacuee family interviews, we build on prior extraction of 350K mentions of leisure activities and organizational memberships (71K unique names) that are too diverse and unstructured to analyze directly. We introduce a categorization framework capturing key dimensions of participation: type of activity/organization, typical sociality, regularity, and the level of physical demand. After creating a gold-standard annotated set, we evaluate whether large language models can apply the schema at scale and find that an open-weight LLM, combined with simple multi-run voting, closely matches expert judgments. We then label all 350K entities to produce a structured resource for downstream analyses of social integration and related outcomes.
Catalogues as Data: Interpretable NLP Pipelines for Ottoman-Turkish Bibliographies
Mark Hill | Ayse Bulus | Paul Spence
Mark Hill | Ayse Bulus | Paul Spence
Bibliographies are both humanities infrastructure and historic record. To computationally analyse them, however, requires implementing complex digitisation and standardisation decisions. This paper turns to Seyfettin Özege’s Eski Harflerle Basılmış Türkçe Eserler Kataloğu as an example, a scanned set of volumes marked by complex page layouts, degraded typography, irregular entry structures, and historically contingent inconsistencies. With this we present a pipeline that constructs a structured, machine-readable, and analysable dataset out of the 27,000 entries with computer vision, OCR, large and visual language models, sequence-based validation, and custom review tools. This process captures 97.8% of records, with remaining cases capable of being addressed by targeted review. This process demonstrates that combining LLMs with interpretable, review-centric pipelines, offers an appropriate approach for historically complex bibliographic sources.
Large language models (LLMs) are post-trained on human feedback collected from annotator communities, yet the linguistic influence of these annotator communities on language models remains poorly understood. We investigated the stylistic transfer from Nigerian annotators to the LLaMA family of models through a natural experiment with LLaMA 2 and LLaMA 3.1, as their release dates are separated by the shutdown of a major data annotation service provider in Nigeria. We generated corpora from both model families and measured linguistic style by computing the difference-in-difference of the Jensen-Shannon distance on the bigram distribution between model outputs and corpora of Nigerian English and US English. We found that, although both pre-trained model variants exhibit similar proximity to both English variants, the LLaMA 2 post-trained model moved toward Nigerian English, while the LLaMA 3.1 post-trained model moved away from Nigerian English. Qualitatively, we found that post-trained LLaMA 2 models used significantly fewer contractions, in line with Nigerian English speakers opting to use a formal register due to its role as an index of knowledgeability. Our findings suggest that annotator communities can imprint linguistic style on large language models, with potential implications such as a disproportionately higher false positive rate in AI plagiarism detection for users who share a linguistic style with annotator communities.
Modeling Changing Scientific Concepts with Complex Networks: A Case Study on the Chemical Revolution
Sofia Aguilar Valdez | Stefania Degaetano-Ortlieb
Sofia Aguilar Valdez | Stefania Degaetano-Ortlieb
While context embeddings produced by LLMs can be used to estimate conceptual change, these representations are often not interpretable nor time-aware. Moreover, bias augmentation in historical data poses a non-trivial risk to researchers in the Digital Humanities. Hence, to model reliable concept trajectories in evolving scholarship, in this work we develop a framework that represents prototypical concepts through complex networks based on topics. Utilizing the Royal Society Corpus, we analyzed two competing theories from the Chemical Revolution (phlogiston vs. oxygen) as a case study to show that onomasiological change is linked to higher entropy and topological density, indicating increased diversity of ideas and connectivity effort.
Speaking on Their Behalf: Detecting Indirect Speech in Historical Danish and Norwegian Texts
Ali Al-Laith | Alexander Conroy | Kirstine Degn | Jens Bjerring-Hansen | Daniel Hershcovich
Ali Al-Laith | Alexander Conroy | Kirstine Degn | Jens Bjerring-Hansen | Daniel Hershcovich
Indirect speech is a fundamental yet understudied form of reported speech that plays a crucial role in literary texts and communication. While direct speech detection has received significant attention in computational linguistics, the automatic identification of indirect speech remains a challenge due to its nuanced linguistic structure and contextual dependencies. This paper focuses on the detection of indirect speech in late 19th-century Scandinavian literature, where its presence has been linked to shifting aesthetic ideals. We present an annotated dataset of 150 segments, each randomly selected from 150 different novels, designed to capture indirect speech in Danish and Norwegian literature. We evaluate four pre-trained language models for classifying indirect speech, with results showing that a Danish Foundation Model (DFM Large), trained on extensive Danish data, has the highest performance. Finally, we conduct a classifier-assisted quantitative corpus analysis and find that the prevalence of indirect speech exhibits fluctuations over time.
Harder than Finding the Lost Sheep? Towards Automatically Suggesting Deliberate Metaphor Annotations in German Sermons
Ronja Laarmann-Quante | Stefanie Dipper
Ronja Laarmann-Quante | Stefanie Dipper
Automatic metaphor detection so far has largely focused on English data annotated for all kinds of metaphors including ubiquitous conventionalized ones. In this paper, we focus on deliberate metaphors in German sermons, i.e., metaphors that are used with a specific communicative goal. This task is harder because there is less training data available, and deliberate metaphors are very rare. Our goal is to support human annotators with automatically generated suggestions, so we strive above all for high recall. Using multilingual transfer learning based on various metaphor datasets and different transformer models, the highest recall we achieve is .70 (precision .10). Our results suggest that larger context windows beyond the sentence level are not helpful and that adding in-domain data even when annotated with different guidelines and in a different language is beneficial.
Semantic Factor Analysis: Validating Personality Structure Recovery from empirically-mediated Word Embeddings
Oliver Müller
Oliver Müller
The present study introduces Semantic Factor Analysis (SFA), a novel computational approach recovering Big Five personality trait structures from pre-trained adjective word embeddings weighted by empirical participant data. Using Word2Vec embeddings trained on the Google-News-300 corpus, semantic relationships of IPIP-50 Big Five inventory adjectives (Goldberg, 1992) were extracted and factor structures computed through weighted vector averaging and K-means clustering. To validate the methodology, SFA was compared against a baseline using unweighted Word2Vec embeddings. In a controlled experiment with n=55 participants completing standard IPIP-50 assessments, HSP-R scale (Pluess et al., 2024) and multimedia impact surveys, empirically-weighted SFA successfully recovered all five personality dimensions with 62.5% average factor purity, substantially outperforming the unweighted baseline (52.0%, 10% relative improvement), while traditional Confirmatory Factor Analysis showed factor collapse and poor model fit. The approach was validated through Latent Class Analysis deriving empirically-based classification thresholds for Big Five dimensions and supporting a trichotomous Environmental Sensitivity model (Lionetti et al., 2018). Results demonstrate that integrating semantic representations with empirical data improves Big Five structure recovery beyond pure semantic similarity alone, particularly for small sample studies where traditional methods such as CFA will fail due to limited empirical data points.
While machine translation systems have been applied to many tasks with remarkable success, machine poetry translation has remained a challenge. This study investigates the capabilities of generative Large Language Models (LLMs) in the translation of poetry (taking Shakespeare’s 154 sonnets as an example) from English to German. For this purpose, I define metrics that assess the reproduction of the rhyme scheme and the metre of the original in a quantitative way. The results indicate that LLMs still lag behind professional human translators (especially with regard to the reproduction of the rhyme scheme), but that their performance is significantly influenced by the chosen prompt strategy. In particular, iteratively refining the result emerges as a successful strategy in terms of the reproduction of the form, but this comes at the expense of other aspects such as grammaticality and the reproduction of the meaning.
WikiLingDiv: a dataset for quantifying digital linguistic diversity using Wikipedia page views
Hannes Essfors | Andreas Baumann
Hannes Essfors | Andreas Baumann
With the conflation of digital and non-digital spaces, and NLP technologies being integrated into an increasing number of aspects of daily life, linguistic diversity cannot be fully understood without considering how language is used online. While existing models of linguistic diversity typically have relied on speaker numbers or language production, the dimension of diversity in language consumption remains comparatively understudied. To facilitate such research, we introduce WikiLingDiv, an openly accessible dataset for quantifying linguistic diversity in online knowledge retrieval using Wikipedia page views. Our dataset is based on yearly page views of 340 language editions of Wikipedia, aggregated across 239 countries and territories over 10 years (2015-2024). Using the dataset, we illustrate spatial and temporal patterns of digital linguistic diversity, suggesting that diversity has both increased and decreased across countries and regions, while highlighting country-specific dynamics in language usage. We release the dataset as an openly available and easily integrable data resource for researchers in computational linguistics, digital humanities, and the broader social sciences, enabling further work on linguistic variation, digital inequality, and the interaction between language use and digital technology.
Modeling Linguistic Imprints of War Propaganda in a Russian Wikipedia Fork: A Comparative Analysis with the Original Wikipedia
Anastasiia Vestel | Stefania Degaetano-Ortlieb
Anastasiia Vestel | Stefania Degaetano-Ortlieb
Although Wikipedia aspires to provide neutral information, alternative versions can be used for political manipulation. This paper analyzes how narratives about the Russo-Ukrainian War are linguistically reframed in a Russian Wikipedia Fork compared to the original Russian Wikipedia. Using Kullback-Leibler Divergence on a corpus of war-related edits in more than 13,000 articles, we identify key differences between the two versions. While the original Wikipedia features Ukrainian references and administrative details, direct war terminology, and Ukraine’s territorial designation, governance, and statehood, RWFork replaces or removes these elements, emphasizing reassignment of Ukrainian territories to Russia, favoring euphemistic war language, renaming locations, and recognizing Russia-backed DPR and LPR. These patterns closely align RWFork with demobilizational strategies observed in pro-Kremlin media.
Stylometric Approach to AI-generated Texts. An Analysis of Contemporary French-Language Literature
Adam Pawłowski | Tomasz Walkowiak
Adam Pawłowski | Tomasz Walkowiak
The article focuses on a stylometric analysis of authentic literary texts and thematically related texts generated by large language models. The texts under study represent a fairly broad cross-section of twentieth-century French literature. Five models were used to generate the texts (ChatGPT 4-o, GPT 4-o mini, DeepSeek v.3, c4ai-command-r-plus, and c4ai-command-a). The original human-written stories of approximately 20,000 characters were summarized, and new narratives were then generated on the basis of these abstracts. In terms of plot and style, they were intended to resemble the originals. The research carried out with TF-IDF of the most frequent words showed that texts generated by specific LLMs and written by humans cluster relatively well as distinct groups. The experiments also showed that the "authorial" specificity of machine-generated texts partly matches the original clustering of human-written source texts.
Degree Zero of Translation: Using Interlinear Baselines to Quantify Translator Intervention
Maciej Rapacz | Aleksander Smywiński-Pohl
Maciej Rapacz | Aleksander Smywiński-Pohl
Literary translation is rarely a neutral act of linguistic transfer, but rather a continuous series of conscious interventions - restructuring, semantic shifts, and stylistic adaptations. While Translation Studies analyzes these shifts qualitatively, current computational methods focus primarily on quality evaluation (e.g., BLEU, COMET) or authorship attribution (e.g., stylometry), lacking a scalable metric to quantify the extent and character of the translator’s intervention. We propose a novel method to measure the translator’s signal by using Interlinear Translation - a strict word-for-word gloss - as a computational baseline representing translational "Degree Zero," i.e., a neutral form of source text devoid of any stylistic adaptation.We define the Intervention Vector as the semantic difference between a literary translation and its interlinear counterpart in a high-dimensional vector space. We validate this approach on a multilingual corpus of the Greek New Testament translations comprising 5 interlinear baselines and 74 literary translations across 5 languages: English (16), French (14), Italian (12), Polish (16), and Spanish (16).Our results demonstrate that the magnitude of the Intervention Vector effectively ranks texts along a spectrum from literal to paraphrase, aligning with established theoretical categories. We find that this magnitude consistently distinguishes between translation strategies, yielding significantly longer vectors for dynamic and paraphrase strategies compared to literal and formal ones. This framework provides a quantitative method for analyzing translator agency without the need for a comprehensive corpus of reference translations.
How to Efficiently Explore Noisy Historical Data? Leveraging Corpus Pre-Targeting to Enhance Graph-based RAG
Donghan Bian | Marie Puren | Florian Cafiero
Donghan Bian | Marie Puren | Florian Cafiero
Graph-based Retrieval-Augmented Generation (RAG) is increasingly used to explore long, heterogeneous, and weakly structured corpora, including historical archives. However, in such settings, naive full-corpus indexing is often computationally costly and sensitive to OCR noise, document redundancy, and topical dispersion. In this paper, we investigate corpus pre-targeting strategies as an intermediate layer to improve the efficiency and effectiveness of graph-based RAG for historical research.We evaluate a set of pre-targeting heuristics tailored to single-hop and multi-hop of historical questions on HistoriQA-ThirdRepublic, a French question-answering dataset derived from parliamentary debates and contemporary newspapers. Our results show that appropriate pre-targeting strategies can improve retrieval recall by 3–5% while reducing token consumption by 32–37% compared to full-corpus indexing, without degrading coverage of relevant documents.Beyond performance gains, this work highlights the importance of corpus-level optimization for applying RAG to large-scale historical collections, and provides practical insights for adapting graph-based RAG pipelines to the specific constraints of digitized archives.
Detecting reported speech as a token classification task: an application to Classical Latin?
Agustin Dei
Agustin Dei
This paper presents the first application of an automatic token-classification approach for detecting reported speech spans in Classical Latin using transformer-based neural architectures.Focusing on Seneca the Elder’s Declamatory Anthology, the study addresses the text’s highly polyphonic nature, resulting from theuse of reported speech. Instead of relying exclusively on sentence-level syntactic information, the proposed approach treats reported speech detection as a token-level sequence labeling problem. This enables the identification of reported speech spans extending across multiple sentences. We fine-tune three Latin neural language models —LatinBERT, LaBERTa, and PhilBERTa— for binary token-level classification and conduct experiments both with and without punctuation. The results show that RoBERTa-based models effectively identify reported speech, with LaBERTa achieving the best performance (F1 scores above 0.90).
Narrative in Short German Prose: A Multi-Phenomenon Dataset for Computational Literary Analysis
Hans Ole Hatzel | Haimo Stiemer | Evelyn Gius | Chris Biemann
Hans Ole Hatzel | Haimo Stiemer | Evelyn Gius | Chris Biemann
We present the novel dataset GermAnProse, an annotated corpus consisting of four German short prose texts accompanied by an extensive set of narrative-focused annotations.As part of this dataset, we contribute an annotation scheme for mentions, speech, and character agency: Characters in Action (ChiA).GermAnProse also contains information on narrative phenomena: narrativity, semantic verb classes, and plot keyness.Moreover, we include reader reception data in the form of timing information for audiobook performances, indicating pauses between sentences and the time taken to read a specific sentence in a performance.We release the dataset, which contains more than 18,000 manually created standoff annotations in JSON format, enabling researchers to utilize this resource for further exploratory applications.
Sense-Based Annotation of Geographical Nouns in Ancient Greek and Latin: A Diachronic Study with LLMs
Andrea Farina | Michele Ciletti | Barbara Mcgillivray | Andrea Ballatore
Andrea Farina | Michele Ciletti | Barbara Mcgillivray | Andrea Ballatore
This paper investigates the lexicalisation of geographical nouns in Latin and Ancient Greek using a nd Ancient Greek using a diachronic, multi-genre corpus (8th cent. BCE – 2nd cent. CE) and Large Language Models for Word Sense Disambiguation. We focus on two main aspects: the onomasiological question of which words encode core geographical concepts, and the semasiological distribution of senses across lemmas. Across both languages, city-related concepts are the most frequently expressed, but Greek shows a stronger focus on maritime terms, whereas Latin favours concepts related to land. Semasiologically, Latin shows clearer evidence of semantic change over time (e.g., ’citizenship’ - ’city’, aequor ’flat surface’ - ’sea’), while Greek displays more gradual or distributed shifts. These results show that computational annotation enables cross-linguistic and diachronic analysis of spatial semantics, allowing us to compare the frequency of concepts across languages, genres, and periods, and to track when semantic change occurs and how core concepts evolve over time.
Evaluating Humanities Theory Alignment in Large Language Models: Incremental Prompting and Statistical Assessment
Axel Pichler | Janis Pagel
Axel Pichler | Janis Pagel
We propose a method to evaluate the extent to which an LLM’s observable input–output behavior aligns with established theories in the humanities and cultural studies. We instantiate the framework on three humanities theories—Davidson’s truth-conditional semantics, Lewis’s truth in fiction, and Iser’s concept of textual gaps—using a top-down, theory-driven black-box framework. Core assumptions of these theories are reconstructed into testable behavioral rules and assessed via controlled classification tasks with systematic prompt comparisons and significance testing. Our experiments show that theory-uninformed classification prompts generally outperform theory-enriched prompts in Lewis and Iser settings, while theory-informed prompts help in the Davidson task. Gemini Flash consistently achieves the highest scores across tasks and corpora, while the Iser gap detection task remains substantially harder than binary truth-conditional judgments. Statistical tests confirm robust prompt effects and the failure of basic prompts. However, model behavior under incremental theory exposure is unstable and architecture-dependent.
Too Long, Didn’t Model: Decomposing LLM Long Context Understanding With Novels
Sil Hamilton | Rebecca Hicke | Mia Ferrante | Matthew Wilkens | David Mimno
Sil Hamilton | Rebecca Hicke | Mia Ferrante | Matthew Wilkens | David Mimno
Although the context length of large language models (LLMs) has increased to millions of tokens, evaluating their effectiveness beyond needle-in-a-haystack approaches has proven difficult. We argue that novels provide a case study of subtle, complicated structure and long-range semantic dependencies often over 128k tokens in length. Existing novel-based long-context benchmarks are limited in scale due to the cost of manual annotating long texts. Inspired by work on computational novel analysis, we release the Too Long, Didn’t Model (TLDM) benchmark, which tests a model’s ability to reliably report plot summary, storyworld configuration, and elapsed narrative time. We find that none of seven tested frontier LLMs retain stable understanding beyond 64k tokens. Our results suggest language model developers must look beyond "lost in the middle” benchmarks when evaluating model performance in complex long context scenarios. To aid in further development we release the TLDM benchmark together with reference code and data.
We present an AI assistant designed to help researchers interact with language corpora using natural language instead of formal query languages. Built as a custom GPT with access to multilingual corpora via Czech National Corpus platform API, the system translates research questions into CQL queries, retrieves corpus data, and guides users through linguistic analysis. After more than a year of deployment, the system has processed over 1000 interactions with human users. We discuss the hybrid approach combining rule-based translation with LLM intelligence, challenges of building on a constantly evolving platform, and lessons learned from production usage. Notably, this system represents the first voice-enabled corpus interface in history, significantly lowering barriers to corpus-based research for non-technical users and users outside linguistic fields.
Generative Information Extraction from Biographical Sources
Robin Winkle | Manfred Stede | Jörn Kreutel
Robin Winkle | Manfred Stede | Jörn Kreutel
Biographical sources, such as literature encyclopedias, encode knowledge about historical figures in textual form. In this paper, we address the task of consolidating structured biographical information about authors from the former German Democratic Republic into a unified database. To this end, we present a generalizable Information Extraction (IE) system based on LLM prompting. Specifically, we compare two midsized open-source models, Qwen-2.5-32B and Llama-3-70B-Instruct, investigate a range of Prompt Engineering (PE) strategies, and propose a semantic similarity-based evaluation metric for open-ended IE. Our experiments on an unpublished annotated subset of biographical texts deliver moderate precision and variable recall, highlighting both the potential and current limitations of generative IE in the Digital Humanities.
WikiFirst: A Genre-Fixed, Content-controlled Corpus for Evaluating Content Effects in Authorship Analysis
Dung Nguyen | G. Çağatay Sat | Evgeny Pyshkin | John Blake
Dung Nguyen | G. Çağatay Sat | Evgeny Pyshkin | John Blake
This paper presents the design and construction of WikiFirst, a corpus for investigating the impact of content variation on authorship similarity under a fixed genre. Prior work has investigated individual authorial style and impact of genre. However, the role of content has remained underexplored due to the lack of suitable data. We address this gap by constructing a Wikipedia-based corpus consisting exclusively of first revisions authored by non-anonymous editors, thereby ensuring high authorship certainty while maintaining a stable encyclopaedic genre.
Measuring the Symbolic Power of Languages with LLM-based Multilingual Persuasion Simulation
Yin Jou Huang | Fei Cheng
Yin Jou Huang | Fei Cheng
Prior studies on the symbolic power of languages have largely relied on surveys or localized experiments, limiting systematic comparison across cultures and domains. In this work, we propose an LLM-based multilingual persuasion simulation framework to quantify the symbolic power of languages through persuasion outcomes. We also introduce a Symbolic Power Index (SPI) that measures how language choice affects persuasion success and efficiency across domains. Experiments show that the LLM-based simulations largely reproduce established sociolinguistic prestige hierarchies tied to institutional authority and global power, especially in domains such as business, finance, education, and technology. These results suggest that LLM-based persuasion simulations offer a scalable, decision-making-driven approach to studying symbolic power in language.
up
Proceedings of the 20th Linguistic Annotation Workshop (LAW XX)
Annotating Clinical Risk and Variation in Haitian Creole Medical Translation
Ludovic Mompelat | David Tézil | Rose Flaure Accilien
Ludovic Mompelat | David Tézil | Rose Flaure Accilien
We present an annotation schema for Haitian Creole medical translation that makes clinical risk and sociolinguistic variation explicit while remaining lightweight enough for small expert teams. The schema includes binary fields for overall acceptability, severity of potential misunderstanding, and foreign-influence cues, along with conditional error tags aligned with Multidimensional Quality Metrics (MQM), commonly used in the medical domain, for interoperability. Through three rounds of annotation and adjudication we achieve stable inter-annotator agreement and release a gold dataset of 152 EN→HC medical sentence pairs. A simple classifier–labeller baseline demonstrates that acceptability and severity are reliably learnable under data scarcity, while foreign-influence judgments remain limited by prevalence. These results show that clinically oriented, variety-sensitive annotation can both support immediate screening of patient-facing translations and provide reward-ready signals for future preference-based MT and LLM fine-tuning.
Parser agreement and disagreement in L2 Korean UD: Implications for human-in-the-loop annotation
Hakyung Sung | Gyu-Ho Shin
Hakyung Sung | Gyu-Ho Shin
We propose a simplified human-in-the-loop workflow for second language (L2) Korean morphosyntactic annotation by leveraging agreement between two domain-adapted parsers. We first evaluate whether parser agreement can serve as a proxy for annotation correctness by comparing it with independent human judgments. The results show strong correspondence between parser and human judgments, supporting the feasibility of semi-automatic L2-Korean UD annotation. Further analysis demonstrates that parser disagreements cluster in linguistically predictable domains such as grammatical-relation distinctions and clause-boundary ambiguity. While many disagreement cases are tractable for iterative model refinement, others reflect deeper representational challenges inherent in parsing and tagging L2-Korean corpora.
Rules-based system for Czech legal text readability
Kateřina Motalík Hodková | Ivan Kraus | Barbora Hladká
Kateřina Motalík Hodková | Ivan Kraus | Barbora Hladká
In this paper, we present a set of linguistic rules, employed to enhance the readability of legal texts. The rules were compiled and implemented as a rule-based module of PONK, an advisory tool that contributes to simplification and higher clarity of Czech legal texts, especially those intended for non-expert audience. Based on recurring phenomena in authentic texts and relevant scientific sources, the rules cover mainly the domains of syntax and lexicon. In addition, we present the results of application of the rules to a corpus of authentic legal texts, evaluated by a human annotator, and examine their impact.
Human-AI Annotation Error Auditing for Hebrew Diacritization with Frontier LLMs
Hillel Gershuni | Avi Shmidman
Hillel Gershuni | Avi Shmidman
Large annotated datasets inevitably contain errors that are costly to identify via manual review. We study a human-AI annotation error auditing workflow using frontier Large Language Models (LLMs), focusing on Hebrew nikud (diacritization). We take the the EACL 2023 Hebrew Homograph Challenge Set as our test case. In a focused evaluation on 12 of the homograph sets with 271 confirmed errors (verified through exhaustive manual review of all 7,241 sentences), Gemini 3 Pro achieves 83.6% recall (95% confidence interval: [79.3%, 88.2%]) and 99.1% precision - substantially higher than other frontier LLMs. Two independent human experts achieved 62.4% and 42.8% recall respectively, a 20-percentage-point spread that reflects the difficulty of sparse-target error search. Even the union of both experts’ findings (73.4% recall) falls short of a single LLM run (83.6%), while LLM-aided auditing reduces review effort by over 95%. We analyze the trade-offs between batch size and recall, and release both a human-verified Gold Standard with per-error difficulty annotations and a globally corrected version of the Challenge Set.
Beyond Annotator Disagreement: Guideline-Induced Errors in Arabic Hate Speech Annotation
Wajdi Zaghouani
Wajdi Zaghouani
Annotation errors in hate speech corpora are often attributed to annotator disagreement or bias. This paper argues that a substantial and underexamined class of errors originates upstream, from structural weaknesses in annotation guidelines themselves. When guidelines fail to encode the linguistic and cultural properties of the target discourse, they make certain errors structurally inevitable regardless of annotator quality. Focusing on Arabic social media discourse, a challenging setting due to its dialect continuum, culturally embedded insult conventions, sarcasm-heavy pragmatics, and complex religious rhetoric, we identify three mechanisms through which guideline design produces systematic annotation errors: cultural misclassification, when culturally specific hostile expressions fall outside annotation categories; dialectal ambiguity, when lexical meanings shift across regional varieties; and annotation projection, when frameworks developed for English moderation are applied to Arabic without adequate adaptation. Using six illustrative case studies with attested Arabic examples, we show how these mechanisms produce recurrent misannotations in existing datasets. We propose a taxonomy of five guideline-induced error types, an explicit mapping from mechanisms to error types, and a practical four-stage diagnostic framework for dataset builders.
When LLMs Disagree with Human Experts: Understanding LLM Annotation Failures in Nutrition Misinformation through Hierarchical Error Analysis using Seed Oil Narratives
Vishwaa Shah | Indika Kahanda | Andrea Arikawa
Vishwaa Shah | Indika Kahanda | Andrea Arikawa
Accurate linguistic annotation is crucial for creating high-quality datasets in specialized domains, yet manual labeling is often slow, expensive, and inconsistent. We present a reproducible workflow for evaluating the effectiveness of large language models (LLMs) as annotators of domain-specific health misinformation on social media. Using a data set of 169 Instagram posts on seed oils, expert nutritionists provided gold-standard labels (71% positives), which we compared against the outputs of five open-source LLMs. We introduce a hierarchical error taxonomy that categorizes LLM misclassifications according to the direction, mechanism, and contributing factors of the error, providing interpretable insights into model failures. Our analysis reveals systematic error patterns, including misinterpretation of nuanced claims and overconfidence in predictions, highlighting conditions under which LLM annotations do not align with expert judgment. Although the data set is modest in size and exhibits class imbalance, it reflects real-world distributions of nutrition-related Instagram content and motivates the need for a careful evaluation of the robustness of the LLM annotation. This study has implications for the development of frameworks for automated LLM-based annotators in the health and nutrition domains, as well as LLM developers in general.
Math-DB: A Discourse Framework for Mathematical Word Problems to Enhance LLM Reasoning
Mustafa Erolcan Er
Mustafa Erolcan Er
Large Language Models have demonstrated significant progress in solving mathematical word problems through techniques like Chain-of-Thought (CoT) prompting. However, recent research indicates that these models often rely on statistical regularities and surface-level patterns rather than true logical reasoning, leading to performance drops when faced with minor problem perturbations or irrelevant information. In this study, we introduce Math Discourse Bank (Math-DB), a novel discourse framework and annotated dataset designed to enhance LLM reasoning. Inspired by the Penn Discourse TreeBank (PDTB) and mathematics education research, Math-DB defines a hierarchy of discourse senses designed for quantitative reasoning, including categories such as Change, Combine, Compare, and Equalize. We applied this framework to the GSM-Symbolic dataset of 12,500 problems, yielding 47,815 sense-labeled discourse relations over 11,414 successfully-aligned instances (91.3% pipeline yield). Our experiments demonstrate that incorporating Math-DB annotations into CoT prompts consistently improves LLM performance across various difficulty levels.
Cross-Linguistic Situation Entity Segmentation for Discourse Analysis in Diachronic English and German Text
Hanna Schmück | Veronika Urban | Xaver Krückl | Sonja Zeman | Claudia Claridge | Annemarie Friedrich
Hanna Schmück | Veronika Urban | Xaver Krückl | Sonja Zeman | Claudia Claridge | Annemarie Friedrich
Situation Entity (SE) segmentation identifies clause-like discourse units focusing on verb constellations. While SE segmentation has been applied to contemporary English as a subtask of SE annotation, systematic guidelines for syntactically ambiguous constructions remain underspecified. We present principled SE segmentation guidelines for contemporary and historical varieties of English and German. Our inter-annotator agreement studies on Late Modern English (1700–1900) and New High German (1650–1900) corpora demonstrate substantial agreement. Using the existing SitEnt corpus in contemporary English, we implement a new automatic segmenter based on XLM-RoBERTa. Our evaluation examines cross-variety and cross-lingual generalization, demonstrating challenges both for human annotation efforts and in transferring segmenters trained on contemporary English to historical varieties. Our code and data are publicly available at https://github.com/coling-unia/sitent-segmenter-law2026.
UD-CHILDES-BG: a dependency treebank of Bulgarian child and child-directed speech
Mila Marcheva-Nash | Yasena Chantova | Tsvetina Kirilova | Ivelina Pavlova | Tsvetelina Stefanova | Yoana Vasileva | Weiwei Sun
Mila Marcheva-Nash | Yasena Chantova | Tsvetina Kirilova | Ivelina Pavlova | Tsvetelina Stefanova | Yoana Vasileva | Weiwei Sun
This paper presents (i) UD-CHILDES-BG, a manually corrected Universal Dependencies treebank of Bulgarian child and child-directed speech, (ii) a quantitative and phenomenon-based evaluation of inter-annotator agreement on developmental data, and (iii) a systematic analysis of parser errors in this underrepresented domain. We manually correct 4,338 dependency parses (10% of the CHILDES-BG corpus), of which 14% are double-annotated. Inter-annotator agreement on UAS/LAS is 91.71/86.12 for child-directed speech (CDS) and 88.14/81.40 for child speech (CS). Parser performance on the manually corrected portion is 92.70/85.54 for CDS and 90.97/81.52 for CS, compared to a reported 93.37/90.21 on the test set of adult written language. Our analyses reveal that CDS and CS pose challenges for dependency annotation and parsing, particularly in discourse-related structures, which are less common in adult written language.
IndiAnn: A Web-based Annotation Platform for Indic Languages
Bandaru Lavadeep | Ritwik Raghav | Abhik Jana
Bandaru Lavadeep | Ritwik Raghav | Abhik Jana
Linguistic annotation tools that work well for non-Indic languages (e.g. English, German, Spanish, etc.) often fail with Indic scripts due to complex Unicode properties, including visual reordering of vowel matras, conjunct characters, and grapheme clusters spanning multiple code points. In this paper, we present a web-based annotation platform IndiAnn, designed for low-resource Indic languages, which uses native browser Unicode rendering, offset-based storage that preserves grapheme clusters, and no forced tokenization in the user interface. The tool supports annotation for tasks such as part-of-speech (POS) tagging, named entity recognition (NER), dependency relation annotation, and semantic role labelling (SRL), that maintain correct character boundaries and enable seamless interoperability with standard NLP pipelines and tools. The framework is designed for Indic languages and has been tested on Telugu, Hindi, Tamil, Malayalam, Bengali, Odia, Marathi, and Kannada, with no script breakage during annotation. To the best of our knowledge, this is the first ever attempt at building a unified annotation framework (IndiAnn), which covers annotation for such varieties of key NLP tasks, having provision for eight Indic languages. The code repository is made publicly available[ <https://github.com/Lavadeep/INDIANN>].
Designing Annotation Guidelines for Trait-Based Arabic Automated Essay Scoring: A Systematic Methodology
Walid Massoud | Houda Bouamor | Abdelrahman Abdel Latif Hussein | Abdullah Mohamed Mohamed Zekri
Walid Massoud | Houda Bouamor | Abdelrahman Abdel Latif Hussein | Abdullah Mohamed Mohamed Zekri
Automated Essay Scoring (AES) fundamentally depends on high-quality annotated data, yet systematic approaches to developing annotation guidelines remain largely undocumented, especially for Arabic. We present a comprehensive methodology for trait-based Arabic AES annotation, applied to build a dataset of 7,859 essays by high school students annotated across seven writing traits, achieving substantial inter-annotator agreement (QWK: 0.66–0.75). Our methodology encompasses: (1) a seven-dimensional scoring framework grounded in Arabic linguistic and rhetorical conventions; (2) over 25 pages of Arabic-language guidelines with terminology unification, text-type-specific scoring descriptors, and annotated student examples; (3) a multi-stage training protocol that raised annotator agreement before production began; and (4) quality assurance mechanisms, including dual annotation and supervisor adjudication. We release all materials publicly, providing both a validated foundation for Arabic AES research and a replicable template for annotation guideline development in other morphologically complex, under-resourced languages.
Revisiting Faithfulness Annotations for Long-form Summaries
Yang Zhong | Yang Janet Liu | Diane Litman
Yang Zhong | Yang Janet Liu | Diane Litman
Benchmarks for long-form summaries (four or more sentences) generated by language models increasingly serve as gold-standard references for developing, evaluating, and comparing faithfulness-checking systems. As their influence grows, understanding the challenges of annotating faithfulness errors within long, discourse-rich summaries becomes critical. We revisit three benchmarks spanning diverse text types and contrasting annotation designs. Using a discourse-aware evaluation framework together with human auditing, we identify cases where benchmark labels may be unreliable. Manual verification shows that 3.4%-5.4% of sentence-level labels warrant revision due to discourse-level inconsistencies that standard annotation procedures overlook. We introduce a taxonomy of five recurring annotation error types, propose revised labels, and show that correcting these cases leads to meaningful shifts in system rankings. We conclude with recommendations for future annotation practices.
Completing and Validating the Re-Aligned Switchboard Dialog Act Corpus
Run Chen | Zihao Tao | John Prado | Ignazio LaManna | Ryan Puterbaugh | Mim Datta | Julia Hirschberg
Run Chen | Zihao Tao | John Prado | Ignazio LaManna | Ryan Puterbaugh | Mim Datta | Julia Hirschberg
Although widely used in dialog act prediction and generation, the Switchboard Dialog Act (SwDA) corpus has performed poorly in models incorporating prosodic information because of misalignment between speech and text data. In this paper, we report our completion of the work begun in Chen et al. (2024) in addressing these misalignment issues with an improved SwDA corpus called RASwDA (Re-Aligned Switchboard Dialog Act Corpus). Now fully re-aligned and validated, RASwDA finally meets standards of accuracy allowing for classification models trained on it to exceed classification benchmarks set by models trained on other Switchboard subcorpora.
Not Worth Mentioning? A Pilot Study on Salient Proposition Annotation
Amir Zeldes | Katherine Conhaim | Lauren Levine
Amir Zeldes | Katherine Conhaim | Lauren Levine
Despite a long tradition of work on extractive summarization, which by nature aims to recover the most important propositions in a text, little work has been done on operationalizing graded proposition salience in naturally occurring data. In this paper, we adopt graded summarization-based salience as a metric from previous work on Salient Entity Extraction (SEE) and adapt it to quantify proposition salience. We define the annotation task, apply it to a small multi-genre dataset, evaluate agreement and carry out a preliminary study of the relationship between our metric and notions of discourse unit centrality in discourse parsing following Rhetorical Structure Theory (RST).
LLMs as annotators of credibility assessment in Danish asylum decisions: evaluating classification performance and errors beyond aggregated metrics
Galadrielle Humblot-Renaux | Mohammad N. S. Jahromi | Rohat Bakuri-Jørgensen | Marieke Anne Heyl | Asta S. Stage Jarlner | Maria Vlachou | Anna Murphy Høgenhaug | Desmond Elliott | Thomas Gammeltoft-Hansen | Thomas B. Moeslund
Galadrielle Humblot-Renaux | Mohammad N. S. Jahromi | Rohat Bakuri-Jørgensen | Marieke Anne Heyl | Asta S. Stage Jarlner | Maria Vlachou | Anna Murphy Høgenhaug | Desmond Elliott | Thomas Gammeltoft-Hansen | Thomas B. Moeslund
Off-the-shelf large language models (LLMs) are increasingly used to automate text annotation, yet their effectiveness remains underexplored for underrepresented languages and specialized domains where the class definition requires subtle expert understanding. We investigate LLM-based annotation for a novel legal NLP task: identifying the presence and sentiment of credibility assessments in asylum decision texts. We introduce RAB-Cred, a Danish text classification dataset featuring high-quality, expert annotations and valuable metadata such as annotator confidence and asylum case outcome. We benchmark 21 open-weight models and 30 system-user prompt combinations for this task, and systematically evaluate the effect of model and prompt choice for zero-shot and few-shot classification. We zoom in on the errors made by top-performing models and prompts, investigating error consistency across LLMs, inter-class confusion, correlation with human confidence and sample-wise difficulty and severity of LLM mistakes. Our results confirm the potential of LLMs for cost-effective labeling of asylum decisions, but highlight the imperfect and inconsistent nature of LLM annotators, and the need to look beyond the predictions of a single, arbitrarily chosen model. The RAB-Cred dataset and code are available at https://github.com/glhr/RAB-Cred
Cracks in the Bridge—or A Bridge Too Far? Comparing Human and LLM Errors in the Annotation of Bridging Anaphora
Lauren Levine | Amir Zeldes
Lauren Levine | Amir Zeldes
In this paper, we perform an error analysis on human and LLM annotation data from the recent GUMBridge corpus for varieties of bridging anaphora. We explore the distribution of precision and recall errors made by annotators and how that distribution correlates with bridging subtypes. We find that while LLMs perform substantially worse than human annotators, they are more balanced in their precision and recall scores than humans, whose performance strongly favors precision. With regard to subtypes, we find that comparison and meronomy relations are easier to reliably annotate than the more broadly construed entity relations for both human and LLM annotators, but that LLM errors are more distributed across subtypes than human errors. Analyzing these results, we provide insights for future annotation projects on bridging anaphora.
Clustering Analysis for Error Detection in Named Entity Recognition Datasets
Matthew Flynn | Timothy Obiso | Sam Newman | Constantine Lignos
Matthew Flynn | Timothy Obiso | Sam Newman | Constantine Lignos
This paper introduces a method for the automatic detection of annotation errors and corrections in named entity recognition datasets using a novel two-stage dimension reduction of dense sentence embeddings. We first find the top-n principal components of an embedding and then use UMAP for second-stage, non-linear dimension reduction and clustering using different distance metrics. We analyze these clusters using silhouette scores to flag outlier mentions for correction. Using the corrections in the CoNLL# dataset as a benchmark, all of the top-five outliers needed correction, as did 7 of the top-10. This approach also identified 32 of the top-50 outlier mentions that are corrections. This method offers a relatively low-effort way to leverage text embeddings and dimensionality reduction to identify likely annotation errors. We release related code and data at https://github.com/bltlab/clustering-for-ner.
When Ground Truth Disagrees: A Human-in-the-Loop Audit of Annotation Errors in High-Stakes Crash Narratives
Md Sajjad Hossain | Lin Li | Judy A. Perkins | John Clary | Joel Meyer
Md Sajjad Hossain | Lin Li | Judy A. Perkins | John Clary | Joel Meyer
Linguistic annotation of high-stakes narrative data is often constrained by data confidentiality, domain expertise, and the lack of large-scale multi-annotator pipelines. We present a human-in-the-loop framework for auditing annotation discrepancies in crash narratives, combining structured labels, narrative-based annotation, and expert adjudication. Using 9,387 crash reports, we conduct a multi-layer analysis of disagreement across annotation sources. Nearly half of the records (49.4%) exhibit discrepancies between structured and narrative labels, driven mainly by unsupported structured assignments. In contrast, narrative-based annotation achieves near-perfect agreement with adjudication (𝜅 = 0.990), indicating strong consistency when grounded in textual evidence. We introduce a taxonomy of discrepancies, showing refinement opportunities and missing details are the most common, while linguistic factors such as hedging and underspecification contribute to ambiguity. We further show that annotator-reported uncertainty strongly predicts annotation difficulty, with uncertain records nearly nine times more likely to disagree with structured labels. These findings highlight limitations of administrative coding and support a scalable, uncertainty-guided annotation paradigm for restricted-access domains.
Prompts in the Wild: A Large Analyzed Collection of Transactional Prompts in Code
Victoria Basmov | Yoav Goldberg | Reut Tsarfaty
Victoria Basmov | Yoav Goldberg | Reut Tsarfaty
The behavior of contemporary generative Large Language Models (LLMs) is directly shaped by prompts, unstructured texts that describe the desired output and model behavior. In this paper we argue that prompts are linguistic objects that merit investigation in their own right. To this end, we collect 57.5K unique samples of prompts from GitHub. Specifically, we focus on transactional prompts: reproducible natural language instructions that are integrated into software. To enable the empirical, quantitative study of prompts, we introduce a structured ontology, capturing the properties of prompts as well as their formal and semantic components. Based on this ontology, we transform prompts from unstructured raw texts into richly structured linguistic objects. Analysis of these structured data reveals significant diversity of usage patterns across languages, domains, tasks, and modalities, in a typical Zipf-like distribution where some clearly prevail and others, more diverse, appear in the long tail. To validate the reliability of the ontology-based annotation of the prompts, we perform a comprehensive error analysis across all fields, providing a detailed assessment of annotation quality. We release the dataset together with a browsing and exploration interface.
TalkTag: Fine-Grained Morphosyntactic Error Annotation for Transcribed Speech
Shamira Venturini | Oliver Hennhöfer | Steffen Kinkel | Jannik Strötgen
Shamira Venturini | Oliver Hennhöfer | Steffen Kinkel | Jannik Strötgen
Fine-grained morphosyntactic error annotation is important in clinical and developmental language research, yet it is labour-intensive, expert-dependent, and difficult to scale. We present TalkTag, an LLM-based lightweight tool fine-tuned to automate CHAT-style error annotation in spoken-language transcripts. Developed under conditions of extreme data scarcity using children’s narrative data, the system shows the feasibility of linguistic analysis in low-resource settings. Our evaluation demonstrates that TalkTag produces encouragingly precise annotation while effectively identifying instances where linguistic ambiguity makes automated tagging genuinely complex. In summary, with TalkTag, we provide a scalable alternative to manual error annotation and practically viable support for morphosyntactic error annotation.
up
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
The Proceedings for the 6th International Workshop on Computational Approaches to Language Change (LChange’26)
Nina Tahmasebi | Pierluigi Cassotti | Syrielle Montariol | Andrey Kutuzov | Netta Huebscher | Elena Spaziani | Naomi Baes
Nina Tahmasebi | Pierluigi Cassotti | Syrielle Montariol | Andrey Kutuzov | Netta Huebscher | Elena Spaziani | Naomi Baes
The SlangTrack Dataset: Supporting the Detection of Words Used in Slang Senses
Afnan Mohammed Aloraini | Riza Batista-Navarro | Goran Nenadic | Viktor Schlegel
Afnan Mohammed Aloraini | Riza Batista-Navarro | Goran Nenadic | Viktor Schlegel
Slang is widespread in informal communication, yet its fluidity poses challenges for natural language processing (NLP), especially when words alternate between slang and non-slang senses. While prior work has examined slang through dictionaries, sentiment analysis, and lexicon building, little attention has been given to detecting slang usage in context. We address this gap by reframing slang detection as distinguishing slang from non-slang senses of the same lexical item. To support this task, we introduce SlangTrack (ST), a diachronically structured dataset of dual-meaning words annotated at the sentence level with high inter-annotator agreement. We benchmark (1) deep learning models with static and contextual embeddings, (2) transformer-based models, and (3) large language models evaluated in zero-shot, few-shot, and fine-tuned settings. Fine-tuned transformers, especially BERT-large enriched with sentiment and emotion features, achieve the strongest performance, reaching an F1-score of 72% for slang and 92% for non-slang usage. Our findings highlight both the difficulty of contextual slang detection and the value of affective cues for improving model robustness.
Statistical Semantic Change Detection via Usage Similarities
Taichi Aida | Daichi Mochihashi | Hiroya Takamura | Toshinobu Ogiso | Mamoru Komachi
Taichi Aida | Daichi Mochihashi | Hiroya Takamura | Toshinobu Ogiso | Mamoru Komachi
Semantic change detection comprises two subtasks: classification, which predicts whether a target word has undergone a semantic shift, and ranking, which orders words according to the degree of their semantic change. While most prior studies concentrated on ranking subtask, the classification subtask plays an equally important role, since many practical scenarios require a yes/no decision on semantic change rather than a global ranking. In this work, we propose a novel statistical method that predicts the presence or absence of semantic change. While most existing approaches infer semantic change by comparing word embeddings across time periods or domains, our method directly models the diachronic/synchronic consistency of usage-level similarity scores. Our experiments on SemEval-2020 Task 1 and WUGS datasets demonstrate that the proposed formulation outperforms existing state-of-the-art embedding-based methods, and robustly detects semantic change across languages in both diachronic and synchronic settings.
Tonogenesis—the historical process by which segmental contrasts evolve into lexical tone—has traditionally been studied through comparative reconstruction and acoustic phonetics. We introduce a computational approach that quantifies the functional role of pitch at different stages of this sound change by measuring how pitch manipulation affects automatic speech recognition (ASR) performance. Through analysis on the sensitivity to pitch-flattening from a set of closely related Tibetan languages, we find evidence of a tonogenesis continuum: atonal Amdo dialects tolerate pitch removal the most, while fully tonal Ü-Tsang varieties show severe degradation, and intermediate Kham dialects fall measurably between these extremes. These gradient effects demonstrate how ASR models implicitly learn the shifting functional load of pitch as languages transition from consonant-based to tone-based lexical contrasts. Our findings show that computational methods can capture fine-grained stages of sound change and suggest that traditional functional load metrics, based solely on minimal pairs, may overestimate pitch dependence in transitional systems where segmental and suprasegmental cues remain phonetically intertwined.
Cross-lingual Lexical Semantic Change in Romance Languages
Ana Sabina Uban | Liviu P Dinu | Anca Daniela Dinu | Simona Georgescu
Ana Sabina Uban | Liviu P Dinu | Anca Daniela Dinu | Simona Georgescu
We present a comprehensive quantitative analysis of lexical semantic change in the five main Romance languages (Romanian, Italian, Spanish, French and Portuguese), based on the most exhaustive database of related words in these languages. We include both cognate words and borrowings (for the first time, to our knowledge), and compute semantic shift measures using different static and contextual embedding models, as well as three different corpora. We publish the obtained lists of semantic divergences across all related word pairs, compute global trends in language-level semantic divergence, and provide insights on particular study cases of highly stable and highly divergent words for different language pairs.
Threshold-Calibrated Word Sense Disambiguation: Semantic Broadening Without Sense Redistribution in Schizophrenia
Naomi Baes | Nick Haslam
Naomi Baes | Nick Haslam
Polysemous words pose a challenge for computational approaches to language change. We extend a recent hypothesis-driven, prototype-based framework to estimate word sense prevalence in diachronic text corpora and apply it to 109,940 usages of schizophrenia drawn from U.S. news media (1985–2025). Our extensions include a contextual dispersion measure (Breadth), robust prototype construction, and human-calibrated prototype-similarity thresholds for conservative sense assignment at scale. Across four decades, distributional semantic change indices commonly used in lexical semantic change detection (LSCD) show significant increases in Breadth and baseline-relative semantic drift (APD), while changes in the central usage prototype (PRT) are influenced by term frequency. In contrast, threshold-calibrated sense assignments reveal stable sense proportions: the psychiatric sense remains dominant, with split-personality and metaphorical senses consistently marginal. Together, these results demonstrate that dispersion- and drift-based LSCD metrics can increase even under stable sense prevalence, indicating that such increases can occur without sense redistribution and primarily reflect broad shifts in usage distributions rather than evidence of polysemization or sense loss. We introduce a threshold-calibrated, prototype-based sense-tracking pipeline that enables conservative sense prevalence estimation at scale and clarifies whether rising distributional LSCD metrics reflect sense redistribution or increasing contextual diversity when historical sense annotation is limited.
Using Correspondence Patterns to Identify Irregular Words in Cognate Sets Through Leave-One-Out Validation
Frederic Blum | Johann-Mattis List
Frederic Blum | Johann-Mattis List
Regular sound correspondences constitute the principal evidence in historical language comparison. Despite the heuristic focus on regularity, it is often more an intuitive judgement than a quantified evaluation, and irregularity is more common than expected from the Neogrammarian model. Given the recent progress of computational methods in historical linguistics and the increased availability of standardized lexical data, we are now able to improve our workflows and provide such a quantitative evaluation. Here, we present the balanced average recurrence of correspondence patterns as a new measure of regularity. We also present a new computational method that uses this measure to identify cognate sets that lack regularity with respect to their correspondence patterns. We validate the method through two experiments, using simulated and real data. In the experiments, we employ leave-one-out validation to measure the regularity of cognate sets in which one word form has been replaced by an irregular one, checking how well our method identifies the forms causing the irregularity. Our method achieves an overall accuracy of 85% with the datasets based on real data. We also show the benefits of working with subsamples of large datasets and how increasing irregularity in the data influences our results. Reflecting on the broader potential of our new regularity measure and the irregular cognate identification method based on it, we conclude that they could play an important role in improving the quality of existing and future datasets in computer-assisted language comparison.
DHPLT: large-scale multilingual diachronic corpora and word representations for semantic change modelling
Mariia Fedorova | Andrey Kutuzov | Khonzoda Umarova
Mariia Fedorova | Andrey Kutuzov | Khonzoda Umarova
In this resource paper, we present DHPLT, an open collection of diachronic corpora in 41 diverse languages. DHPLT is based on the web-crawled HPLT datasets; we use web crawl timestamps as the approximate signal of document creation time. The collection covers three time periods: 2011-2015, 2020-2021 and 2024-present (1 million documents per time period for each language). We additionally provide pre-computed word type and token embeddings and lexical substitutions for our chosen target words, while at the same time leaving it open for the other researchers to come up with their own target words using the same datasets.DHPLT aims at filling in the current lack of multilingual diachronic corpora for semantic change modelling (beyond a dozen of high-resource languages). It opens the way for a variety of new experimental setups in this field.
Transparent Semantic Change Detection with Dependency-Based Profiles
Bach Phan Tat | Kris Heylen | Dirk Geeraerts | Stefano De Pascale | Dirk Speelman
Bach Phan Tat | Kris Heylen | Dirk Geeraerts | Stefano De Pascale | Dirk Speelman
Most modern computational approaches to lexical semantic change detection (LSC) rely on embedding-based distributional word representations with neural networks. Despite the strong performance on LSC benchmarks, they are often opaque. We investigate an alternative method which relies purely on dependency co-occurrence patterns of words. We demonstrate that it is effective for semantic change detection and even outperforms a number of distributional semantic models. We provide an in-depth quantitative and qualitative analysis of the predictions, showing that they are plausible and interpretable.
Semantic Change Characterization with LLMs using Rhetorics
Jáder Martins Camboim de Sá | Jooyoung Lee | Marcos Da Silveira | Cedric Pruski
Jáder Martins Camboim de Sá | Jooyoung Lee | Marcos Da Silveira | Cedric Pruski
Languages continually evolve in response to societal events, resulting in new terms and shifts in meanings. These changes have significant implications for computer applications, including automatic translation and chatbots, making it essential to characterize them accurately. The recent development of LLMs has notably advanced natural language understanding, particularly in sense inference and reasoning. In this paper, we investigate the potential of LLMs in characterizing three types of semantic change: dimension, relation, and orientation. We achieve this by combining LLMs’ Chain-of-Thought with rhetorical devices and conducting an experimental assessment of our approach using newly created datasets. Our results highlight the effectiveness of LLMs in capturing and analyzing semantic changes, providing valuable insights to improve computational linguistic applications.
Using BERT to Explore Lexical Semantic Change of Prepositions
Liudmila Radchankava | Vasily Konovalov
Liudmila Radchankava | Vasily Konovalov
This paper presents a semi-supervised approach to investigating lexical semantic change in English prepositions using contextualized word embeddings from BERT. Due to their hybrid lexico-grammatical nature and high degree of polysemy, prepositions have received limited attention in computational studies of semantic change. We address this gap by first applying BERT-based embeddings in combination with a k-nearest neighbors classifier to the task of preposition sense disambiguation, achieving competitive performance without relying on external lexical resources. The trained model is then applied to diachronic data from the Corpus of Historical American English to analyze semantic change over time. By measuring classifier confidence and correlating it with usage year, we detect systematic differences between simple and compound prepositions. Our results confirm linguistic hypotheses that simple prepositions remain largely semantically stable, while compound prepositions exhibit measurable semantic change. The study demonstrates that BERT embeddings provide an effective tool for exploring diachronic semantic phenomena in functionally complex word classes and can be extended to other languages and datasets.
A Computational Analysis of the Emergence of Therapy-speak in Social Media
Alina Iacob | Ana Sabina Uban
Alina Iacob | Ana Sabina Uban
The present article investigates semantic change in psychology-related concepts, in scientific and social media texts comparatively. We assess patterns of change over 15 years (2010-2025) and compare word usage in a corpus of Psychology journals abstracts and Reddit comments, testing whether specialized communities on social media align with psychology experts. We analyze semantic breadth, semantic displacement and neighbours similarity evolutions, and in addition include in our experiments contextual embeddings alongside static Word2Vec embeddings. Our results reveal diverse patterns of semantic change across the examined concepts and confirm that many terms are used differently on social media compared to specialized literature. Furthermore, Reddit communities focused on psychology discussions occupy an intermediate position, adopting a more objective stance than general-domain threads while remaining distinct from specialized literature.
Lexical semantic change detection (LSCD) increasingly relies on contextualised language model embeddings, yet most approaches still quantify change using a small set of semantic change metrics, primarily Average Pairwise Distance (APD) and cosine distance over word prototypes (PRT). We introduce Average Minimum Distance (AMD) and Symmetric Average Minimum Distance (SAMD), new measures that quantify semantic change via local correspondence between word usages across time periods. Across multiple languages, encoder models, and representation spaces, we show that AMD often provides more robust performance, particularly under dimensionality reduction and with non-specialised encoders, while SAMD excels with specialised encoders. We suggest that LSCD may benefit from considering alternative semantic change metrics beyond APD and PRT, with AMD offering a robust option for contextualised embedding-based analysis.
From sunblock to softblock: Analyzing the correlates of neology in published writing and on social media
Maria Ryskina | Matthew R. Gormley | Kyle Mahowald | David R. Mortensen | Taylor Berg-Kirkpatrick | Vivek Kulkarni
Maria Ryskina | Matthew R. Gormley | Kyle Mahowald | David R. Mortensen | Taylor Berg-Kirkpatrick | Vivek Kulkarni
Living languages are shaped by a host of conflicting internal and external evolutionary pressures. While some of these pressures are universal across languages and cultures, others differ depending on the social and conversational context: language use in newspapers is subject to very different constraints than language use on social media. Prior distributional semantic work on English word emergence *(neology)* identified two factors correlated with creation of new words by analyzing a corpus consisting primarily of historical published texts [(Ryskina et al., 2020)](https://aclanthology.org/2020.scil-1.43/). Extending this methodology to contextual embeddings in addition to static ones and applying it to a new corpus of Twitter posts, we show that the same findings hold for both domains, though the topic popularity growth factor may contribute less to neology on Twitter than in published writing. We hypothesize that this difference can be explained by the two domains favouring different word formation mechanisms.
up
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Proceedings for the Ninth Workshop on Technologies for Machine Translation of Low Resource Languages (LoResMT 2026)
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jonathan Washington | Nathaniel Oco | Xiaobing Zhao
Atul Kr. Ojha | Chao-hong Liu | Ekaterina Vylomova | Flammie Pirinen | Jonathan Washington | Nathaniel Oco | Xiaobing Zhao
Are Small Language Models the Silver Bullet to Low-Resource Languages Machine Translation?
Yewei Song | Lujun Li | Cedric Lothritz | Saad Ezzini | Lama Sleem | Niccolo' Gentile | Radu State | Tegawendé F. Bissyandé | Jacques Klein
Yewei Song | Lujun Li | Cedric Lothritz | Saad Ezzini | Lama Sleem | Niccolo' Gentile | Radu State | Tegawendé F. Bissyandé | Jacques Klein
Small language models (SLMs) offer computationally efficient alternatives to large language models, yet their translation quality for low-resource languages (LRLs) remains severely limited. This work presents the first large-scale evaluation of SLMs across 200 languages, revealing systematic underperformance in LRLs and identifying key sources of linguistic disparity. We show that knowledge distillation from strong teacher models using predominantly monolingual LRL data substantially boosts SLM translation quality—often enabling 2B–3B models to match or surpass systems up to 70B parameters. Our study highlights three core findings: (1) a comprehensive benchmark exposing the limitations of SLMs on 200 languages; (2) evidence that LRL-focused distillation improves translation without inducing catastrophic forgetting, with full-parameter fine-tuning and decoder-only teachers outperforming LoRA and encoder–decoder approaches; and (3) consistent cross-lingual gains demonstrating the scalability and robustness of the method. These results establish an effective, low-cost pathway for improving LRL translation and provide practical guidance for deploying SLMs in truly low-resource settings.
Tao–Filipino Neural Machine Translation: Strategies for Ultra–Low-Resource Settings
Adrian Denzel Macayan | Luis Andrew Sunga Madridijo | Ellexandrei Esponilla | Zachary Mitchell Francisco
Adrian Denzel Macayan | Luis Andrew Sunga Madridijo | Ellexandrei Esponilla | Zachary Mitchell Francisco
Neural Machine Translation (NMT) performance degrades significantly in ultra-low resource settings, particularly for endangeredlanguages like Tao (Yami) which lack extensive parallel corpora. This study investigates strategies to bootstrap a Tao-Tagalog translation system using the NLLB-200 (600 million parameter) model under extremely limited supervision. We propose a multi-faceted approach combining domain-specific fine-tuning, synthetic data augmentation, and cross-lingual transfer learning. Specifically, we leverage the phylogenetic proximity of Ivatan, a related Batanic language, to pre-train the model, and utilize dictionary-based generation to construct synthetic conversational data. Our results demonstrate that transfer learning from Ivatan improves translation quality on in-domain religious texts, achieving a BLEU score of 34.85. Conversely, incorporating synthetic data enhances the model’s ability to generalize to conversational contexts, mitigating the domain bias often inherent in religious corpora. These findings highlight the effectiveness of exploiting linguistic typology and structured lexical resources to develop functional NMT systems for under-represented Austronesian languages.
Text Filter Based on Automatically Acquired Vocabularies for Multilingual Machine Translation
Kenji Imamura | Masao Utiyama
Kenji Imamura | Masao Utiyama
In this paper, we propose a text filter designed to support multiple languages. The method simply aggregates vocabulary from a monolingual corpus and compares it against the input. Despite its simplicity, the approach proves highly effective in removing code-mixed text.When combined with existing language identification techniques, our method can enhance the purity of the corpus in the target language. Consequently, applying it to parallel corpora for machine translation has the potential to improve translation quality.Additionally, the proposed method supports the incremental addition of new languages without the need to retrain those already learned. This feature easily enables our method to be applied to low-resource languages.
Comparing LLM-Based Translation Approaches for Extremely Low-Resource Languages
Jared Coleman | Ruben Rosales | Kira Toal | Diego Cuadros | Nicholas Leeds | Bhaskar Krishnamachari | Khalil Iskarous
Jared Coleman | Ruben Rosales | Kira Toal | Diego Cuadros | Nicholas Leeds | Bhaskar Krishnamachari | Khalil Iskarous
We present a comprehensive evaluation and extension of the LLM-Assisted Rule-Based Machine Translation (LLM-RBMT) paradigm, an approach that combines the strengths of rule-based methods and Large Language Models (LLMs) to support translation in no-resource settings. We present a robust new implementation (the Pipeline Translator) that generalizes the LLM-RBMT approach and enables flexible adaptation to novel constructions. We benchmark it against four alternatives (Builder, Instructions, RAG, and Fine-tuned translators) on a curated dataset of 150 English sentences, and compare them across translation quality and runtime. The Pipeline Translator consistently achieves the best overall performance. The LLM-RBMT methods (Pipeline and Builder) also offer an important advantage: they naturally align with evaluation strategies that prioritize grammaticality and semantic fidelity over surface-form overlap, which is critical for endangered languages where mistranslation carries high risk.
We evaluate the capabilities of several small large language models (LLMs) to translate between Italian and six low-resource language varieties from Italy (Friulan, Ligurian, Lombard, Sicilian, Sardinian, and Venetian). Using recent benchmark datasets, such as FLORES+ and OLDI-Seed, we compare prompting and fine-tuning approaches for downstream translation, evaluated with CHRF scores. Our findings confirm that these LLMs struggle to translate into and from these low-resource language varieties. Pretraining and fine-tuning a small LLM did not yield improvements over a zero-shot baseline. These results underscore the need for further NLP research on Italy’s low-resource language varieties. As the digital divide continues to threaten the conservation of this diverse linguistic landscape, greater engagement with speaker communities to create better and more representative datasets is essential to boost the translation performance of current LLMs.
Balancing Fluency and Adherence: Hybrid Fallback Term Injection in Low-Resource Terminology Translation
Kurt Abela | Marc Tanti | Claudia Borg
Kurt Abela | Marc Tanti | Claudia Borg
Integrating domain-specific terminology into Machine Translation systems is a persistent challenge, particularly in low-resource and morphologically-rich scenarios where models lack the robustness to handle imposed constraints. This paper investigates the trade-off between static dictionary-based data augmentation and dynamic inference constraints (Constrained Beam Search). We evaluate these methods on two high-to-low resource language pairs: English-Maltese (Semitic) and English-Slovak (Slavic). Our experiments reveal a dichotomy: while dynamic constraints achieve near-perfect Terminology Insertion Rates (TIR), they drastically degrade translation quality (BLEU) in low-resource settings, breaking the fragile fluency of the model. Conversely, static augmentation improves terminology adherence on unseen terms in Maltese (4% → 19%), but fails in the context of a highly inflected language like Slovak. To resolve this conflict, we propose Hybrid Fallback Term Injections, a strategy that prioritizes the fluency of static models while using dynamic constraints as a safety net. This approach recovers up to 90% of missing terms while mitigating the quality degradation of pure constraint approaches, providing a viable solution for high-fidelity translation in data-scarce environments.
Context Volume Drives Performance: Tackling Domain Shift in Extremely Low-Resource Translation via RAG
David Samuel Setiawan | Raphaël Merx | Jey Han Lau
David Samuel Setiawan | Raphaël Merx | Jey Han Lau
Neural Machine Translation (NMT) models for low-resource languages suffer significant performance degradation under domain shift. We quantify this challenge using Dhao, an indigenous language of Eastern Indonesia with no digital footprint beyond the New Testament (NT). When applied to the unseen Old Testament (OT), a standard NMT model fine-tuned on the NT drops from an in-domain score of 36.17 chrF++ to 27.11 chrF++. To recover this loss, we introduce a hybrid framework where a fine-tuned NMT model generates an initial draft, which is then refined by a Large Language Model (LLM) using Retrieval-Augmented Generation (RAG). The final system achieves 35.21 chrF++ (+8.10 recovery), effectively matching the original in-domain quality. Our analysis reveals that this performance is driven primarily by the number of retrieved examples rather than the choice of retrieval algorithm. Qualitative analysis confirms the LLM acts as a robust "safety net," repairing severe failures in zero-shot domains.
Building and Evaluating a High Quality Parallel Corpus for English Urdu Low Resource Machine Translation
Munief Hassan Tahir | Hunain Azam | Sana Shams | Sarmad Hussain
Munief Hassan Tahir | Hunain Azam | Sana Shams | Sarmad Hussain
Low-resource languages like Urdu suffer from limited high quality parallel data for machine translation. We introduce a curated English–Urdu corpus of 80,749 high-fidelity sentence pairs across 18 diverse domains, built via ethical collection, manual alignment, deduplication, and strict length-based filtering (AWCD ≤ 5). The corpus is converted into a bidirectional SFT dataset with bilingual (English/Urdu) instructions to enhance prompt-language robustness. Fine-tuning Llama-3.1-8B-Instruct (Llama-FT) and UrduLlama 1.1 (UrduLlama-FT) yields major gains over the baseline. sacreBLEU scores reach 24.65–25.24 (En→Ur) and 76.14–77.97 (Ur→En) for Llama-FT, with minimal sensitivity to prompt language. Blind human evaluation on 90 sentences per direction confirms substantial perceptual improvements. Results demonstrate the value of clean parallel data and bilingual instruction tuning, revealing complementary benefits of general SFT versus Urdu specific pretraining. This work provides a reproducible resource and pipeline to advance Urdu machine translation and similar low-resource languages.
This paper presents a set of linguistic resources that describes Quechua verbs. We first present a dictionary of 1,444 fundamental Quechua verbs, associated with morpho-syntactic grammars to formalize their inflection and their derivations, that can be used to produce over 2,777,000 conjugated Quechua derived verbal forms. We aligned this list of Quechua verbal forms with the corresponding Spanish dictionary that contains 618,000 conjugated verbal forms, thus producing both a Spanish to Quechua and a Quechua to Spanish dictionary.
Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing
Aashish Dhawan | Christopher Driggers-Ellis | Christan Grant | Daisy Zhe Wang
Aashish Dhawan | Christopher Driggers-Ellis | Christan Grant | Daisy Zhe Wang
Machine translation for Indigenous and other low-resource languages is constrained by limited parallel data, orthographic variation, and evaluation instability for morphologically rich languages. In this work, we study Spanish–Aymara, Spanish–Guarani, and Spanish–Quechua translation in the context of the AmericasNLP benchmarks, focusing on data-centric improvements rather than architectural changes. We augment curated parallel corpora with forward-translated synthetic sentence pairs generated using a high-capacity multilingual translation model, while applying conservative, language-specific preprocessing tailored to each language. Training data is filtered using length-ratio constraints and deduplication, whereas official development sets are left unfiltered to ensure fair evaluation. We fine-tune a multilingual mBART model under curated-only and curated+synthetic settings and evaluate performance primarily using chrF++, which is better suited for agglutinative languages than BLEU. Across all three languages, synthetic data augmentation consistently improves chrF++, with the largest gains observed for Aymara and Guarani, while Quechua benefits primarily from deterministic orthographic normalization. Our analysis highlights both the effectiveness and the limitations of generic preprocessing for highly agglutinative languages, suggesting that data-centric augmentation and language-aware normalization are strong, reproducible baselines for low-resource Indigenous language machine translation.
Adapting Multilingual NMT to Language Isolates: The Role of Proxy Language Selection and Dialect Handling for Nivkh
Eleonora Izmailova | Alexey Sorokin | Pavel Grashchenkov
Eleonora Izmailova | Alexey Sorokin | Pavel Grashchenkov
Neural machine translation has achieved remarkable results for high-resource languages, yet language isolates – those with no demonstrated genetic relatives – remain severely underserved, as they cannot benefit from cross-lingual transfer with related languages. We present the first NMT system for Nivkh, a critically endangered language isolate spoken by fewer than 100 fluent speakers in the Russian Far East. Working with approximately 9.5k parallel sentences – expanded through fine-tuned LaBSE sentence alignment – we adapt NLLB-200 to Nivkh-Russian translation. Since Nivkh is absent from NLLB’s language inventory, we investigate proxy language token selection, comparing six typologically diverse languages: Bashkir, Kazakh, Halh Mongolian, Turkish, Tajik, and French. We find that using any proxy substantially outperforms random token initialization (BLEU 18-19.02 vs. 15.44 for rus→niv), confirming the value of proxy-based transfer. However, the choice of which proxy has minimal impact, with all six achieving comparable results despite spanning four language families and two scripts. This suggests that for language isolates, practitioners can select any typologically reasonable proxy without significant performance penalty. We additionally present preliminary experiments on dialect-specific models for Amur and Sakhalin Nivkh. Our findings establish baseline results for future Nivkh NLP research and provide practical guidance for adapting multilingual models to other language isolates.
Machine translation (MT) evaluation is central in guiding researchers on how to improve a model’s performance. Current automatic evaluation practices fail to provide reliable insights into the specific translation errors that occur, especially for low-resource languages. This paper introduces the Lux-MT-Test-Suite, enabling a linguistically motivated and fine-grained analysis of Luxembourgish–English (LB-EN) MT based on 896 test items covering 12 linguistic categories and 36 linguistic phenomena. We compare a baseline local LLM (Gemma 3), its fine-tuned counterpart (LuxMT), and a proprietary state-of-the-art LLM (GPT-5) to analyse what local LLMs learn through fine-tuning in a low-resource setting and to assess performance differences between local and proprietary systems. The findings identify specific performance gains through fine-tuning, minor degradations, a difference in translation strategies, performance gaps between local and proprietary models, and remaining challenges.
Assessing and Improving Punctuation Robustness in English-Marathi Machine Translation
Kaustubh Shivshankar Shejole | Sourabh Deoghare | Pushpak Bhattacharyya
Kaustubh Shivshankar Shejole | Sourabh Deoghare | Pushpak Bhattacharyya
Neural Machine Translation (NMT) systems rely heavily on explicit punctuation cues to resolve semantic ambiguities in a source sentence. Inputting user-generated sentences, which are likely to contain missing or incorrect punctuation, results in fluent but semantically disastrous translations. This work attempts to highlight and address the problem of punctuation robustness of NMT systems through an English-to-Marathi translation. First, we introduce Virām, a human-curated diagnostic benchmark of 54 punctuation-ambiguous English-Marathi sentence pairs to stress-test existing NMT systems. Second, we evaluate two simple remediation strategies: cascade-based restore-then-translate and direct fine-tuning. Our experimental results and analysis demonstrate that both strategies yield substantial NMT performance improvements. Furthermore, we find that current Large Language Models (LLMs) exhibit relatively poorer robustness in translating such sentences than these task-specific strategies, thus necessitating further research in this area. The code and dataset are available at https://github.com/KaustubhShejole/Viram_Marathi.
Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?
Aishwarya Ramasethu | Rohin Garg | Niyathi Allu | Harshwardhan Fartale | Dun Li Chan
Aishwarya Ramasethu | Rohin Garg | Niyathi Allu | Harshwardhan Fartale | Dun Li Chan
Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model’s vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Broadly, our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.
The challenges of building speech-to-text translation (ST) systems (e.g., a relative lack of parallel speech–text data and robustness to noise in audio) are exacerbated for low-resource language pairs. In this work, we seek to improve low-resource ST by building on previous studies that regularize ST training with the connectionist temporal classification (CTC) loss. By systematically evaluating a diverse range of linguistic annotations as CTC labels across multiple auxiliary loss configurations, we improve speech translation systems for both low- and high-resource settings. These improvements over both a standard end-to-end ST system and a speech LLM indicate a need for continued research on regularizing speech representations in ST.
Navigating Data Scarcity in Low-Resource English-Tatar Translation using LLM Fine-Tuning
Ahmed Khaled Khamis
Ahmed Khaled Khamis
The scarcity of high-quality parallel corpora remains the primary bottleneck for English-Tatar machine translation. While the OPUS project provides various datasets, our tests reveal that datasets like WikiMatrix, GNOME, and NLLB, suffer from significant noise and incorrect labeling, making them unsuitable for training robust encoder-decoder translation models that typically requires larger amount of high quality data. Furthermore, we demonstrate that small-scale multilingual Large Language Models (LLMs), such as Qwen3 (4B-30B), Gemma3 (4B-12B) and others, show severe "Turkish interference", and they frequently hallucinate Turkish vocabulary when prompted for Tatar.In this paper, we navigate this data scarcity by leveraging Llama 3.3 70B Instruct, which is the only model in our zero-shot benchmarks capable of maintaining distinct linguistic boundaries for Tatar. To address the lack of gold-standard data, we curated a synthetic dataset of 7,995 high-quality translation pairs using a frontier model as a teacher. We then performed 4-bit LoRA fine-tuning to train Llama for English-Tatar translation. Our results show a performance leap: while fine-tuning on the limited Tatoeba dataset (1,193 samples) yielded a CHRF++ score of 24.38, while fine-tuning on our synthetic dataset achieved 32.02 on the LoResMT 2026 shared task test set. We release our curated dataset and fine-tuned models to support further research in low-resource Turkic machine translation.
No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data
Dmitry Karpov
Dmitry Karpov
We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.
DevLake at LoResMT 2026: The Impact of Pre-training and Model Scale on Russian-Bashkir Low-Resource Translation
Vyacheslav Tyurin
Vyacheslav Tyurin
This paper describes the submission of Team DevLake for the LoResMT 2026 Shared Task on Russian-Bashkir machine translation. We conducted a comprehensive comparative study of three distinct neural architectures: NLLB-200 (1.3B), M2M-100 (418M), and MarianMT (77M). To overcome hardware constraints, we employed parameter-efficient fine-tuning techniques (QLoRA) and extensive data filtering using a domain-specific BERT-based classifier. Our experiments demonstrate that the presence of the target language (Bashkir) in the model’s pre-training data is the decisive factor for performance. Our best system, a fine-tuned NLLB-200-1.3B model augmented with exact match retrieval, achieved a CHRF++ score of 52.67. We also report on negative results with custom tokenization for smaller models, providing insights into the limitations of vocabulary adaptation without extensive pre-training.
We describe an evaluation of several open-source models under identical inference conditions without task-specific training. Despite covering a wide range of available models, including both multilingual systems and models specifically designed for Russian-Kazakh translation, the results indicate that the highest performance is achieved by the language-specific approach.
Script Correction and Synthetic Pivoting: Adapting Tencent HY-MT for Low-Resource Turkic Translation
Bolgov Maxim
Bolgov Maxim
This paper describes a submission to the LoResMT 2026 Shared Task for the Russian-Kazakh, Russian-Bashkir, and English-Chuvash tracks. The primary approach involves parameter-efficient fine-tuning (LoRA) of the Tencent HY-MT1.5-7B multilingual model. For the Russian-Kazakh and Russian-Bashkir pairs, LoRA adaptation was employed to correct the model’s default Arabic script output to Cyrillic. For the extremely low-resource English-Chuvash pair, two strategies were compared: mixed training on authentic English-Chuvash and Russian-Chuvash data versus training exclusively on a synthetic English-Chuvash corpus created via pivoting through Russian. Baseline systems included NLLB 1.3B (distilled) for Russian-Kazakh and Russian-Bashkir, and Gemma 2 3B for English-Chuvash. Results demonstrate that adapting a strong multilingual backbone with LoRA yields significant improvements over baselines while successfully addressing script mismatch challenges. Code for training and inference is released at: https://github.com/defdet/low-resource-langs-mt-adapt
This paper outlines our winning submission to the English-to-Tatar translation task. We evaluated three strategies: few-shot prompting with Gemini 3 Pro Preview, specialized trans-tokenized Tweeties models, and the RL-distilled TranslateGemma family. Results demonstrate that large commercial models significantly outperform smaller specialized ones in this low-resource setting. Gemini secured first place with a chrF++ score of 56.71, surpassing the open-source baseline of 25.23.
Data-Centric Approach at the LoResMT 2026 Turkic Translation Challenge: Russian-Kyrgyz
Dmitry Novokshanov
Dmitry Novokshanov
We describe our submission to the Turkic languages translation challenge at LoResMT 2026, which focuses on translation from Russian into Kyrgyz. Our approach leverages parallel data, synthetic translations, a comprehensive filtering pipeline and a four-stage curriculum learning strategy. We compare our system with contemporary baselines and present the model that achieves a chrF++ score of 49.1 and takes first place in the competition.
We describe our submission to the shared task LoResMT 2026, which involved translating from low-resource Turkic languages Bashkir, Chuvash, Kazakh, Kyrgyz, and Tatar from English or Russian. We submitted runs for the English-Chuvash language pair using Neural machine translation (NMT). Our approach focused on systematic experimentation with diverse model architectures and an emphasis on optimizing inference-time parameters. The key findings indicate that a large-scale, specialized multilingual translation model, combined with targeted data preprocessing and careful generation tuning, yielded the best performance, achieving a chrF++ score of 29.67 on the public test set.
Ensemble Methods for Low-Resource Russian-Kyrgyz Machine Translation: When Diverse Models Beat Better Models
Adilet Metinov
Adilet Metinov
We present our submission to the LoResMT 2026 Shared Task on Russian-Kyrgyz machine translation. Our approach demonstrates that ensembling diverse translation models with simple consensus-based voting can significantly outperform individual models, achieving a +1.37 CHRF++ improvement over our best single model. Notably, we find that including "weaker" models in the ensemble improves overall performance, challenging the conventional assumption that ensembles should only combine top-performing systems. Our system achieved 49.31 CHRF++ on the public leaderboard and 48.55 CHRF++ on the final private test set, placing 3rd in the Russian-Kyrgyz track using only open-weight models without any fine-tuning on parallel Kyrgyz data. We report several counter-intuitive findings: (1) simple voting outperforms quality-weighted selection, (2) more diverse models help even when individually weaker, and (3) post-processing "corrections" can hurt performance when reference translations contain similar artifacts.
up
Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion
Proceedings of the Sixth Workshop on Language Technology for Equality, Diversity, Inclusion
Bharathi Raja Chakravarthi | Bharathi B | Paul Buitelaar | Durairaj Thenmozhi | Miguel Ángel García Cumbreras | Salud María Jiménez Zafra
Bharathi Raja Chakravarthi | Bharathi B | Paul Buitelaar | Durairaj Thenmozhi | Miguel Ángel García Cumbreras | Salud María Jiménez Zafra
Behind the Laughter: Uncovering Gender Bias in Code-Mixed Bangla Memes
Jannatul Ferdusi | Labanya Saha | Paria Chowdhury | Jawad Hossain | Noor Mairukh Khan Arnob
Jannatul Ferdusi | Labanya Saha | Paria Chowdhury | Jawad Hossain | Noor Mairukh Khan Arnob
Bangla memes are widely used on social media to express humor and social commentary, yet computational analysis of gender bias in Bangla memes remains largely unexplored. In this work, we present a multimodal framework for detecting gender bias in Bangla memes by jointly analyzing textual and visual con tent. We construct a dataset of 6,846 Bangla and Banglish code-mixed memes annotated into three categories: male-biased, female biased, and neutral. For textual representation, we use BanglishBERT, while visual features are extracted using ConvNeXt, and the two modalities are fused for final classification. Our best-performing model, ConvNeXt + BanglishBERT, achieves accuracy of 0.67 and an F1-score of 0.63, outperforming several multimodal baselines. The results demonstrate the effectiveness of multimodal learning for understanding culturally nuanced and code-mixed meme content in low-resource languages. Code and data available at this link
Bias and fairness risks in Large Language Models (LLMs) vary substantially across deployment contexts, yet existing approaches lack systematic guidance for selecting appropriate evaluation metrics. We present a decision framework that maps LLM use cases, characterized by a model and population of prompts, to relevant bias and fairness metrics based on task type, whether prompts contain protected attribute mentions, and stakeholder priorities. Our framework addresses toxicity, stereotyping, counterfactual unfairness, and allocational harms, and introduces novel metrics based on stereotype classifiers and counterfactual adaptations of text similarity measures. We release an open-source Python library, langfair, for practical adoption. Extensive experiments on use cases across five LLMs and five prompt populations demonstrate that fairness risks cannot be reliably assessed from benchmark performance alone: results on one prompt dataset likely overstate or understate risks for another, underscoring that fairness evaluation must be grounded in the specific deployment context.
Dual-Axis Compositional Contrastive Few-Shot Learning using Prototypes Across Linguistic and Semantic Dimensions for Indic Low-Resource Multilingual NLU
Kathakali Mitra | Sakshi Singh | Sree Nithish Reddy Gunapati | Aruna Malapati | Mark G. Lee
Kathakali Mitra | Sakshi Singh | Sree Nithish Reddy Gunapati | Aruna Malapati | Mark G. Lee
Multilingual Natural Language Understanding (NLU) systems often struggle to adapt when new languages or new semantic labels are introduced with only a few annotated examples. This challenge is particularly pronounced for low-resource languages, where limited supervision and evolving label spaces make conventional joint-label classification approaches unstable. Most existing multilingual NLU models treat each language-semantic pair as an independent class, entangling linguistic and semantic representations and hindering few-shot adaptation. We propose Dual-Axis Compositional Few-Shot Learning, a framework that explicitly factorizes the representation space into linguistic and semantic embedding axes, enabling independent modeling of language variation and domain-intent semantics. Joint representations are constructed compositionally through multiplicative interaction of axis-specific embeddings, allowing controlled adaptation when either the language set or the semantic label space evolves. The framework integrates factorized prototype learning, axis-structured contrastive alignment, and disentanglement regularization using HSIC-based statistical independence and Jacobian-based cross-axis decorrelation. Experiments on six low-resource Indic languages spanning Indo-Aryan and Dravidian families (Hindi, Bengali, Sanskrit, Assamese, Tamil, and Telugu) demonstrate strong performance under two structured generalization regimes. The model achieves 81.12% accuracy when adapting to few-shot languages with known semantics and 63.5% accuracy when learning new semantic classes from few-shot examples, along with an accuracy of 89.56% on known language and seen semantics. These results show that axis-factorized representations enable stable compositional generalization, offering a promising direction for scalable multilingual NLU in linguistically diverse low-resource settings.
Equilibrium Dynamics and Mitigation of Gender Bias in Synthetically Generated Data
Ashish Kattamuri | Arpita Vats | Harshwardhan Fartale | Rahul Raja | Akshata Kishore Moharir | Ishita Prasad
Ashish Kattamuri | Arpita Vats | Harshwardhan Fartale | Rahul Raja | Akshata Kishore Moharir | Ishita Prasad
Recursive prompting with large language models enables scalable synthetic dataset generation but introduces the risk of bias amplification. We investigate gender bias dynamics across three generations of recursive text generation using three complementary evaluation frameworks: rule-based pattern matching, embedding based semantic similarity, and downstream task performance. Experiments with three initial bias levels (0.1, 0.3, 0.6) and four mitigation strategies reveal equilibrium dynamics rather than monotonic amplification. The low initial bias amplifies toward the model’s inherent bias level (+ 36%), whereas the high initial bias decays toward it (-26%). Among mitigation methods, contrastive augmentation, which introduces gender-swapped variants, achieves significant downstream bias reduction (98.8% for low initial bias and 91% on average) despite producing higher embedding-based bias scores. This paradox demonstrates that semantic similarity metrics may diverge from behavioral fairness outcomes, highlighting the need for multidimensional evaluation in responsible synthetic data generation.
Evaluating Direct Preference Optimization for Personalizing German Automatic Text Simplifications for Persons with Intellectual Disabilities
Yingqiang Gao | Kaede Johnson | David Fröhlich | Luisa Carrer | Sarah Ebling
Yingqiang Gao | Kaede Johnson | David Fröhlich | Luisa Carrer | Sarah Ebling
Automatic text simplification (ATS) aims to enhance language accessibility for various target groups, particularly persons with intellectual disabilities. Recent advancements in large language models (LLMs) have substantially improved the quality of machine-generated text simplifications, however, existing LLM-based ATS systems do not incorporate preference feedback during post-training, resulting in a lack of personalization tailored to the specific needs of target group persons. In this work, we propose an ATS personalization framework using direct preference optimization (DPO). Specifically, we post-trained LLM-based ATS models using human feedback collected from persons with intellectual disabilities, reflecting their preferences of paired text simplifications generated by mainstream LLMs. Our pipeline for developing personalized LLM-based ATS systems encompasses data collection, model selection, supervised fine-tuning (SFT) and DPO post-training, and result evaluation. Our findings underscore the necessity of active participation of target group persons in designing personalized inclusive AI solutions aligned with human preferences.
From Form to Meaning: Interlingua Sense-Alignment of Offensive Language with LLMs
Maria Alexandra Roussopoulou | Stella Markantonatou
Maria Alexandra Roussopoulou | Stella Markantonatou
This paper presents a methodology that uses LLMs to align multilingual offensive lexicons at the sense level. Lexicons of different structures and origins in Arabic, Bulgarian, Modern Greek, French, and Italian have been aligned directly without pivoting through English. The Modern Greek lexicon is LLM-generated, and the other four lexicons are WordNet-compatible. For inter-language alignment of senses, an LLM-as-a-judge rubric was used over lemma–definition–example triples. The LLM makes 2.87M pairwise comparisons and yields 31 strict global-sense categories. The paper discusses the challenges involved in sense alignment tasks. The resource is available to support downstream applications such as Machine Translation and cross-lingual hate-speech detection.
GYAAN-SAHIT: A Persona-Driven Multi-Agent Framework for Caste-Based Hate Speech Detection
Sakshi Gupta | Shunmuga Priya Muthusamy Chinnan | Saranya Rajiakodi | Ratnavel Rajalakshmi | Bharathi Raja Chakravarthi
Sakshi Gupta | Shunmuga Priya Muthusamy Chinnan | Saranya Rajiakodi | Ratnavel Rajalakshmi | Bharathi Raja Chakravarthi
Social media has amplified public discourse in India while perpetuating caste-based hierarchies. Despite legal protections, caste-based hate speech continues to propagate across digital platforms through culturally embedded expressions that conventional classifiers often struggle to interpret. We propose GYAAN-SAHIT, a knowledge-driven multi-agent framework that addresses this problem through structured debate-based classification. Each agent adopts a distinct ideological and socio-cultural persona, engaging in multi-turn argumentation to reason over context, subtext, and intent. A critic agent then evaluates the coherence of the debate before producing the final classification. The framework further integrates Hindi hate lexicons to ground its reasoning in linguistic and cultural specificity. Experiments show that GYAAN-SAHIT achieves improvement in performance while generating culturally grounded explanations, demonstrating the effectiveness of persona-based multi-agent reasoning for hate speech detection in low-resource and socially complex environments.
I’m Sorry, but I Can’t Help with Braille: Revealing Accessibility Failures in State-of-the-Art LLMs
Abdullah
Abdullah
Large Language Models (LLMs) perform strongly on many language tasks, but their capability in structurally constrained, accessibility-critical modalities such as Braille remains unclear. We evaluate state-of-the-art LLMs on bidirectional Korean–Braille translation using a human-annotated dataset. Despite expectations that multilingual, instruction-tuned models can generalize to Braille via text representations, we find consistently poor, unstable outputs and substantial disagreement with human judgments. These results point to missing Braille-aware tokenization and weak alignment between Korean and Braille patterns. In contrast, supervised fine-tuning of a small model (T5-small) on the same data yields large and stable gains over zero-shot and prompted LLM baselines across standard metrics (SacreBLEU, ChrF++, CER, BLEU, ROUGE-L, METEOR, CIDEr). Our findings reveal a systematic limitation of current LLMs and demonstrate the effectiveness of modest task-specific supervision.
Multimodal Transformer Framework for Multilingual Harmful Meme Classification
Charmathi Rajkumar | Malliga Subramanian | Bharathi Raja Chakravarthi
Charmathi Rajkumar | Malliga Subramanian | Bharathi Raja Chakravarthi
The rapid expansion of social media platforms has led to a significant increase in the spread of harmful content, including misogynistic, homophobic, and transphobic memes. Detecting such content is challenging because memes often combine textual and visual elements and frequently appear in multilingual and culturally diverse contexts. This study proposes a multimodal transformer-based framework for multilingual harmful meme classification that integrates textual and visual representations to improve detection performance. The proposed architecture employs XLM-RoBERTa for multilingual text encoding and the Swin Transformer for hierarchical visual feature extraction. A cross-attention fusion mechanism is introduced to enable meaningful interaction between textual and visual modalities. The fused representation is then processed through a classification layer to perform multi-class prediction. Experiments are conducted across multiple datasets covering eight languages and three harmful content categories: misogyny, homophobia/transphobia, and hate speech. The model is evaluated using the macro-F1 score and demonstrates consistent improvements over baseline multimodal systems across both high-resource and low-resource languages. The results highlight the effectiveness of transformer-based multimodal architectures in capturing implicit and contextual harmful signals present in memes. The study contributes to the development of robust multilingual systems for harmful content detection and supports efforts toward creating safer and more inclusive online environments.
While automatic text summarization has achieved remarkable success in English,extending these capabilities to low-resource languages remains a significantchallenge due to the scarcity of labeled training data. We propose atranslation-augmented approach to multilingual summarization: we systematicallytranslate high-quality English summarization corpora into low-resource targetlanguages using NLLB-200, and use the resulting parallel data to train andevaluate sequence-to-sequence models. We experiment across three typologicallydiverse languages—Swahili, Hausa, and Afrikaans—comparing monolingualfine-tuning (MONO), cross-lingual transfer (XLT), and joint multilingualtraining (TAMT) on mBART-large-50. Monolingual fine-tuning achieves the bestperformance for Swahili (ROUGE-L 13.9) and Afrikaans (ROUGE-L 15.7),surpassing the Lead-3 baseline in both cases, while cross-lingual transferremains strongest for Hausa (ROUGE-L 14.5). We show that native language tokenavailability in mBART-50 is a critical determinant of fine-tuning performance,and characterize the conditions under which the theoretically expectedTAMT > MONO > XLT ordering breaks down. We release our dataset, code, andevaluation infrastructure to support future research on low-resourcemultilingual summarization.
Findings of Shared Task on Counter Narrative Generation on Homophobic and Transphobic Comments
Prasanna Kumar Kumaresan | Praveen Prasannan | Tanay Singh | Ruba Priyadharshini | Subalalitha Chinnaudayar Navaneethakrishnan | Saranya Rajiakodi | Paul Buitelaar | Bharathi Raja Chakravarthi
Prasanna Kumar Kumaresan | Praveen Prasannan | Tanay Singh | Ruba Priyadharshini | Subalalitha Chinnaudayar Navaneethakrishnan | Saranya Rajiakodi | Paul Buitelaar | Bharathi Raja Chakravarthi
Online platforms continue to witness harmful expressions targeting LGBTQ+ individuals, particularly in the form of homophobic and transphobic comments. While detection of such content has received substantial attention, generating constructive counter-narratives remains comparatively underexplored. In this shared task, we focus on counter-narrative generation in English and Tamil. Participants were provided with social media comments labeled as homophobic or transphobic and were required to generate respectful, contextually appropriate responses that challenge prejudice and promote empathy. Systems were evaluated using both reference-based metrics (Distinct-2 and BERTScore-F1) and rubric-based human evaluation metrics measuring politeness (PRS), quality (QS), and contextual coherence (CCNC). The results demonstrate variation in system performance across languages, with English systems showing stronger lexical diversity and Tamil systems excelling in politeness and contextual coherence. This paper presents dataset statistics, evaluation methodology, system performance analysis, and key observations from the shared task.
Insights from Multilingual Gender Inclusive Language Generation Shared Task
Bharathi Raja Chakravarthi | Shunmuga Priya Muthusamy Chinnan | Paul Buitelaar | Miguel Ángel García-Cumbreras | Salud María Jiménez-Zafra | Thomas Mandl | Sylvia Jaki | Rahul Ponnusamy | Anand Kumar Madasamy | Dhanalakshmi V | Bharathi B | Premjith B | Senthil Kumar B | Sathiyaraj Thangasamy
Bharathi Raja Chakravarthi | Shunmuga Priya Muthusamy Chinnan | Paul Buitelaar | Miguel Ángel García-Cumbreras | Salud María Jiménez-Zafra | Thomas Mandl | Sylvia Jaki | Rahul Ponnusamy | Anand Kumar Madasamy | Dhanalakshmi V | Bharathi B | Premjith B | Senthil Kumar B | Sathiyaraj Thangasamy
We investigate the role of large language models (LLMs) in promoting gender-inclusive language by evaluating their ability to rewrite biased text and generate counterfactual narratives across multiple languages. We introduce a shared task with two subtasks: gender-inclusive rewriting and counterfactual generation. The task covers five languages English, German, Spanish, Tamil, and Kannada reflecting diverse grammatical gender systems and sociocultural contexts. We release curated word-level and sentence-level datasets to support controlled inclusive generation. A total of 50 teams registered for the shared task, and around 8 teams submitted results. Submissions are evaluated using a hybrid framework combining rubric-based automatic scoring with expert human judgment. Finally, we provide an overview of participating systems and discuss key findings and challenges observed across languages.
Overview of the Multimodal Homophobia and Transphobia Meme Classification Shared Task
Kishore Kumar Ponnusamy | Bharathi Raja Chakravarthi | Prasanna Kumar Kumaresan | Premjith B | Thenmozhi Durairaj | Ruba Priyadharshini | Subalalitha Chinnaudayar Navaneethakrishnan
Kishore Kumar Ponnusamy | Bharathi Raja Chakravarthi | Prasanna Kumar Kumaresan | Premjith B | Thenmozhi Durairaj | Ruba Priyadharshini | Subalalitha Chinnaudayar Navaneethakrishnan
This paper presents an overview of the Shared Task on detecting homophobia and transphobia in meme datasets across three languages: Hindi, English, and Chinese. With the rapid growth of internet users worldwide, memes have become a widely used medium for expressing humor, satire, and sarcasm on social media platforms. However, their increasing popularity has also facilitated the spread of hate, misinformation, and propaganda targeting specific communities. Hateful memes often attack individuals or groups based on attributes such as physical appearance, language, ethnicity, religion, or sexual orientation. Among those affected, the LGBTQ+ community is particularly vulnerable and frequently targeted on social media platforms. To address this issue, we organized a shared task that focuses on identifying homophobic and transphobic hate in memes. The task aims to encourage the development of automated systems capable of detecting such harmful content across multiple languages. Evaluation was conducted using Macro F1-score as the primary metric. The top performing system achieved a Macro F1-score of 0.8377 for English, 0.8081 for Hindi, and 0.7535 for Chinese, demonstrating promising results for multilingual hate detection in memes.
CAI@LTEDI 2026: Multilingual Gender Inclusive Language Generation using Instruction-Guided mT5 Transformer Model
Aiswariya p Nair | Sree S Bhagya | Chinnu Jacob
Aiswariya p Nair | Sree S Bhagya | Chinnu Jacob
Gender bias in multilingual language generation systems poses serious ethical and social issues, especially in languages with complex morphology. In this study, we propose a lightweight multilingual approach that employs instruction-guided fine-tuning of the mT5-small transformer model for gender-inclusive language generation. The framework accommodates five languages: English, German, Spanish, Tamil, and Kannada. The approach uses the task-prefix rewriting method to transform gender-specific sentences to their gender-neutral versions. The training data from different languages is combined into a single multi-lingual dataset for sequence-to-sequence fine-tuning. Beam search decoding with repetition constraints is used during inference to improve the quality of the output. The system’s performance is measured using GIFI, semantic similarity, and an overall combined score across all languages. Experimental results show that the system can eliminate gender-biased language while retaining semantic meaning in part across languages
CuriousVectors@LT-EDI 2026: Detection of Homophobic and Transphobic Memes on Social Media Using a Hybrid Multimodal Approach
Saloni Kushwaha | Jishnu Bandyopadhyay | Deepawali Sharma | Aakash Singh
Saloni Kushwaha | Jishnu Bandyopadhyay | Deepawali Sharma | Aakash Singh
The rapid growth of social media has also led to a rise in abusive and harmful content, which negatively affects the online environment for users. The frequent use of offensive language and hate speech contributes to making these platforms increasingly hostile. In particular, homophobic and transphobic remarks target members of the LGBT+ community. Detecting such comments is therefore essential so that they can be flagged promptly and appropriate warnings can be given to users involved in such behaviour. The problem becomes more serious when such content appears in other forms of communication used by younger generations, such as memes. This work tries to address this issue. We propose a method to detect such content using the meme dataset from the LT-EDI 2026 challenge and secured 8th rank for English and 6th rank for Chinese language dataset in the shared task. Our approach uses a multimodal technique that processes both image and text information. The dataset has limited data, which creates a challenge. To handle this, we pre–fine-tune the models on a similar dataset called PrideMM. The proposed multimodal approach achieved Macro F1-scores of 0.24 and 0.57 for English and Chinese memes respectively.
DLRG@LT-EDI 2026: Automating Counter-Narratives for Homophobic and Transphobic Comments
Ramesh Kannan R | Ratnavel Rajalakshmi
Ramesh Kannan R | Ratnavel Rajalakshmi
Online hate speech is spreading rapidly, creating significant challenge, particularly in low-resource language such as Tamil. Lack of developed automated content moderation systems makes it difficult to control harmful content effectively. In this study, we propose a computational framework for generating Counter Narratives (CNs) using classical NLP techniques. With this, we leverage TF-IDF features with n-grams to identify the labels as Homophobic or Transphobic. Span detection is performed with TF-IDF features with n-grams and Machine learning models. Counter narratives are then retrieved by computing cosine similarity, ensuring semantic alignment and contextual relevance. Evaluation on the expanded human curated dataset demonstrates that our approach produces contextually appropriate and semantically coherent counter narratives. Notably, the proposed system is submitted at Task 2 shown a overall average score of 80.40 % for Tamil and 77.29 % for English and secured first and fourth rank respectively. GitHub: https://github.com/kannanrrk/Span-Counter-Feature-Based
DuoNova@LTEDI 2026: Multilingual Span Detection and Counter-Narrative Generation on Homophobic and Transphobic Comments
Manasa S | Arohi Rawat | Anbukkarasi Sampath
Manasa S | Arohi Rawat | Anbukkarasi Sampath
The detection and response to homophobicand transphobic comments are important challengesin Natural Language Processing. In thispaper, we focus on the detection of span forhomophobic and transphobic comments (Task1) and generation of counter narratives for abusivecomments (Task 2) for the LT-EDI @ ACL2026 shared task. Harmful comments madeonline against the LGBTQ+ community havecreated a hostile environment for users. In thispaper, we have used the transformer model forthe detection of span for homophobic and transphobiccomments and generation of counternarratives. In this task, the detection of the spanof comments containing homophobic and transphobicwords and the generation of counter narrativesfor abusive comments have been doneusing the transformer model. The results showthe efficiency of the transformer model in thedetection of the span of comments and generationof counter narratives. This paper emphasizesthe efficiency of the transformer model increating a safe environment for users.
Igniters@LTEDI 2026: Multilingual Gender-Inclusive Language Generation with mT5 and Counter-Narrative Generation Using Llama-3
Rajendran S | N.Ramkumar | Malarselvi
Rajendran S | N.Ramkumar | Malarselvi
The deployment of Large Language Models(LLMs) has intensified concerns regarding thepropagation of societal stereotypes encodedwith web-scale training corpora. This pa-per presents a dual-paradigm framework spe-cially designed to address multilingual gender-inclusvity and counterfactual generation. Formultilingual gender-neutral text transformation,a fine-tuned mT5 encoder–decoder model per-forms controlled sentence rewriting with mini-mal edits while preserving semantic fidelity andgrammatical fluency. For counter-narrative gen-eration, the Llama-3 8B decoder-only model isemployed to generate empathetic and persua-sive responses through structured prompt-basedgeneration. The framework is evaluated usingdatasets from the LT-EDI ACL 2026 sharedtask across multiple languages, including En-glish, Tamil, Kannada, German, and Spanish.Experimental results demonstrate strong effec-tiveness in identifying and neutralizing gendermarkers, particularly in morphologically richlanguages, while the counter-narrative compo-nent achieves high performance in politeness,coherence, and relevance. Overall, the pro-posed approach contributes toward the develop-ment of responsible and inclusive multilingualNLP systems.
IHLC@LT-EDI 2026: Steering Toward Inclusivity - A Representation Engineering for Gender-Neutral Rewriting
Akhil Rajeev P | Manoj Balaji Jagadeeshan
Akhil Rajeev P | Manoj Balaji Jagadeeshan
We describe the IHLC team’s submissionto the LT-EDI ACL 2026 Shared Task onGender-Inclusive Language Generation andCounterfactual/Counter Narrative Generation.Our English-only system applies an activationsteeringapproach combined with carefullyengineered prompt templates to producegender-neutral rewrites and empathetic counternarratives.We summarize system design, experimentalsetup, evaluation protocol used bythe shared task, and report results for both subtasks(Task A: Gender Inclusive Language Generation– Average = 80.00%, Rank 3; TaskB: Counter Narrative Generation – Average =78.12%, Rank 6). We also analyze strengthsand failure modes observed in automatic andhuman-checked evaluations and highlight directionsfor improvement.
IReL_IIT(BHU)@LTEDI 2026: Fine-Tuning Instruction-Tuned Transformers for Gender-Inclusive Rewriting and Counterfactual Bias Mitigation
Anurag Balaji | Arjun Mukherjee | Krishna Tewari | Sukomal Pal
Anurag Balaji | Arjun Mukherjee | Krishna Tewari | Sukomal Pal
This paper presents our submissions to the LT-EDI@ACL 2026 Shared Task on Gender Inclusive Language Generation. The task focuses on controlled text rewriting that reduces gender bias while keeping the original meaning and fluency intact. We participated in boththe subtasks and treated them independently, training separate instances of the instruction-tuned encoder–decoder model on the respective training datasets. Scores are calculated based on averages across different rubrics, including Gender Assumption (GA), Gender Neutrality (GN), and Quality Relevance (QR) for Task A, and Politeness and Respectful (PR), Contextual Counter-Narrative Coherence (CCNC), and Quality and Relevance (QR) for Task B.For Subtask A (Gender-Inclusive Language Generation) in the English dataset, an average score of 43.7917 could be achieved. For Subtask B (Counterfactual Generation), we achieved an average score of 82.6241. Overall, the experiments indicate that full finetuning of instruction-tuned transformers provides an effective way to produce sentence in gender-neutral form and also producing counter-factual sentences for biased one, wheneach subtask is optimized on its own data.
JusticeBots@LT-EDI 2026: Prompt-Based Counter-Narrative Generation for Homophobia and Transphobia Comments
TT Pranesh | K.K.Thamizhmathi | S Vigneshwaran | Bharathi B
TT Pranesh | K.K.Thamizhmathi | S Vigneshwaran | Bharathi B
Online platforms increasingly host hate speechtargeting marginalized communities, includ-ing homophobic and transphobic commentsdirected at LGBTQ+ individuals. Counter-narratives provide a constructive way to re-spond to harmful speech by promoting em-pathy, factual clarification, and respectful di-alogue.In this work, we participate in the Shared Taskon Counter-Narrative Generation on Homopho-bic and Transphobic Comments at LT-EDI @ACL 2026. We adopt a zero-shot promptingapproach using large language models accessedthrough publicly available AI tools, includingGPT-4o, Gemini 1.5 Pro, and Llama-3 SonarLarge via Perplexity AI. Instead of traininga task-specific model, we design a structuredprompt that guides the models to generate re-spectful, concise, and contextually appropriatecounter-narratives.Experiments were conducted on English andTamil comments provided by the organiz-ers. Results demonstrate that prompt-basedgeneration can produce meaningful multilin-gual counter-narratives without additional train-ing. Our approach highlights the potential oflarge language models as lightweight tools forcounter-speech generation in multilingual on-line environments.
JustGen@LT-EDI 2026: Controlled Gender Inclusive and Bias-Aware Language Generation using LLMs
Nilendu Adhikary | Supriya Chanda | Sukomal Pal
Nilendu Adhikary | Supriya Chanda | Sukomal Pal
Over the past decade, the rapid advancement of LLMs has significantly improved natural language generation. However, these models often inherit and amplify gender biases present in large-scale training data, leading to stereotypical associations, androcentric language, and misgendering. Such biases can negatively impact applications in education, healthcare, legal systems, and automated content generation. In this paper, we address this issue as defined in the shared task LT-EDI on Gender-Inclusive Language Generation. The task focuses on rewriting gender-biased sentences into inclusive, gender-neutral alternatives while preserving meaning. We propose a retrieval-augmented framework combining lexical replacement, semantic retrieval, and controlled instruction-tuned generation. An edit-distance constraint and self-evaluation step ensure minimal, coherent, and bias-free outputs. We also present zero-shot adaptation for low resource language. The implementation code available here https://github.com/SupriyaChanda/gilg-ltedi-acl2026.git.
MemeScouts@LT-EDI 2026: Asking the Right Questions - Prompted Weak Supervision for Meme Hate Speech Detection
Ivo Bueno | Lea Hirlimann | Enkelejda Kasneci
Ivo Bueno | Lea Hirlimann | Enkelejda Kasneci
Detecting hate speech in memes is challenging due to their multimodal nature and subtle, culturally grounded cues such as sarcasm and context. While recent vision-language models (VLMs) enable joint reasoning over text and images, end-to-end prompting can be brittle, as a single prediction must resolve target, stance, implicitness, and irony. These challenges are amplified in multilingual settings. We propose a prompted weak supervision (PWS) approach that decomposes meme understanding into targeted, question-based labeling functions with constrained answer options for homophobia and transphobia detection in the LT-EDI 2026 shared task. Using a quantized Qwen3-VLM to extract features by answering targeted questions, our method outperforms direct VLM classification, with substantial gains for Chinese and Hindi, ranking 1st in English, 2nd in Chinese, and 3rd in Hindi. Iterative refinement via error-driven LF expansion and feature pruning reduces redundancy and improves generalization. Our results highlight the effectiveness of prompted weak supervision for multilingual multimodal hate speech detection.
NEUNI@LT-EDI 2026: Counter Narrative Generation on Homophobic and Transphobic Comments
Preethi Gajawada | Bhanu Harsha Yanamadala | Akankshya Kar | Sahil Wadhwa | Divya Chaudhary
Preethi Gajawada | Bhanu Harsha Yanamadala | Akankshya Kar | Sahil Wadhwa | Divya Chaudhary
Counter Narrative (CN) generation via Large Language Models (LLMs) offers a scalable approach to combating hate speech by producing targeted responses that challenge harmful content. However, existing methods typically require costly post-training or fine-tuning to improve narrative diversity and quality. We introduce a fine-tuning-free prompt optimization technique that enhances Counter Narrative effectiveness without additional model training, making it both resource-efficient and readily deployable. We conduct extensive evaluation on hate speech datasets spanning English and Tamil, employing both reference-based metrics and rubric-based LLM-as-a-judge scoring to capture multiple dimensions of narrative quality. Experiments across multiple LLMs demonstrate that our approach consistently outperforms vanilla prompting baselines, exhibits strong transferability across models, and adapts seamlessly to new evaluation metrics—requiring no architectural or procedural changes. Our findings suggest that carefully optimized prompting strategies can match or exceed the performance of more resource-intensive approaches, offering a practical path toward scalable hate speech intervention.
RspectNLP@LT-EDI 2026:Rubric-Driven Prompting for Safe Multilingual Counter Narrative Generation
S.b.priya | Bharathi B
S.b.priya | Bharathi B
The problem of harmful online discourse against the LGBTQ+ community is still a concern on social media platforms. Although hate speech detection is a well-explored area, the task of constructive counter-narrative generation is still an emerging field of research, especially in the multilingual and low-resource settings. Counter-narratives are designed to counter harmful discourse with respectful and empathetic responses, as opposed to mere content deletion. In this paper, the model proposes a zero-shot multilingual system for counter-narrative generation in English and Tamil. The proposed system employs the pretrained google/flan-t5-base transformer model guided by rubric-aligned prompts to encourage politeness, contextual relevance, and non-toxic response generation. The system operates in a zero-shot setting without task-specific fine-tuning and uses beam search decoding for controlled response generation. On the English test data, the system scored an overall score of 70.33 per cent with a contextual coherence score of 81.82 per cent. On the Tamil test data, the system scored an overall score of 33.57 per cent with significantly lower scores on coherence and quality. These findings indicate that structured prompting can facilitate safe and coherent generation in English, but also underscore the challenges of zero-shot multilingual models in low-resource language scenarios.
SAJI_English@LT-EDI 2026: Detection of Homophobia and Transphobia in Internet Memes Using Zero-Shot Learning
Jishnu Bandyopadhyay | Saloni Kushwaha | Deepawali Sharma | Aakash Singh
Jishnu Bandyopadhyay | Saloni Kushwaha | Deepawali Sharma | Aakash Singh
Social media is now an important platform for communication and interaction. At the same time, the amount of abusive and harmful content online has also increased. Offensive language and hate speech are making these platforms less safe and less welcoming for users. Many of these contents include homophobic and transphobic remarks aimed at the LGBT+ community. Such behaviour damages healthy discussions and can negatively affect users. For this reason, it is important to detect these contents early so they can be flagged and removed to maintain a healthy online well-being. The issue becomes more difficult when harmful messages appear in popular formats like memes. Memes are widely used by younger users to communicate online. Because they combine images and text, detecting offensive meaning becomes challenging. In this work, we attempt to address this problem. We develop a method to identify such content using the meme dataset released for the LT-EDI 2026 challenge and secured rank 5 in the shared task. We propose a Zero-shot learning based method employing two LLMs (Qwen2.5-VL-3B-Instruct and Meta-Llama-3-8B-Instruct) to generate descriptions and classify such memes. We achieved a macro F1-score of 0.55 for the English language meme.
Susmitha@LT-EDI 2026: Detecting LGBTQ+ Phobia in Multilingual Memes via Joint Representation
Susmitha Jaishri | Kogilavani Shanmugavadivel | Malliga Subramanian | Mouleeshuwarapprabu R
Susmitha Jaishri | Kogilavani Shanmugavadivel | Malliga Subramanian | Mouleeshuwarapprabu R
The automated detection of LGBTQ+ phobia in social media memes is essential for fostering inclusive digital environments, yet it remains challenging due to the complex interplay of visual metaphors and multilingual text. We participated in the "Homophobia and Transphobia Meme Classification" shared task at LT-EDI 2026, evaluating a multimodal architecture across English, Hindi, and Chinese tracks. Our system employs a late-fusion strategy: XLM-RoBERTa encodes OCR-extracted text into a representation ht ∈ ℝ768 , while CLIP extracts visual features hv ∈ ℝ512. These are concatenated into a joint vector z = [ht ⊕ hv] ∈ ℝ1280 and processed via a non-linear multilayer perceptron to capture cross-modal interactions.The system demonstrated robust performance in high-resource contexts, securing 3rd rank in the Chinese track (Macro F1: 0.7371) and 4th rank in the English track (Macro F1: 0.6121). In contrast, the Hindi track results (Macro F1: 0.1616) revealed significant challenges related to script complexity and class imbalance. These findings underscore the effectiveness of global transformer-based models for multimodal reasoning while highlighting the ongoing need for specialized linguistic refinement in low-resource and diverse script environments
SigJBS@LT-EDI 2026: Multimodal Homophobia and Transphobia Meme Classification
Gaurangi Sinha | Rajarajeswari Palacharla | Manoj Balaji Jagadeeshan
Gaurangi Sinha | Rajarajeswari Palacharla | Manoj Balaji Jagadeeshan
This paper presents our system for the LT-EDI@ACL 2026 workshop on meme classification of homophobia and transphobia in English, Hindi, and Chinese. Detecting harmful content in memes is challenging because meaning often emerges from the interaction between visual elements and short textual cues, particularly in multilingual settings. To address this, we build a multimodal pipeline using CLIP ViT-L/14 visual embeddings, EasyOCR text extraction, TF–IDF lexical features, and a multinomial logistic regression classifier. We further incorporate two optional expert modules, a LoRA-adapted Qwen2-VL model and a CLIP zero-shot classifier, and combine predictions using weighted majority voting. The system is intentionally lightweight and reproducible, demonstrating that strong pretrained transfer features paired with explicit OCR can provide robust multilingual meme moderation without extensive fine-tuning. On the official leaderboard, our submission ranks 1st in Hindi, 3rd in English, and 5th in Chinese.
SigJBS@LT-EDI 2026: QLoRA-Tuned Homophobic and Transphobic Counter Narrative Generation
Gaurangi Sinha | Rajarajeswari Palacharla | Manoj Balaji Jagadeeshan
Gaurangi Sinha | Rajarajeswari Palacharla | Manoj Balaji Jagadeeshan
We present our approach to LT-EDI@ACL 2026 on counter-narrative generation for homophobic and transphobic comments. Generating high-quality counter-narratives in multilingual and low-resource settings remains challenging, particularly when data imbalance and script variation affect model performance. To address these issues, we explore multiple modeling strategies built around Gemma 3 12B with QLoRA fine-tuning, including data rebalancing and alternative input strategies for Tamil. Our findings show that task-specific fine-tuning combined with native-script Tamil produces more stable and higher-quality outputs than large few-shot prompts or transliteration-basedinputs. On the official leaderboard, our system ranks second in English with an overall score of 86.35% and sixth in Tamil with 63.77%,highlighting both the effectiveness of targeted fine-tuning and the challenges of low-resource counter-narrative generation.
TeamV at LT-EDI 2026: Multilingual Hate Speech Span Detection and Counter-Narrative Generation via Few-Shot In-Context Learning
Vinay Babu Ulli | Jyoti Kumari
Vinay Babu Ulli | Jyoti Kumari
This paper describes the system developed byTeamV for the LT-EDI 2026 Shared Task onCounter-Narrative Generation on Homophobic Transphobic Comments. The shared taskcomprises two subtasks: (1) Hate Speech SpanDetection in English, Tamil, and Hindi, and (2)Counter-Narrative Generation in English andTamil. Our system leverages the reasoning andmultilingual capabilities of a large proprietarylanguage model (Qwen3-Max) through rigor-ous few-shot in-context learning (ICL) and ro-bust post-processing mechanisms. Our submit-ted system demonstrated state-of-the-art perfor-mance on the official CodaBench leaderboard.In Task 1, our approach achieved 1st Placeacross all three languages, securing macro F1scores of 0.5338 in English, 0.5272 in Tamil,and 0.5478 in Hindi. For Task 2, our generatedcounter-narratives ranked 1st globally in En-glish with an overall average score of 87.47%and 5th in Tamil. We present our promptingmethodology, robust span-matching pipeline,detailed official results, and an analysis of themodel’s performance across diverse languages.
up
Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026)
Proceedings of the 2nd Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2026)
Kenton Murray | Reno Kriz
Kenton Murray | Reno Kriz
When Image and Text Disagree: Cross-Modal Evidence Conflict in Multimodal Retrieval-Augmented Generation
Jasper Kyle Catapang
Jasper Kyle Catapang
This paper introduces the Cross-Modal Conflict Benchmark (CMC-Bench) to evaluate how multimodal retrieval-augmented generation (RAG) systems handle contradicting evidence between retrieved text and images. Using 3,768 instances from ChartQA and MMMU evaluation splits, the study benchmarks four open vision-language models (VLMs) across four conflict types (factual, temporal, entity, and granularity) and four evidence conditions: aligned (both modalities support the gold answer), image-correct (image supports the gold and text contradicts it), text-correct (text supports the gold and the image is wrong or swapped), and both-wrong(neither modality supports the gold). Key findings reveal that cross-modal disagreement severely degrades performance, with change in accuracy between 0.17 and 0.46 relative to aligned evidence. Results show models often exhibit a modality lean rather than reliable arbitration, with text-leaning systems particularly vulnerable when only the image is correct. Furthermore, merging abstention and fabrication into a single hallucination score obscures critical behavioral differences; for instance, Qwen3-VL-4B abstains on 31.7% of conflicts, while Gemma-3n-E2B fabricates unsupported answers in 51.9% of conflicts. Multimodal RAG evaluation should explicitly distinguish abstention from fabrication to assess reliability accurately.
MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation
Zehang Wei | JiaXin Dai | Jiamin Yan | Xiang Xiang
Zehang Wei | JiaXin Dai | Jiamin Yan | Xiang Xiang
While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.
Non-Event Oriented Video Assessments in Long-Form Robot Videos
Stephanie M. Lukin | Kimberly A. Pollard | Claire Bonial | Cory J. Hayes | Ron Artstein | Kallirroi Georgila | David Traum
Stephanie M. Lukin | Kimberly A. Pollard | Claire Bonial | Cory J. Hayes | Ron Artstein | Kallirroi Georgila | David Traum
We introduce Video-SCOUT, a novel dataset of sixty 20-minute robot-recorded videos from human-robot collaborative exploration exercises, together with a new video analysis method for these types of exploration videos. Unlike video from stationary cameras where detection of motion can help identify events of interest, the camera in an exploration task is constantly in motion while the environment is stationary. Our analysis method—Non-Event Oriented Video Assessments (NOVA)—uses vision-language models to select frames relevant for supporting a particular assessment within continuous long-form videos. Results of testing with two different video-language models reveals a trade-off in precision and recall, and exhibits gains in overall recall when combined with a human’s knowledge, suggesting that NOVA may improve a human analysis of robot-navigation. We outline future work to mitigate miscommunication in human-robot interaction by leveraging dialogue with NOVA in support of better collaboration.
Less is More: Controlled Visual Evidence Routing and Redundancy Compression for Key Information Extraction
Yang Li | Yajiao Wang | Wenhao Hu | Mengting Zhang | Zhixiong Zhang
Yang Li | Yajiao Wang | Wenhao Hu | Mengting Zhang | Zhixiong Zhang
Key Information Extraction (KIE) in visually-rich documents is inherently token-centric, yet prevailing multimodal encoders often fuse dense visual patches with text tokens indiscriminately, which can introduce low-density visual noise, intensify modality competition, and cause robustness collapse under distribution shifts. We propose OTCR, a lightweight and architecture-agnostic framework that turns vision from a competitor into a selective supporter for extraction. OTCR learns sparse, interpretable cross-modal coupling via optimal transport to route local visual evidence to the most relevant text tokens, applies token-level gating to control injection strength, and further suppresses spurious correlations through a variational information bottleneck. Experiments on FUNSD, CORD, and SROIE show consistent gains when OTCR is plugged into LayoutLMv3 and GeoLayoutLM, and ablations verify the complementary contributions of coupling, gating, and bottlenecking. Under distribution shifts from Do-GOOD and EC-FUNSD, OTCR markedly mitigates performance degradation, indicating that controlled visual evidence can effectively compensate when text/layout shortcuts become unreliable.
Recent advances in multimodal retrieval have improved the ability to retrieve information from visually rich documents such as PDFs and reports. However, existing benchmarks remain largely centered on English and provide limited coverage of Korean visual documents with complex structures. Furthermore, most existing Korean resources primarily evaluate single-page retrieval, failing to capture realistic scenarios that require evidence aggregation across multiple pages. To address these gaps, we introduce KoViDoRe, a benchmark for Korean visual document retrieval. The dataset is constructed from publicly available Korean documents with diverse layouts, including tables, figures, and multi-column structures. We develop a multi-stage data curation pipeline consisting of structured document parsing, synthetic query generation using both summary-based and context-based strategies, and relevance mapping with human verification. Using KoViDoRe, we evaluate a wide range of multimodal retrieval models and observe that current models struggle to effectively handle Korean visual document retrieval, particularly in settings involving structured content and diverse query types. Motivated by this finding, we further curate a large-scale training dataset, Ko-VDR Train Public, to support the development of retrieval models tailored to Korean visual documents. Together, KoViDoRe and Ko-VDR Train Public provide a unified benchmark and training resource for Korean visual document retrieval.
Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation
JiaXin Dai | Zehang Wei | Jiamin Yan | Xiang Xiang
JiaXin Dai | Zehang Wei | Jiamin Yan | Xiang Xiang
This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the Full RAG track, our resource-aware approach demonstrates exceptional precision in both information retrieval and persona-conditioned generation.
MARQUIS: A Three-Stage Pipeline for Video Retrieval-Augmented Generation
Debashish Chakraborty | Dengjia Zhang | Jialiang Jin | Katherine M. Guerrerio | Hanting Liu | Hanxiang Qin | Tyler Skow | Alexander Martin | Reno Kriz | Benjamin Van Durme
Debashish Chakraborty | Dengjia Zhang | Jialiang Jin | Katherine M. Guerrerio | Hanting Liu | Hanxiang Qin | Tyler Skow | Alexander Martin | Reno Kriz | Benjamin Van Durme
Retrieval-augmented generation from videos requires systems to retrieve relevant audiovisual evidence from large corpora and synthesize it into coherent, attributed text. Current approaches struggle at both ends: retrieval methods fail on complex, multi-faceted queries that cannot be captured by a single embedding, while generation methods lack the high-level reasoning needed to synthesize across multiple videos and face memory constraints over long, multi-video contexts. We present MARQUIS: a three-stage pipeline that addresses these limitations through (1) query expansion, fusion, and reranking, (2) calibrated structured evidence extraction, and (3) article generation from extracted evidence, optionally controlled by an RLM. On the MAGMaR2026 shared task, we improve retrieval performance from 0.195 to 0.759 (nDCG@10). For article generation, ITER-QA-BASE improves average human score from 3.09 to 3.83 over the CAG baseline, while MARQUIS-RLM achieves a human score of 3.30 and the strongest citation recall among non-QA systems.
TRACE: Evidence Grounding-Guided Multi-Video Event Understanding and Claim Generation
Pengyu Yan | Akhil V S S Gorugantu | Mahesh Bhosale | Abdul Wasi | Vishvesh Trivedi | David Doermann
Pengyu Yan | Akhil V S S Gorugantu | Mahesh Bhosale | Abdul Wasi | Vishvesh Trivedi | David Doermann
Multi-video event understanding demands models that can locate and attribute query-relevant evidence scattered across long, heterogeneous video corpora. Existing large vision–language models (LVLMs) often underperform in this regime because they quickly exhaust their context budget and struggle to precisely localize evidentially important segments, frequently missing dense informational cues such as broadcast graphics, subtitles, and scoreboards. We introduce TRACE, an evidence grounding-guided framework that follows a ground-before-reasoning strategy for multi-video event reasoning. Our approach first builds a structured, text-searchable timeline for each video using OCR and object detection. A text-only LLM then conducts query-aware evidence localization, selecting relevant moments prior to any downstream visual reasoning. The retrieved frames and their grounding summaries are subsequently used to steer LVLM-based claim generation and cross-video citation consolidation. Experiments on MAGMaR 2026 and WikiVideo demonstrate that structured grounding markedly boosts factual completeness and attribution fidelity. On the MAGMaR validation split, TRACE raises macro-average MiRAGE F1 from 0.705 to 0.811 compared to an unguided Qwen3-VL-30B baseline, with especially strong improvements in citation recall (0.440 0.628). The method also attains state-of-the-art results on the official MAGMaR 2026 leaderboard.
CRAFT: Critic-Refined Adaptive Key-Frame Targeting for Multimodal Video Question Answering
Mahesh Bhosale | Abdul Wasi | Vishvesh Trivedi | Pengyu Yan | Akhil V S S Gorugantu | David Doermann
Mahesh Bhosale | Abdul Wasi | Vishvesh Trivedi | Pengyu Yan | Akhil V S S Gorugantu | David Doermann
Grounded multi-video question answering over real-world news events requires systems to surface query-relevant evidence across heterogeneous video archives while attributing every claim to its supporting source. We introduce CRAFT (Critic-Refined Adaptive Key-Frame Targeting), a query-conditioned pipeline that combines dynamic keyframe selection, per-video ASR with multilingual fallback, and a hybrid critic loop to iteratively verify and repair claims before consolidation. The pipeline integrates UNLI temporal entailment, DeBERTa-v3 cross-claim screening, and a Llama-3.2-3B adjudicator, with a final citation-merging stage that emits each fact once with all supporting source identifiers. On MAGMaR 2026, CRAFT achieves the best overall average (0.739), reference recall (0.810), and citation F1 (0.635). We further evaluate on a MAGMaR-style conversion of WikiVideo with 52 non-overlapping event queries, where CRAFT also performs strongly (0.823 Avg), showing that its claim-centric evidence aggregation generalizes beyond MAGMaR. Ablations show that atomic claims, ASR, and the critic loop drive the main gains over the vanilla query-conditioned baseline. Code and implementation details are publicly available at https://github.com/bhosalems/CRAFT.
Findings of the MAGMaR 2026 Shared Task
Alexander Martin | Dengjia Zhang | Joel Brogan | Francis Ferraro | Jeremy Gwinnup | Reno Kriz | Teng Long | Kenton Murray | Andrew Yates | Xiang Xiang
Alexander Martin | Dengjia Zhang | Joel Brogan | Francis Ferraro | Jeremy Gwinnup | Reno Kriz | Teng Long | Kenton Murray | Andrew Yates | Xiang Xiang
This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems – all of which beat a baseline derived from the winner of last years shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.
up
Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Kaiyu Huang | Fengran Mo | Pinzhen Chen | Meng Jiang
Kaiyu Huang | Fengran Mo | Pinzhen Chen | Meng Jiang
Large Language Models are increasingly used as safety infrastructure for detecting harmful online content and moderating social media across multiple languages. Yet their effectiveness remains uneven across linguistic communities. This disparity reflects not only disparities in training data availability but also structural problems in annotation design. We argue that a central source of multilingual safety failure lies in the annotation gap underlying existing hate speech datasets. Most annotation guidelines and safety benchmarks are developed for English and standard language varieties, overlooking dialectal variation and culturally embedded forms of hostility. Using Arabic dialectal discourse as a case study, we show how harmful speech expressed through dialects, sarcasm, code-switching, and culturally specific expressions often remains undetected by current annotation schemes. We introduce the concept of the Multilingual Safety Annotation Gap (MSAG), identifying four sources of bias: language coverage gaps, dialect representation gaps, cultural semantic gaps, and annotation guideline gaps. We discuss implications for LLM safety alignment and outline directions for culturally grounded multilingual annotation. This paper is primarily a conceptual and methodological position paper; rather than introducing a new benchmark or empirical evaluation, we aim to formalize the MSAG as a framework for analyzing systematic weaknesses in multilingual safety annotation pipelines.
Evidence-Augmented Generation Reasoning for Extremely Low-Resource Language Decipherment
Xiaoyu Zhu | Long Yuan | Rui Qi | Jinan Xu
Xiaoyu Zhu | Long Yuan | Rui Qi | Jinan Xu
Inspired by linguistic Olympiads, extremely low-resource language reasoning presents a unique challenge that enables models to solve problems without prior knowledge. This task mirrors the Rosetta Stone decipherment process, where the goal is to induce and apply linguistic rules from minimal context. Existing methods mainly rely on naive in-context learning that fails to handle the complexity and diversity of language rules. To mitigate this issue, we propose a framework that combines dynamic knowledge construction with task-aware evidence augmentation. First, we use large language models (LLMs) to generate a diverse set of task-specific examples that instantiate potential linguistic rules for the target low-resource language. Second, we apply a semantic retrieval mechanism to select the most relevant examples as evidence for each test query, preventing context overload and ensuring focused, analogical reasoning. Our method shifts from learning language distributions to dynamically discovering and applying rules. Experimental results on the LINGOLY and Linguini benchmark show that our approach achieves competitive performance across various LLMs, outperforming existing baselines. More importantly, our framework advances extremely low-resource reasoning and provides a generalizable framework for rule induction under knowledge constraints.
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging
Youngjoon Jang | Junyoung Son | Taemin Lee | Seongtae Hong | Hyeonseok Moon | Seungyoon Lee | Andrew Matteson | Heuiseok Lim
Youngjoon Jang | Junyoung Son | Taemin Lee | Seongtae Hong | Hyeonseok Moon | Seungyoon Lee | Andrew Matteson | Heuiseok Lim
With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on CLIR and Mono-Lingual Information Retrieval (Mono-IR) performance remains underexplored. To investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train multilingual retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influence IR performance, exhibiting important inter-lingual correlations: Using specific language pairs improves CLIR performance, while declines Mono-IR performance. Our work demonstrates that simple weight-averaged model merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-IR capabilities. Our findings highlight the effects of linguistic configuration of training data on both CLIR and Mono-IR, and present model merging as a viable strategy to optimize performance across these tasks.
Query-Synergy: Leveraging High-Resource Languages for Improving Retrieval Performance Across Multiple Languages
Seongtae Hong | Jungseob Lee | Hyeonseok Moon | Seungyoon Lee | Youngjoon Jang | Heuiseok Lim
Seongtae Hong | Jungseob Lee | Hyeonseok Moon | Seungyoon Lee | Youngjoon Jang | Heuiseok Lim
Multilingual embedding models often exhibit uneven representational quality, heavily favoring high-resource languages like English. However, conventional retrieval systems that rely exclusively on source-language queries fail to exploit the superior semantic expressiveness of these high-resource subspaces. To address this, we propose Query-Synergy, a training-free approach to improving retrieval performance using multilingual embeddings. Our method utilizes additional queries in English to complement source language queries and integrates similarity scores from both queries, effectively enhancing retrieval performance. We evaluate our approach across five languages (Arabic, Chinese, Greek, Thai, and Turkish) using four multilingual embedding models on two datasets. Our experiments show that this approach outperforms conventional source query retrieval methods, achieving superior nDCG scores across various configurations and translation settings. These results confirm that Query-Synergy is a simple yet effective method for retrieval across multiple languages.
Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches
Zarina Uvalieva | Bektemir Kumarbai Uulu | Adilet Metinov | Tynchtykbek Tashbaltaev | Nurtilek Alibekov
Zarina Uvalieva | Bektemir Kumarbai Uulu | Adilet Metinov | Tynchtykbek Tashbaltaev | Nurtilek Alibekov
Text normalization, the task of converting noisy, informal text into a standardized form - is a fundamental preprocessing step for many NLP applications. Despite the growing need for Kyrgyz language processing tools, to the best of our knowledge, no prior work has addressed automatic text normalization for Kyrgyz, a morphologically rich, low-resource Turkic language. In this paper, we present the first systematic study of Kyrgyz text normalization. We collect a dataset of 1.67 million noisy–clean text pairs sourced from YouTube comments, Instagram posts, and Telegram channels, where users frequently write without punctuation, capitalization, or standard spelling. Pairs were annotated with Gemini 3 Pro; the 1,000-example test set was fully verified by two native Kyrgyz speakers with adjudication, and a random subset of the training data was spot-checked, while the full 1.67M training set was not verified exhaustively. For continual pre-training, we additionally use a 538 MB Kyrgyz corpus compiled from news portals and books. We evaluate five systems: a rule-based baseline, zero-shot mT5, a fine-tuned mT5-small model, a continually pre-trained mT5-small followed by fine-tuning, and zero-shot Gemma 4. Our experiments show that fine-tuned mT5-small achieves a CER of 0.0796, outperforming the rule-based baseline (CER 0.2029), zero-shot mT5 (CER 0.9887), and zero-shot Gemma 4 (CER 0.1620), a roughly 32× larger model in a fine-tuned vs. zero-shot setting. Human evaluation by two native Kyrgyz speakers confirms these results, with fine-tuned mT5-small rated as correct in 99.8% of cases. We further analyze why continual pre-training with span corruption does not improve over direct fine-tuning, finding hallucination in 35/40 of the inspected failure cases (87.5%, 95% Wilson CI [74%, 95%]).
MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing
Riasad Alvi | Nurul Labib Sayeedi | Md. Faiyaz Abdullah Sayeedi
Riasad Alvi | Nurul Labib Sayeedi | Md. Faiyaz Abdullah Sayeedi
Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reliable deployment, a vulnerability heavily exacerbated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single-layer internal representations frequently fail to capture deep, complex factual inconsistencies across diverse languages. To address this, we introduce MultiHaluDet, a novel three-stage stacking framework that detects multilingual hallucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method extracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a calibrated classical classifier ensemble, MultiHaluDet captures both fine-grained and coarse-grained patterns of factual inconsistency. Extensive experiments demonstrate that our framework achieves state-of-the-art detection performance, reaching up to 98.55% AUROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2-7B architectures. Crucially, we rigorously evaluate our framework’s cross-lingual generalization across high (French), medium (Bangla), and low-resource (Amharic) languages. MultiHaluDet demonstrates exceptional representational robustness, consistently outperforming baselines and successfully transferring hallucination detection capabilities across typologically diverse linguistic tiers.
MFMDQwen: Multilingual Financial Misinformation Detection Based on Large Language Model
Zhiwei Liu | Yuyan Wang | Yuechen Jiang | Yupeng Cao | Tianlei Zhu | Xiaorui Guo | Zhiyang Deng | Zhiyuan Yao | Xiao-Yang Liu | Jimin Huang | Sophia Ananiadou
Zhiwei Liu | Yuyan Wang | Yuechen Jiang | Yupeng Cao | Tianlei Zhu | Xiaorui Guo | Zhiyang Deng | Zhiyuan Yao | Xiao-Yang Liu | Jimin Huang | Sophia Ananiadou
Financial misinformation poses significant threats to financial market stability and individuals’ investment decisions. The multilingual environment and the inherent complexity of financial information present substantial challenges for Multilingual Financial Misinformation Detection (MFMD). Existing LLM-based approaches for financial misinformation detection primarily focus on English and a single financial misinformation detection task, which limits their ability to capture multilingual contexts and complex features. In this paper, we propose MFMDQwen, the first open-source LLM designed for MFMD tasks. Furthermore, we introduce MFMD4Instruction, the first instruction dataset supporting MFMD with LLMs, covering English, Chinese, Greek, and Bengali. We also construct MFMDBench, a benchmark dataset for evaluating the MFMD capabilities of LLMs. Experimental results on MFMDBench demonstrate that our model outperforms existing open-source LLMs.
Multilingual Chain-of-Thought Compression via Cross-Lingual Distillation
Jiarui Wan | Songming Zhang | Yufeng Chen
Jiarui Wan | Songming Zhang | Yufeng Chen
Chain-of-thought reasoning improves the performance of large language models on complex tasks but often produces overly verbose outputs, leading to increased inference cost. This issue is exacerbated in multilingual settings, where differences in tokenization and linguistic structure result in inconsistent compression performance across languages. Existing methods are largely English-centric and tend to suffer from accuracy degradation, especially in low-resource languages.We propose Multilingual Chain-of-thought Compression via Cross-lingual Distillation (MCD), a unified framework that addresses these challenges through both data construction and optimization. MCD builds a cross-lingually aligned dataset using a translation-with-verification pipeline and difficulty-aware sampling, and employs a reinforcement training strategy that combines supervised fine-tuning with direct preference optimization to encourage concise yet sufficient reasoning.Experiments on multilingual mathematical benchmarks show that MCD consistently reduces reasoning length while maintaining competitive accuracy, and significantly improves robustness in low-resource languages.
When Retrieval Hurts: Evidence Utilization, Script Fidelity, and Knowledge Conflicts in Multilingual RAG
Varalekshmy M Mohan | Swathi Jayakumar | Gadha Saji Menon | Sachin Kurup | Veena G | Vani Kanjirangat
Varalekshmy M Mohan | Swathi Jayakumar | Gadha Saji Menon | Sachin Kurup | Veena G | Vani Kanjirangat
The problem of extractive multilingual QA with LLMs is characterized by complex interactions among retrieval mechanisms, knowledge source configurations, prompting techniques, and scripting biases. Despite high retrieval quality, multilingual RAG often degrades performance, revealing a gap between retrieved evidence and its effective utilization. To address this issue, this paper offers an extensive empirical study that examines these components by comparing retrieval-augmented generation (RAG) with a non-RAG baseline across 21 typologically diverse languages and 5 leading LLMs. Our analysis includes five prompting strategies and multiple retrieval configurations, which enable a unified evaluation across diverse linguistic settings. We have also observed an evidence utilization gap in RAG settings, where RAG underperforms despite high retrieval hit rates due to models’ inefficiency in leveraging the retrieved evidence. We also introduce lightweight inference-time metrics to better characterize retrieval usage and conflict patterns.We also highlight script fidelity as a crucial factor in our experiments, as non-Latin-script languages experience significant performance drops and increased hallucinations without proper grounding. Further, we analyzed generator language preferences, systematically examined conflicts, and identified mechanisms for the effective detection and resolution of conflicts. The study further details how prompting strategies affect language families and script types, offering a detailed analysis for optimizing future multilingual RAG settings.
DIMAS-OMOP: A Deliberative Intelligence-Based Multi-Agent System for Chinese Medical Text Standardization toward OMOP
Hanlin Lv | Xiao Wang | Kesong Wu | Lei Li | Lei Wang
Hanlin Lv | Xiao Wang | Kesong Wu | Lei Li | Lei Wang
Standardizing Chinese clinical imaging reports within the Observational Medical Outcomes Partnership (OMOP) framework is hindered by linguistic complexity and output inconsistency in existing methods. We propose DIMAS-OMOP, a Deliberative Intelligence-based Multi-Agent System designed for high-fidelity medical concept mapping toward OMOP standardization. Moving beyond single-model architectures, DIMAS-OMOP employs a hybrid three-stage workflow that integrates traditional natural language processing modules with selective Large Language Model reasoning and Retrieval-Augmented Generation. The core innovation lies in a hierarchical six-agent proposer-skeptic deliberation mechanism, complemented by a dynamic concept resolution approach and a four-dimensional quality control framework. Experimental results on 1,250 imaging reports demonstrate that DIMAS-OMOP achieves 95.2% mapping accuracy, significantly outperforming rule-based methods (+21.8 percentage points) and single-AI baselines (+8.1 percentage points). The system maintains a throughput of 1,200 reports/hour, with the multi-agent deliberation stage alone contributing an 8.9% relative accuracy gain. Furthermore, pilot deployment shows a 160.6% return on investment and a 31.5% increase in workflow efficiency. This study provides a novel, robust methodology for integrating unstructured non-English clinical data into the global Observational Health Data Sciences and Informatics (OHDSI) ecosystem through deliberative intelligence.
Beyond Accuracy: A Structured Error Analysis of Multilingual LLMs on Marathi Script Variation and Syntax
Tejas Patil | Barnali Chetia
Tejas Patil | Barnali Chetia
Evaluation of multilingual large language models has grown rapidly in recent years, yet Marathi, spoken by over 83 million people across India, has received almost no systematic probing beyond surface-level benchmark tests. Most existing multilingual evaluations either omit Marathi entirely or rely on machine-translated test sets that fail to capture the morphological complexity that defines the language. We evaluate four models, namely Llama-3.1-8B, Llama-3.3-70B, Mistral-7B, and Qwen3-32B, on our manually curated Marathi dataset across three probing dimensions: Devanagari versus Romanized script, Marathi-English code-mixing, and syntactic structures including SOV word order, vibhakti case markers, verb gender agreement, and postpositions. Models are tested under English and Marathi instruction conditions across translation, similarity, grammaticality, and case marker tasks. Translation quality is evaluated using both token-level F1 and BERTScore to capture paraphrase equivalence beyond surface word overlap. All models drop between 7.9% and 20.5% on Romanized input. The negative subjunctive marker nasta is ignored by every model. Vibhakti case markers are consistently replaced with Hindi equivalents, revealing that multilingual training has not produced separate internal representations for Hindi and Marathi despite their distinct morphological systems. These findings reveal structural gaps in how current multilingual LLMs handle morphologically rich, low-resource Indic languages and point to specific areas where dedicated Marathi pretraining data would most benefit future work.
Cross-Lingual Sentiment Misalignment: Auditing Multilingual Language Models for Inversion Risk, Dialectal Representation, and Affective Stability
Nusrat Jahan Lia | Shubhashis Roy Dipta
Nusrat Jahan Lia | Shubhashis Roy Dipta
Recent advances in multilingual representation learning aim to bridge the performance gap between high- and low-resource languages, yet their ability to preserve affective meaning across languages remains underexplored, particularly for underrepresented languages like Bengali. This research addresses cross-lingual sentiment misalignment between Bengali and English by introducing a controlled benchmarking framework evaluating four multilingual transformer models on parallel Bengali-English sentence pairs, stratified by dialect, to assess their representational stability. We demonstrate that a compressed model architecture exhibits a 28.7% "Sentiment Inversion Rate," fundamentally misinterpreting positive semantics as negative (or vice versa). Consequently, we identify a cross-lingual sentiment skew that we call "Asymmetric Empathy", where models systematically dampen or artificially amplify the affective weight of Bengali text relative to its exact English counterpart. Finally, we expose a key vulnerability regarding dialectal representation: a "Modern Bias" in the regional model, which exhibits a 57% increase in alignment error when processing the formal Bengali register compared to modern colloquial text. As foundational encoders continue to serve as safety classifiers and reward models for LLM pipelines, cross-lingual reliability becomes a critical concern. We therefore advocate for the integration of "Affective Stability" metrics into future cross-lingual benchmarks to detect and penalize polarity inversions, particularly in low-resource settings.
GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation
Yunsu Kim | Kaden Uhlig | Joern Wuebker
Yunsu Kim | Kaden Uhlig | Joern Wuebker
Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark-induced measurement error, motivating task-level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package. We also release the code used in our experiments.
Chain-of-Thought (CoT) is commonly used to improve reasoning performance in large language models. We investigate its impact in multilingual contexts by systematically constraining reasoning steps across languages with varying resource levels. This study evaluates two models on two benchmarks with seven languages, comparing constrained CoT depth against zero-shot and free-CoT baselines. We demonstrate that increasing the number of reasoning steps does not consistently improve accuracy across various languages. While high-resource and mid-resource languages remain stable, low-resource languages often experience a decline in performance as the number of reasoning steps increases. We attribute this decline to error accumulation and reasoning noise, which are amplified under deeper reasoning in low-resource languages. These findings indicate that CoT is not inherently beneficial, but its effectiveness is significantly influenced by the interaction between reasoning steps and language resource availability.
On the Limits of Model Merging for Multilinguality in Pre-Training
Seth Aycock | Fedor Vitiugin | Aleksandr Umnov | Christof Monz | Khalil Sima’an
Seth Aycock | Fedor Vitiugin | Aleksandr Umnov | Christof Monz | Khalil Sima’an
Endowing models with consistent multilingual performance can be achieved by _mixing_ pre-training data, or post-training approaches such as language-specific model _merging_. In this work, we test whether merging can be applied to monolingually pre-trained models. We conduct a controlled study on the efficacy of mixed, merged, and monolingual pre-training setups. We find that while monolingual pre-training results in strong in-language performance, merging any combination of monolingual models leads to performance collapse due to interference. Our analysis suggests representational similarity is a prerequisite for model merging. We therefore conclude that the flexibility of merging in fine-tuning does not extend trivially to language-specific pre-training.
We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.
Cross-Lingual Bias in Large Language Models: A Comparative Analysis of English and Swahili
Ruolei Zhang | Teddy Njuguna | Yue Feng
Ruolei Zhang | Teddy Njuguna | Yue Feng
Large language models are increasingly deployed in multilingual contexts, yet safety alignment and bias evaluation remain overwhelmingly English-centric. We investigate whether social biases generalise across languages by submitting 4,900 symmetric English–Swahili prompt pairs to GPT-5.2 and Gemini 2.5 Flash across nine demographic bias axes, yielding 19,600 completions evaluated for stereotype prevalence, sentiment, refusal behaviour, and cross-lingual semantic similarity. Our findings show that bias transforms rather than transfers: stereotype rates shifted by up to 12 percentage points on specific axes, Gemini’s neutral-sentiment rate doubled in Swahili, and GPT-5.2 refused 169 prompts in English and zero in Swahili, indicating safety mechanisms functionally anchored to English-language tokens. Over 55% of prompt pairs produced semantically dissimilar completions across both models. These reinforce the idea that English-only bias audits do not produce adequate coverage for multilingual deployment.
Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR
Quy-Anh Dang | Chris Ngo
Quy-Anh Dang | Chris Ngo
We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of 81 on a single RTX PRO 6000 GPU. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.
The Multilingual Curse at the Retrieval Layer: Evidence from Amharic
Yosef Worku Alemneh | Kidist Amde Mekonnen | Maarten de Rijke
Yosef Worku Alemneh | Kidist Amde Mekonnen | Maarten de Rijke
Multilingual retrieval increasingly underpins cross-lingual question answering and retrieval-augmented generation. Strong zero-shot scores on multilingual benchmarks are often taken as evidence that current encoders transfer reliably across many languages. We argue that this assumption breaks down for underrepresented, morphologically rich languages, and use Amharic as a diagnostic case. Under a shared passage retrieval protocol covering dense, late-interaction, learned sparse, and cross-encoder paradigms, we compare zero-shot multilingual retrievers, Amharic-fine-tuned multilingual retrievers, and monolingual Amharic retrievers. The strongest zero-shot multilingual retriever underperforms the strongest monolingual Amharic first-stage retriever by 23% relative MRR@10. Fine-tuning two recent multilingual embedding models on the same Amharic supervision yields 32–60% relative MRR@10 gains over zero-shot, but the best Amharic-fine-tuned multilingual model remains below the strongest monolingual Amharic retriever. These findings indicate that zero-shot multilingual retrieval is not a sufficient proxy for equitable information access in the LLM era: for underrepresented languages, retrieval must be evaluated and adapted in language rather than inferred from aggregate multilingual benchmarks. To foster future research, we publicly release our trained models, dataset, and codebase at https://github.com/rasyosef/amharic-neural-ir.
ShahiEmotion: A Benchmark Dataset for Punjabi Shahmukhi Emotion Detection
Usman Nawaz | Muhammad Junaid Iqbal | Tahir Alyas | Muhammad Asaf | Shumayla Yaqoob | Usman Ahmed Raza | Muhammad Amin Nadim | Aftab Rafique | Faisal Rehman
Usman Nawaz | Muhammad Junaid Iqbal | Tahir Alyas | Muhammad Asaf | Shumayla Yaqoob | Usman Ahmed Raza | Muhammad Amin Nadim | Aftab Rafique | Faisal Rehman
Emotion detection is an important text classification task with applications in sentiment analysis, social media monitoring, human-computer interaction, and affective language understanding. However, Punjabi written in the Shahmukhi script remains severely under-resourced for emotion detection, with limited benchmark-style resources available for supervised evaluation. This paper introduces ShahiEmotion, a new Punjabi Shahmukhi emotion detection dataset containing 30379 sentence-level instances annotated with seven emotion categories: sadness, surprise, happiness, anger, neutral, fear, and disgust. The dataset is designed to support research in a low-resource setting characterized by script-specific challenges, lexical variation, and substantial class imbalance. We establish baseline results using several pretrained transformer-based models and formulate emotion detection as a sentence-level classification task. In particular, we fine-tune multilingual BERT, multilingual DistilBERT, XLM-RoBERTa, and Urdu RoBERTa under the same training and evaluation setting using standard cross-entropy loss. Experimental results show that XLM-RoBERTa provides the strongest overall performance among the compared models. The best model achieves 77.95% accuracy, 58.47% macro-F1, and 77.60% weighted-F1 on the test set. The dataset, evaluation protocol, and baseline results introduced in this work are intended to support future research on Punjabi Shahmukhi emotion analysis and low-resource NLP.
Evaluating Multilingual Tokenization under Worst-N Parity-Aware BPE
Vani Kanjirangat | David Kletz | Tanja Samardzic | Ljiljana Dolamic | Fabio Rinaldi
Vani Kanjirangat | David Kletz | Tanja Samardzic | Ljiljana Dolamic | Fabio Rinaldi
Improving the fairness of a language model is a goal that applies at every level of the model. In this paper, we evaluate a method targeting a foundational level: tokenization.We present a multilingual evaluation of parity-aware tokenization under worst-N optimization, extending PA-BPE to jointly optimize over the N worst-compressed languages.We evaluate this formulation for N > 1 across vocabulary sizes of 16K and 32K on the languages from the flores+ benchmark, using metrics that capture both efficiency and structural alignment.Our results reveal that the effects of increasing N are inconsistent across metrics and do not lead to major gains. Efficiency-oriented and boundary-level metrics show a modest tendency to improve at higher values of N, while structural alignment metrics (such as AST alignment and boundary crossing) exhibit no clear pattern, suggesting that compression fairness and linguistic structure are mainly orthogonal objectives. Script-level analysis further reveals uneven effects across writing systems, with several non-Latin scripts showing greater sensitivity to increasing N.
MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models
Rishabh Makwana | Mamta Mamta | Deeksha Varshney | Oana Cocarascu
Rishabh Makwana | Mamta Mamta | Deeksha Varshney | Oana Cocarascu
Vision-Language Models (VLMs) have demonstrated strong performance across multimodal tasks, yet their safety robustness remains an open challenge. While prior work has shown that structured visual prompts such as flowcharts can effectively jailbreak VLMs, existing studies are largely limited to English-centric settings. In this paper, we introduce MLingualFC, a multilingual multimodal benchmark designed to evaluate jailbreak vulnerabilities of VLMs across diverse languages using structured flowchart representations. MLingualFC encodes harmful instructions into flowchart images across five languages (Hindi, Punjabi, Spanish, Romanian, and German) We evaluate state-of-the-art multilingual VLMs, including Qwen2.5-VL, Gemma-4, and Pangea, under a black-box threat model. Our results reveal significant multilingual safety gaps. Flowchart-based attacks achieve high attack success rates (ASR) in case of Latin script languages, demonstrating that visual encoding of harmful content effectively bypasses safety alignment across languages. In contrast, non-Latin script languages such as Punjabi exhibit substantially lower ASR, suggesting potential limitations in visual text recognition rather than stronger safety alignment. These findings highlight that current VLM safety mechanisms fail to generalize across languages and modalities.
P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs
Rafael Ferreira | Inês Vieira | Inês Calvo | James Furtado | Iago Paulo | Diogo Glória-Silva | Diogo Tavares | David Semedo | Joao Magalhaes
Rafael Ferreira | Inês Vieira | Inês Calvo | James Furtado | Iago Paulo | Diogo Glória-Silva | Diogo Tavares | David Semedo | Joao Magalhaes
As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored.To address this gap, we introduce P3B3, an expert-curated variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability.Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.
Causal Localization of the English Pivot in LLaVA: Mechanistic VLM Analysis and Training-Free Multilingual Steering
Abrar Zahin Raihan | Aurchi Chowdhury
Abrar Zahin Raihan | Aurchi Chowdhury
Multilingual vision-language models (VLMs) consistently underperform on non-English visual queries, yet the internal mechanism behind this disparity remains unknown. As a focused case study on LLaVA-1.5-7B, we apply logit-lens analysis and causal activation patching to show that non-English visual queries are routed through an English-biased representational bottleneck in layers 5–17, extending the English-pivot phenomenon of Wendler et al. (2024) to the multimodal setting. Peak causal influence occurs at layer 8 ( ̅AIE = 0.49, averaged across languages), with all measurable pivot signal running through text-token positions. Without meaningful visual content (blank-image condition), language-specific representations do not emerge at any layer, showing that the pivot is image-content-dependent rather than triggered by any visual input. Building on these findings, we derive training-free language-steering vectors at the mechanistically identified pivot layers, improving Russian VQA by +6.5 pp and Portuguese by +4.0 pp on MMMB without any fine-tuning — the latter surpassing the English baseline. Within this case study, our results are consistent with the English pivot being a structural property of the LLM backbone that multimodal pre-training does not mitigate; extending this mechanistic methodology to other VLMs and language families remains an important direction for future work.
Multilingual Disparities in LLM-Based Safety Judgments: Evidence from Brand Safety Applications
Songjiang Liu | Riley Grossman | Mike Smith | Cristian Borcea | Yi Chen
Songjiang Liu | Riley Grossman | Mike Smith | Cristian Borcea | Yi Chen
Multilingual LLMs are increasingly used as context-aware judges in real-world information systems under the assumption that equivalent content receives equivalent judgments across languages. We examine this assumption through brand safety, a global application where automated ratings can affect advertisers’ reputations, publishers’ revenues, and users’ access to news. We construct a benchmark of LLM-generated safety ratings for 10,467 semantically aligned news articles across 13 languages. We find systematic cross-lingual disagreement appearing in more than 96% of cases where at least one language receives a non-zero risk rating. Suitability ratings differ significantly by language, controlling for run, category, and article. In the main model, English, German, and French content is generally rated more strictly, while Polish, Hungarian, Greek, Turkish, and Persian content is rated more leniently. Robustness checks with two additional LLMs show that significant language effects persist, though directional patterns vary by model. These findings show that multilingual LLM safety judgments can produce unequal outcomes for semantically equivalent content.
Benchmarking Byte-Pair Encoding Tokenizers on Different Languages with Bits per Byte
Soham Chowdhury | Warren Woolf
Soham Chowdhury | Warren Woolf
Tokenization significantly affects the cross-lingual performance of language models, yet recent tokenizer variants such as SuperBPE and MorphBPE have not been systematically evaluated across typologically diverse languages. We conduct the first extrinsic cross-language comparison of BPE, SuperBPE, and MorphBPE tokenizers on English, Mandarin, and Hungarian, using bits per byte (BPB) normalized perplexity as our metric, with vocabulary sizes of 8K, 16K, and 32K. We find that SuperBPE matches BPE for English but underperforms by 0.01–0.06 BPB for Hungarian and Mandarin, suggesting that cross-whitespace merging is counterproductive for non-English languages. MorphBPE performs worse than BPE across all settings, with gaps of 0.02–0.04 BPB at the 32K vocabulary size. These results suggest that linguistic theory alone does not guarantee practical improvements in tokenizer design, and that standard BPE remains a surprisingly effective baseline across typologically diverse languages.
Where Privacy Risk Lives in English-Source Multilingual RAG: A Stage-Decomposed Audit Across Five Query Languages
Yanhang Li | Zhichao Fan | Zexin Zhuang
Yanhang Li | Zhichao Fan | Zexin Zhuang
A common assumption holds that switching to a non-English language makes a multilingual RAG system easier to attack for personal information. On an English-source synthetic-PII corpus with five query languages and a two-stage defence (LLM input judge + regex output filter), the output-stage point estimates do not support that assumption: English has the highest observed unstructured-PII leak rate, and only English-vs-Swahili separates cleanly under our document-level bootstrap intervals. Once the input judge is added, residual leaks remain on Arabic and Swahili in this Qwen-mediated pipeline, and back-translating the query does not close the gap. Translator, judge, and generator share one model family, so we treat this as pipeline-conditional rather than a causal language ranking. As an oracle diagnostic on a separate n=17 multilingual-prompted-judge residual corner, attaching the gold corpus document to the input judge blocks 15/17 residual cells — a follow-up direction, not a deployed mitigation, since all BLOCK/ALLOW rates are on adversarial queries only and we measure no benign-query FPR or utility. The anonymous supplement contains code, corpora, queries, and per-trial JSONLs.
The Broken Telephone Changes Tone: Examining Nuanced Linguistic Cues in LLM Chains-of-Translation
Quang Minh Nguyen | Maida Aizaz | Braahmi Padmakumar
Quang Minh Nguyen | Maida Aizaz | Braahmi Padmakumar
As LLM-generated content proliferates online, texts are increasingly subject to repeated processing and translation by models, making it critical to understand how such iterative reprocessing reshapes language. Prior work has shown that this degrades factual content and reduces diversity, but the fine-grained linguistic shifts underlying these effects remain unexplored. We track changes in epistemic markers, grammatical voice, degree adverbs, and nominalisation density across 12 iterations of round-trip translation applied to 600 BBC News articles, varying intermediate language, translation model, and chain topology across 17 experimental configurations. We find a consistent epistemic shift: evidential and factive markers increase while hedges decline, potentially causing tentative claims to read as more certain. Concurrently, texts undergo register-level formalisation—informal degree adverbs give way to formal alternatives, active-voice density drops, by-phrase passives attrite disproportionately, and nominalisation density rises. We also record clear model-specific patterns for certain settings. These shifts erode the markers of source, register, and agency, offering a fine-grained account of the factual degradation reported in previous studies.
Group-Merger: A LoRA-based Framework for Multilingual Continual Learning
Weijian yi | Hongliang Li | Jinan Xu
Weijian yi | Hongliang Li | Jinan Xu
Multilingual continual learning (MCL) is crucial for enabling language models to adapt across diverse linguistic environments while retaining knowledge over time. Existing parameter isolation methods allocate language-specific modules but fail to leverage cross-lingual transfer, leading to inefficient parameter growth and poor generalization. Model merging based approaches suffer from severe performance degradation as the number of language-specific tasks increases, due to interference between linguistic and task-specific knowledge. To address these challenges, we propose Group-Merger, a framework that employs group-wise merging to balance parameter efficiency and continual learning performance. Our framework mitigates catastrophic forgetting across languages while enabling knowledge transfer. Extensive experiments on multilingual evaluation benchmarks demonstrate superior performance compared to existing methods.
up
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Proceedings of the First Workshop on Multilingual Multicultural Evaluation
Pinzhen Chen | Vilém Zouhar | Hanxu Hu | Simran Khanuja | Wenhao Zhu | Barry Haddow | Alexandra Birch | Alham Fikri Aji | Rico Sennrich | Sara Hooker
Pinzhen Chen | Vilém Zouhar | Hanxu Hu | Simran Khanuja | Wenhao Zhu | Barry Haddow | Alexandra Birch | Alham Fikri Aji | Rico Sennrich | Sara Hooker
LLMs as Span Annotators: A Comparative Study of LLMs and Humans
Zdeněk Kasner | Vilém Zouhar | Patrícia Schmidtová | Ivan Kartáč | Kristýna Onderková | Ondrej Platek | Dimitra Gkatzia | Saad Mahamood | Ondrej Dusek | Simone Balloccu
Zdeněk Kasner | Vilém Zouhar | Patrícia Schmidtová | Ivan Kartáč | Kristýna Onderková | Ondrej Platek | Dimitra Gkatzia | Saad Mahamood | Ondrej Dusek | Simone Balloccu
Span annotation - annotating specific text features at the span level - can be used to evaluate texts where single-score metrics fail to provide actionable feedback. Until recently, span annotation was done by human annotators or fine-tuned models. In this paper, we study whether large language models (LLMs) can serve as an alternative to human annotators. We compare the abilities of LLMs to skilled human annotators on three span annotation tasks: evaluating data-to-text generation, identifying translation errors, and detecting propaganda techniques. We show that overall, LLMs have only moderate inter-annotator agreement (IAA) with human annotators. However, we demonstrate that LLMs make errors at a similar rate as skilled crowdworkers. LLMs also produce annotations at a fraction of the cost per output annotation. We release the dataset of over 40k model and human span annotations for further research.
Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This shows that even a high average agreement with human data when considering LLM responses independently does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which consider all survey answers independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.
Code-switching is a common feature of multilingual communication, and identifying where the language switches reliably is essential for downstream tasks such as generating code-switched machine translations. This paper introduces CSDI, a Code-Switching Detection (CSD) system for Indic text, which jointly learns CSD, Named Entity Recognition, and Part-of-Speech tagging through a shared encoder. Leveraging multitask learning, CSDI captures linguistic cues that signal switching boundaries and achieves a new state-of-the-art macro-F1 score with near-zero 𝛥CMI across six Indic languages. The model also demonstrates strong cross-lingual transfer, effectively leveraging high-resource languages to improve low-resource performance. Despite challenges such as intra-word code-mixing and limited token-level context, CSDI establishes a new baseline for scalable, low-resource NLP research in code-mixed environments.
Vinclat: Evaluating Reasoning, Cognition and Culture in One Game
Marc Pàmies | Javier Aula-Blasco | Aitor Gonzalez-Agirre | Marta Villegas
Marc Pàmies | Javier Aula-Blasco | Aitor Gonzalez-Agirre | Marta Villegas
This paper introduces Vinclat, a novel evaluation dataset for Catalan carefully designed to assess the reasoning capabilities and cultural knowledge of LLMs. It comprises 1,000 high-quality instances, meticulously crafted and reviewed by human annotators. Each instance presents a complex riddle that requires a two-step reasoning process involving inferential and abductive reasoning, along with other cognitive skills such as lexical retrieval, paraphrasing, flexibility in interpretation, pattern recognition, and associative thinking. Given four independent clues, models should infer intermediate concepts which, despite being seemingly unrelated, can be creatively connected to reach a final solution. The task targets a unique blend of capabilities, distinguishing it from existing NLP benchmarks. Our evaluation of state-of-the-art models reveals that these still fall significantly short of human-level reasoning, although scaling trends suggest that the performance gap may narrow over time. This indicates that Vinclat provides a robust and long-term challenge, resisting the rapid saturation that is commonly observed in many existing evaluation datasets.
Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality
Takumi Ohashi | Hitoshi Iyatomi
Takumi Ohashi | Hitoshi Iyatomi
Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at https://github.com/IyatomiLab/CCI.
The Anthropology of Food: How NLP can Help us Unravel the Food cultures of the World
Arij Riabi | Sougata Saha | Monojit Choudhury
Arij Riabi | Sougata Saha | Monojit Choudhury
Food carries cultural meaning beyond nutrition. It shapes identity, memory, and social norms, which makes it a central concern in anthropology. Given the diversity of food practices across cultures, analyzing them at scale while preserving their depth (“thick” descriptions) remains difficult for ethnographic methods, where Natural Language Processing (NLP) methods can help. Earlier NLP tools often captured only surface-level ”thin” descriptions. Recent methods, especially Large Language Models (LLMs), create openings to recover cultural nuance. In this position paper, we outline research questions at the intersection of food anthropology and NLP, and discuss how LLMs can enable a scalable and culturally grounded anthropology of food. We present a case study examining what LLMs represent about global eating habits, which are often shaped by colonial histories and globalization. Our findings suggest that LLMs’ internal representations recognize cultural clusters, such as shared food habits among formerly colonized regions, but fail to grasp the pragmatic and experiential aspects of food, like the worldwide spread of dishes like pizza or biryani. We conclude by highlighting some of the potential risks and gaps of using NLP for cultural analysis.
LLM-as-a-qualitative-judge: automating error analysis in natural language generation
Nadezhda Chirkova | Tunde Oluwaseyi Ajayi | Seth Aycock | Zain Muhammad Mujahid | Vladana Perlić | Ekaterina Borisova | Markarit Vartampetian
Nadezhda Chirkova | Tunde Oluwaseyi Ajayi | Seth Aycock | Zain Muhammad Mujahid | Vladana Perlić | Ekaterina Borisova | Markarit Vartampetian
Prompting large language models (LLMs) to evaluate generated text, known as LLM-as-a-judge, has become a standard evaluation approach in natural language generation (NLG), but is primarily used as a quantitative tool, i.e. with numerical scores as main outputs. In this work, we propose LLM-as-a-qualitative-judge, an LLM-based evaluation approach with the main output being a structured report of common issue types in the NLG system outputs. Our approach is targeted at providing developers with meaningful insights on what improvements can be done to a given NLG system and consists of two main steps, namely open-ended per-instance issue analysis and clustering of the discovered issues using an intuitive cumulative algorithm. We also introduce a strategy for evaluating the proposed approach, coupled with ~300 annotations of issues in instances from 12 NLG datasets. Our results show that instance-specific issues output by LLM-as-a-qualitative-judge match those annotated by humans in 2/3 cases, and that LLM-as-a-qualitative-judge is capable of producing error type reports resembling the reports composed by human annotators. We also demonstrate in a case study how the use of LLM-as-a-qualitative-judge can substantially improve NLG systems performance.
Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages
Isaac Chung | Linda Freienthal
Isaac Chung | Linda Freienthal
Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences.This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at https://github.com/isaac-chung/cross-lingual-stability-judges.
Cross-lingual and cross-country approaches to argument component detection: a comparative study.
Cecilia Graiff | Chloé Clavel | Benoît Sagot
Cecilia Graiff | Chloé Clavel | Benoît Sagot
Argument mining in multilingual settings has rarely been investigated, due to the lack of annotated resources and to the inherent difficulty of the task. We benchmark the performance of models on cross-lingual and cross-country argument component detection, focusing on political data from the US and France. To do so, we introduce FrenchPolArg, a corpus of argumentative political discourse in French, and we automatically translate already existing US-English resources. We benchmark three different cross-lingual and cross-country pipelines, and compare their results to find the best-performing one. We obtain promising results to be integrated in semi-automatic annotation workflows to reduce the time and cost of annotations.
UNSC-Bench: Evaluating LLM Diplomatic Role-Playing Through UN Security Council Vote Prediction
Ayush Nangia | Aman Gokrani | Ruggero Marino Lazzaroni
Ayush Nangia | Aman Gokrani | Ruggero Marino Lazzaroni
This paper introduces UNSC-Bench, a benchmark for evaluating Large Language Models (LLMs) in simulating diplomatic decision-making through United Nations Security Council (UNSC) vote prediction. The dataset includes 469 UNSC resolutions from 1947 to 2025, with voting records for the five permanent members (P5) (United States, China, France, Russia, United Kingdom) and translations in four languages. We analyze 26 LLMs, along with thinking variants, across multiple P5 roles and find that (1) without explicit role assignment, models are diplomatically unaligned, defaulting to high yes rates and failing to match any P5 voting pattern, indicating they lack inherent diplomatic identity; (2) model capability (as measured by MMLU-Pro) is strongly correlated with role-playing accuracy; (3) regional models do not outperform others in predicting their home country’s votes; and (4) multilingual evaluation reveals that prompt language impacts model predictions, particularly for minority vote outcomes.
Leveraging Wikidata for Geographically Informed Sociocultural Bias Dataset Creation: Application to Latin America
Yannis Karmim | Renato Pino | Hernan Contreras | Hernan Lira | Sebastian Cifuentes | Simon Escoffier | Luis Martí | Djamé Seddah | Valentin Barriere
Yannis Karmim | Renato Pino | Hernan Contreras | Hernan Lira | Sebastian Cifuentes | Simon Escoffier | Luis Martí | Djamé Seddah | Valentin Barriere
Large Language Models (LLMs) exhibit inequalities with respect to various cultural contexts. Most prominent open-weights models are trained on Global North data and show prejudicial behavior towards other cultures. Moreover, there is a notable lack of resources to detect biases in non-English languages, especially from Latin America (Latam), a continent containing various cultures, even though they share a common cultural ground.We propose to leverage the content of Wikipedia, the structure of the Wikidata knowledge graph, and expert knowledge from social science in order to create a dataset of Questions/Answers (Q/As) pairs, based on the different popular and social cultures of various Latin American countries. We create a database of around 23k questions and associated answers extracted from 23k Wikipedia articles, and transformed into a multiple-choice questions (MCQ) in Spanish and Portuguese, in turn translated to English. We use this MCQ to quantify the degree of knowledge of various LLMs and find out extit(i) a discrepancy in performances between the Latam countries, ones being easier than others for the majority of the models, extit(ii) that the models perform better in their original language, extit(iii) that Iberian Spanish culture is better known than Latam one. Our code, our results for reproducing the results, and all datasets by region will be available.
Whom to Trust? Analyzing the Divergence Between User Satisfaction and LLM-as-a-Judge in E-Commerce RAG Systems
Arif Türkmen | Kaan Efe Keleş
Arif Türkmen | Kaan Efe Keleş
We study retrieval-augmented generation (RAG) evaluation in the Trendyol QA Assistant using 150k real e-commerce interactions. Our framework combines user satisfaction labels, LLM-as-a-judge scoring, and factor-based diagnostics to separate retrieval from generation errors. We find that judge models broadly reflect user satisfaction trends, though important nuances of dissatisfaction are often missed. Factor-level analysis highlights systematic error patterns across query types and context quality, demonstrating that hybrid evaluation, combining multiple LLM judges with direct user feedback offers the most reliable assessment strategy for production RAG systems.
Query-Following vs Context-Anchoring: How LLMs Handle Cross-Turn Language Switching
Kyuhee Kim | Chengheng Li Chen | Anna Sotnikova
Kyuhee Kim | Chengheng Li Chen | Anna Sotnikova
When multilingual users switch languages mid-conversation, how should LLMs respond? We extend MultiChallenge to evaluate cross-turn language switching, translating 182 multi-turn conversations into German, Chinese, Spanish, and Arabic. Across five frontier models, we observe asymmetric behavior: switching into a foreign language (EN→X) yields high query-language fidelity (89–99%), but switching back to English (X→EN) reveals divergent policies. GPT-5 follows the query language (>95%), while Claude Opus 4.5 and Command R+ maintain the established conversation language (<8%). Task accuracy remains stable across conditions regardless of language selection differences. A simple explicit system prompt shows limited effectiveness in modifying these defaults.
Generating Difficult-to-Translate Texts
Vilém Zouhar | Wenda Xu | Parker Riley | Juraj Juraska | Mara Finkelstein | Markus Freitag | Daniel Deutsch
Vilém Zouhar | Wenda Xu | Parker Riley | Juraj Juraska | Mara Finkelstein | Markus Freitag | Daniel Deutsch
Machine translation benchmarks sourced from the real world are quickly obsoleted, due to most examples being easy for state-of-the-art translation models. This limits the benchmark’s ability to distinguish which model is better or to reveal models’ weaknesses. Current methods for creating difficult test cases, such as subsampling or from-scratch synthesis, either fall short of identifying difficult examples or suffer from a lack of diversity and naturalness. Inspired by the iterative process of human experts probing for model failures, we propose MT-breaker, a method where a large language model iteratively refines a source text to increase its translation difficulty. The LLM iteratively queries a target machine translation model to guide its generation of difficult examples. Our approach generates examples that are more challenging for the target MT model while preserving the diversity of natural texts. While the examples are tailored to a particular machine translation model during the generation, the difficulty also transfers to other models and languages.
’A Woman is More Culturally Knowledgeable than A Man?’: The Effect of Personas on Cultural Norm Interpretation in LLMs
Mahammed Kamruzzaman | Hieu Minh Nguyen | Nazmul Hassan | Gene Louis Kim
Mahammed Kamruzzaman | Hieu Minh Nguyen | Nazmul Hassan | Gene Louis Kim
As the deployment of large language models (LLMs) expands, there is an increasing demand for personalized LLMs. One method to personalize and guide the outputs of these models is by assigning a persona—a role that describes the expected behavior of the LLM (e.g., a man, a woman, an engineer). This study examines whether an LLM’s interpretation of social norms varies based on assigned personas and whether these variations stem from embedded biases within the models. In our research, we tested 34 distinct personas from 12 categories (e.g., age, gender, beauty) across four different LLMs. We find that LLMs’ cultural norm interpretation varies based on the persona used and that the variations within a persona category (e.g., a fat person and a thin person as in physical appearance group) follow a trend where an LLM with the more socially desirable persona (e.g., a thin person) interprets social norms more accurately than with the less socially desirable persona (e.g., a fat person). While persona-based conditioning can enhance model adaptability, it also risks reinforcing stereotypes rather than providing an unbiased representation of cultural norms. We also discuss how different types of social biases due to stereotypical assumptions of LLMs may contribute to the results that we observe.
up
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Proceedings of the 22nd Workshop on Multiword Expressions (MWE 2026)
Atul Kr. Ojha | Verginica Barbu Mititelu | Mathieu Constant | Ivelina Stoyanova | A. Seza Doğruöz | Alexandre Rademaker
Atul Kr. Ojha | Verginica Barbu Mititelu | Mathieu Constant | Ivelina Stoyanova | A. Seza Doğruöz | Alexandre Rademaker
Large Language Models Put to the Test on Chinese Noun Compounds: Experiments on Natural Language Inference and Compound Semantics
Le Qiu | Emmanuele Chersoni | He Zhou | Yu-Yin Hsu
Le Qiu | Emmanuele Chersoni | He Zhou | Yu-Yin Hsu
Noun compounds are generally considered an open challenge for NLP systems, given to the difficulty of interpreting the implicit semantic relation between modifier and head, although the advent of Large Language Models (LLMs) recently led to remarkable performance leaps. However, most evaluations have been carried out on English benchmarks.In our work, we test LLMs on compound semantics understanding in Chinese, adopting two different evaluation scenarios: an extrinsic evaluation in a Natural Language Inference task, and an intrinsic evaluation in which models are directly asked to predict the semantic relation linking the two constituents.Our results show that the bigger and more recent LLMs are able to surpass supervised baselines in the inference task, especially when tested under the few-shot setting. In the more challenging task of selecting the correct interpretation of the compounds out of a fine-grained typology of semantic relations between head and modifier, the best Chinese LLM (Qwen-plus) manages to select the correct option in about one third of the cases.
SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech
Johan Nevin Sofalas | Dilushri Pavithra | Nevidu Jayatilleke | Ruvan Weerasinghe
Johan Nevin Sofalas | Dilushri Pavithra | Nevidu Jayatilleke | Ruvan Weerasinghe
Figures of Speech (FOS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.
Swedish Multiword Expression Corpora in PARSEME
Sara Stymne | Astrid Berntsson Ingelstam | Eva Pettersson
Sara Stymne | Astrid Berntsson Ingelstam | Eva Pettersson
We present the annotation of Swedish multiword expressions under the PARSEME annotation scheme, including a new release and a historical overview of previous releases. We provide an overview of the evolution of the Swedish datasets and of inter-annotator agreement. We discuss general guidelines and the development of Swedish-specific guidelines for particle verbs and multiword tokens, as well as additional challenges for the Swedish annotation. We also conduct an initial comparison of Swedish and other Germanic languages, identifying aspects where the PARSEME guidelines require revision to ensure better consistency across languages.
Ukrainian Multiword Expressions Corpus: Creation, Annotation, and Linguistic Analysis
Hanna Sytar | Maria Shvedova | Olha Kanishcheva
Hanna Sytar | Maria Shvedova | Olha Kanishcheva
This paper presents the development of a corpus of annotated multiword expressions (MWEs) for Ukrainian. The resource covers four major categories of MWEs: verbal, nominal, adjectival/adverbial, and functional. We describe the methodology used for data selection, the annotation scheme, and the procedures employed during annotation. In addition, the paper discusses some specific types of MWE constructions, illustrating their usage with numerous examples and addressing complex and borderline cases. The resulting corpus is an important resource for linguistic studies and NLP tasks involving MWEs, and is publicly accessible https://gitlab.com/parseme/sharedtask-data/-/tree/master/2.0?ref_type=heads.
Cognitive Signatures of Multi-Word Expressions: Reading-Time and Surprisal
Diego Alves | Sergei Bagdasarov | Elke Teich
Diego Alves | Sergei Bagdasarov | Elke Teich
This study investigates whether eye-tracking measures predict if a word is the final token of a multi-word expression (MWE), focusing on two understudied MWE types: fixed expressions (e.g., due to) and phrasal verbs (e.g., turn out). Using mixed-effects logistic regression, we compared tokens in MWE contexts with the same tokens in non-MWE contexts. Results reveal a clear difference in processing. For fixed expressions, reading-time measures significantly predict MWEhood. In contrast, phrasal verbs show no consistent predictive effects. Additionally, we compared the reading-time models to models that included GPT-2 surprisal as a predictor. While surprisal does predict MWEhood, it fails to capture the distinction between types. These findings highlight the need to consider MWE typology in models of formulaic language processing.
Cheese it up: CamemBERT Outperforms Large Language Models for Identification of French Multi-word Expressions
Sergei Bagdasarov | Diego Alves | Elke Teich
Sergei Bagdasarov | Diego Alves | Elke Teich
In recent years, language models, both encoder-only and generative, have been applied to a variety of downstream NLP tasks, includingsequence labeling tasks like automatic multi-word expression identification (MWEI). Multiple studies show that, in general, fine-tunedencoder-only models like BERT tend to outperform pretrained generative LLMs on downstream tasks (Arzideh et al., 2025; Ochoa et al.,2025; Bucher and Martini, 2024; Sebok et al., 2025). However, such comparisons are sparse for MWEI, in particular for French, in partdue to the lack of comprehensive gold-standard datasets. In this study, we address this research gap by comparing CamemBERT with gpt-oss and Qwen3 for MWEI, using the French subcorpus of the newly released PARSEME dataset. CamemBERT outperforms both LLMs by large margins in precision, recall, and F1. We complement this numerical evaluation with a qualitative analysis of prediction errors.
Extracting Multi-Word Expressions Representing Technical Terms and Proper Nouns in Log Messages
Kilian Dangendorf | Sven-Ove Hänsel | Jannik Rosendahl | Felix Heine | Carsten Kleiner | Christian Wartena
Kilian Dangendorf | Sven-Ove Hänsel | Jannik Rosendahl | Felix Heine | Carsten Kleiner | Christian Wartena
IT-systems generate log messages containing important information about the system’s health. To gather information about system entities, we extract technical terms and proper nouns as multi-word expressions (MWEs) from a wide range of log messages from 16 different real systems. We apply Gries’ information-theoretic approach which iteratively calculates the best MWE candidates using an eight-dimensional ranking method. These candidates are evaluated in an annotation study, achieving a precision of 66 %. This value is significantly higher than evaluations on general-purpose texts, demonstrating the higher occurrence of compound technical terms and proper nouns in log messages. The MWEs found can be used to reduce the number of nodes in a system behavior graph while increasing the information density of the nodes.
Two Birds with One Stone: Annotating Romanian Multiword Expressions with an Eye to the PARSEME 2.0 Guidelines Applicability
Verginica Mititelu | Mihaela Cristescu | Elena Irimia | Carmen Mîrzea Vasile
Verginica Mititelu | Mihaela Cristescu | Elena Irimia | Carmen Mîrzea Vasile
This paper presents an enhanced version of the Romanian corpus previously annotated only for verbal multiword expressions. The new release extends the annotation to multiword expressions of other parts of speech, following version 2.0 of the PARSEME guidelines. The corpus has been expanded, its new part was automatically morpho-syntactically annotated based on the Universal Dependencies framework, followed by extensive semi-automatic annotation of multiword expressions across all morphological categories. The paper also reports quantitative data on the updated corpus and discusses the distribution and characteristics of Romanian multiword expressions. We also highlight language-specific annotation challenges and issues arising from the PARSEME 2.0 guidelines.
Incorporating Multiword Expressions in Galician Neural Machine Translation: Compositionality, Efficiency, and Performance
Daniel Solla | Paula Pinto-Ferro | Laura Castro | Pablo Gamallo | Marcos Garcia
Daniel Solla | Paula Pinto-Ferro | Laura Castro | Pablo Gamallo | Marcos Garcia
This paper explores the behavior of neural machine translation models on two newly introduced datasets containing noun-adjective MWEs with different degrees of semantic ambiguity and compositionality. We compare general-domain machine translation systems with fine-tuned models exposed to small subsets of the target MWEs. By assessing the effects of the learning steps and corpus size, we found that carefully designed fine-tuned may improve MWE handling while mitigating catastrophic forgetting. However, our error analysis reveals that models still struggle in several scenarios, particularly when translating MWEs with idiomatic meanings. Both the datasets and the experiments focus on translation involving Galician, English, and Spanish.
Beyond Single Words: MWE Identification in Bioinformatics Research Articles and Dispersion Profiling Across IMRaD
Jurgi Giraud | Andrew Gargett
Jurgi Giraud | Andrew Gargett
Multiword Expressions (MWEs) are pervasive in scientific writing, and in specialized domains they include both multiword terminology (e.g., noun compounds) and recurrent academic phrasing. This study profiles MWEs in a large corpus of bioinformatics research articles segmented by IMRaD sections. Building on recent multi-method approaches to scientific MWE identification, we extract MWEs using complementary automated strategies (semantic matching, dependency parsing, controlled vocabularies, and academic formula lists) and compare the resulting inventories by size, form, and IMRaD section distribution. We further quantify cross-document dispersion using document frequency and Gries’ DP to distinguish widely reused expressions from items concentrated in a small subset of articles. Results show that bioinformatics MWEs are predominantly short and nominal, but that extraction methods differ in the extent to which they recover discourse and reporting phraseology. Dispersion is strongly long-tailed across sections with most MWEs being document-specific, while a smaller recurrent core aligns with section function and is enriched for conventional templates and standardized multiword terms. Overall, the findings argue for combining complementary identification methods with dispersion profiling to characterize domain "multiwordness" in a principled and section-sensitive way.
Multiword expressions are an important area of study in linguistics and natural language processing as they represent combination of words that function as a single unit, and display properties that cannot be predicated fully from their individual components. This paper describes annotated corpora of about 3000 multiword expressions across syntactic categories in Marathi. This is the first exhaustive resource for Marathi which includes both verbal and non-verbal multiwords. In order to develop the guidelines for annotation, we have used the existing literature on the identification and classification of these expressions. Following the PARSEME 2.0 guidelines, we discuss the categories of multiwords and their behaviour in the corpus. Throughout the annotation process, we encounter variability in compositionality and syntactic realization and discuss our design decisions during annotation. Such a dataset will further our understanding of how grammatical structure can be integrated with lexically stored multiword units in Marathi.
Despite recent significant advances, idioms, like other forms of figurative language, present a challenge to natural language processing (NLP). Benchmark corpora are essential for improving the current models on understanding idioms. However, such corpora are only available for a limited set of languages. In this paper, we introduce our ongoing work on a benchmark corpus of Turkish idioms. Our corpus is structured for testing both idiom recognition and idiom understanding. The corpus is currently consists of 200 instances with sentences including idiomatic use, their literal paraphrases, similar sentences with no entailment, and non-idiomatic use of the idiomatic expressions when possible. We describe the methodology used to create the corpus, as well as initial experiments with a selection of LLMs.
Diversity patterns run deep: Impact of diversity intake on multiword expression identification
Mathilde Deletombe | Manon Scholivet | Louis Estève | Thomas Lavergne | Agata Savary
Mathilde Deletombe | Manon Scholivet | Louis Estève | Thomas Lavergne | Agata Savary
Multiword expressions (MWEs) are good examples of a phenomenon where identification systems struggle with generalisation: MWE present in the test set but absent in the training set are rarely identified. This raises the question of the diversity of the test set, relative to that of the train set, and how this impacts performance. We set out to measure how much diversity of a train corpus increases when adding individual MWEs from the test corpus, and how this increase impacts MWE identification performance. We measure diversity across a three-dimension framework and find mostly consistent negative correlations with performance in 14 languages and 8 systems.
A Curious Class of Adpositional Multiword Expressions in Korean
Junghyun Min | Na-Rae Han | Jena D. Hwang | Nathan Schneider
Junghyun Min | Na-Rae Han | Jena D. Hwang | Nathan Schneider
Multiword expressions (MWEs) have been widely studied in cross-lingual annotation frameworks such as PARSEME. However, Korean MWEs remain underrepresented in these efforts.In particular, Korean multiword adpositions lack systematic analysis, annotated resources, and integration into existing frameworks.In this paper, we present a study of Korean functional multiword expressions: postpositional verb-based constructions (PVCs).Using data from Korean Wikipedia, we survey and analyze several PVC expressions and contrast them from non-MWEs with similar structure.Building on this analysis, we propose annotation guidelines designed to support future work in Korean multiword adpositions and facilitate alignment with cross-lingual frameworks.
PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani
Nina Hosseini-Kivanani
Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe 2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision–language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English, and 60.0% Top-1 (0.822 NDCG@5) in zero-shot transfer to Portuguese. On the multilingual blind test, our systems achieved average Top-1/NDCG scores of 0.35/0.73 for Subtask A and 0.32/0.71 for Subtask B across 15 languages. Ablation results highlight idiom-aware rewriting as the main contributor to performance, while sentence-type prediction and multimodal fusion enhance robustness. These findings suggest that effective idiom disambiguation is feasible without fine-tuning large multimodal encoders.
IdiomRanker-X at MWE-2026 AdMIRe 2: Multilingual Idiom-Image Alignment via Low-Rank Adaptation of Cross-Encoders
Mehmet Utku Colak
Mehmet Utku Colak
This paper describes the system submitted for the MWE 2026 Shared Task (AdMIRe 2.0 Subtask A). The submission focused on a text-centric approach, reframing the idiom-image alignment task as a sentence-pair classification problem using mBERT (Multilingual BERT). The submitted system relied on full fine-tuning using only the English training data, achieving a Top-1 Accuracy of approximately 0.30 on the blind test set. Following the evaluation phase, significant limitations were identified in the cross-lingual generalization of the base model. In a post-evaluation study, the backbone was upgraded to XLM-RoBERTa-Large-XNLI, incorporating Low-Rank Adaptation (LoRA) and utilizing the full multilingual dataset with hard negative mining. These improvements boosted the accuracy to 0.41, demonstrating the necessity of NLI-specific pre-training and parameter-efficient tuning for MWE-aware multimodal tasks.
alexandru412 at MWE-2026 AdMIRe 2.0: Advancing Multimodal Idiomaticity Representation
Cristea Alexandru-Marian
Cristea Alexandru-Marian
This paper presents the system developedby team alexandru412 for the AdMIRe 2.0Shared Task. We participated in the Text-Onlytrack, ranking images based on idiomatic us-age without accessing pixel data. Our approachcombines a strict list-wise ranking strategy withsystematic test-time augmentation. We fine-tuned a Large Language Model (LLM) on En-glish and Portuguese data and relied on zero-shot transfer for other languages. Our systemachieved the 3rd place in the Text-Only track.
BeeParser at MWE-2026 PARSEME 2.0 Subtask 1: Can Cross-Lingual Interactions Improve MWE Identification?
Ahmet Erdem | Oguzhan Karaarslan
Ahmet Erdem | Oguzhan Karaarslan
This paper describes a multilingual system for automatic multiword expression identification for PARSEME 2.0 Subtask 1. We formulate MWE identification as a token-level sequence labeling problem using a BIO tagging scheme and fine-tune XLM-RoBERTa-base on PARSEME 2.0. We mainly investigate cross-lingual interactions on language pairs, and test hypotheses whether using a given language pair for training improves MWE detection performance on both or one of the languages. Then, we apply selected successful language pairs on PARSEME 2.0 MWE Identification task. Experiments are conducted independently for a subset of the languages given in PARSEME 2.0, for a total of 8 languages. Our approach achieves strong token-based and span-based F1 scores across diverse languages, and we observe that training with even distant language pairs may result in improvement on at least one of the languages. We publish our code at https://github.com/ahmeterdem1/parseme-blg505
VisAffect at MWE-2026 AdMIRe 2: IMMCAN Idiom Multimodal Cross-Attention Network
Barış Bilen | Ali Azmoudeh | Hazım Kemal Ekenel | Hatice Kose
Barış Bilen | Ali Azmoudeh | Hazım Kemal Ekenel | Hatice Kose
We address AdMIRe 2.0, a static image ranking task where a sentence containing a potentially idiomatic expression is paired with five image–caption candidates, and the goal is to rank the candidates by semantic compatibility with the intended idiomatic or literal meaning. We propose IMMCAN, which keeps XLM-R and Jina-CLIP-v2 frozen and learns a lightweight two-stage cross-attention fusion, caption–image grounding followed by idiom-to-multimodal conditioning, to predict a compatibility score per candidate. We also evaluate caption-only augmentation via back-translation and synonym substitution, and compare regression and rank-class formulations. On AdMIRe 1.0, text-only achieves higher test top-image accuracy than VLM-grounded modeling. In contrast, on AdMIRe 2.0 zero-shot, adding visual patch grounding improves both accuracy and NDCG indicating better cross-lingual ranking transfer.
Sahara Tokenizers at PARSEME 2.0 Subtask 1: Combining Contextual Embeddings with Structural Decoding for Multi-Word Expression Detection
Yunus Karatepe | Mert Sülük | Zeynep Tuğçe Kırımlı | Begüm Özbay
Yunus Karatepe | Mert Sülük | Zeynep Tuğçe Kırımlı | Begüm Özbay
Multi-Word Expressions (MWEs) pose a significant challenge for natural language processing systems due to their idiosyncratic semantic and syntactic properties. This paper describes our system for the PARSEME 2.0 Shared Task on automatic identification of verbal MWEs across 17 typologically diverse languages. Our approach combines multilingual BERT with explicit Part-of-Speech (POS) feature injection through a dual-head architecture that jointly performs BIO-based identification and category classification. We further investigate extensions, including Conditional Random Field (CRF) decoding for structured prediction, focal loss for addressing class imbalance, and model ensembling for improving discontinuous MWE detection. Our official submission achieves a global MWE-based F1 score of 48.39%, securing second place in the shared task. Ablation studies reveal a strong synergy between POS features and CRF decoding, with the combined approach yielding the best single-model performance. Furthermore, ensembling models trained with different objectives improves both overall F1 score and discontinuous MWE scores, demonstrating the importance of training diversity for capturing non-adjacent syntactic patterns.
3K2T at MWE-2026 AdMIRe 2: CARIM– Category-Aware Reasoning for Idiomatic Multimodality
Kubilay Kağan Kömürcü | Tugce Temel
Kubilay Kağan Kömürcü | Tugce Temel
Idiomatic expressions pose a fundamental challenge for multimodal understanding due to their non-compositional semantics, while pretrained vision–language models tend to over-rely on literal visual alignments. We address this issue in the context of the AdMIRe 2.0 multimodal idiomatic image ranking task by introducing CARIM (Category-Aware Reasoning for Idiomatic Multimodality), an inference-time framework that injects structured semantic reasoning without end-to-end retraining.Experiments on the official Codabench leaderboard demonstrate that CARIM achieves competitive Top-1 Accuracy and nDCG across multiple languages. Additional post-competition evaluation on the released test annotations further shows that CARIM maintains robust multilingual performance, highlighting the effectiveness of inference-time category-aware reasoning for multimodal idiomatic grounding.
PMI MWE Scorer at PARSEME 2.0 Subtask 1: identifying multi-word expressions using pointwise mutual information and universal dependencies
Anna Bogdanova | Ileana Bucur
Anna Bogdanova | Ileana Bucur
Multi-word expressions (MWEs) remain a challenge for NLP systems due to their syntactic variability and non-compositional semantics, that is why this issue was proposed as shared task within Unidive organization. With increasing popularity of large language models (LLM) it is important to continue researching alternative solutions. One of classical approaches for identifying MWEs is calculating pointwise mutual information (PMI), but this is a purely statistical approach that cannot unveil the links between words in natural text. To fix this issue we propose this paper with a simple syntax-aware PMI method that leverages Universal Dependency (UD) trees (Nivre et al.,2016) to model co-occurrence between syntactically related words. By computing PMI over dependency-linked word pairs and aggregating these scores, we aim to improve surface-based methods. Opposed to expectations, our experiment shows that classical statistical approach gets better results in identifying MWEs partially. Still, this approach is aimed to find a balance between lightweight calculations as opposed to LLMs and precision in results.
tiberiucarp at MWE-2026 AdMIRe 2: GLIMMER-Gloss-based Image Multiword Meaning Expression Ranker
Andrei Tiberiu Carp
Andrei Tiberiu Carp
Multiword expressions (MWEs), particularlyidioms, pose persistent challengesfor vision-language systems due to theirnon-compositional semantics and culturallygrounded meanings. This paper presentsGLIMMER, a three-stage hybrid ranking systemthat evaluates how well images expressthe intended meaning of MWEs across 15 languages.Our approach uses LLM-generatedsemantic glosses as multilingual meaning anchors,combined with dual-path embeddingscoring (textual captions and visual features),and LLM-based semantic verification. Evaluatedon the ADMIRE shared task benchmark,GLIMMER achieves competitive performanceacross diverse languages without relying onparallel training data or language-specific resources.The results show that using glossesto anchor meaning helps match idioms withimages across languages and modalities, andthat combining retrieval with reasoning is morerobust than using embeddings alone.
IPN at MWE-2026 PARSEME 2.0 Subtask 1: MWE Identification via Related Languages and Harnessing Thinking Mode
Anna Hülsing | Noah-Manuel Michael | Daniel Mora Melanchthon | Andrea Horbach
Anna Hülsing | Noah-Manuel Michael | Daniel Mora Melanchthon | Andrea Horbach
We present IPN, our system for Subtask 1 of the PARSEME 2.0 Shared Task, which targets the identification of MWEs in 17 languages. Overall, IPN outperformed a much larger-parameter baseline model, yet a performance gap to the top-performing systems remains. To better understand these results, we investigate Qwen3-32B’s suitability for mono-, cross- and multilingual MWE identification. We also explore whether this model benefits from prepending automatically generated thinking data to the gold label during instruction-tuning. We find that target language data is vital for instruction-tuning. Prepending generated thinking data to a subset of the training data slightly improves performance for two out of three languages, but more detailed evaluation is required.
Semantic Stars at MWE-2026 PARSEME 2.0 Subtask 2: Alternative Approaches for MWE Paraphrasing
Elif Bayraktar | Vedat Doğancan | Muhammed Abdullah Gümüş | Nusret Ali Kızılaslan
Elif Bayraktar | Vedat Doğancan | Muhammed Abdullah Gümüş | Nusret Ali Kızılaslan
This paper describes the system submitted by Semantic Stars Team for Subtask 2 of the PARSEME 2.0 shared task (Paraphrasing Multiword Expressions). Our approach addresses the challenge of paraphrasing sentences containing MWEs such that the MWE is removed while the original meaning and grammatical structure are preserved. The paper describes multiple distinct approaches powered by open-weight Large Language Models (LLMs), each employing a combination of different techniques such as prompting, multi-agent pipelines and classical NLP methods. Four distinct methods are tested on the test data in French, including a fifth one combining the results from the first four. We tested with several different open-weight LLMs including Llama3.1:8b, Qwen3:8b and gpt-oss-120b and were able to achieve significant improvements over the baseline, securing the first place on the shared task leader board.
MorphoFiltered-Gemini at MWE-2026 PARSEME 2.0 Subtask 1: Tackling LLM Overgeneration via Universal POS-based Constraints
Irina Moise | Sergiu Nisioi
Irina Moise | Sergiu Nisioi
This paper describes MorphoFiltered-Gemini, a multilingual system submitted to the PARSEME 2.0 shared task on multiword expression (MWE) identification. The system relies on Google Gemini 2.0 Flash-Lite to generate MWE predictions using zero-shot and selectively applied few-shot prompting, without fine-tuning or language-specific resources. To reduce the tendency of large language models to over-generate MWEs, we introduce a lightweight morphological post-filter that removes unlikely constructions while preserving high-precision patterns.Rather than optimizing peak performance for individual languages, our approach prioritizes precision and cross-lingual robustness. As a result, the system exhibits stable behavior across 17 typologically diverse languages and achieves the highest Shannon evenness score among all submitted systems. The experimental results highlight a clear trade-off between recall-oriented LLM prompting strategies and precision-oriented filtering, and show that simple linguistic constraints can effectively improve the stability of LLM-based multilingual MWE identification systems.
LST at MWE-2026 AdMIRe 2: Advancing Multimodal Idiomaticity Representation
Le Qiu | Yu-Yin Hsu | Emmanuele Chersoni
Le Qiu | Yu-Yin Hsu | Emmanuele Chersoni
This paper presents our methods for the AdMIRe 2.0 shared task, which addresses multilingual and multimodal idiom understanding. Our submission focuses on the text-only track. Specifically, we employ an ensemble of three large language models (LLMs) to directly perform the presented image ranking task. Each model independently produces a ranking of the candidate images, and we aggregate their outputs using a hard voting strategy to determine the final prediction. This ensemble learning framework leverages the complementary strengths of different LLMs, improving robustness and reducing the variance of individual model predictions.
UniBO at MWE-2026 PARSEME 2.0 Subtask 2: A Cross-lingual Approach to Multiword Expression Paraphrasing
Debora Ciminari | Alberto Barrón-Cedeño
Debora Ciminari | Alberto Barrón-Cedeño
This paper describes MISP (Multilingual Id-iomatic Sentence Paraphrasing), a system sub-mitted to the PARSEME 2.0 MultilingualShared Task on Identification and Paraphras-ing of Multiword Expressions (MWEs). Weparticipated in Subtask 2 on MWE para-phrasing and developed our system based onQwen3-4B-Instruct fine-tuned on syntheticPortuguese MWE paraphrases. We appliedMISP not only to Portuguese, but also to Frenchand Romanian, aiming to leverage cross-lingualtransfer within related languages, with ours be-ing the only submission for Portuguese. Ourresults indicate that MISP struggles to generateparaphrases that both rephrase and preserve theoriginal meaning of the MWE. Additionally,instruction fine-tuning does not appear to im-prove performance. Overall, our findings high-light the challenges of paraphrasing MWEs,particularly in a cross-lingual setting
DCSN-NLP at MWE-2026 AdMIRe 2: Bridging Literal and Figurative Meaning Through Hierarchical Multimodal Reasoning
David Cotigă | Sergiu Nisioi
David Cotigă | Sergiu Nisioi
This paper presents our system for the MWE-2026 ADMiRe 2.0 shared task, which aimedto advance multimodal idiomatic understand-ing across 15 languages. We address the taskof selecting, from a set of five images, theone that best represents either the literal oridiomatic meaning of a given compound incontext. Our approach follows a multi-steppipeline: a large language model (LLM) firstdetermines whether the compound is used lit-erally or idiomatically and generates auxiliarytext, consisting of an idiomatic meaning expla-nation and a visual description of the literalmeaning. An ensemble of three CLIP modelsthen identifies the two images most semanti-cally similar to the appropriate generated textvia a voting mechanism. Finally, the LLM se-lects the best image from these two candidates.
ITUNLP at MWE-2026 AdMIRe 2: A Zero-Shot LLM Pipeline for Multimodal Idiom Understanding and Ranking
Atakan Site | Oğuz Ali Arslan | Gülşen Eryiğit
Atakan Site | Oğuz Ali Arslan | Gülşen Eryiğit
This paper presents our system for AdMIRe 2 (Advancing Multimodal Idiomaticity Representation), a shared task on multilingual multimodal idiom understanding. The task focuses on ranking images according to how well they depict the literal or idiomatic usage of potentially idiomatic expressions (PIEs) in context, across 15 languages and two tracks: a text-only track, and a multimodal track that uses both images and captions. To tackle both tracks, we propose a hybrid zero-shot pipeline built on large vision–language models (LVLMs). Our system employs a chain-of-thought prompting scheme that first classifies each PIE usage as literal or idiomatic and then ranks candidate images by their alignment with the inferred meaning.A primary–fallback routing mechanism increases robustness to safety-filter refusals, while lightweight post-processing recovers consistent rankings from imperfect model outputs.Without any task-specific fine-tuning, our approach achieves 55.9% Top-1 Accuracy in the text-only track and 60.1% in the multimodal (text+image) track, ranking first overall on the official leaderboard. These results suggest that carefully designed zero-shot LVLM pipelines can provide strong baselines for multilingual multimodal idiomaticity benchmarks.
Archaeology at WE-2026 PARSEME 2.0 Subtask 1 and 2: Parsing is for Encoders, Paraphrasing is for LLMs
Rares-Alexandru Roscan | Sergiu Nisioi
Rares-Alexandru Roscan | Sergiu Nisioi
This paper presents our approach to the PARSEME 2.0 Shared Task on Romanian, covering both Identification (Subtask 1) and Paraphrasing (Subtask 2). While Large Language Models (LLMs) excel at semantic generation, we hypothesize that they lack the structural precision required for MWE identification, leading to "boundary hallucinations" that compromise downstream simplification. Our Rank 1 results on Romanian confirm this: a specialized encoder (RoBERT) using standard sequence labeling outperforms both few-shot LLMs and complex structural parsers (MTLB-STRUCT). This justifies our proposed pipeline: using encoders as precise “pointers” to guide the generative power of LLMs.
ITUNLP2 at MWE-2026 AdMIRe 2: Modular Zero-Shot Pipelines for Multimodal Idiom Grounding and Ranking
Özge Umut | Bora Şenceylan
Özge Umut | Bora Şenceylan
We describe a zero-shot system for AdMIRe 2.0, a shared task on multimodal understanding of potentially idiomatic expressions (PIEs). Given a context sentence with a PIE and five candidate images, the system predicts whether the usage is literal or idiomatic and ranks images by how well they match the intended meaning. We use closed-source large multimodal models and compare prompting pipelines from direct one-step ranking to modular multi-step pipelines that separate sense prediction, PIE-focused image semantics, and final ranking. All steps produce constrained JSON outputs to enable deterministic parsing and composition. In the official AdMIRe 2.0 evaluation on CodaBench, our best pipeline achieves an average Top-1 accuracy of 0.52 and an average nDCG score of 0.70 across the 12 languages we submitted. We obtain the best score among submitted systems in 10 of these languages.
Edition 2.0 of the PARSEME shared task on multilingual identification and paraphrasing of multiword expressions
Manon Scholivet | Agata Savary | Carlos Ramisch | Eric Bilinski | Takuya Nakamura | Maria Mitrofan | Vasile Pais
Manon Scholivet | Agata Savary | Carlos Ramisch | Eric Bilinski | Takuya Nakamura | Maria Mitrofan | Vasile Pais
Multiword expressions (MWEs) have been a major challenge in NLP for decades and research on MWEs was driven notably by shared tasks, including those organized by the PARSEME community. We report the organisation and the results of edition 2.0 of the PARSEME shared task. For the first time, all syntactic categories are covered: verbal, nominal, adjectival, adverbial and functional. We rely on edition 2.0 of the PARSEME corpus, annotated for all these categories in 17 languages. We create a new dataset with paraphrases of sentences containing idioms in 14 languages, and defining a new subtask dedicated to MWE paraphrasing. We extend our evaluation protocol by measuring both performance and diversity of systems, and including manual evaluation in paraphrasing. 10 systems, including the baseline, participated in the MWE identification subtask and 5 in the paraphrasing subtask. Results are promising, but known MWE identification challenges remain unsolved. Performance correlates positively with diversity in MWE identification, and negatively in MWE paraphrasing.
MWE-2026 Shared Task: AdMIRe 2 Advancing Multimodal Idiomaticity Representation
Doğukan Arslan | Rodrigo Wilkens | Wei He | Dilara Torunoglu Selamet | Thomas Pickard | Aline Villavicencio | Adriana Silvina Pagano | Gülşen Eryiğit
Doğukan Arslan | Rodrigo Wilkens | Wei He | Dilara Torunoglu Selamet | Thomas Pickard | Aline Villavicencio | Adriana Silvina Pagano | Gülşen Eryiğit
Idiomatic expressions present a unique chal-lenge in NLP, as their meanings are often notdirectly inferable from their constituent words.Despite recent advancements in large languagemodels, idiomaticity remains a significant ob-stacle to robust semantic representation. Wepresent datasets and task results for MWE-2026 Shared Task 2: Advancing MultimodalIdiomaticity Representation 2 (AdMIRe 2),which challenges the community to assess andimprove models’ ability to interpret idiomaticexpressions in multimodal contexts across mul-tiple languages. Participants competed in animage ranking task in which, for each item,systems receive a context sentence containinga potentially idiomatic expression (PIE) andfive candidate images. Participating systemsare required to predict the sentence type (i.e.,idiomatic vs. literal) for the given context andrank the images by how well they depict the in-tended meaning in that context. Among the par-ticipating systems the most effective methodsinclude pipelines utilizing closed-source com-mercial models such as Gemini 2.5 and GPT-5, and employing chain-of-thought reasoningstrategies. Methods to mitigate language mod-els’ bias towards literal interpretations and en-sembles to smooth out variance were common.
up
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Proceedings of the 6th International Conference on Natural Language Processing for the Digital Humanities
Sil Hamilton | Emily Öhman | Rebecca M. M. Hicke | Yuri Bizzoni | Axel Bax | Jacob A. Matthews | Mika Hämäläinen
Sil Hamilton | Emily Öhman | Rebecca M. M. Hicke | Yuri Bizzoni | Axel Bax | Jacob A. Matthews | Mika Hämäläinen
From OCR to Analysis: Tracking Correction Provenance in Digital Humanities Pipelines
Haoze Guo | Ziqi Wei
Haoze Guo | Ziqi Wei
Optical Character Recognition (OCR) is a critical but error-prone stage in digital humanities text pipelines. While OCR correction improves usability for downstream NLP tasks, common workflows often overwrite intermediate decisions, obscuring how textual transformations affect scholarly interpretation. We present a provenance-aware framework for OCR-corrected humanities corpora that records correction lineage at the span level, including edit type, correction source, confidence, and revision status. Using a pilot corpus of historical texts, we compare downstream named entity extraction across raw OCR, fully corrected text, and provenance-filtered corrections. Our results show that correction pathways can substantially alter extracted entities and document-level interpretations, while provenance signals help identify unstable outputs and prioritize human review. We argue that provenance should be treated as a first-class analytical layer in NLP for digital humanities, supporting reproducibility, source criticism, and uncertainty-aware interpretation.
The "law of conformity," the finding that frequent words are semantically stable, has been treated as a broad regularity of language change. We show it does not hold for Korean. Using diachronic word embeddings trained on historical corpora spanning 500 years (15th–20th centuries), we find a robust positive correlation between frequency and semantic shift: high-frequency Korean words change more, not less. The pattern survives six robustness controls and is validated against an English replication. Partial correlation analysis reveals that the role of polysemy in mediating the frequency–change relationship is not fixed but depends on time resolution and corpus homogeneity. We connect the reversal to frequency-driven reductive processes, including grammaticalization, semantic bleaching, and domain shift, that are especially productive in Korean. The frequency–change relationship is not a fixed regularity but varies with language typology and analytical conditions.
Narrative Landscape: Mapping Narrative Dispositions Across LLMs
Donghoon Jung | Jiwoo Choi | Songeun Chae | Seohyon Jung
Donghoon Jung | Jiwoo Choi | Songeun Chae | Seohyon Jung
This study proposes a quantitative framework for profiling LLM dispositions as stable, model-specific regularities in output under repeated, controlled elicitation. Using a structured narrative constraint-selection task administered across six frontier models and three instruction types, we operationalize disposition through two dimensions: "consistency", measured as cross-replication selection overlap via Jaccard similarity, and "diversity", measured as dispersion across options via the inverse Simpson index. We further introduce Narrative Landscape, a PCA-based visualization that maps each model’s selection profile into a shared space for direct comparison. Results reveal a clear rigidity–exploration spectrum across model families and show that instruction types shift the geometry of selection spaces even when scalar metrics appear similar, indicating that comparable scores can mask qualitatively distinct selection topologies.
We present a new publicly available corpus of 100,502 movie reviews from Kazakhstan collected from kino.kz, spanning 2001–2025 and covering 4,943 unique titles. The dataset is multilingual, consisting mainly of Russian reviews alongside Kazakh and code-switched texts. Reviews are manually annotated for language and sentiment polarity, and 11,309 reviews additionally contain explicit user-provided ratings. We define two sentiment tasks—three-way polarity classification and five-class score classification—and benchmark classical BoW/TF–IDF baselines against multilingual transformer models (mBERT, XLM-RoBERTa, RemBERT). Experimental results show that transformer models consistently outperform classical baselines on polarity classification, while score classification remains challenging under leakage-controlled evaluation due to severe class imbalance and subtle distinctions between adjacent rating levels.
Quantifying Text Reuse Across Three Kṛṣṇa Yajurveda Recensions: Using Multi-Algorithm Computational Collation
So Miyagawa | Kyoko Amano | Yuzuki Tsukagoshi | Yuki Kyogoku
So Miyagawa | Kyoko Amano | Yuzuki Tsukagoshi | Yuki Kyogoku
The Kṛṣṇa Yajurveda survives in multiple recensions that share substantial ritual content, yet the degree and distribution of textual overlap across recensions have never been quantified systematically. This paper presents a computational analysis of text reuse across three recensions—the Maitrāyaṇī Saṃhitā (MS), the Kāṭhaka Saṃhitā (KS), and the Taittirīya Saṃhitā (TS)—for two ritual sections (Agnyupasthāna and Punarādhāna), using ICoMa (Intertextuality Collation Machine), a new web-based multi-algorithm collation tool. Five independent similarity algorithms consistently rank MS–KS as the most closely related pair, corroborating the philological consensus. Crucially, the two ritual sections exhibit strikingly different reuse profiles: Punarādhāna shows near-identical MS–KS overlap (up to 93.5%) with sharp divergence from TS, while Agnyupasthāna displays moderate, broadly distributed similarity across all three pairs. These contrasting patterns provide quantitative evidence that different ritual categories followed distinct paths of textual transmission within the Yajurvedic tradition. ICoMa and the experimental data are freely available.
Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study
Ammar Toutou | Abdelrahman Harb | Christine Basta
Ammar Toutou | Abdelrahman Harb | Christine Basta
Ancient and endangered languages pose a unique challenge for NLP: their datasets are inherently scarce, difficult to expand, and built from formulaic corpora—making data-quality issues especially consequential yet rarely audited. Motivated by the need to understand what current NMT can realistically achieve for such languages, we investigate hieroglyphic-to-German translation, where a recent study reported 61.5 BLEU using fine-tuned M2M-100. Our reproduction yields only 37.0 BLEU with the released model. Investigating this gap, we find 32% of test targets appear identically in training (16/50; 50% under 8-gram overlap at 70% threshold). This contamination inflates scores dramatically: contaminated samples achieve up to 83.8 BLEU / 0.924 COMET-22 versus 30.9–39.2 BLEU / 0.622–0.676 COMET-22 on clean samples across five model configurations spanning two architectures. Document-level decontamination reduces contaminated BLEU by only 4.6 points because 8/16 targets persist via other source documents—target-level deduplication is required. We release a decontaminated 34-sample test set and establish corrected baselines (30.9–39.2 BLEU), providing a realistic assessment of NMT capability for this endangered writing system.
Beyond Prompt-Sensitive Emotion Words: Stable Embeddings for Tang Poetry Analysis
Linyue Zhang | Feiyue Li
Linyue Zhang | Feiyue Li
Many Tang-poetry emotion studies still rely on coarse labels (e.g., positive/negative), while recent LLM-based attempts face a practical problem: one-word emotion outputs are highly sensitive to prompt wording. When labels shift with phrasing, historical interpretation becomes hard to reproduce and hard to trust. Focusing on Tang poetry around the An Lushan Rebellion (安史之乱), we propose a fine-grained sentence-level workflow centered on emotion embeddings: we use continuous hidden-state vectors, run automatic clustering, and then consolidate labels for interpretation. On the same 3,198 emotional sentences, one-word outputs show only 50.3% A/B exact agreement, while embedding-based clustering remains stable and well distributed (Hnorm=0.989; 20/20 active clusters). On 7,195 labeled sentences, a char-based baseline reaches 0.446 micro-F1 and 0.395 macro-F1. This multi-stage label-construction path supports historically grounded findings, including the emotional turning point around 762, and also reveals layered patterns that are less visible in coarse setups. These results suggest that stable representation is a prerequisite for turning computational outputs into credible evidence for humanities interpretation.
Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings
Benjamin Icard | Lila Sainero | Alice Breton | Evangelia Zve | Jean-Gabriel Ganascia
Benjamin Icard | Lila Sainero | Alice Breton | Evangelia Zve | Jean-Gabriel Ganascia
Large language models (LLMs) can convincingly imitate human writing styles, yet it remains unclear how much stylistic information is encoded in embeddings from any language model and retained after LLM rewriting. We investigate these questions in French, using a controlled literary dataset to quantify the effect of stylistic variation via changes in embedding dispersion. We observe that embeddings reliably capture authorial stylistic features and that these signals persist after rewriting, while also exhibiting LLM-specific patterns. These analytical results offer promising directions for authorship imitation detection in the era of language models.
Prompting the Past: Linguistic Transformations and Cultural Accuracy in AI-Generated Image Reconstructions for Multivocal Cultural Heritage
Ravini Wimalasuriya | Lea Krause | Gert-Jan Burgers
Ravini Wimalasuriya | Lea Krause | Gert-Jan Burgers
This research explores the intersection of cultural heritage and Generative AI (Gen-AI), examining AI-generated historical image reconstructions as a potential tool for visualising multiple perspectives in heritage interpretation. In critical heritage studies, the concept of multivocality or polyvocality advocates for representing diverse, often underrepresented, perspectives in how heritage is understood and communicated. We evaluated three prominent AI image generation models across three heritage test cases. A total of 13 user prompts generated 39 images, which underwent both linguistic analysis of intermediate prompt transformations and systematic visual assessment by heritage experts for historical accuracy and cultural sensitivity. The findings revealed both strengths and limitations of the models. While the models produced visually compelling outputs and, in some cases, meaningfully distinct depictions across perspectives, they also exhibited representation imbalances, neutralisation and amplification tendencies, inconsistencies in human portrayal, and misinterpretations introduced during the linguistic transformation of user inputs. Based on these findings, we propose initial guidelines for structured prompt construction that target the specific failure patterns identified. The research suggests that generative AI could serve as a supplementary tool, not a definitive historical source, for exploring multivocal heritage interpretation, particularly in museum and visitor engagement contexts, provided it is used critically and in conjunction with expert input.
Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.
Evaluating Latin and Ancient Greek Sentence Alignment through Parallel Sentence Mining
Sebastian Reichbauer | Shu Okabe | Alexander Fraser
Sebastian Reichbauer | Shu Okabe | Alexander Fraser
Cross-lingual detection of intertextuality and translation in Latin and Ancient Greek through computational approaches is of great interest for classical studies.While several systems exist for parallel sentence detection, based on general multilingual or specific models for Latin–Ancient Greek, they have not been compared against each other. Therefore, we present a synthetic benchmark to evaluate the performance of language models regarding cross-lingual Ancient Greek and Latin parallel sentence mining. We first compare six language models to encode sentences and then further improve the cross-lingual alignment through post-processing, fine-tuning, and knowledge distillation. We find that the whitening transformation in combination with knowledge distillation provides excellent results. Specifically, SPhilBERTa, a trilingual language model for Ancient Greek and Latin, benefits the most from the improvements and achieves a substantial mining score of 97.6 on our benchmark.
Modeling the "Dalet" Clitic in Historical Hebrew Texts: A New Prefix-Segmented BERT Model and Stylistic Analysis
Rachel Tal | Cheyn Shmuel Shmidman | Avi Shmidman
Rachel Tal | Cheyn Shmuel Shmidman | Avi Shmidman
The Aramaic proclitic *dalet*, widely used in historical Hebrew texts, serves two distinct grammatical functions: as a subordinating conjunction and as a possessive preposition. Because these functions are orthographically identical and no annotated resources exist for this task, large-scale computational analysis of their usage has previously been infeasible. In this paper we introduce a new BERT model for historical Hebrew in which all prefixes are segmented and encoded as independent tokens. This representation allows the model to evaluate proclitics directly and provides a probe-based unsupervised method for determining the grammatical role of the *dalet* clitic using masked language modeling predictions. We evaluate the approach on a manually annotated dataset drawn from historical Hebrew literature spanning multiple regions and historical periods, achieving over an average F1 score of over 0.89. Applying the method to a corpus of more than 300 million words of historical Hebrew texts, we conduct large-scale stylistic analyses of the choice between the Aramaic *dalet* and available Hebrew alternatives. The results reveal geographic and diachronic trends and identify distinct stylistic clusters within the corpus. The prefix-segmented model and annotated dataset are released for unrestricted use.
Beyond Genre Categories: How Narrative Pattern Coherence and Spanning Distance Shape Film Success
Zhichao Wang | Zeyu Lyu
Zhichao Wang | Zeyu Lyu
Prior research on cultural markets has relied on genre labels to distinguish products, overlooking the specific content features that differentiate films within the same genre. We address this gap using tropes as building blocks of narrative structure. From a dataset of 30k tropes across 18k films (TVTropes.org), we identify 29 narrative patterns via community detection and characterize each film by two measures: coherence (how concentrated its tropes are within a few patterns) and spanning distance (how far apart the patterns it combines are). Regression analyses show that coherence improves both audience evaluations and attention, while spanning distance increases evaluations but reduces attention. These findings extend category-spanning theory from genre labels to the internal narrative composition of films, demonstrating how stories are constructed and shape audience responses.
Register Mixing Is the Norm on the Web
Erik Henriksson | Alireza Razzaghi | Tuomas Lundberg | Antti Kanner | Veronika Laippala
Erik Henriksson | Alireza Razzaghi | Tuomas Lundberg | Antti Kanner | Veronika Laippala
Nearly all studies on web registers—online text varieties associated with characteristic social contexts and linguistic features—use full documents as the unit of analysis. However, web documents often contain sections in different registers. A cooking blog, for instance, may combine personal storytelling, recipe instructions, user comments, and promotional text within a single URL. This internal variation raises doubts about the validity of document level register labeling. In this paper, we propose an LLM-based approach that identifies register homogeneous segments within documents and apply it to a 10,000-document English sample from HPLT 3.0. We show that segmentation addresses persistent problems in register analysis, including low inter-annotator agreement and category fuzziness. Strikingly, it also reveals that most web documents contain more than one register, making register mixing the norm rather than the exception on the web.
Scaling Sentence Similarity for Classical Tibetan with Automatic Annotations
Shay Cohen | Jingyi Yang | Gal Rabinovitz | Sonam Choden | Ofir Shtrosberg | Nicola Bajetta | Goody Ben Horin | Rebecca Sundén | Omri Drori | Sonam Jamtsho | Dorji Wangchuk | Kfir Bar | Orna Almogi | Shai Fine
Shay Cohen | Jingyi Yang | Gal Rabinovitz | Sonam Choden | Ofir Shtrosberg | Nicola Bajetta | Goody Ben Horin | Rebecca Sundén | Omri Drori | Sonam Jamtsho | Dorji Wangchuk | Kfir Bar | Orna Almogi | Shai Fine
Identifying intertextual parallels is central to philology, traditionally requiring labor-intensive manual analysis. While digitized historical corpora enable automated approaches using semantic sentence embeddings, training such models requires large annotated datasets, which are scarce for low-resource languages. We address this challenge by introducing a scalable automatic annotation pipeline for training semantic embedding models for Classical Tibetan. Our method combines unsupervised contrastive bootstrapping with iterative pair mining, generating silver-standard similarity labels through two complementary annotation strategies: (1) an ensemble of embedding models and rerankers, and (2) an LLM-as-a-judge committee using best–worst scaling. When combined with a domain-specific, gold-standard annotated dataset for sequential fine-tuning, the resulting text-similarity model achieves a state-of-the-art Spearman correlation of 0.864 on the STS task. This enables effective semantic search in Classical Tibetan and provides a framework for automatic supervision in low-resource languages used in digital humanities. We will make our code, dataset, and trained model publicly available upon publication.
PHMartialLawNER: A Tagalog Named Entity Recognition Corpus for the Philippine Martial Law Era
Abdiel Clarence Tabuzo | Vladimir Gray Velazco | Cassandra Cabral | Moneah Shaila Lacsam | Charmaine Salvador Ponay
Abdiel Clarence Tabuzo | Vladimir Gray Velazco | Cassandra Cabral | Moneah Shaila Lacsam | Charmaine Salvador Ponay
Historical corpora for Tagalog remain limited, particularly texts produced during the Martial Law period under the dictatorship of Ferdinand Marcos Sr. (1972–1986). Much of this material remains undigitized, restricting computational analysis of a significant period in Philippine political history. To support research on historical Tagalog texts, we introduce PHMartialLawNER, a gold-standard named entity recognition corpus constructed from newspapers and underground publications of the Martial Law era. The corpus includes approximately 13k extracted sentence segments (362,000 tokens), consolidated into 8k annotated text spans through a semi-automatic pipeline with manual validation. The reliability of the annotation is measured using Cohen’s 𝜅, reaching 0.86 on all tokens and 0.72 on annotated tokens, with a pairwise F1-score of 0.74. The schema defines historically relevant entity categories including Person (Individual, Collective), Organization (Political, Government, Other), Event (Local, International), Production (Media, Government, Doctrine), as well as Time, Numerical Statistics, Location, and Object entities, specifically identifying weapon artifacts. We establish baseline performance using GLiNER variants, calamanCy models, and transformer-based architectures under zero-shot and few-shot settings. The PHMartialLawNER corpus will be publicly released to support Tagalog NLP, historical text processing, and digital humanities research.
Literary translation requires balancing target-language fluency with faithfulness to the source. Recent large language models (LLMs) often produce fluent translations, but it remains unclear whether fluency corresponds to semantic preservation in literary text. We examine this relationship using 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations. Fluency is measured as original-likeness with a translationese classifier trained on paragraph part-of-speech n-grams, and faithfulness with the automatic translation evaluation metric COMET-KIWI. We control for paragraph length and find a consistent negative correlation between fluency and faithfulness. The pattern appears for both human and Google Translate, but is weaker and often non-significant for TranslateGemma. These results show that segment length matters for automatic evaluation and suggest a tradeoff between fluency and faithfulness in literary translation.
Directional Alignment and Narrative Agency in Human–LLM Co-Writing
Halfdan Nordahl Fundal | Yuri Bizzoni
Halfdan Nordahl Fundal | Yuri Bizzoni
We investigate narrative agency in hu-man–LLM creative co-writing, asking whodrives story development in turn-based collabo-ration. Using a new corpus of human–LLM co-written stories, we apply sentiment and seman-tic modeling to quantify affective alignmentand semantic novelty in turn-taking, and direc-tional measures to assess which agent shapesnarrative progression. Our results show asym-metric influence: human turns introduce greatersemantic novelty and are more likely to shapesubsequent developments, whereas LLM con-tributions predominantly elaborate on human-introduced elements. At the sentiment level,alignment is also asymmetric, but more bidirec-tional: LLMs exhibit stronger turn-level emo-tional adaptation than humans, but both agentstrack each other’s emotional valence and LLMsshow an independent tendency to more pos-itive emotional baselines. These findings in-dicate a complementary division of labor inhuman–LLM co-writing, where humans drivenarrative innovation and direction, while LLMsact as adaptive amplifiers that sustain coherenceand elaborate emerging narratives.
Bias Mitigation in Hiring-Related NLP: Interactions Between Masking, Rewriting, and Adversarial Debiasing
Alexandre Puttick | Rami El-Wazzi
Alexandre Puttick | Rami El-Wazzi
AI-driven language technologies are increasingly used in hiring, but they may encode and reproduce harmful social stereotypes. Prior work often studies bias mitigation methods in isolation and outside realistic application settings. We examine the combined effects of data-level and model-level debiasing in a hiring-related context, using Norwegian-language academic bios and a proxy STEM/non-STEM classification task. Specifically, we study masking sensitive information, GenWriter-based rewrites (CITATION), and adversarial debiasing (CITATION). We evaluate these interventions using downstream task performance, group fairness metrics, intrinsic bias tests based on WEAT (CITATION), and measures of gender leakage from hidden representations. We find that combining masking, GenWriter rewrites, and adversarial debiasing substantially reduces gender leakage while maintaining or improving downstream performance. However, effects on fairness gaps and intrinsic bias are mixed, underscoring the need for downstream, context-sensitive evaluation of bias mitigation methods in hiring-related NLP.
Matching Meaning at Scale: Evaluating Semantic Search for 18th-Century Intellectual History through the Case of Locke
Yu Wu | Ananth Mahadevan | Filip Ginter | Michael Mathioudakis | Mikko Tolonen
Yu Wu | Ananth Mahadevan | Filip Ginter | Michael Mathioudakis | Mikko Tolonen
While digitized corpora have transformed the study of intellectual transmission, current methods rely heavily on lexical text reuse detection, capturing verbatim quotations but fundamentally missing paraphrases and complex implicit engagement. This paper evaluates semantic search in 18th-century intellectual history through the reception of John Locke’s foundational work. Using expert annotation grounded in a semantic taxonomy, we examine whether an off-the-shelf semantic search pipeline can surface meaning-level correspondences overlooked by lexical methods. Our results demonstrate that semantic search retrieves substantially more implicit receptions than lexical baselines. However, linguistic diagnostics also reveal a “lexical gatekeeping” effect, where retrieval remains partially constrained by surface vocabulary overlap. These findings highlight both the potential and the limitations of semantic retrieval for analyzing the circulation of ideas in large historical corpora. The data is available at https://github.com/COMHIS/locke-sim-data.
How did the thematic repertoire of early English-language science fiction change as the genre consolidated between 1818 and 1930? Using a corpus of 238 public-domain texts, we apply temporally binned latent Dirichlet allocation (LDA), comparing models with and without Authorless preprocessing (which probabilistically downweights author-specific vocabulary). Cross-period topic alignments exceed a permutation null baseline, indicating continuity in topic structure over time. Full-corpus LDA can produce comparable per-topic quality, but only temporal binning enables diachronic alignment; within the binned setting, Authorless reduces author concentration and modestly increases the share of thematic topics without materially reducing coherence. Four high-continuity topic chains – centered on mobility, affect, planetary scale, and scientific knowledge – suggest a shift from earlier romantic and speculative concerns toward more consolidated technoscientific forms. These chains generate interpretable hypotheses about the literary history of early science fiction, and the workflow supports diachronic analysis in small, author-skewed corpora.
Twenty’s Plenty: Semantic Scaffolding and Span Architecture for 19-Label NER in Medieval Latin Charters
Tamás Kovács | Giuseppe Consolo | Georg Vogeler
Tamás Kovács | Giuseppe Consolo | Georg Vogeler
This study investigates whether a high-quality, 19-label named entity recogniser for medieval Latin charters can be constructed using only a few hundred annotated sentences. The authors introduce "semantic scaffolding," an innovation that utilizes richly descriptive English label phrases as prompts to activate latent multilingual knowledge within the model. This is paired with a custom span-based architecture utilizing XLM-ROBERTa-large, 4-head attention pooling to handle long property descriptions, and a hybrid loss system including Asymmetric Focal-Dice and InfoNCE contrastive terms. Results demonstrate that semantic scaffolding enables fine-tuned GLiNER to reach 80.8% overlap F1, while the custom architecture achieves 83.4% overlap F1 using only 298 training sentences. Significantly, the paper provides an empirical demonstration that domain-specific pre-training on medieval Latin offers no performance advantage once task-specific fine-tuning is applied. While the model excels at frequent categories like PER (95.7% F1) and LOC (93.5% F1), challenges persist for rare, position-dependent legal categories such as LEG (53.1% F1) and TRANS (52.6% F1).
Artistic Interventions for NLP Annotation Challenges: The Stress Test of Machinic Glossolalia
Tyler Grimes | Marshall Washington
Tyler Grimes | Marshall Washington
MotherBoard’s Mother Tongue is a computational linguistics and artistic research project that explores a Large Language Model’s (LLM) vocal production of glossolalia. Glossolalia, colloquially known as ‘speaking in tongues,’ consists of the human production of seemingly unintelligible utterances. It is, by its nature, difficult to annotate accurately with linguistic features relevant for natural language. The glossolalia-producing system demonstrated here consists of the interaction of 1) a ‘nonsense’ linguistic corpus 2) a micro-controller based environmental data stream and 3) a fine-tuned LLM. While discussing some philosophical and artistic considerations of machinic glossolalia, we also address some methodological considerations for Natural Language Processing (NLP). Using the artistic project as a case study, we argue that machinic glossolalia presents a ‘stress test’ that could inform both creative redirections of NLP methods and the definitions held by the subfield.
In Search of Lost Adventure Novels: Supervised Genre Retrieval and Corpus Refinement in Gallica
Jean Barré
Jean Barré
This paper addresses a practical problem in computational literary history: retrieving adventure novels from a large digitized collection of French fiction where genre metadata are sparse and unreliable. We begin with supervised genre modeling based on a historically situated seed list of 101 adventure novels drawn from literary scholarship. We compare several classifiers and representations, and validate them against 364 independently labeled adventure novels from the Chapitres corpus. The best-performing model, HistGradientBoosting on mean paragraph embeddings, achieves strong external recall (81%) despite the small training set. We then apply this model to the 12,176-novel Fictions littde Gallica archive and refine the resulting candidate corpus through a graph-based post-processing step over a k-nearest-neighbor similarity graph. On the Chapitres benchmark, this graph correction produces negligible changes in retrieval performance, indicating that it is not a generally superior classifier. On Gallica, however, it yields a more cohesive and homogeneous candidate corpus and surfaces interpretable correction cases, including missed canonical adventure novels and excluded borderline texts. We therefore argue that graph-based correction is best understood not as a replacement for supervised classification, but as a heuristic for refining large, noisy archival corpora where exhaustive manual annotation is impossible.
Computational Modeling of Educational Theory in Low-Socioeconomic Contexts
Jadon Swearingen | Mustafa Ocal | Md Tarique Hasan Khan | Labiba Jahan
Jadon Swearingen | Mustafa Ocal | Md Tarique Hasan Khan | Labiba Jahan
This study examines narratives in which students describe challenges they faced in higher education due to low socioeconomic (SES) backgrounds and the strategies they used to overcome them. Using computational text analysis, we operationalize three educational theories, Paulo Freire’s Critical Pedagogy, Urie Bronfenbrenner’s Ecological Systems Theory, and Pierre Bourdieu’s Theory of Capital and Habitus to analyze patterns in these narratives. To strengthen the theory-to-method connection, we incorporate temporal timeline extraction, identifying ordered event sequences and tracking how challenges and forms of capital evolve across a student’s posting history. This temporal lens links theoretical categories (barriers, supports, forms of capital) to when they occur, highlighting moments for timely interventions. By combining theory-driven features with temporal analysis, we evaluate the explanatory capacity of each framework and demonstrate how computational methods can quantitatively examine qualitative lived experience at scale, supporting interdisciplinary research on equity in education.
Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan
Ahan Chatterjee | Matthias Schöffel | Matthias Aßenmacher | Marinus Wiedner | Esteban Garces Arias
Ahan Chatterjee | Matthias Schöffel | Matthias Aßenmacher | Marinus Wiedner | Esteban Garces Arias
The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine). In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available.
From Traditional Taggers to LLMs: A Comparative Study of POS Tagging for Medieval Romance Languages
Matthias Schöffel | Esteban Garces Arias
Matthias Schöffel | Esteban Garces Arias
Part-of-speech (POS) tagging for Medieval Romance languages remains challenging due to orthographic variation, morphological complexity, and limited annotated resources. This paper presents a systematic empirical evaluation of large language models (LLMs) for POS tagging across three medieval varieties: Medieval Occitan, Medieval Catalan, and Medieval French. We compare traditional rule-based and statistical taggers with modern open-source LLMs under zero-shot prompting, few-shot prompting, monolingual fine-tuning, and cross-lingual transfer learning settings.Experiments on historically grounded datasets show that LLM-based approaches consistently outperform traditional taggers, with fine-tuning and multilingual training yielding the largest improvements. In particular, cross-lingual transfer learning substantially benefits under-resourced varieties, while targeted bilingual training can outperform broader multilingual configurations for specific target languages. The results highlight the importance of linguistic proximity and dataset characteristics when designing transfer strategies for historical NLP.These findings provide empirical insights into the applicability of modern neural methods to medieval text processing and provide practical guidance for deploying LLM-based POS tagging pipelines in digital humanities research. All code, models, and processed datasets are released for reproducibility.
This paper introduces a computational frameworkfor evaluating structural properties ofthe undeciphered Indus script.The study usesa corpus of 6,579 inscriptions.The analyticalapproach combines unsupervised visual clusteringof sign morphology, entropy-based sequenceanalysis, Kullback-Leibler divergencecomparison, and neural sequence modeling(BiLSTM). The results indicate directionalasymmetry and structured combinatorial patternsin sign sequences. We conclude that theIndus sign sequences exhibit statistical propertiesconsistent with structured symbolic systemsand not easily explained by random generation.
We present the result of preliminary explorations of using the topology of embedded manifolds as a semantic invariant. Our main question is whether the topology of large embedded corpora is invariant in the following two senses. First, one might reasonably expect that the same corpus in two languages would give topologically equivalent embeddings. Second, one might reasonably expect that the same corpus embedded by two different embedding models might give topologically equivalent embeddings. In the paper we will justify these intuitions and give preliminary results indicating that they are, to some extent, justified.
MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring
Ali Keramati | Shiyuan Zhou | Sharad Mehrotra | Mark Warschauer
Ali Keramati | Shiyuan Zhou | Sharad Mehrotra | Mark Warschauer
Automated Essay Scoring (AES) is shifting from feature-engineering to LLMs, yet current training-free approaches struggle with calibration, often exhibiting a "middle-score bias" that fails to distinguish between exceptional and weak writings. In this work, we introduce MADRAG (Multi-Agent Debate with Retrieval-Augmented Generation), a training-free framework designed to achieve the reliability of supervised models without the need for labeled training data. MADRAG decomposes the scoring process into a multi-agent interaction: an Advocate highlights essay strengths, a Skeptic critiques weaknesses, and a Judge synthesizes these arguments to assign a score. Crucially, we augment the Judge with RAG mechanism that retrieves rubric-aligned exemplar essays spanning the full score range, grounding the debate in concrete evidence. Evaluating our approach on the ASAP dataset for analytic trait scoring, we demonstrate that MADRAG significantly outperforms existing prompt-based LLM baselines and achieves performance competitive with state-of-the-art supervised models.
Never Care For What They Say ? Platform vs Genre Rules in Online Horror Narratives (2007–2024)
Alexandre Lionnet-Rollin | Florian Cafiero
Alexandre Lionnet-Rollin | Florian Cafiero
Research on online cultural production shows that platforms are acting as mediators that can heavily shape textual form. Yet, empirical work is often platform-bounded, making it difficult to assess whether stylistic regularities that we observe are indeed genre signals or if some of them are platform artefacts. We address this question through a cross-platform design focused on creepypasta, a digital-born horror genre circulating across heterogeneous infrastructures. Using a corpus of ∼23,000 English-language stories published from 2007 to 2024 on Reddit’s /r/nosleep and the Creepypasta Fandom wiki, we compare stylistic profiles across platforms and relate them to differences in rule regimes and moderation practices, established through qualitative extraction and close reading of platform guidelines. Across readability indices, lexical diversity measures, syntactic proxies, and a cross-fit feature-based model, we find that platform membership leaves only a narrow stylistic imprint, largely reducible to a single architectural rule: r/NoSleep’s mandatory first-person narration. Beyond this constraint, differences are modest and fail to form coherent platform-specific stylistic signatures. This helps us define what is stylistically common in creepypastas, and understand what the genre is to its writers beyond the topics it deals with or the platform it is written on.
StoicLLM: Preference Optimization for Philosophical Alignment in Small Language Models
Ishmam Khan | Sindhuja Thogarrati | Shuo Zhang
Ishmam Khan | Sindhuja Thogarrati | Shuo Zhang
While large language models excel at factual adaptation, their ability to internalize nuanced philosophical frameworks under severe data constraints remains underexplored. We investigate this by specializing small LLMs on micro-datasets of foundational Stoic texts using preference optimization (ORPO, AlphaPO). Evaluated via a multi-model critic bank, our results show that just 300 high-fidelity examples can induce strong alignment with inward-facing Stoic virtues, closely approaching few-shot prompting while freeing the context window. Critically, however, all models, including few-shot baselines, exhibit a persistent failure on Stoicism’s outward-facing cosmopolitan duties, pointing to a representational limitation of small models that micro-dataset adaptation alone cannot overcome.
Between Whispers and Screams: Loudness Standard Deviation as a Proxy for Explicit Content Detection in US Romance Novels
Svenja Guhr
Svenja Guhr
This study proposes and tests loudness standard deviation (SD) of fictional sound events as an acoustically grounded proxy for detecting explicit content in romance fiction. Working with a subcorpus of novels from the Harlequin Men Made in America series, scenes were annotated for character and ambient sound with loudness levels. Additionally, the scenes were annotated on a ternary severity scale with two content advisory categories drawn from the PG-story taxonomy, Sex & Nudity and Violence & Scariness (CITATION), and tested whether within-scene loudness SD of character and ambient sound correlates with either category. Loudness standard deviation analyses of character and ambient sounds in scenes featuring explicit content reveal that erotic scenes are acoustically marked by significantly higher variability in character-produced sounds, reflecting the dynamic range from whispered dialogue to vocalized arousal, while no significant correlation was found between high ambient sound loudness SD and scenes of elevated Violence & Scariness.
Computational Authorship Attribution in the Children’s Tales of Oscar and Constance Wilde: The Case of "The Selfish Giant"
Liviu P Dinu | Alina Iacob | Cosmin Ciotlos
Liviu P Dinu | Alina Iacob | Cosmin Ciotlos
This study introduces and analyzes a novel authorship attribution case: the children’s stories published by Oscar and Constance Wilde. We analyzed the corpus of stories with both supervised (SVM with string kernel) and unsupervised (Hierarchical Clustering via Rank Distance) methods and found a strong stylistic similarity between the story "The Selfish Giant" published by Oscar Wilde and the stylometric profile of Constance Wilde. Starting from this baseline, we also explored the the capabilities of LLMs in authorship attribution via Perplexity. Our finding suggests that the story "The Selfish Giant" might be the result of a collaboration between Oscar and Constance Wilde. Moreover, our results pointed to the distinct stylistic fingerprints of the two authors with regards to the rest of the corpus, confirming that their respective styles are separable despite shared genre and period.
Evaluating Open-Source LLMs for Text Summarization and Named Entity Recognition in Long, Unstructured Text
Pauline Kister | Miriam Schirmer
Pauline Kister | Miriam Schirmer
This work investigates the extent to which open-source Large Language Models (LLMs) can improve accessibility of unstructured historical documents by performing abstractive summarization and fine-grained Named Entity Recognition (NER) for role classification and violation types. We evaluate open-source LLMs in zero-shot settings and apply these tasks to witness testimonies collected by the South African Truth and Reconciliation Commission (TRC), which archived a large body of text documenting human rights violations during apartheid. Despite their historical significance, these texts are difficult to access due to their length, lack of standardized structure, and the absence of systematic indexing.Open-source LLMs show strong performance in summarization, with most models surpassing non-LLM baselines (maximum BERTScore 0.77), while NER performance remains limited (maximum F1-score 0.61). Results suggest a trade-off in which stylistic fluency is prioritized over factual precision. A two-stage pipeline, summarization followed by NER on LLM summaries, leads to measurable improvements.
Perspectives – Interactive Document Clustering for Qualitative Data Analysis
Tim Fischer | Chris Biemann
Tim Fischer | Chris Biemann
This paper introduces Perspectives, an interactive extension of a qualitative data analysis tool suite developed at our university, designed to empower Digital Humanities (DH) scholars to explore and organize large, unstructured document collections. Perspectives implements a flexible, aspect-focused document clustering pipeline with human-in-the-loop refinement capabilities.We showcase how this process can be initially steered by defining analytical lenses through document rewriting prompts and instruction-based embeddings, and further aligned with user intent through tools for refining clusters and mechanisms for fine-tuning the embedding model. The demonstration highlights a typical workflow, illustrating how DH researchers can leverage Perspectives’s interactive document map to uncover topics, sentiments, or other relevant categories, thereby gaining insights and preparing their data for subsequent in-depth analysis.
up
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Proceedings of the 4th Workshop on NLP for Music and Audio (NLP4MusA 2026)
Elena V. Epure | Sergio Oramas | SeungHeon Doh | Pedro Ramoneda | Anna Kruspe | Mohamed Sordo
Elena V. Epure | Sergio Oramas | SeungHeon Doh | Pedro Ramoneda | Anna Kruspe | Mohamed Sordo
From Novice to Expert: Generating Audience-Dependent Concert Moderations with RAG-LLMs
Kerstin Denecke
Kerstin Denecke
In this paper, we study the capabilities of large language models (LLMs) to adapt a concert moderation to diverse expertise levels of listeners. Our proof-of-concept concert moderator is based on retrieval-augmented generation (RAG) and uses few-shot audience modelling to infer listener’s expertise. We study the capabilities of the system to adapt to three different listener’s expertise levels. Two open domain LLMs are compared: gpt-oss:20b and llama3. The recognised differences among the models suggest that they vary in how directly they reproduce versus paraphrase retrieved information while maintaining semantic alignment.
LabelBuddy: An Open Source Music and Audio Language Annotation Tagging Tool Using AI Assistance
Ioannis Prokopiou | Ioannis Sina | Agisilaos Kounelis | Pantelis Vikatos | Themos Stafylakis
Ioannis Prokopiou | Ioannis Sina | Agisilaos Kounelis | Pantelis Vikatos | Themos Stafylakis
The advancement of Machine learning (ML), Large Audio Language Models (LALMs), and autonomous AI agents in Music Information Retrieval (MIR) necessitates a shift from static tagging to rich, human-aligned representation learning. However, the scarcity of open-source infrastructure capable of capturing the subjective nuances of audio annotation remains a critical bottleneck. This paper introduces LabelBuddy, an open-source collaborative auto-tagging audio annotation tool designed to bridge the gap between human intent and machine understanding. Unlike static tools, it decouples the interface from inference via containerized backends, allowing users to plug in custom models for AI-assisted pre-annotation. We describe the system architecture, which supports multi-user consensus, containerized model isolation, and a roadmap for extending agents and LALMs. Code available at https://github.com/GiannisProkopiou/gsoc2022-Label-buddy.
Stochastic Parrots or True Virtuosos? Digging Deeper Into the Audio-Video Understanding of AVQA Models
Sara Pernille Jensen | Hallvard Innset Hurum | Anna-Maria Christodoulou
Sara Pernille Jensen | Hallvard Innset Hurum | Anna-Maria Christodoulou
Audio-video question answering (AVQA) systems for music show signs of multimodal "understanding", but it is unclear which inputs they rely on or whether their behavior reflects genuine audio-video reasoning. Existing evaluations focus on overall accuracy and rarely examine modality dependence. We address this gap by suggesting a method of using counterfactual evaluations to analyse the audio-video understanding of the models, illustrated with a case study on the audio-video spatial-temporal (AVST) architecture. This includes interventions that zero out or swap audio, video, or both, where results are benchmarked against a baseline based on linguistic patterns alone. Results show stronger reliance on audio than video, yet performance persists when either modality is removed, indicating learned cross-modal representations. The AVQA system studied thus exhibits non-trivial multimodal integration, though its "understanding" remains uneven.
Beyond Musical Descriptors: Extracting Preference-Bearing Intent in Music Queries
Marion Baranes | Romain Hennequin | Elena V. Epure
Marion Baranes | Romain Hennequin | Elena V. Epure
Although annotated music descriptor datasets for user queries are increasingly common, few consider the user’s intent behind these descriptors, which is essential for effectively meeting their needs. We introduce MusicRecoIntent, a manually annotated corpus of 2,291 Reddit music requests, labeling musical descriptors across seven categories with positive, negative, or referential preference-bearing roles.We then investigate how reliably large language models (LLMs) can extract these music descriptors, finding that they do capture explicit descriptors but struggle with context-dependent ones. This work can further serve as a benchmark for fine-grained modeling of user intent and for gaining insights into improving LLM-based music understanding systems.
How Far Can Pretrained LLMs Go in Symbolic Music? Controlled Comparisons of Supervised and Preference-based Adaptation
Deepak Kumar | Emmanouil Karystinaios | Gerhard Widmer | Markus Schedl
Deepak Kumar | Emmanouil Karystinaios | Gerhard Widmer | Markus Schedl
Music often shares notable parallels with language, motivating the use of pretrained large language models (LLMs) for symbolic music understanding and generation. Despite growing interest, the practical effectiveness of adapting instruction-tuned LLMs to symbolic music remains insufficiently characterized. We present a controlled comparative study of finetuning strategies for ABC-based generation and understanding, comparing an off-the-shelf instruction-tuned backbone to domain-adapted variants and a music-specialized LLM baseline. Across multiple symbolic music corpora and evaluation signals, we provide some insights into adaptation choices for symbolic music applications. We highlight the domain adaptation vs. preserving prior information tradeoff as well as the distinct behaviour of metrics used to measure the domain adaptation for symbolic music.
A central limitation of current music understanding frameworks is the reliance on audio embeddings, which frequently yields interpretations lacking traceable ties to explicit musical elements such as notes, dynamics, and instrumentation. We address this gap with MIDIPHOR, a MIDI-first framework that converts symbolic data into structured, queryable representations for reasoning. MIDI-PHOR distills each piece into three complementary views: a symbolic view capturing pitch, meter, and key; a time-series (TS) view that tracks rhythmic salience, texture, and role activity; and an instrument-role graph encoding ensemble interactions. With evidence-linked claims, experiments demonstrate reduced hallucinations compared to raw-MIDI baselines and offer a robust, auditable bridge between symbolic data and semantic music understanding.
Read Between the Tracks: Exploring LLM-driven Intent-based Music Recommendations
Anna Hausberger | Petra Jósár | Markus Schedl
Anna Hausberger | Petra Jósár | Markus Schedl
This paper evaluates the effectiveness of large language models (LLMs) on the task of context-aware music recommendation, specifically focusing on the alignment of music tracks with a listening intent, in addition to user preferences. We present a preliminary investigation in which five LLMs (variants of LLama, Qwen, and Mistral) are tasked with ranking a candidate set of tracks containing both ground-truth items (associated with specific user-intent pairs) and distractor items (containing user-relevant, intent-relevant, or non-user and non-intent relevant items). Our results show that LLMs rank intent-user-relevant items higher than the distract items, with "Llama-3.1-8B-Instruct" having the best performance (NDCG of 0.320.20 vs. 0.200.15). We further investigate whether performance differs when mentioning the listening intent explicitly in the prompt vs. implicitly given solely music preferences.Surprisingly, the LLMs achieved the best performance through an implicit indication of intent, versus explicitly adding it to the prompt, with "Mistral-7B-Instruct-v0.3" performing the best (NDCG of 0.370.22 vs. 0.290.18).
Learning When to Personalize: LLM Based Playlist Generation via Query Taxonomy and Classification
Fedor Buzaev | Ivan Sukharev | Rinat Mullahmetov | Roman Bogachev | Ilya Sedunov | Oleg Pavlovich | Daria Pugacheva
Fedor Buzaev | Ivan Sukharev | Rinat Mullahmetov | Roman Bogachev | Ilya Sedunov | Oleg Pavlovich | Daria Pugacheva
Playlist generation based on textual queries using large language models (LLMs) is becoming an important interaction paradigm for music streaming platforms. User queries span a wide spectrum from highly personalized intent to essentially catalog-style requests. Existing systems typically rely on non-personalized retrieval/ranking or apply a fixed level of preference conditioning to every query, which can overfit catalog queries to a single user or under-personalize explicitly listener-dependent requests. We present an industrial-scale LLM-based playlist generation system with dynamic personalization that adapts the personalization strength to the query type. We define a query taxonomy, train a query-type classifier on 5,000 manually labeled queries, and use its predicted probability to modulate the mixture of LLM-based semantic scoring and personalized evaluation. In a blind user study with pairwise comparisons and ELO aggregation, this approach consistently outperforms both non-personalized and fixed-personalization baselines.
HumMusQA: A Human-written Music Understanding QA Benchmark Dataset
Benno Weck | Pablo Puentes | Andrea Poltronieri | Satyajeet Prabhu | Dmitry Bogdanov
Benno Weck | Pablo Puentes | Andrea Poltronieri | Satyajeet Prabhu | Dmitry Bogdanov
The evaluation of music understanding in Large Audio-Language Models (LALMs) requires a rigorously defined benchmark that truly tests whether models can perceive and interpret music, a standard that current data methodologies frequently fail to meet.This paper introduces a meticulously structured approach to music evaluation, proposing a new dataset of 320 hand-written questions curated and validated by experts with musical training, arguing that such focused, manual curation is superior for probing complex audio comprehension.To demonstrate the use of the dataset, we benchmark six state-of-the-art LALMs and additionally test their robustness to uni-modal shortcuts.
up
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Proceedings of the Seventh Workshop on Natural Language Processing and Computational Social Science
Dallas Card | Anjalie Field | Katherine Keith | Julia Mendelsohn
Dallas Card | Anjalie Field | Katherine Keith | Julia Mendelsohn
Prompt Perturbations Reveal Human-Like Biases in Large Language Model Survey Responses
Jens Rupprecht | Georg Ahnert | Markus Strohmaier
Jens Rupprecht | Georg Ahnert | Markus Strohmaier
Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known human-like response biases, such as central tendency, opinion floating and primacy bias are poorly understood. This work investigates the response robustness of LLMs in normative survey contexts—we test 18 LLMs on questions taken from the World Values Survey (WVS), applying a comprehensive set of ten perturbations to both question phrasing and answer option structure, resulting in over 334,800 simulated survey interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also show that almost all tested models exhibit a consistent recency bias, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.
Borrowed Words, Borrowed Minds: Probing LLM Choice of English-Derived Loanwords in Japanese
Joseph James
Joseph James
The choice between English-derived loanwords (gairaigo) and native Japanese equivalents is a socially meaningful aspect of language use, carrying implications for register, style, and pragmatic interpretation. We introduce a controlled evaluation dataset probing how large language models encode this form of sociolinguistic variation. The dataset comprises 113 interchangeable lexical pairs embedded across six communicative contexts spanning formal and informal, spoken and written registers. We evaluate 16 Japanese-capable LLMs across three complementary tasks: sentence rating, pairwise choice, and masked word prediction. Although both lexical forms were generally rated as natural, models diverged substantially in contextual sensitivity and lexical preference, revealing architectural differences in how socially grounded lexical alternatives are represented. These findings suggest that surface fluency may mask instability in modeling pragmatic variation, with implications for socially aware language generation and evaluation.
Does Local News Stay Local?: Online Content Shifts in Sinclair-Acquired Stations
Miriam Wanner | Sophia Hager | Anjalie Field
Miriam Wanner | Sophia Hager | Anjalie Field
Local news stations are often considered to be reliable sources of non-politicized information, particularly local concerns that residents care about. The Sinclair Broadcast group is a broadcasting company that has acquired many local news stations in the last decade. We investigate the effects of local news stations being acquired by Sinclair: how does coverage change? We analyze YouTube content put out by local news stations through topic modeling, log-odds ratios, and word embedding analyses to investigate changes after being acquired by Sinclair. We find evidence that local news stations report more frequently on national news at the expense of local topics, and that their coverage of polarizing national topics increases. These findings associate acquisition by Sinclair with increasing polarization and nationalization of news content, which in-turn risks increasing political polarization of local news viewers.
Learning Moral Diversity: Modelling Individual Perspectives in Moral Classification of Texts
Yi Ren | Lewis Mitchell | Matthew Roughan
Yi Ren | Lewis Mitchell | Matthew Roughan
Understanding moral values in social media text offers insight into moral judgement formation, and supervised NLP models trained on crowdsourced data have achieved strong classification performance. However, most approaches simplify the problem by aggregating multiple annotators’ labels into a single "ground truth", overlooking the inherent subjectivity of the task. In practice, there are disagreements between annotators caused by personal viewpoint or inherent ambiguities, particularly for short tweets. Here, we extend a pretrained language model with a layer that learns annotator-specific features. Our model improves predictions of individual annotations and yields representations that reveal meaningful insights into annotators’ moral perspectives. We show that models trained on aggregated labels may hide variation and give a misleading impression of performance. Overall, we demonstrate that disagreement reflects the inherent subjectivity of the task and that modelling individual perspectives creates benefits for moral classification of texts.
Launch and Aftermath: Contrasting Social Media Responses to Chatbot Releases. The Cases of Meta’s Galactica and OpenAI’s ChatGPT
Maximilian Weber | Johannes B. Gruber
Maximilian Weber | Johannes B. Gruber
In November 2022, Meta’s Galactica and OpenAI’s ChatGPT were released within fifteen days of each other, two transformer-based language models that were architecturally similar and built on comparable underlying technology, yet experienced starkly different outcomes. Where they diverged was not in technical kind but in domain positioning and epistemic framing: Galactica was explicitly marketed as a reliable scientific assistant, while ChatGPT was presented as a general-purpose conversational tool. Using Twitter data collected via the Twitter Research API, we conduct a comparative analysis of early social media discourse surrounding both models.Through sentiment classification, zero-shot harm and risk annotation, and LLM-based topic modeling, we find that negative sentiment escalated rapidly for Galactica while remaining comparatively stable for ChatGPT in the release period. Galactica experienced a marked escalation in criticism during its first week, eventually structuring much of the conversation. In contrast, ChatGPT’s early discourse remained more evenly distributed across hype, experimentation, practical engagement, and criticism. We argue that domain positioning and epistemic expectations, rather than any meaningful technological difference, played a central role in shaping public perception, with Galactica’s scientific presentation making its well-documented hallucinations appear far more damaging in public opinion.
When Do LLMs Need Human Experts? Evidence for Social Science from Jurisprudential Classification
Caroline Cheng | Edward Stiglitz | David Mimno | Matthew Wilkens
Caroline Cheng | Edward Stiglitz | David Mimno | Matthew Wilkens
Social scientists increasingly use large language models (LLMs) to classify text at scale, raising a key question: when can LLMs replace expert human annotation? Prior work found that earlier generative models failed on complex social science tasks while fine-tuned BERT succeeded, but whether current frontier-scale models close this gap remained untested. We investigate this question on a challenging legal reasoning task—classifying paragraphs from U.S. Supreme Court opinions as employing formal, grand, or no reasoning. Testing frontier LLMs including GPT-5.2 and leading open-weight alternatives, we find that even the most capable prompted models consistently underperform fine-tuned BERT. Only when high-parameter-count generative LLMs are fine-tuned on human-annotated training data does performance improve, and fine-tuned BERT remains a cost-effective alternative. Contrary to a common view, our results demonstrate that scaling to frontier-size LLMs does not eliminate the need for expert annotation on tasks requiring deep domain expertise—a finding with important implications for computational social science measurement.
An NLP Framework for Analyzing Corporate Strategic Behavior in the Opioid Industry Documents Archive
Duy Dang Phu | Thìn Đặng Văn
Duy Dang Phu | Thìn Đặng Văn
The Opioid Industry Documents Archive (OIDA) provides extensive internal corporate records that offer valuable insight into the drivers of the opioid crisis, yet its use in systematic analysis of corporate strategy remains limited. In this study, we propose an NLP-based framework to analyze strategic behavior in large-scale litigation archives, combining relevance filtering and topic modeling with large language model (LLM)-assisted interpretation. Applied to documents from Insys Therapeutics and Mallinckrodt Pharmaceuticals, our approach uncovers systematic differences in corporate strategies and organizational priorities. These results highlight the potential of integrating representation learning and LLMs for large-scale analysis in public health and corporate accountability research.
Large-scale ASR systems such as Whisper achieve competitive aggregate Word Error Rate (WER) on multilingual benchmarks, but this aggregate conceals systematic disparities across speaker populations. We evaluate Whisper large-v3 on 276 recordings from the Corpus Oral y Sonoro del Español Rural (COSER), a dialectological archive of elderly rural speakers across all Spanish provinces. WER is computed separately for Informants and Interviewers within each recording, revealing that mixed-role evaluation underestimates Informant WER in the majority of provinces, with the largest corrections in southern areas.Negative Binomial regression with cluster-robust standar errors shows that Andalusia and Extremadura generate significantly more Informant errors than the Castilian heartland (Andalusia IRR = 1.20, p < 0.001; Extremadura IRR = 1.24, p = 0.020), while no geographic predictor reaches significance for Interviewers sharing the same recording environment. Male Informants generate 12.5% more errors than females after geographic adjustment (p < 0.001), consistent with differential vernacular retention in traditional rural communities. The geographic pattern aligns with established dialectological classifications of Peninsular Spanish. These results demonstrate that role-disaggregated evaluation is a necessary methodological prerequisite for fairness audits of ASR systems applied to sociolinguistically diverse corpora: aggregate benchmarks systematically suppress disparities that are borne disproportionately by the most underrepresented speaker populations, and their use in isolation constitutes both an allocative harm and a measurement failure
Who Speaks for Whom? LLM-Generated Survey Data as a Proxy for Public Opinion
Radhakrishnan Venkatakrishnan | Travis Brodbeck | Michael D. Young
Radhakrishnan Venkatakrishnan | Travis Brodbeck | Michael D. Young
Technological advancements, such as Large Language Models (LLMs), offer a potential solution to the two-faceted problem facing social science researchers: rising costs and declining response rates. The use of artificial personas is a budding practice, where chatbots are given the demographic characteristics of the person they are supposed to role-play as and answer questions for researchers. Before scholars and practitioners augment or replace the data created by interviewing humans, it is essential to understand how well models perform in generating accurate, reliable, and robust data, with concerns that the training of LLMs results in a bias towards the norms of WEIRD cultures. We present a procedure for practitioners to use to evaluate the quality of their synthetic data by measuring Intra Class Correlation (ICC), Earth Mover Distance (EMD), Variance, Hedging, and demographic drivers of LLM output. We find that the models may generate plausible results in the aggregate, but these synthetic data do not exhibit the depth or nuance of human respondents. Secondarily, we find that despite having generated definitive answers on a ten-point scale, the reasoning provided by the LLM exhibited varying degrees of hedging that do not consistently align with the LLM’s answer. The distortion of the results was not uniformly distributed; instead, the effects were more extreme for some demographic groups. Our findings suggest that the technology generating synthetic survey data may not be mature enough to address the increasing challenges of interviewing humans for public opinion research.
Documenting Corporate Harm: A Semantic Action Trajectories Approach to the Opioid Industry Document Archive Shared Task
Ben Miller
Ben Miller
This paper presents a method for modeling change in the possibility space of actors over time as represented in the Opioid Industry Document Archive (OIDA). The approach treats documents as a structured field of actor–action relations and models these relations as semantic action trajectories across time. Semantic role labeling (SRL) using the Emory Language and Information Toolkit (ELIT) is applied to extract subject–predicate structures from a corpus of internal industry documents. Subjects are normalized and grouped into actor categories using a combination of rule-based heuristics and constrained language model adjudication. Predicate vocabularies associated with these actors are mapped to psycholinguistic categories using the LIWC lexicon, and random forest feature selection with principal component analysis is used to construct a low-dimensional representation of discourse structure across periods.The resulting discourse space reveals systematic shifts in how corporate actors, regulators, clinicians, and patients are positioned over time. In particular, corporate entities and the opioid products they produce follow nearly identical semantic trajectories, suggesting that companies and the pharmaceutical drugs they produce occupy similar roles in the archive’s discourse. This method provides a way to analyze changing institutional behavior at scale across heterogeneous litigation and historical archives.
Toward Unsupervised Conceptual Metaphor Discovery: A Case Study in Online Immigration Discourse
Alexandria Leto | Maria Leonor Pacheco
Alexandria Leto | Maria Leonor Pacheco
In Conceptual Metaphor Theory (CMT), a metaphor is a systematic mapping from a concrete source domain (e.g., physical load) to a more abstract target domain (e.g., taxes), so that reasoning about concepts in the target domain is guided by inferences from the source domain. In this work, we propose that since different source domains can frame the same target in starkly different ways, the conceptual mappings evidenced by metaphorical expressions can guide computational political discourse analysis. We present a proof-of-concept for an unsupervised method that uncovers salient conceptual mappings from a corpus. Prior work in computational political metaphor analysis has drawn on CMT, but it typically requires a predetermined inventory of focused source and target domains. In contrast, we introduce a simple LLM-based method that detects metaphorical expressions from a corpus with strong performance, then clusters them to approximate source domain categories. We demonstrate its utility through a case study on online immigration discourse, showing that the resulting metaphor clusters provide context for frame analysis. We conclude by outlining future work needed to develop a robust framework for conceptual metaphor discovery in political discourse.
Simulating Social Attitudes with LLMs: Accuracy, Demographic Effects, and Refusal Behavior in the Sensitive Domain of Suicide Prevention
Cristina J. Perez | Michael P. Vasquez Jr | Philippe Giabbanelli | Patrick Y. Wu
Cristina J. Perez | Michael P. Vasquez Jr | Philippe Giabbanelli | Patrick Y. Wu
Large language models (LLMs) are increasingly used to simulate public opinion, yet their validity in sensitive policy domains remains underexplored. We evaluate whether LLMs can reproduce attitudes toward suicide prevention policies using 32 questions drawn from seven nationally representative U.S. surveys (2023-2025). We systematically vary demographic conditioning (race/ethnicity, gender, age, education, income, party), prompt framing (direct elicitation, respondent embodiment, specialist embodiment), and model architecture (GPT-5 Nano, DeepSeek V3.2, Meta Llama 3.1 8B, Mistral Small 24B). Across 811,560 prompts, the mean absolute error—the average gap between predicted and human response distributions—is 23 percentage points. We also find that LLM responses to demographic-conditioned prompts diverge substantially from prompts without demographic information. In short, what distribution LLMs draw on when generating responses to sensitive polling questions remains unclear. Model choice matters more than framing for accuracy, whereas refusal behavior varies sharply across models and prompt designs. Our findings highlight the limitations of LLMs for social simulation in the context of sensitive topics.
Gender Disparities in LLM-Based Intimate Partner Violence Detection
Tabia Tanzin Prama | Mikaela Irene Fudolig | Abigail M. Crocker | Christopher M. Danforth | Peter Dodds
Tabia Tanzin Prama | Mikaela Irene Fudolig | Abigail M. Crocker | Christopher M. Danforth | Peter Dodds
Intimate Partner Violence (IPV) is a major public health concern, and large language models (LLMs) are increasingly used for support and information-seeking in sensitive domains. We examine whether LLMs perceive relationship abuse differently depending on victim–perpetrator gender configuration. Using 475 Reddit posts from r/relationship_advice, we generate counterfactual variants by swapping gendered identifiers to create four dyads: female–female (F/F), female–male (F/M), male–female (M/F), and male–male (M/M), where the first position denotes the victim. Four recent LLMs (GPT-5o, Gemini 3, Llama 4, and Grok 3) evaluate each variant using a structured questionnaire covering IPV, perpetrator intent, cheating, and abuse subtypes. Results show substantial variation across models and dyads. Abuse and intent detection systematically decrease in mixed-gender dyads where the victim is male, with female perpetrator identity emerging as a consistent negative predictor of abuse recognition. Mixed-effects logistic regression confirms that gender roles significantly shape model outputs. Our findings suggest that LLMs reproduce gendered biases from online training data, with implications for support-related deployment. Code and resources are available at GitHub.
Datasets and Methods for Improving the Cultural Capabilities of NLP Systems: A Survey
Tania Chakraborty | Eylon Caplan | Zhaoqing Wu | Kevin Cushing | Bruce Qin | Shreya Havaldar | Dan Goldwasser
Tania Chakraborty | Eylon Caplan | Zhaoqing Wu | Kevin Cushing | Bruce Qin | Shreya Havaldar | Dan Goldwasser
In recent years, there has been a surge of interest in Cultural NLP, with substantial efforts to create globally inclusive NLP systems. The rapid growth of literature in this field makes it difficult to track trends in methods and data resources. To address this, we survey over 375 papers to answer three complementary questions: (1) What Cultural Capabilities (CCs) are being targeted in NLP systems? (2) How are cultural data resources being created? and (3) What methods are being used to improve the CCs of those systems? We discuss trends observed across the three questions, and identify relevant research gaps. To facilitate further research in this field, we release our full list of surveyed papers, in the form of an interactive web interface, CultureMine, which includes a feature to allow researchers to add their work; we hope this facilitates future research and proves to be a valuable resource for the Cultural NLP community.
Towards More Transparent Online Campaigning: Detecting Political Campaign Content in Election-related Social Media Posts
Abdullah Alabdullah | Conor Gaughan | Thomas Flavel | Shubhanjay Varma | Rachel Gibson | Marta Cantijoch | Alexandru Cernat | Riza Batista-Navarro
Abdullah Alabdullah | Conor Gaughan | Thomas Flavel | Shubhanjay Varma | Rachel Gibson | Marta Cantijoch | Alexandru Cernat | Riza Batista-Navarro
A large part of political campaigns during elections is now being conducted online, with political actors leveraging their networks on social media platforms. To maintain transparency in political communications, regulations applicable to online campaigning have been put in place in many democracies. While it should be straightforward for voters to determine who produced and funded online advertisements comprising paid political campaigns, it is much more challenging to detect if organic content, i.e., social media posts, pertains to political campaigning, due to possibly subtle yet suggestive language that can be used by certain actors. In this paper, we investigate the feasibility of automatically detecting whether a given tweet posted by a political actor pertains to political campaigning, and if yes, whether it was conveyed in a direct or indirect (subtle) manner. After establishing an annotation scheme for the task of detecting political campaign content in tweets, we fine-tuned three encoder models (BERT, BERTweet and PoliBERTweet) for the same task and evaluated their performance. Our results show that fine-tuning BERTweet leads to the best macro-averaged F1-score (0.776), although all models consistently struggle to detect indirect campaigning.
Mapping the Landscape of Unregulated eXplicit Contents on Reddit
Msvpj Sathvik | Manan Roy Choudhury | Rishita Agarwal | Sathwik Narkedimilli | Thao Ha | Liesel Sharabi | Vivek Gupta
Msvpj Sathvik | Manan Roy Choudhury | Rishita Agarwal | Sathwik Narkedimilli | Thao Ha | Liesel Sharabi | Vivek Gupta
The rise of online platforms has facilitated covert forms of explicit content, which pose significant challenges for detection and regulation. Often using coded language to bypass moderation, this content erodes user trust and may be associated with scam-related risks, posing direct financial and personal risks. In this study, we map the landscape of online explicit content posts, focusing on their categorization, linguistic strategies, and temporal and behavioral patterns as they appear within mainstream platform reddit. We investigated five distinct content categories including Virtual Services (VS), Physical Services (PS), Exhibitionism (Ex), Couples and Group Interactions (CGI), and Content Creation and Sales (CCS) and performedmed large-scale experimentation using state-of-the-art large language models (LLMs) such as GPT-4, LLaMA 3.3-70B-Instruct, Gemini 1.5 Flash, Mistral 8×7B, Qwen 2.5 Turbo, and Claude 3.5 Haiku. Our work demonstrates that a nuanced classification of these services requires moving beyond simple keywords, and we establish that expressive signals such as sentiment, emotion, and tone are critical features for accurate detection. Our analysis reveals the distinct behavioral and psychosocial expression patterns that characterize each service category, providing a robust framework for future moderation.
From Adoption to Adaptation: Tracing the Diffusion of New Emojis on Twitter
Yuhang Zhou | Xuan Lu | Wei Ai
Yuhang Zhou | Xuan Lu | Wei Ai
The frequent introduction of new emojis in each Unicode release creates a dynamic shift in social media content, providing a unique opportunity to explore the evolution of digital language. Analyzing a large dataset of sampled English tweets, we examine how newly released emojis gain popularity and evolve in meaning. We find that the community size of early adopters and emoji semantics are positively correlated with their popularity. Certain emojis experienced notable shifts in the meanings and sentiment associations during the diffusion process. Additionally, we propose a novel framework utilizing language models to extract words and pre-existing emojis with semantically similar contexts, which enhances the interpretation of new emojis. The framework demonstrates its effectiveness in improving downstream text classification performance by substituting unknown new emojis with familiar ones. This study offers a new perspective in understanding how new language units are adopted, adapted, and integrated into the fabric of online communication.
Social Construction of Urban Space: Using LLMs to Identify Neighborhood Boundaries From Craigslist Ads
Adam Visokay | Ruth Bagley | Chris Hess | Ian Kennedy | Kyle Crowder | Rob Voigt | Denis Peskoff
Adam Visokay | Ruth Bagley | Chris Hess | Ian Kennedy | Kyle Crowder | Rob Voigt | Denis Peskoff
Rental listings offer a window into how urban space is socially constructed through language. We analyze Chicago Craigslist rental advertisements from 2018 to 2024 to examine how listing agents characterize neighborhoods, identifying mismatches between institutional boundaries and neighborhood claims. Through manual and large language model annotation, we classify unstructured listings from Craigslist according to their neighborhood. Further geospatial analysis reveals three distinct patterns: properties with conflicting neighborhood designations due to competing spatial definitions, border properties with valid claims to adjacent neighborhoods, and “reputation laundering" where listings claim association with distant, desirable neighborhoods. Through topic modeling, we identify patterns that correlate with spatial positioning: listings further from neighborhood centers emphasize different amenities than centrally-located units. Natural language processing techniques reveal how definitions of urban spaces are contested in ways that traditional methods overlook.
The Hidden Language of Harm: Examining the Role of Emojis in Harmful Online Communication and Content Moderation
Yuhang Zhou | Yimin Xiao | Wei Ai | Ge Gao
Yuhang Zhou | Yimin Xiao | Wei Ai | Ge Gao
Social media platforms have become central to modern communication, yet they also harbor offensive content that challenges platform safety and inclusivity. While prior research has primarily focused on textual indicators of offense, the role of emojis, ubiquitous visual elements in online discourse, remains underexplored. Emojis, despite being rarely offensive in isolation, can acquire harmful meanings through symbolic associations, sarcasm, and contextual misuse. In this work, we systematically examine emoji contributions to offensive Twitter messages, analyzing their distribution across offense categories and how users exploit emoji ambiguity. To address this, we propose an LLM-powered, multi-step moderation pipeline that selectively replaces harmful emojis while preserving the tweet’s semantic intent. Human evaluations demonstrate that our approach effectively reduces offensiveness while preserving semantic integrity. Our analysis also reveals heterogeneous effects across offense types, offering nuanced insights for online communication and emoji moderation.
up
Proceedings of the Seventh Workshop on Privacy in Natural Language Processing
Proceedings of the Seventh Workshop on Privacy in Natural Language Processing
Ivan Habernal | Sepideh Ghanavati | Sara Haghighi | Krithika Ramesh | Timour Igamberdiev | Shomir Wilson
Ivan Habernal | Sepideh Ghanavati | Sara Haghighi | Krithika Ramesh | Timour Igamberdiev | Shomir Wilson
From Conventional Web Privacy to Agentic Disclosure: How Tool Schemas May Invite LLM Oversharing
Shahriar Shayesteh | Shomir Wilson
Shahriar Shayesteh | Shomir Wilson
LLM agents increasingly act on behalf of users by selecting tools and constructing API requests to external services. This creates a new privacy risk in agentic systems: disclosure is no longer limited to what users directly enter into a form, but can instead be generated by the agent at runtime. In conventional web settings, disclosure is largely bounded by the user-facing interface, and what is appropriate to share varies across service contexts. In tool-using agents, however, disclosure is generated at runtime when user intent is translated into tool-call arguments for a particular receiving service, making context-sensitive disclosure boundaries harder to preserve. In this position paper, we argue that the runtime tool call is the key unit of privacy analysis in agentic systems. Our contribution is diagnostic rather than behavioral: instead of measuring realized leakage, we analyze interface conditions that may make agent oversharing more plausible. In particular, schemas that expose generic, weakly constrained free-text fields leave part of disclosure under agent discretion. In a case study of 2,344 tool specifications from the OpenAI GPT ecosystem, we find that 36.9% expose at least one such channel, creating conditions for within-context over-disclosure, cross-context leakage, and what we call contextual flattening. We conclude by outlining a research agenda for NLP that moves beyond output-only evaluation toward argument-level analysis of what tool schemas allow agents to send to third-party services.
The Challenge of Identifying the Origin of Black-Box Large Language Models
Ziqing Yang | Yixin Wu | Yun Shen | Wei Dai | Michael Backes | Yang Zhang
Ziqing Yang | Yixin Wu | Yun Shen | Wei Dai | Michael Backes | Yang Zhang
The tremendous commercial potential of large language models (LLMs) has heightened concerns over their unauthorized use. To address this, we focus on the task of identifying the origin of black-box LLMs. We further propose PlugAE, an effective and efficient identification method that proactively leverages LLM-specific adversarial embeddings and allows users to customize copyright tokens on a targeted query set. Extensive experiments demonstrate that PlugAE outperforms both state-of-the-art model watermarking and fingerprinting methods in accuracy and robustness. We further analyze its stealthiness and reliability from three complementary perspectives and conduct ablation studies under various configurations, confirming its practicality for real-world misuse detection.
SecureLLM: Using Inference-time Compositionality to Build Secure Language Models
Abdulrahman Alabdulkareem | Christian Michael Arnold | Yerim Lee | Pieter M Feenstra | Conner Arnold | Boris Katz | Andrei Barbu | Brian Cheung
Abdulrahman Alabdulkareem | Christian Michael Arnold | Yerim Lee | Pieter M Feenstra | Conner Arnold | Boris Katz | Andrei Barbu | Brian Cheung
As Large Language Models (LLMs) increasingly support critical sectors such as healthcare, finance, and public governance, ensuring data confidentiality and robust access control is a pressing societal challenge. Traditional security mechanisms isolate sensitive resources from unauthorized users, yet existing LLM safety approaches often fail to enforce strict segregation of confidential data. In this work, we introduce SecureLLM, a novel compositional framework for building secure LLMs that integrates fine-tuning with traditional access security measures to protect private information. By fine-tuning LLMs on segregated, “siloed” training data and composing their outputs at inference time based solely on a user’s verified credentials, SecureLLM not only prevents unauthorized data leakage but also enables accurate responses for complex queries spanning multiple data silos. Our method is demonstrated on a challenging natural-language-to-SQL translation task and is designed with real-world applications in mind, where protecting sensitive information is critical.
STAMP-R: Stylometric Text Anonymization with Memory-guided Policy Rewriting
Zhan Shi | Yefeng Yuan | Liang Cheng | Yuhong Liu
Zhan Shi | Yefeng Yuan | Liang Cheng | Yuhong Liu
Modern machine learning systems rely heavily on large-scale textual data that often contain sensitive personal information. Although conventional anonymization techniques remove explicit identifiers, textual data remain vulnerable to authorship inference attacks that exploit persistent stylometric signals.Recent approaches leverage Large Language Models (LLMs) to rewrite text and obscure such signals, but they frequently overlook distinctive stylometric outliers and fail to achieve a favorable privacy–utility trade-off due to rigid, one-size-fits-all obfuscation strategies, while also incurring high computational costs.To address these challenges, we propose STAMP-R, a risk-adaptive reinforcement learning framework for instance-level authorship anonymization. We formulate anonymization as a risk-aware, instance-level style distribution shaping problem. Central to our approach is the Style Manifold Memory (SMM), which models the global stylistic landscape via prototype-based density estimation. SMM detects high-risk stylometric outliers and adaptively modulates a composite reward function, enabling stronger obfuscation for highly identifiable samples while preserving semantic fidelity for low-risk instances.We further distill a lightweight 3B-parameter model from a teacher LLM for efficient local deployment. Experiments show that STAMP-R reduces authorship re-identification risk while maintaining strong downstream utility.
Loss Masking Under the Hood: Backdoor Concealment and Private Data Memorization in LLMs
Tagore Rao Kosireddy | Evan Lucas
Tagore Rao Kosireddy | Evan Lucas
Loss masking has been proposed as a method for preventing language models from generating specific content by selectively zeroes the training loss on sensitive tokens,which allows a language model to learn protected content as contextwithout learning to reproduce it (CITATION).% Although promising, many critical questions about the impacts to a model remain unanswered. In this work, we investigate the impact of loss masking on internal model representation and context understanding using a small causal language model (GPT-2) at three scales (124M, 355M, 774M parameters) and apply mechanistic interpretability tools including causal tracing, attention analysis, and linear probing. We explore two use cases of loss-masking: backdoor concealment and prevention of memorization of named entities. In both settings, we find that loss masking successfully blocks generation of the protected tokens. Through mechanistic analysis, we show that protected token identity remains fully encoded in hidden states regardless of loss masking, confirming that loss masking suppresses the output pathway but not the internal encoding. Code is available at https://github.com/Tagore-7/loss-masking-analysis
Prompt Stylometry for On-Device Affect-Adaptive AI: A Feasibility Study in Linguistic Signal Detection and Response Steering
Debmalya Pal
Debmalya Pal
Every user prompt contains latent linguistic signals beyond its explicit semantic content: lexical choice, hedging, sentence structure, and discourse patterns, that reflect the user’s affective state and cognitive style. Yet most large language models are optimized for generalized assistant behavior rather than explicit adaptation to these fine-grained signals. We introduce Prompt Stylometry, a framework for detecting affective and cognitive-style signals directly from user prompts and using them to steer response generation. We study two categories of signals: affect-related cues associated with emotional states, and cognitive-style cues associated with patterns such as analytical, exploratory, self-critical, or indecisive reasoning. This inference capability, however, creates substantial privacy risks: any system processing prompts server-side could implicitly profile users’ psychological states without their knowledge or consent. This motivates our core design choice of a fully on-device architecture in which no interaction data leaves the user’s device. We benchmark three annotation paradigms, lexicon-based, neural, and generative, across 600 synthetic prompts spanning 30 stylometric profiles, and evaluate affect-adaptive response steering across two small language model families under 5B parameters. Our results show systematic differences in both signal detection behavior and downstream steering responsiveness across annotation methods and model families, demonstrating the feasibility of privacy-preserving affect-adaptive AI on consumer hardware while identifying annotation paradigm sensitivity and cross-profile transfer as key open challenges.
Differential Privacy (DP) for text matured from disjointed word-level substitutions to contiguous sentence-level rewriting by leveraging the generative capacity of language models. While this form of text privatization is best suited for balancing formal privacy guarantees with grammatical coherence, its impact on the register identity of text remains largely unexplored. By conducting a multidimensional stylistic profiling of differentially-private rewriting, we demonstrate that the cost of privacy extends far beyond lexical variation. Specifically, we find that rewriting under privacy constraints induces a systematic functional mutation of the text’s communicative signature. This shift is characterized by the severe attrition of interactive markers, contextual references, and complex subordination. By comparing autoregressive paraphrasing against bidirectional substitution across a spectrum of privacy budgets, we observe that both architectures force convergence toward a non-involved and non-persuasive register. This register-blind sanitization effectively preserves semantic content but structurally homogenizes the nuanced stylistic markers that define human-authored discourse.
Privacy-preserving natural language processing (NLP) typically focuses on removing explicit identifiers such as names, addresses, and phone numbers. We argue that this approach overlooks a key risk: natural language itself encodes signals about a speaker’s geographic origin, social background, and community membership that persist after anonymization. We introduce Linguistic Identity Leakage (LIL), defined as the inference of personal or demographic attributes from linguistic features in text where explicit identifiers have been removed. We further introduce Linguistic Personally Identifiable Information (L-PII) to denote the linguistic features that enable such inference. Drawing on sociolinguistics, stylometry, and NLP privacy research, we propose a taxonomy of linguistic identity signals across five categories and examine implications for dataset release, language model training, and privacy auditing. Using examples from Arabic dialectal variation and other multilingual contexts, we present the Identity Inference Risk (IIR) framework for assessing residual privacy risk in NLP systems and discuss how contemporary LLMs amplify these risks. Our goal is to encourage broader recognition of the gap between conventional anonymization practices and the linguistic reality of natural language data.
A Systematic Exploration of Text Decomposition and Budget Distribution in Differentially Private Text Obfuscation
Stephen Meisenbacher | Angelo Kleinert | Florian Matthes
Stephen Meisenbacher | Angelo Kleinert | Florian Matthes
The goal of *differentially private text obfuscation* is to obfuscate, or "perturb", input texts with Differential Privacy (DP) guarantees, such that the private output texts are quantifiably indistinguishable from the originals. While perturbation at the word level is intuitive, meaningful text privatization happens on complete documents. Recent research has laid the groundwork for reasoning about *privacy budget distribution*, namely, how an overall 𝜀 budget can be sensibly distributed among the component pieces of a text. We perform a systematic evaluation of multiple text decomposition and budget distribution techniques in the context of DP text obfuscation, testing how different methods for chunking texts can be combined with techniques for allocating 𝜀 to these chunks. Our experiments reveal that such design choices are very important, as even with comparable privacy budgets, significantly different results can occur based on which methods are chosen. In this, we provide credible evidence of the feasibility of maximizing empirical trade-offs by optimizing DP obfuscation procedures.
Safer Reasoning Traces: Measuring and Mitigating Chain-of-Thought Leakage in LLMs
Patrick Ahrend | Tobias Eder | Xiyang Yang | Zhiyi Pan | Georg Groh
Patrick Ahrend | Tobias Eder | Xiyang Yang | Zhiyi Pan | Georg Groh
Chain-of-Thought (CoT) prompting improves LLM reasoning but can increase privacy risk by resurfacing personally identifiable information (PII) from the prompt into reasoning traces and outputs, even under policies that instruct the model not to restate PII. We study such direct, inference-time PII leakage using a model-agnostic framework that (i) defines leakage as risk-weighted, token-level events across 11 PII types, (ii) traces leakage curves as a function of the allowed CoT budget, and (iii) compares open- and closed-source model families on a structured PII dataset with a hierarchical risk taxonomy. We find that CoT consistently elevates leakage, especially for high-risk categories, and that leakage is strongly family- and budget-dependent: increasing the reasoning budget can either amplify or attenuate leakage depending on the base model. We then benchmark lightweight inference-time gatekeepers: a rule-based detector, a TF–IDF + logistic regression classifier, a GLiNER-based NER model, and an LLM-as-judge, using risk-weighted F1, Macro-F1, and recall. No single method dominates across models or budgets, motivating hybrid, style-adaptive gatekeeping policies that balance utility and risk under a common, reproducible protocol.
up
Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
Proceedings of the 1st Workshop on Multilingual Report Generation via Retrieval Augmented Generation (RAG4Reports 2026)
Eugene Yang | Dawn Lawrie | Sean MacAvaney | James Mayfield | Luca Soldaini | Andrew Yates
Eugene Yang | Dawn Lawrie | Sean MacAvaney | James Mayfield | Luca Soldaini | Andrew Yates
A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems
Florin Cuconasu | Giovanni Trappolini | Nicola Tonellotto | Fabrizio Silvestri
Florin Cuconasu | Giovanni Trappolini | Nicola Tonellotto | Fabrizio Silvestri
Retrieval-Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by Large Language Models (LLMs). Common wisdom and practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques.However, contrary to this popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more complex situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".
Decompose, Retrieve, Cite: A RAG Pipeline for Structured Report Generation from Technical Documentation
Himanshu Dhurve | Sreedath Panat | Rajat Dandekar | Raj Dandekar
Himanshu Dhurve | Sreedath Panat | Rajat Dandekar | Raj Dandekar
Retrieval-Augmented Generation (RAG) grounds language-model output in external knowledge, yet its application to dense technical documentation remains largely unexplored. Engineering software manuals pose compounding challenges: formulae are corrupted during PDF extraction, heterogeneous content types require different parsing treatment, and queries demand cross-document synthesis across multiple reference volumes.We present an end-to-end RAG system for OpenFOAM, an open-source computational fluid dynamics toolkit, operating in two modes. In single-query mode, a formula-preserving parser (Marker), adaptive header-aware chunking, two-stage dense-then-rerank retrieval, and a citation-enforcement prompt produce grounded, source-attributed answers across a 20-question benchmark.In report mode, a user prompt is decomposed into sub-questions via LLM planning; each sub-question undergoes independent retrieval and cross-encoder re-ranking, and the deduplicated chunk set is passed to a long-context generation call that produces a structured, multi-section report with inline citations.Evaluated on a 10-prompt golden set with a six-dimension LLM-as-a-judge framework, both pipelines achieve overall scores above 4.6/5.0 with perfect citation correctness (5.0/5.0). The decomposed pipeline demonstrates superior robustness (90% vs 70% judge success rate). Retrieval analysis using page-level ground truth reveals low absolute recall (<14%), identifying retrieval breadth as the primary bottleneck.
We introduce EncouRAGe, a comprehensive Python library designed to streamline the development and evaluation of Retrieval-Augmented Generation (RAG) systems using Large Language Models (LLMs) and Embedding Models. EncouRAGe comprises five modular and extensible components: Type Manifest, RAG Factory, Inference, Vector Store, and Metrics, facilitating flexible experimentation and extensible development. Each component helps to make development RAG evaluation and emphasizes scientific reproducibility, diverse evaluation metrics, and local deployment, enabling researchers to efficiently assess datasets within RAG workflows. This paper presents implementation details and an extensive evaluation across multiple benchmark datasets, including 25k QA pairs and over 51k documents. Our results show that RAG still underperforms compared to the Oracle Context, while Hybrid BM25 consistently achieves the best results across all four datasets. Code: https://github.com/uhh-hcds/encourage
REFSafE: A RAG-Enabled Framework for Predictive Risk Analysis and Automated Safety Report Generation in Mission-Critical Environments
Sanjay Das | Ran Elgedawy | Ethan Seefried | Ryan A. Burchfield | Gavin Wiggins | Dana Hewit | Sudarshan Srinivasan | Prasanna Balaprakash | Robert M. Patton | Todd Thomas | Tirthankar Ghosal
Sanjay Das | Ran Elgedawy | Ethan Seefried | Ryan A. Burchfield | Gavin Wiggins | Dana Hewit | Sudarshan Srinivasan | Prasanna Balaprakash | Robert M. Patton | Todd Thomas | Tirthankar Ghosal
Operational safety in mission-critical environments requires AI systems that are accurate, interpretable, and resistant to hallucination. We present an agentic Retrieval-Augmented Generation (RAG) framework, REFSafe, for grounded hazard analysis and automated safety report generation. The system integrates Large Language Models (LLMs) with structured operational data, historical incident repositories, policy documents, and external authoritative sources. Through iterative agentic reasoning, the framework retrieves, verifies, and synthesizes evidence prior to generation, enforcing citation-backed outputs with explicit source attribution (documents, links, and prior events) to ensure traceability and trust.To mitigate hallucinations and unsupported claims, all risk assessments and forecasts are constrained to retrieved evidence, with confidence signals derived from retrieval relevance and source consistency. A transparent pipeline enables subject matter experts (SMEs) to validate predictions, and provide structured feedback, forming a continuous performance calibration loop. Preliminary deployment demonstrates improved reliability in hazard detection and safety/vulnerability report generation. This work advances trustworthy, evidence-grounded AI for predictive safety intelligence in mission-critical operations.
ORCHID: Orchestrated Retrieval-Augmented Classification of High-Risk Property with Intelligent Decision-Making
Sanjay Das | Maria Mahbub | Vanessa Lama | Brian Starks | Christopher Polchek | Saffell Silvers | Lauren Deck | Prasanna Balaprakash | Robert M. Patton | Tirthankar Ghosal
Sanjay Das | Maria Mahbub | Vanessa Lama | Brian Starks | Christopher Polchek | Saffell Silvers | Lauren Deck | Prasanna Balaprakash | Robert M. Patton | Tirthankar Ghosal
High-Risk Property (HRP) classification is critical at U.S. Department of Energy (DOE) sites, where inventories include sensitive and often dual-use equipment. Compliance must track evolving rules designated by various export control policies to make transparent and auditable decisions. Traditional expert-only workflows are time-consuming, backlog-prone, and struggle to keep pace with shifting regulatory boundaries. We propose ORCHID, a modular agentic framework for HRP classification that pairs retrieval-augmented generation (RAG) with human oversight to produce policy based outputs that can be audited. Small cooperating agents—retrieval, description refiner, classifier, validator, and feedback logger—coordinate via agent-to-agent messaging and invoke tools through the Model Context Protocol (MCP) for model-agnostic on-premise operation. The interface follows an "Item to Evidence to Decision" loop with step-by-step reasoning, on-policy citations, and append-only audit bundles (run-cards, prompts, evidence). In preliminary tests on real HRP cases, ORCHID improves accuracy and traceability over a non-agentic baseline while deferring uncertain items to Subject Matter Experts (SMEs). The demonstration shows single item submission, grounded citations, SME feedback capture, and exportable audit artifacts—illustrating a practical path to trustworthy LLM assistance in sensitive DOE compliance workflows.
A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science
Pierre Achkar | Tim Gollub | Arno Simons | Harrisen Scells | Maik Fröbe | Martin Potthast
Pierre Achkar | Tim Gollub | Arno Simons | Harrisen Scells | Maik Fröbe | Martin Potthast
Automating systematic reviews (SRs), i.e., evidence-driven analyses under explicit protocol constraints, is a natural target for retrieval-augmented generation and deep research agents, yet existing benchmarks evaluate isolated subtasks or assume fixed evidence inputs. We introduce RAG4SR-CS-200, a benchmark of 200 computer science systematic reviews designed for protocol-driven systematic review automation. Each instance comprises review objectives, research questions, eligibility criteria, cleaned full-text review structure, references, and extracted tables. These elements support evaluation across key tasks in systematic review creation such as literature retrieval, eligibility screening, citation-grounded review generation, and structured table generation, in both stage-wise and end-to-end settings. RAG4SR-CS-200 provides a foundation for developing more reliable and diagnosable deep research agents for scientific evidence synthesis. Code and data are publicly available (https://github.com/webis-de/rag4sr-cs-200).
UNH @ Rag4Reports: A Broad Exploration of LLM-Judges for RAG
Minna Tran | Ryan McCarthy | Aiden Parsons | Jaren Unzen | Laura Dietz
Minna Tran | Ryan McCarthy | Aiden Parsons | Jaren Unzen | Laura Dietz
We submitted a breadth of LLM-as-a-Judge approaches to Rag4Reports Task A; our top method ranked first among all submitted systems. We find that citation faithfulness is the most essential signal, and that content is best verified by checking whether cited documents cover nuggets generated from the LLM’s internal knowledge.
Crucible @ Rag4Reports: Generating Nuggets for Report Generation and Evaluation
Laura Dietz | Eugene Yang
Laura Dietz | Eugene Yang
We submit to both tracks of the RAG4Reports challenge with two complementary components: PREFNUGGET, which derives concise nugget banks from pairwise preference judgments between system responses, and CRUCIBLE, a nugget-first pipeline that uses such banks to assemble reports on a given topic. The shared nugget-level representation unifies our approach to report evaluation (Task A) and report generation (Task B).
GenAIus at RAG4Reports 2026: Citation-Aware Compression for Multilingual Report Generation
Reyyan Yeniterzi | Suveyda Yeniterzi
Reyyan Yeniterzi | Suveyda Yeniterzi
This paper describes the GenAIus submission to RAG4Reports 2026 Multilingual Report Generation Task. Our system builds on our earlier TREC RAGTIME pipeline, reusing the evidence preparation stages for overlapping topics, including question generation, multilingual retrieval, nugget generation, and nugget clustering. For RAG4Reports, we focused on the final generation stage and tested a citation-aware compression strategy: generating the long report first from clustered evidence nuggets and then deriving the short report from it, rather than generating both length conditions independently. Our baseline run, which followed the original TREC-style setup, ranked third overall. Our best run, genaius-cluster-gpt4, ranked second overall with an F1 score of 0.5456, achieving the best balance among our submissions between nugget coverage and sentence support. The results suggest that citation-aware compression is a promising strategy for length-constrained, citation-grounded report generation.
AMU at RAG4Reports 2026 Task B: A Practical Multilingual RAG Pipeline for Citation-Grounded Reports
Maciej Czajka | Piotr Jabłoński | Mateusz Czajka | Konrad Pierzyński | Krzysztof Jassem
Maciej Czajka | Piotr Jabłoński | Mateusz Czajka | Konrad Pierzyński | Krzysztof Jassem
This system paper presents AMU’s submission to RAG4Reports 2026 Task B: a practical multilingual retrieval-augmented generation pipeline for evidence-supported report generation. The system combines full-query retrieval, optional query rewriting, dense retrieval with Qdrant, cross-encoder reranking, diversity-aware context selection, and structured generation. The best submitted run uses BAAI/bge-m3 embeddings, BAAI/bge-reranker-v2-m3 reranking, and gpt-5.1 generation with medium reasoning effort, using a partial-coverage prompt strategy. On the official leaderboard, it achieved F1=0.4351, sentence_support=0.8280, and nugget_coverage=0.3403, indicating that the generated reports were well grounded but only partially comprehensive.
Exploring Capability Thresholds in Ultra-Lightweight LLM Judges for Nugget-Based Report Evaluation
Mann Bajpai | Pulkit Chatwal | Priyanshu Deswal | Harish Pratap Singh | Santosh Kumar Mishra
Mann Bajpai | Pulkit Chatwal | Priyanshu Deswal | Harish Pratap Singh | Santosh Kumar Mishra
Reliable automatic evaluation of retrieval-grounded long-form reports typically requires human annotation or frontier-scale proprietary LLMs, both of which are expensive in constrained settings. Team rgipt participated in RAG4Reports@ACL 2026 Task 1 with a zero-shot nugget-verification system that runs entirely on a single NVIDIA T4 GPU. We compare three ultra-lightweight decoder-only models: Qwen2-0.5B, Qwen2-1.5B, and Qwen2.5-0.5B, under identical inference conditions to examine how small an LLM judge can be while retaining human-aligned ranking signal. Both Qwen2 models produced negative 𝜏gap, whereas Qwen2.5-0.5B achieved 𝜏gap = 0.0772 and Pearson r = 0.2209, ranking 13th of 21 teams. Within this family and evaluation setting, model generation appears to matter more than parameter count, although this finding is based on three configurations on a single task and warrants further validation.
EFSG: Evidence-First Structured Generation for Multilingual RAG Report Generation
Shaurya Gupta | Jatin Bedi
Shaurya Gupta | Jatin Bedi
We describe EFSG (Evidence-First Structured Generation), our submission to Task B of the RAG4Reports@ACL 2026 shared task. Standard retrieval-augmented generation pipelines allow generation models to write from parametric memory and attach citations retroactively: a behaviour we term post-rationalization. EFSG addresses this structurally through a phase boundary: all evidence is retrieved, extracted, and sealed into a fact pool before any generation begins; each sentence then sees only its single committed source passage. Our best run (t5100k doc corpus) achieved sentence_support of 0.612 and nugget_coverage of 0.126 (F1 = 0.182).
Adapting AutoARGUE for Automatic Report Evaluation under Missing Citation Annotations
Divrose Kaur | Jatin Bedi | Jasmeet Singh
Divrose Kaur | Jatin Bedi | Jasmeet Singh
We adapt the AutoARGUE framework (Walden et al., 2026) for Task A.2 of RAG4Reports 2026, which requires ranking 57 report generation systems across 68 topics using automated evaluation. The RAGTIME-1 corpus poses a fundamental challenge: all nugget annotations use a no-reference-doc sentinel rather than ground-truth document citations, rendering the original citation-relevance gating inoperable. We address this with three adaptations: automatic sentinel detection with forced direct LLM-based nugget matching; a WEAK POSITIVE partial credit mechanism for sentences that correctly answer nuggets but lack attesting citations; and a report-level request alignment check. Our nugget_coverage_weighted metric achieves the highest topic-level Pearson correlation (r=0.599) of any non-coordinator submission, closely approaching the coordinator baseline (r=0.607).
JU-NLP-PG at RAG4Reports 2026: Memory-Efficient Multilingual Report Generation with 4-bit Quantized LLMs
Swayam Chatterjee | Dipankar Das
Swayam Chatterjee | Dipankar Das
In the present article, we have described our system developed for participating in Task B on Multilingual Report Generation under RAG4Reports 2026 at ACL 2026 with submitted run ID ju_nlp_pg. The problem statement is given a report request in English, the system retrieves relevant passages from a four million multilingual document corpus (English, Chinese, Russian, Arabic) and generates a grounded, citation-bearing report. Our core challenge was how to fit a large retrieval corpus along with a capable generative model on a two-GPU node with ≈29 GB RAM. We addressed the challenge employing three different techniques: (1) 4-bit NF4 quantization, shrinking the LLM from ≈14 GB to ≈4 GB; (2) memory-mapped, chunked FAISS index construction over pre-computed multilingual-e5-large embeddings; and (3) strict model-loading order to prevent heap fragmentation. On the other hand, the reports are structured around topic nuggets to directly target the Auto-ARGUE evaluation signal.
up
Proceedings of the Society for Computation in Linguistics 2026
Proceedings of the Society for Computation in Linguistics 2026
Rob Voigt | Alex Warstadt | Naomi Feldman | Tal Linzen
Rob Voigt | Alex Warstadt | Naomi Feldman | Tal Linzen
Measuring Perceptions of Personhood with Semantic Proto-role Properties
Elizabeth Spaulding Hoefer | James Martin
Elizabeth Spaulding Hoefer | James Martin
We show that semantic proto-role properties can be used as a tool to measure implicit human perceptions of agency and patiency of entities in human-generated text. First, we demonstrate that silver-generated semantic proto-role property labels are strongly correlated with both human judgment and a probabilistic text-based measure of anthropomorphism. Then, we use our measure to quantify linguistic idiosyncrasies across different AI-related Reddit communities. Our measure shows that subreddits dedicated to discussing AI companionship ascribe higher sentience to "bots" and higher agency to "companies" when compared to other subreddits. This phenomenon reveals not only the unique way in which chatbots are anthropomorphized in such subreddits, but also the users’ keen awareness of their power imbalance with the companies that created the chatbots.
Given a listener’s native language, some non-native contrasts may be harder to discriminate than others. The computation required to mimic this variable difficulty is not yet known. The present work approaches this question by training small supervised feedforward neural networks to perform Spanish vowel classification and then evaluating model classification of Catalan vowels, thereby approximating Spanish-listeners’ cross-linguistic perception of Catalan. Vowels were extracted from Spanish and Catalan audio corpora, respectively. Ultimately, Spanish models exhibited expected misperception of Catalan’s /e/-/ɛ/, /o/-/ɔ/, and /ɛ/-/a/ contrasts; Spanish-dominant listeners have difficulty perceiving these contrasts, and Spanish models classified Catalan /ɛ/ as /e/ or /a/, and Catalan /ɔ/ as /o/. This demonstrates that small supervised neural models are capable of making specific, cross-linguistic perceptual predictions given realistic input.
A Family of Effective Methods for Decompiling Canonical Acceptors, Instantiated for Languages of Dot-Depth One and Tier-Based Extensions
Dakotah Lambert
Dakotah Lambert
Many kinds of logical systems have been employedin constructing formal languages to model phonological phenomena.A common theme among them is that the systems compile into finite automata.Two questions naturally arise.Can a given phenomenon be described with another logical system?And, if so, what is that description?To the first question, algebraic techniques are well establishedthrough deep connections with logic and automata.To the second, the situation is less clear.Translations from automata are establishedfor first-order and monadic second-order logicsunder precedence,but these may not translate easily to the simpler systems we often use.Translations for simple cases of restricted propositional logic(strictly local or strictly piecewise languages)are established,but insufficient to describe attested phenomena.The present work establishes a general way to handle many systems in between.Specifically,we show how to translate between certain kinds of algebraic varieties𝐕(systems defined by universally satisfied identities)and associated logical systems,then use decomposition to handle classes of the form𝐕∗𝐃,where the notion of “symbol” is replaced by “k-block”.With this, we handle several (unrestricted) propositional logics,facilitating logical description of natural language.
Word2Vec’s effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance—a topic rarely addressed in word embedding literature—we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec’s effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.
Learning Latent Representations with Progressive Hypothesis Space Expansion
Jonathan Charles Paramore
Jonathan Charles Paramore
This paper introduces a learning model to address the computational challenges arising from including highly abstract underlying representations (URs) in morphophonemic learning. The proposed learner structures the UR hypothesis space by disparity distance and considers potential URs in batches, beginning with fully concrete URs, only expanding the UR candidate space if the current set of UR candidates fails to meet a predetermined likelihood threshold. When expanding the UR candidate set, the learner uses markedness constraint weights and violation profiles to identify features that are potentially mis-specified underlyingly, limiting the generation of new URs to changes of those feature values. Overall, the learner inherently restricts abstraction to cases where introducing it demonstrably improves likelihood, while avoiding issues associated with the exhaustive search of an unbounded hypothesis space. Applied to Pakistani Punjabi a vowel nasality pattern, the model is shown to successfully acquire abstract URs for phonological patterns that parallel learners fail to capture.
What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies
Zhenghao Zhou | William Dai | Maya Viswanathan | Simon Charlow | R. Thomas McCoy | Robert Frank
Zhenghao Zhou | William Dai | Maya Viswanathan | Simon Charlow | R. Thomas McCoy | Robert Frank
Children’s acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora – matrix wh-questions, embedded wh-questions, and relative clauses – and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children’s filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
Determinants of Hesitations and Repetitions in Hindi Spontaneous Speech
Eashani Sharma | Ishita Arun | Samar Husain
Eashani Sharma | Ishita Arun | Samar Husain
This study investigates the factors that predict disfluencies in Hindi spontaneous speech. In particular, we probe the influence of lexical, syntactic, phonological, and prosodic factors on two kinds of disfluencies, namely, hesitations and repetitions. These disfluencies are probed through both the nature of linguistic factors as well as through the source (preceding vs. following word) of these factors. Our results show that hesitations and repetitions pattern differently during spontaneous speech. Hesitations increase due to lexical, syntactic, as well as articulatory features from both preceding and following words. On the other hand, repetitions arise mainly due to lexical and articulatory factors of the upcoming word. Further, while previous research (e.g., Bell et al., 2009; Dammalapati et al., 2021) on English highlights the importance of upcoming difficulty on disfluencies, our results suggest that previously encountered difficulties can also lead to an increase in disfluencies. This suggests that language typology (SVO vs SOV) can play a critical role in determining the planning process and thereby affecting the distribution of disfluencies in a language. Together, these findings highlight the need for increased cross-linguistic research to understand the nature of incrementality and monitoring of the production system cross-linguistically.
An LLM Investigation into Inherent and Structural Case Representation: a German Case Study
Iona Carslaw | András Bárány | Itamar Kastner | Mark Steedman
Iona Carslaw | András Bárány | Itamar Kastner | Mark Steedman
A question for computational linguistics has been to what degree do language models encode case information. However, the majority of the work has focused on structural cases (cases which change when the syntactic configuration changes). On the other hand, inherent cases (which are assigned by specific lexical items and do not change if the syntactic configuration changes) have been overlooked. This paper sets out to investigate if German language models distinctly encode inherent dative from structural accusative and nominative. We conducted a linguistic probing investigation where probes are trained on contextual word embeddings of active nominative, accusative, and dative arguments to predict if passivised datives are analysed as a structural nominative. We provide a cased and caseless version of the experiment. Our results suggest that when case information is removed language models can distinguish between inherent dative and structural accusative, regardless of argument position, due to verb information. However, language models cannot distinguish between structural nominative and inherent dative when the dative appears in a position where there is an expected nominative, due to over-relying on surface patterns.
Autosegmental approaches to Arabic root-and-pattern morphology generally take a three-tier approach, with tiers corresponding to the prosodic template, consonantal root, and affixes (e.g., McCarthy 1981); association between these tiers proceeds from left-to-right. However, Jardine (2017) shows that left-to-right association exceeds regular computation for autosegmental representations of arbitrary length, challenging the cognitive plausibility of this approach. This paper demonstrates that in the case of Arabic morphology, the constraints of the system itself — in particular, the finite length of the consonantal root — allow such a left-to-right autosegmental association to not only be definable with Monadic Second Order (MSO) logic, but with First Order logic. This paper introduces a logical relational structure formalizing the three-tier autosegmental representations and defines a set of transductions which apply in parallel over these structures to yield well-formed root and affix associations.
Comonadic Morphophonology: A Compositional Framework for Context-Dependent Morphological Rules in Finnish
Yongseok Jang
Yongseok Jang
Composing finite-state transducers (FSTs) for context-dependent morphophonological rules—consonant gradation, vowel harmony, possessive suffix assimilation—leads to multiplicative state explosion; neural models sidestep the problem but provide no formal account of the rules themselves. We present the first framework where each morphophonological rule is a function from a focused local context to a single output segment—the type of a local rule familiar from cellular automata—and where length-changing rules compose as coKleisli arrows of a comonad. Our central contribution is the Writer comonad (DeletionSet x Zipper), a new algebraic construction that restores strict coKleisli compositionality for such rules: each rule is a coKleisli arrow, extend lifts it to a global transformation, and deletions accumulate as a monoid action rather than requiring intermediate materialization. As supporting evidence, thirteen coKleisli arrows provide an alternative formulation expressing the same morphophonological behaviors that Omorfi encodes via 874 continuation classes (67:1 reduction at the rule-representation level), and the same abstraction enables bidirectional morphology—a MorphGenerator reuses the analysis arrows for generation. On UD Finnish-TDT, the system achieves 83.92% UPOS accuracy with rule-only disambiguation (94.66% with an external suffix tagger), validating the framework as a practical morphological engine.
This study uses a modeling approach to explore the development of spectral and positional encodings in speech sounds. Humans rely on their auditory system to differentiate between individual sounds in words by analyzing both spectral properties of phonemes and their relative positions. Previous neuroscientific research has identified specific neural populations in the auditory cortex that respond to spectral processing, while behavioral studies have confirmed humans’ ability to perceive the relative positions of phonemes in speech sequences. To investigate these encodings, a Long Short-Term Memory (LSTM) autoencoder with a cross-attention mechanism trained on Mel-spectrogram transformed from raw speech data is employed in this research. By conducting ABX tests on the model’s representations at various learning stages, we observe the emergence of spectral and positional encodings. The results show that the model excels in distinguishing spectral features similar to neuroscientific findings, and also reveals independent positional encoding through accurate temporal distinctions. Furthermore, we illustrate the developmental trajectory of spectral and positional encodings during the learning process, proposing the need for further investigating their neural correlates.
The Spanish Learner and Heritage Speaker Dependency Treebank
Valeria Pagliai | Sergio José Salazar Rodó | Emiliana Pulido | Andres Gutierrez-Quintero | Zoey Liu
Valeria Pagliai | Sergio José Salazar Rodó | Emiliana Pulido | Andres Gutierrez-Quintero | Zoey Liu
We present a manually curated L2-Heritage Speaker Spanish dataset (N = 49,247) following the Universal Dependencies framework, including lemmatizations, part-of-speech tags, syntactic dependencies, and instances of pro-drop and ungrammatical structures. In addition to this, for dependency parsing we examined different data partitioning strategies and data representations, as well as different training configurations using our data and the AnCora treebank. Overall, the results yield reasonable LAS scores and comparable performance between AnCora and our dataset.
Omnivorous Agreement, like Uyghur Backness Harmony, is a Challenge for Tier-Based Strict Locality
Allison Verbil | Tim Hunter
Allison Verbil | Tim Hunter
A well-known exception to the characterization that phonological patterns belong to the subregular class of TSL dependencies is found in Uyghur backness harmony (Mayer and Major, 2018). At the same time, a recent line of work has argued that many long-distance syntactic phenomena are subsumed by the TSL class, revealing an interesting parallel between phonology and syntax. We show that a certain omnivorous syntactic agreement pattern, namely Mundari object agreement (Murugesan et al., 2025), poses the same challenge to TSL as Uyghur backness harmony.
Modelling the Diachronic Emergence of Phoneme Frequency Distributions
Fermin Moscoso Del Prado Martin | Suchir Salhan
Fermin Moscoso Del Prado Martin | Suchir Salhan
Phoneme frequency distributions exhibit robust statistical regularities across languages, including exponential-tailed rank-frequency patterns and a negative relationship between phonemic inventory size and the relative entropy of the distribution. The origin of these patterns remains largely unexplained. In this paper, we investigate whether they can arise as consequences of the historical processes that shape phonological systems. We introduce a stochastic model of phonological change and simulate the diachronic evolution of phoneme inventories. A naïve version of the model reproduces the general shape of phoneme rank-frequency distributions but fails to capture other empirical properties. Extending the model with two additional assumptions –an effect related to frequency and a stabilising tendency toward a preferred inventory size– yields simulations that match both the observed distributions and the negative relationship between inventory size and relative entropy. These results suggest that some statistical regularities of phonological systems may arise as a result of diachronic sound change instead of –or in addition to– explicit optimisation or compensatory mechanisms.
Morpheme structure phonotactics: a categorical model for morpho-phonological productivity in Russian vowel-zero alternations
Daniar Kasenov
Daniar Kasenov
Nonce word studies motivate a notion of gradient similarity between nonce words and real words. In morpho-phonological research, similarity is often taken as to be a relationship between a nonce word and the list of morphemes / words that undergo a given morphophonological alternation (Albright and Hayes 2003; Becker et al. 2011 i.a.). This paper challenges this view on the basis of nonce word data on Russian vowel–zero alternations (Gouskova and Becker 2013; Becker and Gouskova 2016). I propose a model where morpho-phonological similarity is a relationship between the available underlying representations and the underlying representation the nonce item must have in order to undergo the alternation. The implementation of the proposed model matches—and in some comparisons exceeds—the performance of Becker and Gouskova’s (2016) MaxEnt-model. This study thus presents a linking hypothesis between nonce word studies and approaches that mark segments themselves as undergoing certain restricted alternations.
Agreement attraction errors, in which a verb erroneously agrees with an intervening noun rather than its grammatical head, are amplified by morphological syncretism in some languages (English, German, Russian) but not others (Turkish, Armenian), a cross-linguistic pattern without a principled account. We use surprisal and attention entropy from large language models as processing proxies to investigate this variation across four languages. LLM-derived measures replicate behavioral findings in English and German (syncretism modulates attraction), align with Turkish null results (no modulation), and partially capture Russian patterns. We discuss further directions for better understanding why syncretism affects agreement attraction differently across languages.
Search & Change (S&C) is a procedural model of phonological rule application that is conceptually clear and linguistically motivated, but whose computational properties have not been fully characterized. This paper provides a formal specification of S&C within the framework of Logical Phonology, presents a linear-time algorithm for rule application with a proof of correctness, and gives a compilation procedure mapping S&C rules to a single transition structure that is subsequential in one scan orientation and reverse-subsequential in the other, situating S&C within a well-understood subclass of regular string-to-string functions with known learnability guarantees and algebraic characterizations, implying that S&C-definable mappings are learnable from positive input/output pairs and amenable to algebraic classification.
This paper investigates whether tonotactic learning differs across representations and learning models. We conduct an experiment using the same dataset encoded in three representations: segments, features, and autosegmental representations (ARs). To the extent possible, two learning models are evaluated, the Maximum Entropy (MaxEnt) model and the Bottom-Up Factor Inference Algorithm (BUFIA), to examine how learning outcomes interact with both model type and representations. A follow-up experiment further explores the roles of frequency and complexity thresholds. The results show that (1) AR-based learning gives the strongest overall performance; (2) there is no consistent advantage between segmental and featural representations across learning models; (3) MaxEnt performance improves substantially when frequency information is introduced and lastly (4) the effects of complexity bounds interact with representation type and frequency information. These findings suggest that tonotactic learning benefits from structurally explicit representations. Overall this work highlights the importance of using linguistically meaningful representations into learning.
Investigating Syntactic Biases in Multilingual Transformers with RC Attachment Ambiguities in Italian and English
Michael Kamerath | Aniello De Santo
Michael Kamerath | Aniello De Santo
This paper investigates whether monolingual and multilingual LLMs show human-like preferences when presented with examples of relative clause attachment ambiguities in Italian and English. We also test whether these preferences can be modulated by lexical factors (the type of verb/noun in the matrix clause) which have been shown to be tied to subtle constraints on syntactic and semantic relations. Our results overall showcase how LLM behavior varies inconsistently across models and languages, and highlight the importance of leveraging subtle syntactic contrasts in exploring these models’ ability to correctly align with human-like preferences.
A Feature-Driven Tensor Semantics for Minimalist Grammars
John Paulson | Aniello De Santo | Jonathan Rawski
John Paulson | Aniello De Santo | Jonathan Rawski
This paper shows how tensor-based distributional semantics can be incorporated into Minimalist Grammars (MGs), leveraging the tensor-based MG representations of beim Graben and Gerth (2012). We embed the Minimalist feature calculus with a tensor algebra and give a joint tensor-based representation where compositional semantics is guided by the minimalist syntax. By bridging syntactic and semantic operation in tensor spaces, we aim to contribute to the broader enterprise of neurosymbolic approaches to linguistic cognition.
Word Predictability on Code-switching Points in Cantonese–English Discourse
Ariel Shuk Ling Chan | Yanting Li | Jacob Poschl
Ariel Shuk Ling Chan | Yanting Li | Jacob Poschl
This paper investigates how word predictability influences code-switching probability. We analyze 1,010 code-switched instances drawn from naturalistic sociolinguistic interviews with 41 Cantonese–English bilinguals across three bilingual groups (homeland, immersed, and heritage). In particular, we examine whether the predictability of switch points, operationalized as surprisal, influences the likelihood of code-switching. Using pretrained transformer-based language models, we estimate surprisal at the switch point under different modeling conditions, including autoregressive and masked models and varying amounts of contextual information. Mixed-effects logistic regressionanalyses show that less predictable words are more likely to be code-switched. These effects are largely consistent across model types and bilingual groups. Overall, these findings highlight the role of predictability in bilingual speech production and provide new insights into code-switching among bilingual speakers with diverse language experiences.
Non-literal Meaning Representation in the Brain during Naturalistic Listening
Zhengwu Ma | Yuhan Huang | Chengcheng Wang | Jixing Li
Zhengwu Ma | Yuhan Huang | Chengcheng Wang | Jixing Li
Naturalistic language comprehension often involves interpretations that go beyond literal meaning. In continuous narratives, literal and non-literal meanings are tightly intertwined, making them difficult to distinguish computationally. Here, we combined literal sentence representations and human-annotated non-literal interpretations for model-brain alignment. Using fMRI data recorded during passive listening to the Chinese version of The Little Prince, we annotated sentences containing non-literal meaning with human-written interpretations of their implied meaning. We then derived the literal and non-literal representations from LLaMA3.1-8B and evaluated their correspondence with neural activity using whole-brain encoding models. Literal representations aligned strongly with left-lateralized frontotemporal regions, whereas non-literal interpretations showed broader right-hemisphere involvement. Combining the two further improved encoding performance in the bilateral temporal and dorsal frontal cortices, suggesting that naturalistic comprehension engages complementary levels of meaning.
Probing the Attention Representation of Filler-Gap Dependency in Transformers
Ruoqing Yao | Pranav Anand
Ruoqing Yao | Pranav Anand
Prior work (Wilcox et al, 2024; Kobzeva et al., 2025) shows that neural language models exhibit filled-gap and unlicensed-gap effects, yet these effects attenuate with intervening clauses, especially with intervening overt complementizers. We conduct attention probing experiments on GPT-2 and identify two specific heads (layer 5, head 2, and layer 8, head 9) whose verb-to-filler attention correlates with filled-gap surprisal. The two heads are sensitive to clausal intervention but not to linear distance, and they show distinct patterns in islands. When intervening overt complementizers appear, head 2 of layer 5’s attention redistributes from the filler to the nearest complementizer, producing an “attend-closest-C” pattern, while head 9 of layer 8 does not. These results may suggest that LMs may have allocated distinct linguistically meaningful representations from the training data to individual attention heads, but they fail to fully learn the correct grammars of FGDs.
Learning Stress in Arabic Low-Resource Settings
Abed Qaddoumi | Jordan Kodner | Owen Rambow | Salam Khalifa | Jeffrey Heinz
Abed Qaddoumi | Jordan Kodner | Owen Rambow | Salam Khalifa | Jeffrey Heinz
We predict lexical stress in Arabic varieties using syllable structure (a sequence of CVs, with C for consonants and V for vowels). Our task is generation: given an unstressed input, the system outputs a stress-marked word. We compare four approaches: a grammar induction algorithm (BUFIA), a transformer-based neural network (NN), a rule-based method, and a frequency baseline. The models are evaluated across several low-resource settings by varying the training data size by words, structural type, and syllable count. BUFIA outperforms the neural network, especially when data are scarce. This supports grammar induction as an interpretable and sample-efficient alternative for learning stress.
Many gradable properties have been found to be encoded as axes in embedding space. Most commonly, property axes are computed using seed words, but recent work has noted limitations to seed-based axes. Here, we present a novel methodology for computing property axes that is based on human ratings and does not require seeds. We apply this methodology to a particular problem at the syntax-semantics interface: which semantic properties of intransitive verbs affect their likelihood to occur in one of two syntactic structures, unergative and unaccusative. Comparing property axes that encode different semantic dimensions of the concept of agentivity, we find that properties like movement and being alive are a better predictor of the syntactic behavior of intransitives than goal-directedness or intentionality. We discuss the potential of rating-based axes for future work in semantics and at the syntax-semantics interface.
Mapping the meaning of Hungarian impulsative constructions
Ágnes Kalivoda | Robert Malouf | Fackerman@Ucsd.Edu Fackerman@Ucsd.Edu
Ágnes Kalivoda | Robert Malouf | Fackerman@Ucsd.Edu Fackerman@Ucsd.Edu
We upload the abstract as a PDF file.
Various work in computational phonology has studied the computational properties of Optimality Theory. Some algorithms exist for the universal generation problem, including those of Ellison and Tesar, but their domain of applicability is poorly understood. I propose and study a concrete ’minimal’ fragment of finite-state Optimality Theory.I show that the universal generation problem for it is efficiently solvable by improving Ellison’s Algorithm, demonstrate that it has been implicitly used in the literature, and discuss its limitations.The minimal fragment is a natural and foundational step towards a computationally tractable general formalism for phonological analysis.
This paper investigates the learnability of interacting phonological processes by restricting the hypothesis space to a subregular class of functions. Interacting processes can be modeled as function composition, where the output of one function serves as the input to another. We focus specifically on interactions between two simplex Input Strictly Local (ISL2) functions, a proper subclass of the ISL function class. We propose a decomposition algorithm that reconstructs both the individual component processes and their relative ordering by exploiting structural properties of simplex ISL2 transducers and their compositions. This work provides an initial step toward understanding how learners can infer not only single phonological processes, but structured interactions between processes.
Do I know what I want to say? Modeling meaning uncertainty in RSA
Anzi Wang | Carolyn Jane Anderson | Grusha Prasad
Anzi Wang | Carolyn Jane Anderson | Grusha Prasad
Models using the Rational Speech Act (RSA) framework typically assume that speakers are certain about the meaning being communicated. In this work we note that there are contexts in which this assumption does not hold, and propose a method (um-RSA) to incorporate this meaning uncertainty within the RSA framework. As a case study, we explore two sources of meaning uncertainty: Counting-Uncertainty (from numerical cognition) and Discounting-Uncertainty (from behavioral economics). We generate predictions from these two hypotheses and test these predictions with two human experiments. The results show that um-RSA can account for differences in uncertainty expression usage that the standard RSA framework cannot account for, thus demonstrating the usefulness of modeling meaning uncertainty.
This paper examines the learnability of different types of tone sandhi in Structural Optimality, a constraint-based framework that posits hierarchical scales and defines constraints over the scales. Approached as a hidden structure problem, we show that Expectation Driven Parameter Learning can acquire these grammars, but that their properties can make learning difficult.
Concrete words (e.g., apple) are often described in the literature to share more semantic features across languages than abstract words (e.g., appetite). We test this hypothesis using multilingual aligned word embeddings by measuring the distance between words and their nearest neighbor in other languages, and examining whether shorter distances predicted higher concreteness ratings in six languages: Dutch, English, French, Cypriot Greek, Mandarin, and Portuguese. The relationship between concreteness and cross-linguistic distance varied across languages, suggesting that concreteness does not uniformly correspond to cross-linguistic semantic relatedness. Our attempt highlights the potential of using aligned word embeddings for operationalizing psycholinguistic constructs.
Frequency modulates structural choice in Turkish suspended affixation: a latent-process account
Utku Turk | Eva Neu | Özge Bakay | Brian Dillon | Gaja Jarosz
Utku Turk | Eva Neu | Özge Bakay | Brian Dillon | Gaja Jarosz
Suspended affixation (SA) allows a suffix on one conjunct to scope over all coordinated elements. While inflectional SA is productive in Turkish, derivational SA is claimed to be highly restricted; yet speakers readily accept certain cases. We propose that this gradient acceptability reflects a frequency-modulated choice between two possible syntactic representations: base-generation, which licenses derivational SA, and ellipsis. To test this, we conducted a rating task on the acceptability of four derivational suffixes in SA form while manipulating the frequency of coordinations. Using a Multinomial Processing Tree model to isolate latent structural choices from surface ratings, we found that frequency modulated SA acceptability for some suffixes (i.e., sIz ’-less’ and cI ’-maker’), but not others (i.e., lI ’-having’ and lIk ’-for’). These findings suggest that frequency shapes syntactic parsing in morphologically complex environments.
Effect of case markers during agreement production: A model comparison using Armenian forced choice data
Pranab Bagartti | Samar Husain
Pranab Bagartti | Samar Husain
Agreement attraction errors, where the verb erroneously agrees with a non-subject noun, have been a useful tool for investigating processes that subserve sentence production. Research has shown that case markers play an important role in modulating such errors. These effects have been argued to arise due to an underlying cue-based retrieval system. However, subsequent research in Armenian has challenged this conclusion (Avetisyan et al., 2020), arguing against a cue-based retrieval account. The current paper revisits the Armenian production data through computational modeling. Specifically, we implemented three distinct models and compared their predictions; we compare (a) a cue-based retrieval model, (b) a feature migration model, and (c) a case as markers for agreement prediction model. Our model comparison results show that a case as markers for agreement prediction model followed by an inference component explains the effect of case better than the cue-based retrieval model as well as the feature migration model.
One of the most fundamental representations in linguistic semantics is that of the proposition (McGrath and Frank, 2005), standardly taken as the carrier of truth-conditions. Recent work shows that some form of truth can be decoded from language models (Azaria and Mitchell, 2023; Li et al., 2023), and strikingly, that for some models, truth is even represented linearly in intermediate layers (Marks and Tegmark, 2024, GoT). We take this line of work a step further and argue that neural language models can use propositional representations compositionally (Janssen 2010; Pickel and Szabó 2025 a.o.), drawing from evidence of the behaviour of logical connectives: the linear compositionality hypothesis. Specifically, we show (a) that the truth values of individual conjuncts can be decoded independently of the truth value of a complex conjunction, and (b) that we can causally intervene on individual conjuncts in a way that affects the truth value of the whole.
Honorifics are linguistic forms that encode respect toward a socially valued individual or entity. This paper investigates how language models process Korean subject honorifics, which signal the social status of the subject through specific morphological markers. We evaluate a set of language models to determine whether they process honorifics in a human-like way by capturing the socio-pragmatic constraints governing their use, rather than merely relying on surface co-occurrence patterns. Our results indicate a systematic dissociation: models generally succeeded in detecting surface morphosyntactic mismatches, successfully treating unacceptable honorific constructions as less expected. However, models consistently favored overt honorific marking regardless of the subject’s social status, suggesting reliance on surface heuristics over genuine pragmatic knowledge. These findings suggest that language models have not fully acquired the socio-pragmatic constraints underlying honorific use, even when extensively trained on Korean text.
CrosSing: Cross-Scale Reasoning Evaluation on LLMs against Humans
Qi Han | Yifan Wu | Marten Van Schijndel
Qi Han | Yifan Wu | Marten Van Schijndel
While many studies have shown LLMs perform well in various reasoning tasks, few have examined their capacity on semantic reasoning tasks. As LLMs reason with language, it is crucial to understand how well they grasp and use the underlying scalar relationships in language. In this study, we introduced a new dataset CrosSing (Cross-Scale reasoning), providing a human baseline against which to evaluate LLMs’ ability to reason across lexical scales in gradable adjectives. We further probed how their understanding is influenced by overinformative contexts. We evaluated ten high-performing LLMs and found that some outperformed humans when no extra information was provided, but that LLM performance declined in certain overinformative contexts while human performance improved significantly. This contrast reveals a fundamental difference between recent LLMs and humans in understanding adjectives’ scalar relationships and how such understanding behaves in overinformative contexts.
We fine-tune Whisper large-v3 independently on each of the 81 languages in the FLEURS benchmark. Fine-tuning improves WER for all 81 languages, reducing it by nearly 30% on average. However, improvement varies widely, and the language’s writing system is the best predictor of success. Latin and Cyrillic script languages reach single-digit WERs, while languages with unique scripts (Thai, Georgian, Burmese, Khmer) benefit least. We further show that Whisper’s BPE compression ratio predicts fine-tuning headroom (Spearman ρ ≈ −0.78), pointing to tokenization as the underlying bottleneck. We will release model weights upon publication.
Human comprehenders have greater difficulty forming pairwise grammatical dependencies in cases where the earlier word competes with a "distractor" to which it is similar. Cue-based retrieval theories (see e.g., Lewis et al., 2006) address this "interference" phenomenon with explicit quantifications of memory retrieval difficulty. We propose a computational model, consistent with Cue-based retrieval, that separately quantifies two different kinds of similarity. A linear combination of the two reproduces the graded interference pattern reported in Van Dyke (2007). This simple account offers a more straightforward mechanistic interpretation than Attention-based predictors from opaque Transformer based models.
How much capacity does Turkish inflection require? An empirical study of GRU encoder–decoder bottlenecks.
Fred Mailhot
Fred Mailhot
Encoder–decoder neural networks with high-dimensional (e.g. d=300-–500) embeddings and hidden layers can be used to model a variety of morphophonological phenomena as sequence-to-sequence mappings, achieving high accuracy across languages and patterns. We show here that these high-capacity models are overparameterized, at least for the task of morphological inflection, and that simpler and smaller networks can perform near ceiling on the task of inflecting Turkish stems. Moreover these reduced-capacity models encode linguistically relevant information even when they are too small to succeed at the inflectional task.
The signal is coming from inside the noun phrase! Tracking semantic proto-role inferences during sentence processing
Lucas Y. Li | Zander Lynch | Marten Van Schijndel
Lucas Y. Li | Zander Lynch | Marten Van Schijndel
Semantic roles between a predicate and argument can be decomposed into proto-role properties (e.g.,Instigation). We introduce a novel LLM feature attribution method, Generalized Contextual Decomposition for Transformers (GCD-T), which we use to probe which parts of a sentence enable models to infer proto-role properties. We compare our findings with human inferences.
Quantifying mutual intelligibility gradients in Turkic languages using language models
Moldir Baidildinova | Shiva Upadhye | Austin Wagner | Connor Mayer | Richard Futrell
Moldir Baidildinova | Shiva Upadhye | Austin Wagner | Connor Mayer | Richard Futrell
Mutual intelligibility (MI) among related languages is a gradient phenomenon shaped by lexical, grammatical, and phonetic-phonological similarity. This study proposes a neural language modeling approach to quantifying MI patterns within the Turkic language family. Using IPA-transcribed naturalistic text from six Turkic languages, we train character-level LSTM models on a source language and fine-tune them on target languages that vary in genealogical distance. Cross-lingual transfer is evaluated using character-level cross-entropy (CE) loss, Area Under the Curve (AUC), and Rate of Change (ROC), which together capture model generalization, learning dynamics, and early-stage adaptation. We further examine whether model performance is predicted by cophenetic distance, lexical similarity, weighted trigram frequency overlap, and differences in vowel harmony index. Overall, the results suggest that character-level language models can approximate MI gradients across Turkic languages: closely related pairs generally show lower CE loss and smaller AUC, while more distant pairs show greater early-stage change. Lexical similarity, local phonotactic overlap, and genealogical distance appear to be the most informative predictors of model convergence. These findings provide preliminary evidence that neural language models trained on naturalistic text can offer a scalable way to model MI patterns, including directional asymmetries, across closely related languages.
This paper offers an updated perspective on the computational complexity of reduplication. Since one-way deterministic transducers cannot model reduplication in a straightforward way, the phenomenon has long been considered the outlier of morphology from a complexity perspective. Drawing on algebraic methods, I show that the vast majority of reduplicative processes belong to a few remarkably simple classes of subregular functions. A detailed study of the RedTyp database (Dolatian and Heinz, 2019) reveals that 100% of the surveyed reduplicative processes correspond to string-to-string functions in the class DA, while over98% are locally testable (LJ1) and over 87% are locally trivial (L1). These results indicate a new upper bound on the complexity of reduplication that is comparable to that of morphological processes in general.
Learning reduplicative templates as hidden structures: the case of reduplication-phonology interactions
Yang Wang
Yang Wang
Models of morphophonological learning have focused primarily on concatenative processes, leaving the challenges of non-concatenative morphology largely unaddressed. Reduplication, the systematic copying operation (e.g., Ilokano pluralization [kal-kaldÍN] ‘goats’), is particularly revealing because successful learning requires the joint inference of prosodic templates that govern copying, underlying representations (URs) of stems and other affixes, and the phonological grammar. In this paper, we present a learner that tackles this challenge by allowing reduplication to be learned alongside general morphophonemic alternations, a combination that, to our knowledge, has not been directly modeled in prior computational work. We show that the learner successfully captures the attested typology of reduplication–phonology interaction.
Do Large Language Models Acquire Phrase-Based Processing? Evidence from Eye Movements and Model-Brain Alignment After Fine-Tuning
Xufeng Duan | Zhengwu Ma | Zhaoqian Yao | Jixing Li | Zhenguang Cai
Xufeng Duan | Zhengwu Ma | Zhaoqian Yao | Jixing Li | Zhenguang Cai
Autoregressive large language models (LLMs) process text token-by-token, yet the human language system operates over multi-word units. We ask whether aggregating LLM representations at the phrase level yields a closer correspondence to human reading behavior and language cortex than the default word-level representations, and whether phrase-segmentation fine-tuning amplifies this correspondence. Using Meta-Llama-3.1-8B (base and fine-tuned), we provide three converging lines of evidence. First, phrase-level attention features predict regressive eye-saccade patterns more closely than word-level features; a partial correlation analysis with a shuffled-boundary control indicates that this is not solely an aggregation artifact and that linguistic chunk boundaries explain unique variance beyond word-level attention. Second, fMRI encoding analyses show that fine-tuning selectively improves phrase encoding in left superior temporal gyrus and inferior frontal gyrus, with no improvement for word representations. Third, representational similarity analysis confirms a phrase-specific gain in model-brain geometric alignment. These results identify phrase-level representation as a critical granularity for LLM–human correspondence and suggest that targeted training can model human-like compositional processing, linking computational representations to hierarchical theories of language.
Roles of Predictability and Acoustic Distance in Sound Discrimination via Contrastive Learning
Shuhao Zhang | Youngah Do
Shuhao Zhang | Youngah Do
Research in sound discrimination demonstrates that listeners exhibit reduced sensitivity to acoustic differences between allophones, as opposed to phonemes. Previous studies indicates that highly predictable, complementary distribution of allophones contributes to this limited sensitivity by providing strong contextual cues. Building on these insights, this study investigates the role of predictability in sound discrimination within a supervised contrastive learning framework. Specifically, we examine how varying levels of predictability affect the ability to distinguish sounds and whether this influence is categorical or gradual. Additionally, we explore the interaction between acoustic distance and predictability, as well as how the presence of other contrasts within a language modulates this process. Our findings indicate that only full predictability leads to a significant decline in discrimination performance, demonstrating a categorical effect. This impairment can be alleviated as acoustic distance increases. Moreover, the presence of additional contrasts sharing the relevant acoustic dimension enhances discriminability, showing the importance of contextual contrasts in speech perception.
Graded Expectations: Do Large Language Models Show Human-like Sensitivity to the Likelihood of Deceptive Speech Acts?
Xingyuan Zhao | Seana Coulson
Xingyuan Zhao | Seana Coulson
Human discourse comprehension includes graded expectations about whether a speaker is likely to lie. If language models capture human-like discourse expectations, they should be sensitive not only to factual consistency but also to lie expectancy as a contextual probability from complex pragmatic cues. We test this idea using discourse scenarios with varying incentives to deceive. Human lie probability is estimated from free continuations, and model lie expectancy is derived from the probability mass assigned to human-produced lie versus truth continuations. Across Qwen3 models, likelihood-derived lie mass aligns strongly with human lie expectancy. The best performance comes from the base checkpoints. By contrast, post-trained and mode-specialized variants show weaker alignment. Qualitative analysis suggests a structured error pattern: models tend to overpredict lies when a response directly conflicts with known facts, but underpredict them when lie expectancy depends more on contextual pressures such as politeness, self-protection, or strategic gain. These results suggest that graded lie expectancy is recoverable from model continuation probabilities and can be learned, at least in part, through the ordinary next-token prediction objective.
Lexical exceptionality in paradigm-specific learning: modeling stem-final obstruent alternations in Korean verbs and adjectives
Stella Eunsoo Hong
Stella Eunsoo Hong
Korean stem-final conjugations illustrate the interaction between lexical exceptionality and heterogeneous phonological processes. When /p/-, /t/-, and /s/-final stems occur before vowel-initial suffixes, the irregular classes in these paradigms undergo intervocalic lenition, each exhibiting a distinct alternation pattern. Learners must therefore not only identify which roots trigger lenition, but also determine the corresponding repair strategy. This study investigates how lexically-specific phonological patterns are acquired when multiple repair strategies are available. We employ a lexically scaled MaxEnt model (Linzen et al., 2013; Hughto et al., 2019) to learn these paradigm-specific alternations and run simulations under two learning scenarios: (1) when repair strategies occur at equal frequencies and (2) when one strategy significantly outnumbers the others. Results show that the model favors a least-cost solution by treating statistically dominant morpheme classes as the general pattern. We conclude by discussing the model’s sensitivity to lexical statistics, predictions for empirical testing, and implications for language acquisition.
Adaptive Speech Perception: Empirical Indeterminacy and a Path Forward
Shawn N. Cummings | T. Florian Jaeger | Chigusa Kurumada | Xin Xie
Shawn N. Cummings | T. Florian Jaeger | Chigusa Kurumada | Xin Xie
Human listeners rapidly adapt to unfamiliar talkers, but the underlying computational mechanisms remain contested. Three candidate hypotheses—pre-linguistic normalization, changes in phonetic category representations, and changing decision biases—have largely been pursued in separation, using subfield-specific paradigms. Researchers working in these paradigms often assume that adaptivity observed in their particular paradigm can only be explained by one of the three mechanisms. We test this assumption for one of the most popular experimental paradigms (lexically-guided perceptual learning or LGPL) using a unified computational framework (ASP). We apply ASP to the largest existing LGPL data: 89,600 categorization responses from over 1000 listeners after lexically-guided exposure to 32 different stimulus sets. Despite the unprecedented scale of these data, we find that behavioral data are equally compatible with all three candidate mechanisms. We discuss how model-guided stimulus selection can increase the diagnosticity of future LGPL experiments. Our simulation code can easily be adapted to other experimental paradigms.
Modeling generalization in perceptual learning of speech
Yiming Lu | Xinyu Leslie Liao | Alejandro Tabas | Xin Xie
Yiming Lu | Xinyu Leslie Liao | Alejandro Tabas | Xin Xie
A hallmark of learning is generalization to novel instances. In speech, exposure to atypical pronunciation drives perceptual adjustment that can generalize to unheard tokens. Prior work has attributed constraints on generalization primarily to acoustic similarity between exposure and test contexts. We propose that generalization can also be understood as an inference problem: listeners must determine whether, and how strongly, a learned phonetic mapping should apply in a new context. We test this proposal using data from a recent experiment in which listeners were exposed to shifted vowel pronunciations and then tested on minimal pairs varying in lexical frequency. Learning effects appeared strongest when the exposure direction aligned with a high-frequency alternative in mixed-frequency pairs, and were absent for low-frequency pairs. The observed pattern could reflect token-level acoustic similarity, reliance on prior expectations, or frequency-dependent constraints in applying the learned mapping. We formalized these alternatives within a Bayesian belief-updating framework: a talker-specific model assuming full transfer, a mixture-of-expectations model that interpolates between the updated representation and the listener’s prior, and a hierarchical Bayesian model that deploys the updated representation with uncertainty. The talker-specific model captured most generalization patterns through its sensitivity to token-level acoustic properties, but overpredicted learning for low-frequency pairs. The hierarchical model best recovered the theoretically central exposure-control contrast pattern, suggesting that lexical frequency may constrain how learned representations are applied. Our results provide a computationally explicit framework for studying how contextual factors shape generalization in speech perception.
This paper investigates the relationship between strictly local phonological processes and strictly local phonotactic constraints. On the theoretical side, I identify phonological rewrite rules that do not produce strictly local output languages and that do not weakly preserve the class of strictly local languages. Empirically, I find that strictly local rules without strictly local output languages are largely absent from the PBase database.
This paper argues in favor of a fundamentally new perspective on phonology via modal logic. We show that the class of total Boolean Monadic Recursive Schemes (BMRS), used in computational modeling of phonological processes (Bhaskar et al., 2020; Chandlee Jardine, 2021), is equivalent in expressive power to the well-studied modal 𝜇-calculus. As a corollary of this result, we obtain an alternative proof that order-preserving BMRS transductions capture the class of rational functions, which have been posited as a complexity bound on natural language phonological grammars.
up
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Ekaterina Kochmar | Debanjan Ghosh | Kai North | Mamoru Komachi
Ekaterina Kochmar | Debanjan Ghosh | Kai North | Mamoru Komachi
psy detectives at SemEval-2026 Task 10: PsyCoMark – Psycholinguistic Conspiracy Marker Extraction and Detection
Roxana Carabas | Anamaria Nacu | Lucian Isac | Daniela Gifu
Roxana Carabas | Anamaria Nacu | Lucian Isac | Daniela Gifu
We present our SemEval-2026 Task 10 (PsyCoMark) system that combines interpretable psycholinguistic signals with supervised neural modeling. Our approach includes (1) a marker-derived lexicon and LIWC-style ratio features built from span annotations, (2) binary Yes/No transformer baselines (RoBERTa and DeBERTa families) with optimized training configurations, and (3) a zero-shot TinyLlama-1.1B baseline for the full three-way setting (Yes/No/Can’t tell). Results show that marker-only features are transparent but weak, while transformer models provide substantially stronger performance; the best model, DeBERTa-v3-large, achieves 0.8339 weighted F1 on development and 0.75 weighted F1 on the competition test set. We also evaluate marker-driven heuristic relabeling of uncertain instances, which does not improve downstream performance. Overall, the submission provides a controlled, interpretable, and reproducible reference point for future work on integrating span-level psycholinguistic evidence with conspiracy detection.
wangkongqiang at SemEval-2026 Task 10: PsyCoMark- Psycholinguistic Conspiracy Marker Extraction and Detection
Wang Kongqiang | Tan Qingli
Wang Kongqiang | Tan Qingli
This paper presents our system developed for the SemEval-2026 Task 10: PsyCoMark Psycholinguistic Conspiracy Marker Extraction and Detection. on Subtask 1: Conspiracy Marker Extraction. on Subtask 2: Conspiracy Detection. To this end, we focus on English language use four different pre-trained languages models: models–distilbert–distilbert-base uncased, models–distilbert–distilbert-base-multilingual-cased, models–lxyuan–distilbert-base-multilingual-cased-sentiments-student, and models–microsoft–deberta-v3-base. We experiment with 1) the training set data is analyzed visually, 2) use the gemma-3-27b-it generative model to perform data augmentation on the training dataset through prompts for Subtask 2: Conspiracy Detection, and 3) multiple numbers of single models are trained on the training set data. We further study the influence of different hyperparameters on the single model and select the best single model for the prediction of the test set. Our submission achieved the good ranking place in the test set leaderboard. For Subtask 1, the evaluation criteria for this task mainly consist of the aggregate results of the four markers: Actor, Action, Effect, and Victim, and they are measured using the Macro F1 score. For Subtask 2, this task is essentially a binary classification task for text. Performance will be evaluated using macro-averaged F1 score. In other words, this subtask evaluated using Weighted F1 score across different sentences and cultural contexts. For Subtask 1 and Subtask 2, our best approach is to obtain the results are Macro F1 score 0.1587 and Weighted F1 score 0.7411 separately. For the final ranking, organizers will use the aggregate results of Macro F1 score and Weighted F1 score. Even so, our approach has yielded good results.
NTNU-SMIL at SemEval-2026 Task 3: Logistic-Loss Regression with Same-Language Transfer for Valence–Arousal Stance Prediction in Dimensional Stance Analysis (DimStance)
Siang-Ting Lin | Tien-Hong Lo | Yun-Ting Sun | Jhih-Rong Guo | Tung-Yen Hao | Fong-Chun Tsai | Berlin Chen
Siang-Ting Lin | Tien-Hong Lo | Yun-Ting Sun | Jhih-Rong Guo | Tung-Yen Hao | Fong-Chun Tsai | Berlin Chen
We propose NTNU-SMIL’s system for SemEval-2026 Task 3 Track B Subtask 1 Dimensional Stance Analysis (DimStance). Our approach models target-conditioned valence–arousal regression using sentence-pair encoding, dual regression heads, and a logistic-loss regression formulation. For English and Chinese, we further leverage same-language transfer from Track A and apply lightweight out-of-fold calibration with multi-seed ensembling to reduce cross-lingual scale mismatch. Post-hoc analysis shows that same-language transfer and logistic-loss regression are the main drivers of performance gains, while arousal variance collapse remains a challenge in low-resource settings such as Swahili.
MindMiner at SemEval-2026 Task 10: Multi-Model Approaches to Conspiracy Detection and Psycholinguistic Marker Extraction
Pramod Kumar Ajmeera | Akshara Sri Lakshmipathy
Pramod Kumar Ajmeera | Akshara Sri Lakshmipathy
Conspiracy narratives on social media often hide in subtle word cues and quiet reasoning patterns, making their detection a challenging task for natural language processing systems. SemEval-2026 Task 10 PsyCoMark introduces a benchmark for studying these phenomena, pairing binary conspiracy detection with the extraction of five key psycholinguistic markers: Actor, Action, Effect, Victim, and Evidence. In this paper, we examine how modern transformer-based models can grasp both the conspiratorial intent and the deeper reasoning structures behind such narratives, using rehydrated Reddit comments annotated by experts in psychology and linguistics. We test five models across these subtasks, emphasizing the gap that exists between classification and deeper discourse-level interpretation. Our best system reaches 0.80 weighted F1 on conspiracy detection and 0.16 macro F1 on marker extraction, with per-marker F1 ranging from 0.36 (Actor) to 0.00 (Victim). This work also contributes to the growing call for explainable NLP methods that integrate psycholinguistic insights to better illuminate misinformation and conspiratorial thinking online.
FER at SemEval-2026 Task 6: Analysis of Different Approaches to Unmasking Political Question Evasions
Matija Akrap | Andrija Bilić | Roko Šimpraga | Fran Račić | Luka Čuturilo
Matija Akrap | Andrija Bilić | Roko Šimpraga | Fran Račić | Luka Čuturilo
We tackle classifying evasive political answerswithin the context of SemEval-2026 Task 6 andcompare three modeling strategies: a flat base-line, a hierarchical cascade, and a multitasklearning approach. Our experiments demon-strate that a hierarchical RoBERTa-base modelachieves the best performance, particularly byleveraging the distinctiveness of the class ClearNon-Reply. Conversely, we find that stan-dard multitask learning frequently producesstructurally invalid label combinations in a sig-nificant fraction of predictions. Our demon-strations show that applying a constrained in-ference mask eliminates these errors entirelywhile improving F1 performance, whereas afully joint training approach underperforms dueto data sparsity. Finally, we employ datasetcartography to compare training dynamics be-tween the hierarchical and multitask approach.
nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
Nils Constantin Hellwig | Jakob Fehle | Udo Kruschwitz | Christian Wolff
We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM’s PagedAttention mechanism for efficient key–value cache reuse. Evaluation across 6 languages and 8 language–domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.
Team HausaNLP at SemEval-2026 Task 4: Narratives via Semantic Embeddings
Faisal Adam | Lukman Aliyu | Sani Aji
Faisal Adam | Lukman Aliyu | Sani Aji
This paper presents Team HausaNLP’s submission to SemEval-2026 Task 4 (Track A),which requires identifying the more narrativelysimilar of two candidate stories relative to ananchor. Narrative similarity is defined alongthree dimensions: abstract theme, course ofaction, and story outcomes. We conduct a systematic ablation comparing five approaches:a lexical TF-IDF baseline, two bi-encoderSBERT variants (all-MiniLM-L6-v2 andall-mpnet-base-v2), a paraphrase-focusedembedding model, and a cross-encoder reranker. On the 200-instance development set,all-mpnet-base-v2 achieves the best performance (61.5% accuracy, 61.48 macro-F1), outperforming both TF-IDF (54.5%) and the official SBERT baseline (55.0%). Surprisingly,the cross-encoder re-ranker (55.5%) does notimprove on the bi-encoders, which we attributeto the long-document nature of Wikipedia storysummaries exceeding the model’s effective context window. On the official test set, our primary SBERT MiniLM submission achieved61.50% accuracy (33rd of 44 teams). Our erroranalysis over 200 development instances identifies five systematic failure categories, distinctfrom the All Correct / Partial cases, including23 Lexical Trap cases, 23 Hard Cases, and 24Proposed-Recovery cases, thereby informingconcrete directions for future work.
Team HausaNLP at SemEval-2026 Task 9: Tackling Class Imbalance in Low-Resource Hausa Polarization Detection
Faisal Adam | Sani Aji | Lukman Aliyu | Abdulhamid Abubakar
Faisal Adam | Sani Aji | Lukman Aliyu | Abdulhamid Abubakar
This paper describes our submission toSemEval-2026 Task 9, Subtask 2 (Hausa). Thetask involves identifying specific categories ofpolarization (Political, Religious, Ethnic, etc.)in Hausa social media comments. The datasetpresented significant challenges, primarily extreme class imbalance and the low-resourcenature of the language. Our system uses a pretrained multilingual transformer (Afro-XLMRLarge) fine-tuned with Weighted Binary CrossEntropy loss and dynamic undersampling (1:3ratio) to mitigate the scarcity of polarized examples. On the official test set, our systemachieved an official Macro-F1 score of 0.2346and a Micro-F1 score of 0.2581. Our model isrecall-oriented (Micro-Recall: 0.6166), demonstrating strong capability in detecting polarization, though precision remains a challenge(0.1632). We achieved our best per-class performance in the Political domain (F1: 0.48).
LAFED at SemEval-2026 Task 13: Language-Agnostic Feature Engineering for Cross-Lingual AI-Generated Code Detection
Juan Villate Lemus
Juan Villate Lemus
Robust detection of AI-generated source code across programming languages remains challenging due to language-specific cues and train–test distribution shifts. We present LAFED (Language-Agnostic Feature Engineering Detector), a feature-engineering approach trained on {Python, Java, C++} and evaluated on a multilingual test set that includes unseen languages {C, C#, Go, JavaScript, PHP}. LAFED combines (i) structural skeletal features (indentation, control-flow density, and approximations of McCabe/Halstead complexity), (ii) character and whitespace statistics inspired by stylometry, and (iii) micro-style patterns (operator spacing, blank lines, indentation consistency). Using XGBoost (Chen and Guestrin, 2016) with Optuna hyperparameter search (Akiba et al., 2019), our best model achieves macro-F1=0.7570 on a 1,000-sample test set; the official submission obtains macro-F1=0.75209 (5th place in Subtask A). Per-language analysis shows strong transfer to C# (0.7753) and JavaScript (0.7683), but weaker performance on Go (0.6400) and PHP (0.5238).
ModusPonens at SemEval-2026 Task 11: Breaking the Plausibility Trap in LLMs via Conflict-Aware Ensembling
Soumyajit Roy | Manav Malhotra
Soumyajit Roy | Manav Malhotra
Large Language Models (LLMs) often struggle to disentangle formal logical validity from real-world plausibility, a phenomenon known as the "belief bias". This paper describes our submission to SemEval-2026 Task 11. We frame the task as a calibration problem between "System 1" (heuristic) and "System 2" (logical) thinking. Our experiments reveal that standard neuro-symbolic interventions, such as Structural Chain-of-Thought (CoT) and Nonsense Augmentation, degrade performance in low-resource regimes due to an "abstraction penalty". Instead, we propose a Conflict-Aware Logit Ensemble. We fine-tune two variations of Qwen-2.5-14B: a standard "Believer" model and a bias-hardened "Skeptic" model trained on oversampled conflict data. By ensembling their logits via soft-voting, we achieve a Pareto-optimal balance, reducing the Total Content Effect (TCE) to 3.21 while maintaining an overall accuracy of 94.27%, resulting in a Combined Score of 39.09.
QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis
A.j.w. De Vink | Filippos Karolos Ventirozos | Natalia Amat-Lefort | Lifeng Han
A.j.w. De Vink | Filippos Karolos Ventirozos | Natalia Amat-Lefort | Lifeng Han
We present our system for SemEval-2026 Task 3 on dimensional aspect-based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction-level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in-context learning with LLMs and ridge-regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder-based and LLM-based approaches for dimensional sentiment analysis.Our development code and resources will be shared at \url{https://github.com/aaronlifenghan/ABSentiment}
wangkongqiang at SemEval-2026 Task 1: MWAHAHA- Competition on Humor Generation
Wang Kongqiang | Zhang Peng | Tan Qingli
Wang Kongqiang | Zhang Peng | Tan Qingli
This paper presents our system developed for the SemEval-2026 Task 1: MWAHAHA-Competition on Humor Generation. on Subtask A: Text-based Humor Generation. Given a set of text-based constraints, generate a joke. This subtask A will be conducted in English, Spanish, and Chinese. on Subtask B: Image-Based Caption Generation. This subtask explores humor in a multimodal context, combining visual inputs with text generation. This subtask B is in English only. To this end, we mainly focus on Subtask A: Text-based Humor Generation in English and Chinese, Subtask B: Image-BasedCaption Generation in English language to use two important languages models: BLIP and Qwen series LLM. For Task B1: Image-only Humor Generation and Task B2: Image and Prompt Humor Generation. Our submission achieved the good ranking place in the test set. All subtasks evaluated using Rating (95% CI) score across different languages and modality contexts. For Subtask A in English and Chinese, Rating score 950 and 1054, 95% CI [ 922, 982] and [1024, 1104], ranked 16th and 1st respectively. For Subtask B in B1 and B2, Rating score 976 and 987, 95% CI [ 941, 1007] and[948, 1016], ranked 5th and 3rd respectively. For the final ranking, organizers will use the Rating (95% CI) score. Even so, our approach still has yielded good results.
JCT at SemEval-2026 Task 1: Let the Best Joke Win - A Generate - and-Rank Approach to Constrained Humor
Batya Schechter | Sarah Barzel | Chaya Liebeskind
Batya Schechter | Sarah Barzel | Chaya Liebeskind
We present a humor generation system forSemEval-2026 Task 1, Subtask A (Castro et al.,2026) that produces short jokes under lexicalor headline-based constraints. For each input,our system generates multiple candidate jokesusing a large language model across diverse hu-mor styles and prompting strategies, includingzero-shot, few-shot, and structured prompting.Constraint satisfaction is explicitly enforced,either by requiring exact lexical inclusion orby approximating semantic relevance to a head-line using sentence-embedding similarity. Allvalid candidates are ranked using a weightedhumor score that combines semantic incon-gruity, emotion-based humor potential, ironylikelihood, linguistic fluency, and novelty withrespect to a large external jokes corpus, andthe single highest-scoring joke is selected foreach constraint. This approach follows a best-candidate selection paradigm, leveraging auto-mated humor proxies to improve joke qualitywithout task-specific fine-tuning.
zhangpeng at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Zhang Peng | Lu Gehao
Zhang Peng | Lu Gehao
This paper presents our system developed for the SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization. on Subtask 1: Multilingual Text Classification Challenge - Polarization Detection. on Subtask 2: Multilingual Text Classification Challenge - Polarization Type Classification. on Subtask 3: Multilingual Text Classification Challenge - Manifestation Identification. For Subtask 1, we explored classical text representation approaches including Bag-of-Words, Word2Vec Average Vectors, and Bag-of-Centroids. Among these methods, the Bag-of-Centroids model achieved the best performance on both development and test datasets. For Subtask 2 and Subtask 3, we fine-tuned four different pre-trained language models: google-bert, FacebookAI-roberta, dccuchile-bert, and distilbert-multi. We experiment with 1) the training set data is analyzed visually, 2) multiple numbers of single models are trained on the training set data, and 3) multiple number of single models for voting weight ensemble learning. We further study the influence of different hyperparameters on the integrated model and select the best integration model for the prediction of the test set. On the official test set, our system achieved Macro-F1 scores of 0.6882 (EN) and 0.6711 (SP) for Subtask 1, 0.3752 (EN) and 0.6386 (SP) for Subtask 2, and 0.3561 (EN) and 0.4366 (SP) for Subtask 3. For the final ranking, organizers will use the Macro F1 score. These approachs has yielded good results.
lmfaoooo at SemEval-2026 Task 1: Humor Is an Audience. Preference Modeling for Constrained Humor Generation
Alexey Tikhonov | Alexey Ivanov
Alexey Tikhonov | Alexey Ivanov
Humor generation remains difficult not only because producing fluent, novel jokes is hard, but because "funny" is audience-dependent and supervision is noisy—preferences vary with audience, context, and culture, and annotator agreement is often low.In this paper, we describe our system for the SemEval-2026 Task~1 (MWAHAHA), which focuses on humor generation under explicit constraints. The task evaluates submitted systems via human preference judgments in 1-on-1 arena-style comparisons.We adopt a "generate-many - select-best" strategy. First, we generate a diverse pool of candidates per instance using multi-step prompting, model ensembling, and diversity-oriented decoding. Second, we select outputs using a preference model that approximates a “reader” by learning from human comparisons rather than absolute funniness scores. To support this approach, we release 2.5K human pairwise judgments collected through the Humor Arena prototype. We further propose an interpretable pipeline that converts labeled comparisons into a preference model. Across three preference datasets, our models consistently outperform baselines and show stronger cross-domain transfer. Finally, we apply the learned preference model to rank candidates for the MWAHAHA setting and release intermediate artifacts (candidate pools and rankings) to facilitate follow-up work. Our system ranked 1st in the English and Chinese subtasks of MWAHAHA and 2nd in the Spanish subtask.
kevinyu66 at SemEval-2026 Task 3: A Retrieval-Augmented LLM System for Aspect–Opinion Triplet Extraction
Kuanlin Yu | Wen-Ni Liu
Kuanlin Yu | Wen-Ni Liu
This paper describes our system used in the SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis. To address the inherent subjectivity and nuanced emotional expressions in this task, we propose a Retrieval-Augmented Generation (RAG) framework based on Large Language Models (LLMs) for sentiment triplet extraction. Our approach leverages a dynamic retrieval mechanism to identify semantically similar training examples, which are then integrated into the prompts as in-context demonstrations. This strategy effectively guides the model’s inference process by providing relevant linguistic patterns and emotional contexts. Our implementation is available at https://github.com/Kevinyu66/dimaste.
Lakksh at SemEval-2026 Task 11(1 2): Neuro-Symbolic Decomposition to Mitigate Content Bias in Syllogistic Reasoning
Lakksh Sharma | Krish Sharma | Jatin Bedi
Lakksh Sharma | Krish Sharma | Jatin Bedi
Syllogistic reasoning is the ability to distinguish logical validity from semantic plausibility — a setting in which LLMs succumb to frequent content bias by conflating the two. The result is a characteristic failure to recognize logically valid arguments with highly implausible conclusions and logically invalid but semantically plausible arguments. This paper introduces a neuro-symbolic system that avoids this behavior by design: neural structure extraction is strictly separated from symbolic validity checking. A T5-Small parser is trained only on synthetic nonsense-symbol syllogisms, ensuring that the structural parse is learned in the absence of real-world semantics. Validity checking is performed by a deterministic symbolic kernel operating on extracted logical form alone, ensuring that semantic content cannot influence the final call. In binary validity classification, the system achieves 97.38% accuracy with a Total Content Effect of 3.10; in the retrieval setting, it achieves 82.11% accuracy with 99.47% F1 on premise identification. Ablation experiments show that formal theorem proving via NL-to-Z3 translation actually increases content bias due to leakage in intermediate representations. The results recommend architectural separation as a promising content-robustness strategy for syllogistic reasoning.
CuriosAI at SemEval-2026 Task 2: Predicting Emotion using RoBERTa-large model
Fumika Beppu | Hiroki Takushima | Aiswariya Manoj | Daichi Yamaga | Yuki Shibata | Takayuki Hori
Fumika Beppu | Hiroki Takushima | Aiswariya Manoj | Daichi Yamaga | Yuki Shibata | Takayuki Hori
This paper proposes a method for predicting continuous emotion dimensions, namely Valence and Arousal, from text by combining affective intermediate training with multi-task learning. The proposed approach consists of two training phases: an intermediate pre-training phase using external emotion datasets, followed by a multi-task learning phase using task-specific data. RoBERTa-large is employed as the backbone model, and independent regression heads are introduced for each subtask. Experimental results show that the proposed method achieves Pearson correlation coefficients of 0.68 for Valence and 0.45 for Arousal on Subtask 1, demonstrating stable performance, particularly in capturing inter-user differences in emotional expression.
UIT-Polar at SemEval-2026 Task 9 Detecting Multilingual, Multicultural and Multievent Online Polarization
Hoàn Trần
Hoàn Trần
We present a two-stage hybrid system forSemEval-2026 Task 9 on multilingual and mul-tievent online polarization detection. The firststage employs DeBERTa for high-recall binaryfiltering to mitigate severe class imbalance. Thesecond stage leverages Mistral for fine-grainedpolarization classification, enabling improvedsemantic reasoning over candidate instances.This coarse-to-fine design enhances robustnessand efficiency while preserving minority-classperformance. Our system achieves Top-5 results on the English test set, demonstratingthe effectiveness of integrating encoder-basedscreening with LLM-based refinement.
wangkongqiang at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Wang Kongqiang | Tan Qingli
Wang Kongqiang | Tan Qingli
This paper presents our system developed for the SemEval-2026 Task 9: Detecting Multilingual,Multicultural and Multievent Online Polarization. on Subtask 1: Multilingual Text Classification Challenge - Polarization Detection. on Subtask 2: Multilingual Text Classification Challenge - Polarization Type Classification. on Subtask 3: Multilingual Text Classification Challenge - Manifestation Identification. To this end, we focus on English and Spanish language use two different pre-trained languages models: models–google-bert–bertbase-uncased, and models–microsoft–debertav3-base. We experiment with 1) the training set data is analyzed visually, 2) use the gemma-3-27b-it generative model to perform data augmentation on the training dataset through prompts, and 3) multiple numbers of single models are trained on the training set data. We further study the influence of different hyperparameters on the single model and select the best single model for the prediction of the test set. Our submission achieved the good ranking place in the test set. All subtasks evaluated using Macro F1 score across different languages and cultural contexts. For Subtask 1, the English and Spanish language tasks are Macro F1 Score 0.7805 and 0.7155 respectively. For Subtask 2, the English and Spanish language tasks are Macro F1 Score 0.2603 and 0.4647 respectively. For Subtask 3, the English and Spanish language tasks are Macro F1 Score 0.2766 and 0.3322 respectively. For the final ranking, organizers will use the Macro F1 score. Even so, my approach has yielded good results from an overall perspective.
"AGI” Team at SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Harsh Rathva
Harsh Rathva
This paper describes our submission to SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal. We combine RoBERTa-Large text encoding with a unidirectional GRU for temporal modeling and gated user embeddings for personalization. A four-phase staged training curriculum employs ordinal regression for absolute affect prediction and a zero-inflated delta model for change detection. Our approach achieves competitive performance on Subtask 1 (longitudinal affect assessment) with composite correlation r=0.600 for valence and r=0.452 for arousal. However, we observe systematic degradation in Subtask 2A (state change detection) with negative correlations (r=-0.167 for valence, r=-0.147 for arousal), revealing a fundamental trade-off between stability-oriented representations and change sensitivity. We provide detailed empirical analysis of these failure modes, contributing insights into the challenges of modeling emotional dynamics in ecological data.Code and trained checkpoints are publicly available.
Narrative Team at SemEval-2026 Task 4: Two-Stage Contrastive Learning for Narrative Similarity Assessment
Tatiana Khaidukova | Ana Ciobanu | Daniela Gifu | Diana Trandabat
Tatiana Khaidukova | Ana Ciobanu | Daniela Gifu | Diana Trandabat
For SemEval-2026 Task 4, we introduce a unified two-stage framework based on a RoBERTa-large encoder. Stage 1 performs contrastive pre-training on synthetic triplets to learn general narrative similarity patterns. Stage 2 fine-tunes the model with a ranking-based objective tailored to Track A.The resulting encoder supports both binary similarity classification (Track A) and narrative embedding generation (Track B) without architectural changes. Our system achieves an accuracy of 0.64 on Track A and 0.69 on Track B, outperforming single-stage baselines and demonstrating that combining synthetic contrastive supervision with task-specific ranking yields stable and reusable narrative representations.
CYUT at SemEval-2026 Task 3: Multi-Task Dimensional Aspect Sentiment Regression with Polar Multi-Zone Labeling in VA Space
Shih-Hung Wu | Xian-Yan Chen | Yi-Min Jian
Shih-Hung Wu | Xian-Yan Chen | Yi-Min Jian
This paper describes CYUT’s system for SemEval-2026 Task~3 Track~B, a multilingual aspect-based dimensional sentiment regression task. We formulate the task as continuous Valence–Arousal (VA) prediction and adopt a multi-task learning (MTL) framework with auxiliary tasks automatically derived from gold VA annotations, including polarity, intensity, and quadrant classification. However, these coarse-grained labels may still suffer from regional imbalance in the VA space, leaving some regions with insufficient auxiliary supervision. To address this issue, we extend the system with Polar Multi-Zone Labeling (PMZL) and use its seven-zone variant, PMZL-7. PMZL-7 partitions the VA plane into one core neutral region and six non-central zones based on the directional distribution of non-central samples. This design reduces the risk of auxiliary-label imbalance while supplementing directional information that conventional auxiliary tasks cannot directly capture. We evaluate XLM-R and two generative pretrained models. Results show that PMZL-7 is strongly model-dependent: it provides more stable improvements for Qwen2 and Ministral, while its effect on XLM-R is less consistent. On the official test set, our system achieves the best performance on the NigerianPidgin subset among all participating systems.
CSIRO-LT at SemEval-2026 Task 2: In-the-Wild Valence and Arousal Forecasting on Ecological Text Time Series
Jiyu Chen | Necva Bölücü | Sarvnaz Karimi | Diego Molla | Cecile Paris
Jiyu Chen | Necva Bölücü | Sarvnaz Karimi | Diego Molla | Cecile Paris
Predicting emotional valence and arousal in text is challenging due to the continuous, dynamic, and context-dependent nature of emotions. The SemEval 2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays shared task investigates longitudinal affect prediction from real-world personal essays, including forecasting short-term state and longer-term dispositional changes. We compare Pre-trained Language Models (PLMs) and Large Language Models (LLMs) for these subtasks, examining different input representations and feature formulations. We show that sentiment-aware PLMs are most effective for continuous valence and arousal prediction, and LLMs are effective for short-term state forecasting. Modelling dispositional changes remains challenging, and none of our neural approaches surpass simple a historical baseline approach in this setting.
CITD@UIT at SemEval-2026 Task 2: Temporal Mixture-of-Experts for Longitudinal Valence and Arousal Prediction from Ecological Essays
Son Phuong | My Ngo | Tri Minh Dao | Duc-Vu Nguyen
Son Phuong | My Ngo | Tri Minh Dao | Duc-Vu Nguyen
This paper describes our participation in SemEval-2026 Task 2, which focuses on the longitudinal assessment and forecasting of emotional states through text. The challenge is divided into two primary objectives: Subtask1, which requires estimating continuous Valence and Arousal (V&A) scores for a sequence of texts, and Subtask2, which focuses on forecasting future emotional variations, specifically State Change (2A) and Dispositional Change (2B). To address these tasks, we propose a unified framework based on cardiffnlp/twitter-roberta-base-sentiment-latest, a transformer architecture pretrained on 124 million tweets. For all subtasks, we sort the data chronologically by userid and use a sliding window approach to capture longitudinal context. We conduct extensive experiments combining this pretrained RoBERTa model with Multilayer Perceptron (MLP) and Mixture-of-Experts (MoE) architectures to optimize performance. Furthermore, we utilize both attention pooling and mean pooling on all output hidden state representations to extract richer semantic features. Our proposed system demonstrated competitive performance, officially ranking 9th in Subtask 1 and 5th in Subtask 2A among participating teams.
Hidetsune at SemEval-2026 Task 10: A Systematic Exploration of Training and Inference Strategies for Detecting Conspiracy Beliefs
Hidetsune Takahashi
Hidetsune Takahashi
This paper describes a system developed for SemEval-2026 Task 10 Subtask 2, which focuses on identifying conspiracy beliefs expressed in Reddit comments. The study begins with a comparative analysis of language models fine-tuned on the task data. In addition to fine-tuning, multiple auxiliary techniques were examined, including instruction-based prompting, data augmentation via back-translation, and loss function methods designed to address label imbalance. In the final stage, the inference behavior was further examined by varying the decision threshold applied to the softmax output probabilities. The results highlight how choices made during model selection, training, and inference collectively affect performance, offering empirical insights into the challenges of conspiracy belief detection in social media contexts.
OZemi at SemEval-2026 Task 9: A Cross-Lingual Approach to Online Text Polarization Classification Using Multilingual Models and Adaptive Loss Formulation
Hidetsune Takahashi | Eleale Nusi Tee | Aika Yu | Ruri Furukawa | Sooeun Kim | Shuta Niinomi | Dingyu Zhang | Emily Ohman
Hidetsune Takahashi | Eleale Nusi Tee | Aika Yu | Ruri Furukawa | Sooeun Kim | Shuta Niinomi | Dingyu Zhang | Emily Ohman
This paper presents the OZemi team’s submission to SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization.We propose a unified multilingual approach that addresses multiple languages and subtasks efficiently. Our system combines multilingual models with data-level techniques and a class-weighted cross-entropy loss to mitigate data imbalance across languages, subtasks, and categories. Results show consistent performance across languages, achieving macro F1 scores above 70% in most languages for Subtask 1 achieving our highest rank in subtask 1 for Persian (1 out of 44). These results suggest that the proposed framework provides a flexible foundation for multilingual and multi-task polarization analysis.
Hidetsune at SemEval-2026 Task 11: Adapting Pretrained Reasoning Models with Deep Supervision and Inference Refinement for Content-Independent Validity Classification
Hidetsune Takahashi
Hidetsune Takahashi
This paper presents a system that applies training and inference approaches for SemEval2026 Task 11 Subtask 1, which focuses on binary classification for content-independent validity reasoning in syllogistic inference. Building on fine-tuning of relatively standard language models, additional approaches were explored, including layer-wise deep supervision and in-context learning. Furthermore, models that had been previously trained on datasets related to logical reasoning were adapted to thetask through additional fine-tuning. Finally, refinement was performed at the inference stage by adjusting the softmax-based decision threshold of the selected model. The experimental results illustrate how model selection, training strategies, and threshold adjustment affect not only validity accuracy but also robustness against plausibility-driven bias, thereby contributing to improved logical integrity.
cclin at SemEval-2026 Task 2 : SLM-Enhanced Lightweight Multi-BERT Ensemble for Longitudinal Affect Assessment
Jing-Jun Lin
Jing-Jun Lin
This paper describes the system developed by team for SemEval-2026 Task 2, Subtask 1: Longitudinal Affect Assessment. Our goal is to predict Valence and Arousal from ecological essays and feeling words over time. We propose an efficient hybrid framework that uses quantized 7B-scale language models as deterministic meta-feature extractors and combines them with an ensemble of DeBERTa, RoBERTa, and DistilBERT encoders. The system is designed to run on a single consumer-grade RTX 5060 Ti (16GB) GPU while remaining competitive. To bridge discrete supervision and continuous evaluation, we train the model as an ordinal classification problem and decode class probabilities into continuous scores through expected-value decoding. Our best system achieved an overall V&A average of 0.587, with per-dimension composite correlations of 0.647 for Valence and 0.527 for Arousal, ranking 3rd out of 31 teams. The results show that lightweight SLM-derived priors and multi-encoder fusion provide a strong performance–efficiency trade-off, especially for Arousal, where contextual anchoring is crucial.
YNU-HPCC at SemEval-2026 Task 12: Retrieval-Guided Reasoning with Teacher Distillation for Abductive Event Reasoning
Yuwei Sun | Jin Wang | Xuejie Zhang
Yuwei Sun | Jin Wang | Xuejie Zhang
This paper describes the YNU-HPCC system for SemEval-2026 Task 12, Abductive EventReasoning (AER). Given multi-document retrieved evidence with distractors, the task requires selecting all direct-cause options for a target event and outputting an answer set. The main challenges are sparse and dispersed evidence in long documents and a boundary-sensitive set-level evaluation. This paper proposes a two-stage framework. Stage 1 trains a DeBERTa-v3-base student with retrieval-guided evidence modeling: documents are split into overlapping windows, BM25 ranks and filters candidate windows, and Top-K pooling aggregates window-level scores into option probabilities. Stage 2 distills soft targets from a Qwen-14B teacher with temperature scaling and high-confidence filtering to reduce pseudo-label noise and improve generalization. The system achieves an official dev score of 0.9712(micro-F1 0.9746, macro-F1 0.9745) and improves the test score from 0.46 to 0.73, ranking 84th out of 221 submissions.
Emo-tica at SemEval-2026 Task 2: Trait–State Affect Forecaster for Longitudinal Valence and Arousal
Sadia Noor | Mehwish Fatima
Sadia Noor | Mehwish Fatima
Modeling longitudinal affect requires capturing both stable user tendencies and transient textual signals. For SemEval-2026 Task 2, we propose the Trait-State Affect Forecaster (TSAF), which decomposes affect into persistent user traits and text-conditioned states integrated through adaptive gating. On per-text prediction (Subtask 1), TSAF achieves composite Pearson correlations of 0.645 for valence and 0.409 for arousal, outperforming the Linear(BERT) baseline. In forecasting tasks, results reveal strong short-term affective inertia, where prior affect dominates next-step prediction, while long-term drift remains challenging under sparse supervision; TSAF shows comparatively stronger gains for arousal in this setting. Analyses across user splits and modalities highlight the strengths and trade-offs of explicit trait-state modeling, particularly under cold-start and short-text conditions.
Sifei at SemEval-2026 Task 8: Hybrid Retrieval and Query Rewriting for Multi-Turn RAG
Sifei Meng | Dmitry Ilvovsky
Sifei Meng | Dmitry Ilvovsky
Multi-turn retrieval-augmented generation (RAG) is challenging due to evolving user intent, conversational noise, and strict context limits. We propose a training-free hybrid retrieval pipeline for SemEval-2026 Task 8 that combines dense and sparse retrieval with controlled query rewriting and cross-encoder reranking. Our system achieves 0.5453 nDCG@5 on the official test set of Task A, ranking 3rd out of 38 teams and outperforming the strongest baseline (0.4795). For Task C, we reuse the Task A retrieved documents in a lightweight generation pipeline based on the official prompt, achieving 0.5312 (harmonic mean of quality and faithfulness) and ranking 15th out of 29 teams. All retrieval components are open-source, while rewriting and generation use LLM APIs. Code and scripts are available on GitHub (https://github.com/mengsifei/MultiturnRAG).
MarSan at SemEval-2026 Task 4: Narrative Similarity via Sentence-BERT Metric Learning with Triple-Derived Losses
Maryam Najafi | Ehsan Tavan | Simon Colreavy
Maryam Najafi | Ehsan Tavan | Simon Colreavy
We describe our research to SemEval-2026 Task 4 on Narrative Story Similarity and Narrative Representation Learning (NSNRL). The shared task defines narrative similarity through comparative judgments over triples consisting of an anchor story and two candidates, where systems determine which candidate is narratively closer (Track A), and must output story embeddings whose cosine distances reproduce the same ordering under withheld evaluation triples (Track B). We implement a unified representation-learning approach based on a Sentence-BERT bi-encoder trained with triple-derived metric learning objectives, combining in-batch contrastive learning with explicit triplet and margin-ranking constraints. Track A is solved by direct cosine comparison between the anchor embedding and each candidate embedding, while Track B outputs normalized story vectors from the same encoder without any additional test-time modelling. During evaluation, we achieve 65.00% accuracy on Track A and 65.50% on Track B. These results suggest that a single, well-aligned bi-encoder can perform competitively across both tracks while remaining computationally efficient.
HU at SemEval-2026 Task 6: A Hybrid Discriminative Modeling of Political Clarity and Evasion
Taha Munawar | Basil Khan | Arsal Jangda | Sarfaraz Baig | Sandesh Kumar | Abdul Samad
Taha Munawar | Basil Khan | Arsal Jangda | Sarfaraz Baig | Sandesh Kumar | Abdul Samad
We describe our submission to SemEval-2026 Task 6: CLARITY, which aims to classify political question–answer pairs by response clarity and evasive technique. We investigate several approaches, including long-context transformers, multiple instance learning, hierarchical multi-task models, and a natural language inference (NLI) formulation. On the development set, our best-performing NLI model achieves a macro-F1 of 0.79 for Subtask 1, while our best attention-based MIL model achieves a macro-F1 of 0.43 for Subtask 2. On the hidden evaluation set, our official submission obtains macro-F1 scores of 0.81 for Subtask 1 and 0.45 for Subtask 2. Our findings demonstrate the benefits of entailment-based modeling for clarity prediction and localized reasoning for evasion detection under limited computational resources.
Team faisalm3at SemEval-2026 Task 3: From Standard Regression to Distributional Alignment in Dimensional Sentiment Analysis
Faisal Adam | Lukman Aliyu | Sani Aji | Abdulhamid Abubakar | Aliyu Rabiu Shuaibu
Faisal Adam | Lukman Aliyu | Sani Aji | Abdulhamid Abubakar | Aliyu Rabiu Shuaibu
This paper describes our participation in SemEval2026 Task 3: Dimensional Aspect-Based SentimentAnalysis (DimABSA) (Yu et al., 2026). We utilizeda pre-trained DeBERTa-V3 backbone to capturesemantic meaning through disentangled attention.While standard Mean Squared Error (MSE) loss establishes a performance floor, we propose a HybridMSE-CCCLoss to identify distributional relationships that simple regression missed. Our resultsdemonstrate a 54.6% reduction in validation losscompared to the baseline, significantly improvingdetection in high-intensity emotional bins by mitigating the "regression to the mean" phenomenon.
wangkongqiang at SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
Wang Kongqiang | Zhang Peng | Tan Qingli
Wang Kongqiang | Zhang Peng | Tan Qingli
This paper presents our system developed for the SemEval-2026 Task 7: Everyday KnowledgeAcross Diverse Languages and Cultures. on Subtask 1: Short Answer Questions (SAQ). on Subtask 2: Multiple-Choice Questions (MCQ). To this end, we focus on models’ cultural competence across 26 languages and 30 countries using four different versions large language models (LLMs): deepseek-v3.2-exp, qwen-max, qwen-plus, and qwen3-next-80ba3b-instruct. We experiment with 1) the trialand test dataset is analyzed visually, 2) use the large language generative model to perform generate or select the answer that it deems correct on the trial and test dataset through prompts, and 3) many prompt engineering approaches of generative models are evaluated on the trial dataset. We further study the influence of different hyperparameters on the generative model and select the best single model for the prediction of the test dataset. Our submission achieved the good ranking place in the test dataset leaderboard. For Subtask 1 (SAQ), the evaluation criteria for this task mainly consistof the aggregate results of the 23 languages: ar-EG, ar-MA, ar-SA, bg-BG, el-GR, en-AU, and so on, and they are measured using the accuracy score. For Subtask 2 (MCQ), this task is essentially a multiple-choice task for questions text. Performance will be evaluated using accuracy score. In other words, this subtask evaluated using accuracy score based on the correctness of the selected answer across different languages and cultural contexts. For Subtask 1 (SAQ) and Subtask 2 (MCQ), our best approach is to obtain the results in test dataset are accuracy score 51.4689 and accuracy score 80.26 separately. For the final ranking, organizers will use the aggregate results of accuracy score. Even so,our approach has yielded good results.
VAP-GameController at SemEval-2026 Task 2: Lexical-based and Emotion-Aware Approaches for Longtitudinal Emotion Prediction
Huy Le | Truong Phu | Trung Tran | Nga Nguyen | Monojit Choudhury
Huy Le | Truong Phu | Trung Tran | Nga Nguyen | Monojit Choudhury
In this work, we participate in SemEval-2026 Task 2, which focuses on predicting continuous valence and arousal trajectories from longitudinal ecological essays. To model fine-grained emotional dynamics, we explore three approaches: (1) hierarchical encoder-based models to capture contextual emotional patterns, (2) a lexicon-based pipeline with linguistic rules and a dual-level calibration mechanismfor personalized estimation, and (3) a hybrid framework that integrates lexical emotional signals into neural encoders. Experiments on the official dataset, evaluated using Pearson correlation (r) and MAE, show consistent improvements over baseline methods, highlighting the complementary strengths of neural representations and calibrated lexical features.
TeleAI at SemEval-2026 Task 13: Data-Centric Full-Parameter Fine-Tuning with Multi-Level Ensembling for Generated Code Detection
Shiquan Wang | Fang Yu | Shuangyong Song | Yongxiang Li | Xuelong Li
Shiquan Wang | Fang Yu | Shuangyong Song | Yongxiang Li | Xuelong Li
This paper presents our top-ranking system for SemEval-2026 Task 13 on code generation detection under multi-lingual and distribution-shift settings. Our approach achieved 1st place in Subtasks A and B, and 2nd place in Subtask C in the official evaluation.Our framework integrates data-centric analysis, full-parameter model adaptation, and multi-level ensemble learning. We first analyze label and length distributions and apply repeated oversampling to address class imbalance. We then optimize prompts in a data-driven manner to improve inference stability. Based on Qwen3-30B-A3B-Instruct, we conduct full-parameter fine-tuning with diverse training configurations and integrate multiple checkpoints using soft voting, hard voting, logits-based voting, and LightGBM stacking.Experimental results demonstrate substantial improvements over zero-shot baselines and consistent gains from ensemble strategies, validating the effectiveness of systematic adaptation and ensembling for robust code generation detection.
CodeHunters at SemEval-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios
Daniel-Antoniu Dumitru | Simina Lazăr | Nicoleta Danilă (amargheoalei) | Daniela Gîfu | Diana Trăndăbăț
Daniel-Antoniu Dumitru | Simina Lazăr | Nicoleta Danilă (amargheoalei) | Daniela Gîfu | Diana Trăndăbăț
We participated in Subtasks A and B, where we fine-tuned 3 different pre-trained models (UniXCoder, CodeT5 and codeBERT). The paper describes the detailed approach for both of the subtasks.
YNU-HPCC at SemEval-2026 Task 11: Mitigating Content Effects in Syllogistic Reasoning with Qwen2-1.5B-Instruct and XLM-RoBERTa-Large for English and Multilingual TasksMultilingual Tasks
Rongchuan Luo | Jin Wang | Xuejie Zhang
Rongchuan Luo | Jin Wang | Xuejie Zhang
This paper addresses SemEval-2026 Task 11, which focused on mitigating content effects in syllogistic reasoning. Logical validity is often conflated with semantic plausibility in large language models.Prior methods rely on standard fine-tuning or prompting, without explicit bias control.A rule- and template-based symbolic data augmentation framework is proposed for fine-tuning the \texttt{Qwen2-1.5B-Instruct} model and instruction-tuning the \texttt{XLM-RoBERTa-large} model. Logic-preserving synthetic data are generated through lexical rules. The system is ranked 1st in Task 1 with a perfect overall score of 100, and 6th in Task 3 with a score of 56.97. Code is publicly available at: \url{https://github.com/YNU-HPCC/semeval-2026-task11}.
PuerAI at SemEval-2026 Task 5: Homograph Appropriateness Assessment via DeBERTa Contrastive Regression and Contextual Grouping
Jiaxu Dao | Zhuoying Li | Hangchao Ma | Jinli Tong | Xiaoli Lan | Yifan Lu | Zhanji Yang
Jiaxu Dao | Zhuoying Li | Hangchao Ma | Jinli Tong | Xiaoli Lan | Yifan Lu | Zhanji Yang
To assess homograph appropriateness in narrative contexts for SemEval-2026 Task 5, we propose a contrastive regression framework. This approach combines candidate sense definitions with full narrative texts to establish an MSE regression baseline, further enhanced by a contextual grouping ranking loss that models relative rationality among senses. Evaluated on the official AmbiStory dataset, our method consistently outperforms the baseline in accuracy and Spearman correlation. These results validate the efficacy of relative order modeling for capturing fine-grained semantic nuances in complex narratives. The code is available at: https://github.com/daojiaxu/Semeval2026task5.
We present our system for the DimASR subtask of SemEval-2026 Task 3: DimABSA, targeting dimensional sentiment regression of Valence-Arousal scores in English restaurant reviews. Our approach leverages Qwen3 large language models combined with contrastive LLM-based data augmentation to enrich training data and capture subtle affective variations. Experiments show that this data augmentation framework significantly improves performance on the DimASR task, particularly in capturing subtle affective shifts at the aspect level. Finally, our system achieves a score of 1.227 RMSE on the test set.
YNU-HPCC at SemEval-2026 Task 2: Contrastive Calibration and Temporal Modeling for Continuous Valence-Arousal Prediction
Xin Lan | Jin Wang | Xuejie Zhang
Xin Lan | Jin Wang | Xuejie Zhang
This paper addresses continuous affect modeling in SemEval-2026 Task 2 through two task-specific architectures tailored to static state estimation and dynamic change prediction. To mitigate semantic ambiguity and annotation subjectivity in Subtask 1, a hard-prompt-based regression model is developed and enhanced with unsupervised contrastive learning (SimCSE) and supervised contrastive calibration (SCL) grounded in an external affect lexicon. This design improves the structural consistency and scale stability of textual representations in the Valence–Arousal (V/A) space. For Subtask 2a, which involves irregular time intervals and historical dependencies, a Time-Aware LSTM architecture is introduced to integrate current affective states with temporally enriched historical trajectories. Experimental results show that the YNU-HPCC system ranks 2nd in both subtasks. In Subtask 1, the Valence and Arousal scores are 0.677 and 0.528, respectively; in Subtask 2a, they are 0.692 and 0.647.
PICT at SemEval-2026 Task 3: A Transformer-Based System for Dimensional Aspect-Aware Sentiment Regression with Weighted Layer Pooling
Aditya Bhalgat | Omkar Jagtap | Anupama Phakatkar
Aditya Bhalgat | Omkar Jagtap | Anupama Phakatkar
Team PICT’s submission for SemEval-2026 Task 3 (DimASR) tackles continuous valence and arousal prediction by heavily focusing on variance reduction and avoiding cross-domain negative transfer. We built strictly domain-isolated pipelines for the Laptop and Restaurant datasets using a RoBERTa-Large backbone. Our architecture extracts a rich feature hierarchy using weighted layer pooling, isolates local context with a [CLS]-driven aspect-aware attention module, and maps to the continuous space using a deep residual regression head. Regularized via R-Drop and SWA, our system achieved 3rd place in the Restaurant domain (RMSE: 1.195) and 9th in the Laptop domain (RMSE: 1.326).
YNU-HPCC at SemEval-2026 Task 6: Hierarchical Taxonomy Prompting and CoT Distillation for Political Clarity Classification
Canning Wen | Jin Wang | Xuejie Zhang
Canning Wen | Jin Wang | Xuejie Zhang
In political interviews, politicians frequently employ evasion strategies to avoid direct answers, making it challenging to evaluate response clarity in Natural Language Processing. This paper presents the YNU-HPCC system for SemEval-2026 task 6: clarity classification in political interviews. To address the limitation where traditional models capture only surface-level semantics, this paper proposes two reasoning-enhanced frameworks. First, we introduce Hierarchical Taxonomy Prompting. This method guides LLMs to follow a strict top-down classification logic. Specifically, the model determines the clarity level before identifying specific evasion techniques. Furthermore, it explicitly articulates the reasoning process. Second, to balance reasoning capability with resource constraints, we employ Chain-of-Thought Distillation. We utilize DeepSeek V3.1 as a teacher model to generate comprehensive reasoning chains, which are then used to SFT the smaller student models. Experimental results demonstrate the effectiveness of our approach: The system achieved 6th place in Task 1 and 5th place in Task 2 among all participating teams, highlighting the importance of reasoning processes in detecting complex linguistic evasion.
mdok-style at SemEval-2026 Task 9: Finetuning LLMs for Multilingual Polarization Detection
Dominik Macko | Alok Debnath | Jakub Simko
Dominik Macko | Alok Debnath | Jakub Simko
SemEval-2026 Task 9 is focused on multilingual polarization detection. Specifically, it covers the identification of multilingual, multicultural and multievent polarization along three axes (in subtasks), namely detection, type, and manifestation. Online polarization presents a concern, because it is often followed by hate speech, offensive discourse, and social fragmentation. Therefore, its detection before it escalates is crucial for a safer and more inclusive online space. We have coped with this SemEval task by finetuning mid-size LLMs for the sequence-classification task using the QLoRA parameter-efficient finetuning technique. The training data augmented the multilingual (22 languages) training sets by anonymized, lower-cased, upper-cased, and homoglyphied counterparts, making the detection more robust.
mcdok at SemEval-2026 Task 13: Finetuning LLMs for Detection of Machine-Generated Code
Adam Skurla | Dominik Macko | Jakub Simko
Adam Skurla | Dominik Macko | Jakub Simko
Multi-domain detection of the machine-generated code snippets in various programming languages is a challenging task. SemEval-2026 Task 13 copes with this challenge in various angles, as a binary detection problem as well as attribution of the source. Specifically, its subtasks also cover generator LLM family detection, as well as a hybrid code co-generated by humans and machines, or adversarially modified codes hiding its origin. Our submitted systems adjusted the existing mdok approach (focused on machine-generated text detection) to these specific kinds of problems by exploring various base models, more suitable for code understanding. The results indicate that the submitted systems are competitive in all three subtasks. However, the margins from the top-performing systems are significant, and thus further improvements are possible.
DUTIR at SemEval-2026 Task 8: A Hybrid Retrieval and Faithfulness-Guarded Framework for Multi-Turn RAG
Jin Yang | Yichong Chen | Liang Yang
Jin Yang | Yichong Chen | Liang Yang
This paper describes the system submittedby DUTIRtaskC for SemEval-2026 Task 8:MTRAGEval (Task C). Multi-turn RetrievalAugmented Generation (RAG) poses significant challenges in context tracking, retrievalprecision, and hallucination mitigation. Ourproposed system addresses these by employinga multi-stage pipeline consisting of: (1) LLMbased query rewriting (powered by GPT-5.2) toresolve conversational dependencies; (2) a hybrid retrieval module combining dense embeddings (BGE-M3) and sparse retrieval (BM25)with Reciprocal Rank Fusion (RRF); (3) aconfidence-based answerability gating mechanism; and (4) a post-generation faithfulnessguard. Experimental results on the blind test setshow that our approach achieves a CompositeScore of 0.5576, ranking 4th out of 29 participating teams. Detailed analysis reveals that oursystem significantly outperforms strong baselines in faithfulness and successfully handlesunderspecified queries.
NLP-FSDM at SemEval-2026 Task 2: Temporal Smoothing and CCC-MAE Optimization for Balanced Longitudinal Affect Assessment
Abdessamad Benlahbib | Zouhir Essalmani | Achraf Boumhidi | Anass Fahfouh | Hamza Alami
Abdessamad Benlahbib | Zouhir Essalmani | Achraf Boumhidi | Anass Fahfouh | Hamza Alami
This paper describes the NLP-FSDM system for SemEval-2026 Task 2, Subtask 1 on longitudinal affect assessment. The task requires predicting Valence and Arousal (V & A) scores for sequences of ecological essays and feeling words written over time. We adopt ModernBERT-large as a text encoder and formulate the task as a joint regression problem optimized using a Concordance Correlation Coefficient (CCC) loss combined with a lightly weighted Mean Absolute Error (MAE) term. To reduce variance induced by fine-tuning large transformers on relatively small user-specific datasets, we employ a three-seed ensemble. Finally, we introduce a lightweight post-inference temporal smoothing mechanism applied per user to improve within-user consistency. Our system achieves an rcomposite of 0.546 for Valence and 0.453 for Arousal, demonstrating stable cross-dimensional performance without explicitly modeling sequential dependencies.
Team Macaroni at SemEval-2026 Task 10: PsyCoMark: Psycholinguistic Conspiracy Marker Extraction and Detection
Rofaida Rabehi | Nicolai Plenk | Miriam Han
Rofaida Rabehi | Nicolai Plenk | Miriam Han
This paper describes our submission to SemEval-2026 Task 10: PsyCoMark, which addresses span-level identification of psycholinguistic conspiracy markers and document-level conspiracy classification. For Subtask 1, we fine-tune several pretrained transformer encoders and analyse their behaviour under different training configurations. For Subtask 2, we develop a hybrid system that combines ModernBERT-large with surface-level linguistic features.Our results show that straightforward fine-tuning of strong pretrained models is more effective than more complex pipelines and that additional handcrafted features do not yield consistent improvements. On the official test set, we rank 18th in Subtask 1 (overlap-based macro F1 = 0.16) and 20th in Subtask 2 (macro F1 = 0.76).
Tralaleros at SemEval-2026 Task 9: Multilingual Polarization Detection with Transformer-based Models
Adrian Dahl | Bado Völckers | Adam Mierzwa
Adrian Dahl | Bado Völckers | Adam Mierzwa
We present a multilingual polarization detection system for SemEval-2026 Task 9 (Subtask 1), covering 22 languages with transformer-based models. We evaluate four strategies: data rebalancing, hyperparameter optimization, model scaling, and ensembling, and show that undersampling harms performance, while larger pretrained models improve results substantially. Our best single model, XLM-RoBERTa Large, achieves a Macro-F1 of 0.7929, with analysis showing complementary strengths across model families (e.g., RemBERT for several Indic languages and mDeBERTa for Semitic/morphologically rich languages). Ensemble gains are marginal, suggesting language-aware routing is more promising than uniform aggregation. We also provide a privacy-preserving Firefox extension that runs local ONNX inference for practical deployment without sending user text to external servers.
Dream at SemEval-2026 Task 13: SALSA for Single-Pass Machine-Generated Code Detection
Ruslan Berdichevsky | Shai Nahum-Gefen | Elad Ben-Zaken
Ruslan Berdichevsky | Shai Nahum-Gefen | Elad Ben-Zaken
Large language models have transformed code generation, raising concerns around authorship, assessment integrity, and software trust. SemEval-2026 Task 13 Subtask A operationalizes detection as binary classification over code snippets, with a particular emphasis on out-of-distribution (OOD) generalization across unseen programming languages and application domains. We propose a SALSA-style formulation, Single-pass Autoregressive LLM Structured Classification, that maps each class to a dedicated output token and trains the model to emit a single-token label in a structured response. Rather than engineering hand-crafted features or decision rules, this formulation delegates the authorship decision to the model. To improve OOD robustness, we combine balanced sampling across languages with parameter-efficient fine-tuning and conservative training (low learning rate, single epoch) to avoid overfitting to the training domain. Our best system achieves OOD F1 = 0.789 on the official leaderboard, substantially outperforming the CodeBERT baseline (F1 = 0.305).
PUEB-DimASR at SemEval-2026 Task 3: Escaping the Mean Regression Trap with Graph-Enhanced Transformers for Dimensional Aspect-Based Sentiment Regression
Oskar Riewe-Perła | Agata Filipowska
Oskar Riewe-Perła | Agata Filipowska
The DimABSA shared task aims to combine dimensional analysis with Aspect-Based Sentiment Analysis (ABSA). It addresses the lack of continuous sentiment representation, as opposed to categorical labels (e.g., positive, negative, or neutral), and enriches it with an assessment of arousal. Our team’s PUEB-DimASR investigates the "mean-regression trap" — the tendency of standard MSE loss in high-dimensional sentiment tasks to over-predict values closer to the global mean. We propose a two-step advancement in model ar chitecture. First, we enhance baseline Trans formers with Graph Convolutional Networks(GCN) to capture syntactic aspect-sentiment dependencies. Second, we evaluate and recommend a Hybrid loss function that combines Mean Squared Error (MSE) and Concordance Correlation Coefficient (CCC).Our proposed GCN-deBERTa model consistently outperforms the baseline across six target languages. While MSE loss yields the best RMSE scores for English (0.876) and Chinese (0.546), it introduces significant variance collapse, which we successfully mitigated using the Hybrid loss, achieving near-perfect distributional alignment (99.6\%). Additionally, our model trained with the Hybrid loss achieved the best RMSE scores for Russian (1.136), Tatar (1.207), and Ukrainian (1.178).
YNJTC at SemEval-2026 Task 11: A Neuro-Symbolic Hybrid Pipeline for Content-Independent Syllogistic Reasoning
Junhao Fu | Yun He | Lina Zhao | Weijuan Li
Junhao Fu | Yun He | Lina Zhao | Weijuan Li
This paper presents a neuro-symbolic hybrid pipeline for SemEval-2026 Task 11 that addresses the content effect in syllogistic reasoning. The system converts natural-language syllogisms into formal mood-figure representations via regex parsing and LLM-powered extraction, then determines validity through symbolic table lookup against the 24 classically valid forms. The approach achieved a perfect Combined Score of 100.0 on Subtask 1 and competitive results on all four subtasks.
Draken at SemEval-2026 Task 2: Frozen BERT Embeddings with Ridge Regression for Predicting Emotional Valence and Arousal
Rajalakshmi Sivanaiah | Angel Deborah S | Krishna Varun R | Krishnaraj N
Rajalakshmi Sivanaiah | Angel Deborah S | Krishna Varun R | Krishnaraj N
We present a lightweight and computationally efficient system for Subtask 1 of SemEval-2026 Task 2, which focuses on predicting longitudinal variation in emotional valence and arousal from ecological essays. Our approach uses frozen contextual embeddings from BERT-base-uncased to obtain mean-pooled sentence representations without fine-tuning the transformer. These 768-dimensional embeddings are fed into a multi-output Ridge regression model to jointly predict normalized valence and arousal scores.The system emphasizes simplicity, reproducibility, and efficiency, avoiding complex temporal architectures, external lexicons, or user metadata. Despite its simplicity, the model achieves strong performance for valence prediction (r = 0.594) and moderate performance for arousal prediction (r = 0.296). Detailed evaluation across seen and unseen users, as well as between-user and within-user splits, shows that between-user correlations are consistently higher, and that valence is substantially easier to predict than arousal. These findings suggest that frozen transformer embeddings combined with linear regression provide a competitive and interpretable baseline for longitudinal affect prediction tasks.
NLPGroup8 at SemEval-2026 Task 2: Diverse Ensembles and Hierarchical Transformers for Emotional State Prediction
Troy Arthur | Aidan Kelley | Sierra Reschke
Troy Arthur | Aidan Kelley | Sierra Reschke
Our approach combines a diverse ensemble for Subtask 1 with a context-aware transformer aggregation architecture for temporal forecasting in Subtasks 2A and 2B. The ensemble achieved state-of-the-art performance for the Subtask 1 Valence metric, ranking first in Valence prediction. Our Subtask 2B independent architecture ranked second in Valence prediction and fourth in Arousal prediction among competitive submissions. We also report results for Subtask 2A, analyzing challenges our architecture faced with next-entry affect forecasting. These findings underscore the significance of our methodology for affective prediction, achieved without reliance on external affective datasets.
The Counterfactuals at SemEval-2026 Task 9: Can Counterfactually-Inspired Preprocessing help Detect Polarization?
Teagan Johnson
Teagan Johnson
This paper presents the English-language submissions of The Counterfactuals team for the three subtasks of Task 9 at SemEval 2026. The task aims to detect multicultural online polarization, how it is expressed, and in what contexts. The task provides a high-quality annotation dataset of posts that follows a three-level schema: polarized or not (subtask 1), polarization type classification (subtask 2), and manifestation identification (subtask 3). I construct a pointwise mutual information-based lexicon that identifies highly-correlated words with the polarized class as labeled in subtask 1. Using this lexicon, I implement a large language model data augmentation technique. I then use the preprocessed datasets to finetune a BERT model (BERTweet) for each subtask. My highest performing models placed 48th out of 60, 35th out of 36, and 17th out of 24 on subtasks 1, 2, and 3 respectively. All code is available on GitHub.
CCNU at SemEval-2026 Task 10: Conspiracy Marker Extraction and Detection via Multi-task Learning and LLM-based Data Augmentation
Zijun Wang | Guanyi Chen
Zijun Wang | Guanyi Chen
This paper presents the system of CCNU forSemEval-2026 Task 10: Psycholinguistic Con-spiracy Marker Extraction and Detection. Thetask requires identifying fine-grained conspir-acy markers that characterize conspiracy think-ing, as well as determining whether a Redditcomment constitutes conspiratorial discourse.For Conspiracy Marker Extraction (Subtask 1),we adopt a Unified Multi-Task Sequence La-beling Framework that jointly models multi-ple conspiracy markers within a single labelingspace. This formulation enables collaborativelearning across marker types while maintaininga compact architecture. For Conspiracy Detec-tion (Subtask 2), we formulate the problem assentence-level classification. Across both sub-tasks, we apply data augmentation powered bylarge language models and ensemble inferenceto improve robustness and generalization. Oursystem achieves strong performance on Sub-task 1, ranking 3rd on the official test set, anddelivers competitive results on Subtask 2.
HCMUS RepeatedGames at SemEval-2026 Task 12: CausalRAG: Synergizing Causal Graph Retrieval and Extended LoRA for Abductive Reasoning
Duy Minh Dao Sy | Nguyen Tran | Trung Kiet Huynh | Phu Quy Nguyen Lam | Phu Hoa Pham
Duy Minh Dao Sy | Nguyen Tran | Trung Kiet Huynh | Phu Quy Nguyen Lam | Phu Hoa Pham
This paper presents our system developed for SemEval-2026 Task 12: Abductive Event Reasoning (AER). The shared task aims at identifying the most plausible cause of a real-world event from multiple-choice options, given retrieved documents as evidence. In this work, we propose using hybrid retrieval that combines BM25 keyword matching with dense semantic search to capture explicit causal keywords. Moreover, we apply extended LoRA fine-tuning that trains both attention and MLP layers of a 32-billion parameter language model with only 0.81% trainable parameters. For final refinement, we perform development set fine-tuning to leverage validation data before inference. We achieve a tie for fifth place in the shared task: our system achieves a score of 0.90 on the official test set evaluation, ranking tied for fifth among participating teams and representing a +0.27 improvement over our baseline.
UIT-AMMC at SemEval-2026 Task 13: Exploiting Structural Formatting Signatures for Robust AI-Generated Code Detection
Cuong Pham | Minh Nguyen | Minh Le | An Nguyen | Chinh Nguyen
Cuong Pham | Minh Nguyen | Minh Le | An Nguyen | Chinh Nguyen
We participated in Subtask A with our Structure-Aware Contrastive Cascade, a multi-stage architecture designed to distinguish between human-authored and machine-generated code by integrating generative reasoning with explicit structural linguistic features. Our system focuses on exploiting structural formatting signatures that frequently emerge in AI-generated code as a byproduct of post-training alignment and readability optimization. The pipeline utilizes a Qwen-2.5-Coder 14B model fine-tuned via QLoRA, incorporating stochastic data augmentation techniques to ensure robustness across unseen programming languages. Final classification is achieved through a late-fusion mechanism that combines contrastive probability scores with statistical metrics of code presentation density. For samples exhibiting high epistemic uncertainty, we implement a multi-agent adversarial debate step to refine the final verdict. This approach enabled our system to achieve a Macro F1 score of 0.802, ranking 3rd on the official leaderboard.
NUST CodeIntel at SemEval-2026 Task 13: Cross-Domain Detection of Machine-Generated Code via Stylometric Features and Transformer Models
Azher Ali | Mehwish Fatima
Azher Ali | Mehwish Fatima
We present our submission to SemEval-2026 Task 13 on cross-language and cross-domain detection of machine-generated code. We compare TF-IDF-based models with stylometric features against LoRA-tuned transformer encoders. While transformers achieve near-perfect in-distribution performance, they degrade sharply on unseen languages and domains. In contrast, a TF-IDF + Logistic Regression model attains the best test Macro-F1 and shows greater robustness. These results highlight the limitations of neural models under distribution shift and the strength of lexical baselines for cross-domain generalization.
CuriosAI at SemEval-2026 Task 4: A Comprehensive Study of Zero-Shot versus Fine-Tuned Approaches for Narrative Similarity
Yuki Shibata | Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Daichi Yamaga | Takayuki Hori
Yuki Shibata | Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Daichi Yamaga | Takayuki Hori
This paper presents our system for SemEval-2026 Task 4 on narrative similarity assessment.Through comprehensive experimentation, we evaluated various approaches including zero-shot pre-trained models, prompt engineering with large language models, and multiple fine-tuning strategies using synthetic data. Our experiments revealed a surprising finding: pre-trained sentence transformers in a zero-shot setting consistently outperformed all fine-tuning attempts. Specifically, our best system using sentence-transformers/sentence-t5-xl achieved 67.5% accuracy on the development set (95% CI: [61.0%, 74.0%]), while all fine-tuning approaches resulted in performance degradation of 2-18 percentage points. We provide a detailed analysis of why fine-tuning failed and discuss the implications for narrative similarity tasks.
YNU-HPCC at SemEval-2026 Task 4: Narrative Similarity via Multi-Perspective E5-Mistral and Embedding Routing
Feiyang Song | Jin Wang | Xuejie Zhang
Feiyang Song | Jin Wang | Xuejie Zhang
This paper presents the system developed by the YNU-HPCC team for SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. The task challenges computational systems to identify narrative similarity across three orthogonal dimensions: abstract theme, course of action, and outcomes. The primary scientific difficulty lies in distinguishing the underlying structural fabula from surface-level lexical overlaps, particularly when facing long-context narratives with subtle plot twists. To address this, our approach employs a hybrid architecture that strategically decouples retrieval and ranking tasks. For Track A, we introduce a dynamic routing mechanism where an instruction-tuned E5-Mistral-7B model handles clear cases, while ambiguous hard samples are routed to a Gemini-3-Flash reasoner. For Track B, we leverage the global semantic modeling capabilities of Gemini-Embedding-001 via a structure-preserving chunking strategy, enhanced by All-But-The-Top (ABTT) during inference. Extensive experiments on the official test set show that this divide-and-conquer strategy effectively balances local instruction following with global open-domain generalization. Our system performs competitively, ranking 5th in Track A and 2nd in Track B among all participating teams.
SemEval-2026 Task 10 is focused on conspiracy detection. Specifically, the goal is to detect whether a Reddit comment expresses a conspiracy belief. Our submitted mdok-style system utilizes data augmentation and self-training (to cope with a rather small amount of training data) to finetune the Qwen3-32B model for a binary text-classification task. The submitted system is very competitive, ranking in the 85th percentile (8th out of 52 submissions). The results shown that our approach, which originated in machine-generated text detection, can be used for conspiracy detection as well.
RvH-40 at SemEval-2026 Task 11: Disentangling Reasoning from Belief through Symbolic Abstraction
Niek Biesterbos | Mark Den Ouden | Janiek De Rijke
Niek Biesterbos | Mark Den Ouden | Janiek De Rijke
Large Language Models (LLMs) often struggle with syllogistic reasoning due to "belief bias," where semantic world knowledge overrides formal logical structure. In this paper, we present our submission for the SemEval-2026 Task 11 shared task. We investigate the discrepancy between a model’s latent logical capabilities and its performance on natural language text. By employing symbolic transformations, specifically variable and pseudoword substitution, we demonstrate that models like Qwen2.5-14B possess strong inherent reasoning skills that are suppressed by linguistic content. We propose a "logic alignment" strategy using Low-Rank Adaptation (LoRA) to bridge this gap. Our final model achieved a near-perfect accuracy of 97.92% on the validation set and 96.34% on the official hidden test set, effectively eliminating content bias while maintaining robust generalization across abstract formats.
Ajman University at SemEval-2026 Task 2: Overcoming Scale Collapse in Temporal Emotion Modeling via Residual Learning
Haseebullah Jumakhan | Soud Assad | Seyed Abdullah | Mahmoud Al-Ayyoub
Haseebullah Jumakhan | Soud Assad | Seyed Abdullah | Mahmoud Al-Ayyoub
Ajman University Team develops a set of specialized architectures for longitudinal affective forecasting for SemEval-2026 Task 2. We establish a baseline for our performance with a standard transformer model that sets our performance floor in Subtask 1 (ranked 18). In Subtask 2A (ranked 7) and Subtask 2B (ranked 8), our main contribution is to address the problem of scale collapse. To address the scale collapse, we use a novel "bifurcated leviathan" architecture to combine residual learning with target scaling. Our additional contribution is that we counteract the effects of regression to the mean by using optimized covariance via specialized objective functions (CCC and Huber). We use these objective functions while enforcing strict user level data splits. Finally, we show empirically that standard gradient stabilization methods decrease zero shot cross subject generalization, even when they optimize intra subject memorization.
Team TüLK at SemEval-2026 Task 1: Humor Generation with Qwen and Group Relative Policy Optimization
Konrad Brüggemann | Luting Hou
Konrad Brüggemann | Luting Hou
This paper addresses the challenge of computational humor generation proposed in SemEval-2026 Task 1: Humor Generation. Our approach leverages Group Relative Policy Optimization, with an LLM serving as the policy and a custom joke rating model providing a reward signal. We demonstrate that this framework is an effective and computationally efficient approach, reliably producing genuinely funny content that adheres to task constraints.
UMUTeam at SemEval-2026 Task 6: Soft-Voting Transformer Ensembles for Detecting and Classifying Response Ambiguity in Political Discourse
Tomás Bernal-Beltrán | Ronghao Pan | Jorge Gómez-Navalón | José Antonio García-Díaz | Rafael Valencia-Garcia
Tomás Bernal-Beltrán | Ronghao Pan | Jorge Gómez-Navalón | José Antonio García-Díaz | Rafael Valencia-Garcia
Political discourse frequently involves strategically ambiguous responses, particularly in high-stakes settings such as presidential debates and interviews. Detecting whether a politician has directly answered a question, provided an ambiguous reply or issued a clear non-reply remains a challenging task due to the pragmatic and rhetorical nature of political language. This paper describes our participation in the SemEval 2026 CLARITY shared task on response ambiguity detection and classification in English. We focused exclusively on Task 1 (Clarity-level Classification) and proposed a weighted soft-voting ensemble that combines four fine-tuned encoder-only transformer models: RoBERTa-large, BERT-large-cased, DistilBERT-cased and ModernBERT-large. Each model was optimized through grid search and their predicted class probability distributions were aggregated using a weighted linear combination. On the official test set, our system achieved a macro-F1 score of 0.71, ranking 26th out of 41 participating teams. Even with the performance gap compared to top-ranked systems, our results demonstrate that a lightweight set of moderately sized encoder models can provide stable and competitive performance without relying on external data or large-scale architectures.
UMUTeam at SemEval-2026 Task 10: Transformer Ensembles for Conspiratorial Span Extraction and Detection
Jorge Gómez-Navalón | Ronghao Pan | Tomás Bernal-Beltrán | José Antonio García-Díaz | Rafael Valencia-Garcia
Jorge Gómez-Navalón | Ronghao Pan | Tomás Bernal-Beltrán | José Antonio García-Díaz | Rafael Valencia-Garcia
Conspiracy theories pose significant societal risks and require reliable automated detection methods. In this paper, we present our system for SemEval 2026 Task 10, addressing both conspiracy detection and psycholinguistic marker extraction. We leverage multiple pretrained transformer architectures and ensemble strategies to model conspiratorial discourse at both document and token levels. For classification, our ensemble achieves a weighted F1-score of 0.7688, indicating effective performance in distinguishing conspiratorial statements. For marker extraction, we formulate the task as a BIOES sequence labeling problem and enhance predictions through ensemble and specialist models. Our results highlight both the effectiveness of transformer-based approaches and the challenges of fine-grained conspiracy marker extraction.
CUETLuminaries at SemEval-2026 Task 11 Disentangling Logical Validity from Semantic Plausibility through Canonical Abstraction
Adnan Faisal | Shiti Chowdhury
Adnan Faisal | Shiti Chowdhury
Determining whether large language models (LLMs) perform genuine formal reasoning or rely on semantic heuristics is a key challenge in NLP. Syllogistic reasoning constitutes a theoretically principled evaluation paradigm where validity is fully determined by quantifier structure, allowing systematic analysis of structural inference disentangled from semantic plausibility.SemEval-2026 Task-11, Subtask-1: Disentangling Content and Formal Reasoning in Language Models, establishes a multilingual benchmark designed to rigorously isolate formal logical validity from semantic plausibility effects. The subtask evaluates English syllogistic reasoning under a binary classification setting using Overall Accuracy (ACC) and Total Content Effect (TCE), where lower TCE indicates stronger resistance to content-induced bias.Our proposed approach combines cross-validation, structured aggregation and bias-aware evaluation to optimize the robustness–performance trade-off. It achieves 93.19\% accuracy with a TCE of 3.13, yielding a strong combined score of 38.56 under the official evaluation metric. Condition-wise and multi-run analysis confirms that robustness-focused optimization curbs content-driven errors, reinforcing the necessity of bias-aware training for formal inference
CuriosAI at SemEval-2026 Task 10:Hybrid approaches to conspiracy span extraction and conspiracy detection
Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Yuki Shibata | Takayuki Hori | Daichi Yamaga
Hiroki Takushima | Fumika Beppu | Aiswariya Manoj Kumar | Yuki Shibata | Takayuki Hori | Daichi Yamaga
We present CuriosAI’s system for SemEval-2026 Task 10, addressing Conspiracy Marker Extraction and Conspiracy Detection. For marker extraction, we employ multi-label token classification with a bidirectional transformer (DeBERTa-v3-large) to predict overlapping spans. Alternative feature-based and LLM-based approaches do not surpass the encoder baseline. For Conspiracy Detection, we compare heterogeneous models, including transformer fine-tuning, lexical classifiers, embedding-based models, and LLM-based refinement. Development-optimal models do not always generalize best; logit-level ensembling achieves the strongest test performance (F1=0.7620). These results highlight the importance of bidirectional token modeling for span extraction and calibration-aware ensembling for robust detection.
AlphaLyrae at SemEval-2026 Task 9: Metric Learning and Asymmetric Loss for Chinese Polarization Analysis
Minh-Hoang Le | Khoan Phung
Minh-Hoang Le | Khoan Phung
For the Chinese track of SemEval-2026 Task 9 (Detecting Online Polarization), we address two key challenges: polarized content frequently uses implicit language (e.g., homophones and coded terms) to evade moderation, and class distributions exhibit severe long-tail imbalance. We propose a metric learning approach that frames polarization detection as semantic similarity matching, which captures implicit language patterns better than linear decision boundaries. We fine-tune an ERNIE-3.0 encoder with SoftTriple loss and apply ik/iNN retrieval for binary detection (Subtask 1). For multi-label categorization (Subtasks 2 and 3), we transfer learned representations from the detection model and fine-tune with Asymmetric Loss. A priority-based stratified cross-validation strategy ensures minority classes appear across all training folds despite extreme label skew. Evaluated on the official 1,927-sample test set using an end-to-end pipeline, our system achieved Macro-F1 scores of 0.9190 (Rank 6) on Polarization Detection, 0.8244 (Rank 5) on Type Classification, and 0.6670 (Rank 4) on Manifestation Identification.
dutirshlee at SemEval-2026 Task 11: Symbolic Augmentation for Content-Bias-Resistant Syllogistic Reasoning
Songhuan Li | Liang Yang | Shengdi Yin | Qiang Zhang | Hongfei Lin
Songhuan Li | Liang Yang | Shengdi Yin | Qiang Zhang | Hongfei Lin
We describe our system for SemEval-2026 Task 11 Subtask 1 (English syllogistic validity). Our approach fine-tunes Qwen2.5-7B-Instruct with LoRA and a symbolic data augmentation (SDA) scheme that replaces real-world entities with abstract placeholders, explicitly decoupling logical form from content. The resulting model achieves 96.34% accuracy and a total content effect (TCE) of 2.15, yielding a primary score of 44.86. We provide detailed ablations and negative results (prompting, self consistency, contrastive decoding, structured chain-of-thought, andDPO)tocharacterizewhy direct LoRA training with SDA is the most ro bust configuration for this task. Finally, we use a specialist–generalist complementarity setting where a strong API model (ACC 99.48, TCE 1.06, score 57.68) is corrected by the SDA spe cialist on a single disagreement, producing a merged output with ACC 100 and TCE 0.
SU NLP 29 at SemEval-2026 Task 5: DynaOrd - Hybrid Dynamic Ordinal Regression with LoRA-Fine-Tuned DeBERTa-v3
Musab Khan
Musab Khan
We describe our system submitted to SemEval-2026 Task 5 on rating the plausibility of word senses in ambiguous sentences within narrative contexts. The task requires predicting human-perceived plausibility scores on a 1-5 scale for candidate word meanings embedded in short stories, posing challenges such as limited training data and the ordinal nature of target labels. Our approach combines a DeBERTa-v3-large encoder with Low-Rank Adaptation (LoRA) and a dynamically weighted hybrid CORAL-MSE loss for ordinal regression. This formulation adapts the contribution of ranking and regression objectives during training, prioritizing ordinal consistency early and regression refinement in later epochs.We analyze the contributions of dynamic loss weighting to overall system performance.
Khaleesiyali at SemEval-2026 Task 2: Lexicon-Augmented RoBERTa for Valence–Arousal Regression on Ecological Essays
Eleale Tee
Eleale Tee
This paper presents a lexicon-augmentedRoBERTa system for the SemEval-2026 Task2 valence–arousal regression challenge. Themodel integrates deep contextual embeddingswith a 6-dimensional feature vector derivedfrom the NRC VAD lexicon, achieving a hightoken coverage rate of 72.05%. Under officialuser-aware evaluation, the system reached acompetitive average composite correlation of0.547, significantly outperforming the ridgeregressionbaseline. The system demonstratedparticular robustness in valence (r = 0.656)and achieved strong generalization to unseenusers (rarousal = 0.519). These findings indicatethat lightweight lexicon-based statisticsprovide valuable complementary cues for longitudinalemotion modeling in modern transformerarchitectures.
UKPPsycontrol at SemEval-2026 Task 2: Modeling Valence and Arousal Dynamics from Text
Darya Hryhoryeva | Amaia Zurinaga | Hamidreza Jamalabadi | Iryna Gurevych
Darya Hryhoryeva | Amaia Zurinaga | Hamidreza Jamalabadi | Iryna Gurevych
This paper presents our system developed for SemEval-2026 Task 2. The task requires modeling both current affect and short-term affective change in chronologically ordered user-generated texts. We explore three complementary approaches: (1) LLM prompting under user-aware and user-agnostic settings, (2) a pairwise Maximum Entropy (MaxEnt) model with Ising-style interactions for structured transition modeling, and (3) a lightweight neural regression model incorporating recent affective trajectories and trainable user embeddings. Our findings indicate that LLMs effectively capture static affective signals from text, whereas short-term affective variation in this dataset is more strongly explained by recent numeric state trajectories than by textual semantics.
EcoAffectTrack at SemEval-2026 Task 2: A Hierarchical DeBERTa-Transformer Framework with CCC Optimization for Longitudinal Affect Modeling
Diya Satish Kumar | Om Joshi
Diya Satish Kumar | Om Joshi
This submission proposes a hierarchical framework for longitudinal affect modeling, specifically designed for predicting variations in emotional valence and arousal over time. The system utilizes a DeBERTa-v3 encoder backbone optimized with a differentiable Concordance Correlation Coefficient (CCC) Loss for affect assessment (Subtask 1). This approach prioritizes capturing the "shape" and trend of emotional trajectories over absolute point-wise accuracy, yielding a significant performance gain over standard Mean Squared Error.For state change forecasting (Subtask 2A), the framework employs a Transformer-based temporal forecaster with positional encoding to account for inter-subject variability in emotional baselines. Disposition profiling (Subtask 2B) is addressed using a deep attention network that aggregates historical embeddings to identify emotionally informative essays. Experimental results from the official competition indicate that aligning the loss function with evaluation metrics and utilizing task-specific temporal modeling are essential for robust performance in longitudinal emotion recognition.
Momentum at SemEval-2026 Task 2: LongVA-RoBERTa, a transformer-Based Longitudinal Valence and Arousal Modeling
Supriya Nadiger | Sunil Saumya | Rahul Pujari | Veeresh Hiremath | Kiran Chikaraddi | Anoop Kadkol
Supriya Nadiger | Sunil Saumya | Rahul Pujari | Veeresh Hiremath | Kiran Chikaraddi | Anoop Kadkol
This paper studies the emotion as affective circumplex model representing valence and arousal in continuous two dimensional space. It also explores the disposition of emotion over time to identify the behavioural cues and self-identified affective states. while traditional methods use categorical emotion classes, SemEval 2026 Task 2 studies emotions in continuous space. In this paper, we proposes a transformer-based LongVA-RoBERTa model for emotion modeling in regression for ecological essays. For subtask 1 , we develop an affect prediction framework employing RoBERTa with attention pooling and a regression head for valence and arousal prediction. In subtask 2A , we employ BiLSTM to capture the temporal dependencies and fuse surface, contextual, user-level features to predict short-term affect variation. Our results outperform the baseline, paving ways to continue emotion prediction in continuous dimensional space
LexMachina at SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Somdev Ganguli | Vibhan Dutta | Romit Datta | Amit Barman | Sudip Naskar
Somdev Ganguli | Vibhan Dutta | Romit Datta | Amit Barman | Sudip Naskar
Tracking emotional dynamics like valence and arousal is critical for understanding users’ affective baselines in ecological text. However, encoder models often struggle to distinguish stable user traits from dynamic shifts, leading to poor generalization. This paper presents LexMachina, our system for SemEval-2026 Task 2, addressing "domain shift" and "regression to the mean." LexMachina utilizes a DeBERTa-v3-Base backbone with a bifurcated strategy: post-hoc Isotonic Regression for valence calibration and a Domain Adversarial Neural Network (DANN) to mitigate user-bias in arousal. LexMachina achieved composite scores of r=0.645 (Valence) and r=0.434 (Arousal), demonstrating that adversarial disentanglement effectively captures nuances in longitudinal affective data.
lakshadvani at SemEval-2026 Task 11: A Neuro-Symbolic Approach to Content-Independent Syllogistic Reasoning
Laksh Advani
Laksh Advani
We describe our system for SemEval-2026 Task 11 on disentangling content from formal reasoning. The content effect in syllogistic reasoning, where models judge validity based on conclusion plausibility rather than logical structure, persists even with explicit instructions to ignore real-world knowledge. We find that this bias is better addressed structurally than through prompting: by restricting the LLM to a translation role (mapping natural language to abstract variables) and delegating all deductive reasoning to a deterministic checker over the 24 valid Aristotelian forms, we eliminate content bias entirely on Subtask 1 (100.0 combined, TCE=0.0, 4th place).Our Subtask 2 system, which lacks this separation, scores 41.08 (7th place) despite 95.26% accuracy and 99.47% premise retrieval F1, because a TCE of 2.94 incurs a 58% penalty. A three-way ablation on training data using GPT-5 confirms the pattern:Vanilla LLM: 78% accuracy / TCE=19LLM + Aristotelian Rules in Prompt: 90% accuracy / TCE=5LLM + Symbolic Checker: 97% accuracy / TCE=3
CUETLuminaries0227 at SemEval-2026 Task 13: Invariance-Oriented Representation Learning for Robust AI-Generated Code Detection
Shiti Chowdhury | Adnan Faisal
Shiti Chowdhury | Adnan Faisal
Large language models increasingly generate high-quality source code, making reliable detection of machine-generated code essential for maintaining authorship integrity and software accountability. However, detection systems often degrade under distribution shift, particularly across programming languages and application domains. SemEval-2026 Task 13 Subtask A addresses this challenge through a structured OOD evaluation framework that assesses binary machine-generated code detection across unseen languages and application domains. To mitigate this limitation,we propose a robustness-oriented framework that enhances feature-fused UniXcoder representations with supervised contrastive learning, adversarial language-invariant training and uncertainty-aware filtering to promote stable and shift-resilient representations. Our proposed system achieves a macro-F1 of 0.5411 on the official test set and maintains stable performance under severe language–domain shift. Our results demonstrate that domain-level semantic variation is the primary source of degradation under distribution shift, reinforcing the importance of invariance-oriented representations for stable OOD performance
DualAxis AI at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis
Yahya Missaoui | Solomon Kebede | Mounika Marreddy | Alexander Mehler
Yahya Missaoui | Solomon Kebede | Mounika Marreddy | Alexander Mehler
Dimensional Aspect-Based Sentiment Analy-sis models sentiment using continuous valenceand arousal scores instead of discrete polaritylabels, enabling fine-grained affect representa-tion at the aspect level. SemEval 2026 Task3 defines this setting through three subtaskscovering aspect-level regression and structuredextraction of aspect–opinion pairs with continu-ous scoring. We implement transformer-basedbaselines for all subtasks within a unified, re-producible framework. For aspect-level regres-sion, we fine-tune pretrained encoders in anaspect-conditioned setup to predict valence andarousal. RoBERTa-large achieves the best de-velopment performance, with average RMSEsof 0.884 (restaurant) and 0.789 (laptop).
DUTH at SemEval-2026 Task 1: Prompt-Based Zero-Shot Large Language Models for Constrained Multilingual Humor Generation
Georgios Arampatzis | Avi Arampatzis
Georgios Arampatzis | Avi Arampatzis
Humor generation is a challenging problem fornatural language processing systems due to itssubjectivity, cultural dependence, and relianceon creative language use. These challenges arefurther amplified in constrained multilingualsettings, where models must satisfy explicitlexical or topical requirements while producingshort and humorous outputs.In this paper, we present DUTH’s system forSemEval-2026 Task A on constrained multilingual joke generation in English, Spanish, andChinese. Our approach leverages instructiontuned large language models in a zero-shot setting, combining prompt engineering, controlleddecoding, and lightweight post-generation validation to enforce constraint satisfaction andlanguage consistency. We evaluate multiplemodel families and parameter scales, includingQwen and Mistral models. Human evaluationdemonstrates that larger models consistentlyoutperform smaller ones, with Qwen2.5-14BInstruct achieving the strongest overall performance. Error analysis highlights remainingchallenges such as lexical constraint violationsand cross-lingual interference.
Ambirig at SemEval-2026 Task 5: Distributional Ordinal Modelling for Ambiguous Word Senses in Narrative Contexts
Soumyajit Roy
Soumyajit Roy
Word Sense Disambiguation (WSD) has traditionally been framed as selecting a single correct sense given context. However, natural language understanding by humans often involves ambiguity, underspecification, and graded plausibility judgments rather than categorical decisions. SemEval-2026 Task 5 explicitly targets this gap by requiring systems to predict human-perceived plausibility scores for word senses in short narratives. In this paper, we present a systematic empirical study of modelling plausibility as an ordinal distribution prediction problem. We hypothesise that standard classification objectives fail to capture the ordinal nature of human uncertainty in this domain. While we experimented with complex auxiliary tasks, including Siamese networks, Task-Adaptive Pretraining (TAPT), and transfer learning from Natural Language Inference (NLI), our results show these approaches fail in low-resource settings. Instead, we propose a streamlined architecture based on DeBERTa-v3-base utilising a GlossBERT-style Cross-Encoder optimised with Earth Mover’s Distance (EMD) loss. By modeling the problem as ordinal regression over a probability distribution and enriching inputs with prototypical examples, our system achieves an accuracy of 73% and Spearman correlation of 0.593, establishing a robust baseline that outperforms complex parameter-heavy approaches.
DUTH at SemEval-2026 Task 3: Multilingual Transformer Models for Dimensional Stance Prediction Across Tracks
Georgios Arampatzis | Avi Arampatzis
Georgios Arampatzis | Avi Arampatzis
This paper presents DUTH, our system forTrack A and Track B of SemEval-2026 Task 3on Dimensional Sentiment Analysis, focusing on the Dimensional Aspect-Based Sentiment Regression (DimASR) subtask. DimASRrequires predicting continuous Valence andArousal (VA) scores for aspect terms in opinionated text and stance targets in public-issuediscourse.Our approach uses a multilingual Transformerencoder fine-tuned end-to-end to jointly encodethe input text and its corresponding aspect orstance target, followed by a regression head forVAprediction. We evaluate DUTH on the official multilingual and multidomain datasets andcompare it against the shared-task baselines.Results show competitive performance, withimprovements over the strongest official baseline in Track A and over the mBERT baselinein Track B, while yielding consistently strongerpredictions for Valence than for Arousal.
DUTH at SemEval-2026 Task 9: Joint Multilingual Fine-Tuning for Online Polarization Detection
Georgios Arampatzis | Avi Arampatzis
Georgios Arampatzis | Avi Arampatzis
Online polarization on social media presentssubstantial challenges for public discourse, content moderation, and large-scale social analytics across diverse linguistic and cultural contexts. A recent multilingual benchmark enablessystematic evaluation of polarization detectionacross 22 languages and multiple sociopoliticalevents, providing a unified setting for studying socially grounded NLP under multilingualconditions.Wepresent DUTH, a unified multilingual system for binary polarization detection based onjoint fine-tuning of XLM-RoBERTa on the 22languages of SemEval-2026 Task 9 Subtask1. Our system uses a single shared encoderwith a linear classification head and is trainedjointly on the multilingual training set usingmixed-precision optimization. On the officialevaluation, the system achieved an average Accuracy of 0.822 and an average Macro-F1 of0.780 across 22 languages. The results showthat a simple jointly fine-tuned multilingualtransformer provides a competitive and scalable baseline for online polarization detection,while still facing difficulties in implicit, sarcastic, and culturally grounded cases.
UAlberta at SemEval-2026 Task 2: Temporal Fusion Models for Predicting Affect Over Time
Duc Ho | Khanh Bui | Daniela Teodorescu | Grzegorz Kondrak
Duc Ho | Khanh Bui | Daniela Teodorescu | Grzegorz Kondrak
We describe our systems for the SemEval 2026 Task 2 on Predicting Variation in Emotional Valence and Arousal from Ecological Essays. To predict affect in a single instance, and for forecasting dispositional change, we use embeddings from a language model and a Recurrent Neural Network. To predict state changes from a previous timestep to the next, we use time-series forecasting. Our systems ranked first for forecasting dispositional change, and third for forecasting state change over time. We make our code publicly available.
NLP-FSDM at SemEval-2026 Task 4: Narrative Similarity via Multiple Negatives Ranking and Instruction-Based Embeddings
Abdessamad Benlahbib | Zouhir Essalmani | Achraf Boumhidi | Anass Fahfouh | Hamza Alami
Abdessamad Benlahbib | Zouhir Essalmani | Achraf Boumhidi | Anass Fahfouh | Hamza Alami
The identification of narrative similarity is a complex NLP challenge that requires modeling deeper plot and thematic alignment rather than relying solely on lexical overlap. In this paper, we detail the participation of team NLP-FSDM in SemEval-2026 Task 4. Our approach utilizes the bge-large-en-v1.5 encoder. For Track A, we fine-tune it using Multiple Negatives Ranking Loss (MNRL), while for Track B we rely on the pretrained encoder to generate fixed narrative representations. We achieved an accuracy of 65.50% in Track A and 62.50% in Track B. This paper provides an extensive comparison of our results with competitive baselines and top-performing systems, analyzing the efficacy of dense encoders in low-resource narrative contexts.
AI4PC-Howard University at SemEval-2026 Task 2: Fine-Tuning DistilBERT, DeBERTa and ModernBERT for Valence–Arousal Prediction and Change Estimation
Araj Shah | Utsav Shah | Saurav Aryal
Araj Shah | Utsav Shah | Saurav Aryal
We present lightweight, reproducible models for longitudinal valence–arousal (VA) prediction in the SemEval-2026 Task 2 essay corpus. Using only the official data, we enforce user-disjoint splits to prevent leakage and evaluate three settings: essay-level VA state estimation, short-horizon VA change forecasting, and long-horizon disposition change prediction. Our submitted systems use DistilBERT for essay-level regression, ModernBERT-based history modeling with a GRU and a blended previous-delta baseline for short-horizon change, and pooled DeBERTa history embeddings with a compact MLP for disposition change. On the official evaluation, across our best performing approaches, we achieve rcomp =0.665/0.468 (valence/arousal) for Subtask 1, r = 0.597/0.413 for Subtask 2A, and r =0.046/0.348 for Subtask 2B.
Simorgh at SemEval-2026 task 7: Region-Aware Hybrid Retrieval for Low-Resource Cultural Reasoning in Multilingual Question Answering
Hadi Bayrami Asl Tekanlou | Mahdi Bakhtiyarzadeh | Jafar Razmara
Hadi Bayrami Asl Tekanlou | Mahdi Bakhtiyarzadeh | Jafar Razmara
We propose a region-aware hybrid retrieval framework for culturally grounded multilingual question answering. Our system combines BM25-based lexical matching with dense semantic similarity using sentence embeddings, integrating both signals into a unified ranking function. To further prioritize culturally relevant evidence, we introduce a regional weighting heuristic that boosts documents containing explicit region-specific references. The top-ranked evidence passages are incorporated into a structured prompt and processed by a 4-bit quantized Qwen3-14B model. Instead of generating free-form text, the model selects answers deterministically using a logit-based scoring mechanism over the four multiple-choice options. This design enables efficient inference while improving cross-lingual stability, particularly in culturally explicit contexts.
lamanhnguyen at SemEval-2026 Task 2: Uncovering Lexical Bias and Momentum Lag in Longitudinal Emotion Prediction using Multi-task DeBERTa
Lam Anh Nguyen
Lam Anh Nguyen
This paper describes our system for SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal. We approached the task by fine-tuning a weighted ensemble of DeBERTa-v3-base models. Our system achieved the second-highest Valence composite correlation and ranked 5th in the overall V&A average in Subtask 1. More importantly, we provide an empirical analysis of our model’s performance on longitudinal tasks, where it exhibited significant inverse cor- relations. We quantify the Venting Effect, showing a systematic tendency for the model to over-index on negative lexical cues despite self-reported relief. Furthermore, we analyze the structural trade-off between Mean Absolute Error and Pearson correlation induced by smoothing techniques.
VerbaNex AI at SemEval-2026 Task 2: DeBERTa for Longitudinal Valence and Arousal Prediction
Melissa Moreno | Juan Carlos Martinez Santos | Edwin Puertas
Melissa Moreno | Juan Carlos Martinez Santos | Edwin Puertas
This paper describes our submission to SemEval 2026 Subtask 1: Longitudinal Affect Assessment, which aims to predict continuous valence and arousal scores from chronologically ordered texts. Implement two regression based configurations built on DeBERTa fine tuning: a contextual model and a hybrid model that incorporates normalized lexical features derived from the NRC VAD lexicon. Both systems preserve temporal ordering and apply user level data splits to ensure generalization to unseen individuals. Results show competitive performance, with stronger outcomes in valence than in arousal. The integration of lexical features does not yield consistent improvements for arousal, highlighting the difficulty of modeling emotional intensity dynamics. Error analysis indicates challenges in handling implicit emotions, pragmatic ambiguity, and subtle affective shifts over time. Overall, findings underscore the importance of combining contextual representations with structured lexical knowledge while addressing longitudinal variability in emotional activation.
We describe the PALI system submitted to SemEval-2026 Task~3 (Dimensional Aspect-Based Sentiment Analysis), which requires predicting valence–arousal (VA) scores and extracting structured sentiment tuples across multiple languages.Our final system centers on LoRA fine-tuning of Qwen3-32B using Llama-Factory, together with data conversion/cleaning, multilingual data-mixing strategies, and inference-time validation and repair.We additionally explored retrieval-based few-shot prompting with BGE-M3, but found it less effective for learning consistent VA scoring preferences.On Track~A, our final system uses per-language LoRA adapters that mix all subtasks per language for a better trade-off between performance and efficiency.On the official test set, we achieve average per-language scores of 1.2071 RMSE\VA for Subtask~1 and 0.5641/0.4905 cF1 for Subtask~2/3.On the development set, we find that per-language-per-task adapters further improve extraction cF1 but are less attractive in terms of training and deployment cost.For Track~B, we report results for VA prediction on five languages and two domains.
Duluth at SemEval-2026 Task 6: DeBERTa with LLM-Augmented Data for Unmasking Political Question Evasions
Ted Pedersen
Ted Pedersen
This paper presents the Duluth approach toSemEval-2026 Task 6 on CLARITY: Unmask-ing Political Question Evasions. We addressTask 1 (clarity-level classification) and Task 2(evasion-level classification), both of which in-volve classifying question–answer pairs fromU.S. presidential interviews using a two-leveltaxonomy of response clarity. Our system isbased on DeBERTa-V3-base, extended withfocal loss, layer-wise learning rate decay, andboolean discourse features. To address classimbalance in the training data, we augmentminority classes using synthetic examples gen-erated by Gemini 3 and Claude Sonnet 4.5. Ourbest configuration achieved a Macro F1 of 0.76on the Task 1 evaluation set, placing 8th outof 40 teams. The top-ranked system (TeleAI)achieved 0.89, while the mean score across par-ticipants was 0.70. Error analysis reveals thatthe dominant source of misclassification is con-fusion between Ambivalent and Clear Replyresponses, a pattern that mirrors disagreementsamong human annotators. Our findings demon-strate that LLM-based data augmentation canmeaningfully improve minority-class recall onnuanced political discourse tasks.
HCMUSDroneBoys at SemEval-2026 Task 11: Asymmetric Counterfactual Debiasing and Rank-Sensitive Logical Invariance Adaptation for Syllogistic Reasoning
Nguyen Tran | Duy Minh Dao Sy | Trung Kiet Huynh | Phu Hoa Pham | Phu Quy Nguyen Lam
Nguyen Tran | Duy Minh Dao Sy | Trung Kiet Huynh | Phu Hoa Pham | Phu Quy Nguyen Lam
This paper describes our system for SemEval-2026 Task 11, Subtask 1: binary classification of syllogistic validity in English. The main challenge is the content effect, where language models confuse formal logical validity with how plausible the argument sounds. We propose three techniques that work together to separate logical form from semantic content: (1) Structure-Disentangled Prompting (SDP), which breaks syllogisms into premise-conclusion triples and uses a logic-first instruction template; (2) Asymmetric Counterfactual Debiasing (ACD), a data augmentation method that only generates valid-to-invalid counterfactual pairs, taking advantage of an asymmetry in validity composition to avoid label noise; and (3) Rank-Sensitive Logical Invariance Adaptation (RLIA), where we find that low-rank QLoRA adapters cannot simultaneously learn classification and suppress content-correlated shortcuts, and solve this by increasing adapter rank. Built on Qwen2.5-14B-Instruct, our system achieved a perfect Combined Score of 100.0 on the SemEval-2026 Task 11 Subtask 1 benchmark.
YNU-HPCC at SemEval-2026 Task 1: Constraint-Aware In-Context Learning for Multilingual Humor Generation
Xulong Zhang | Jin Wang | Xuejie Zhang
Xulong Zhang | Jin Wang | Xuejie Zhang
This paper describes the system developed by the YNU-HPCC team for SemEval-2026 Task 1 (Humor Generation). The task aims to generate humorous texts from given news headlines or from two unrelated words. The core challenge lies in enabling Large Language Models (LLMs) to understand human humor and align with specific humorous styles. We investigated two approaches: fine-tuning with Proximal Policy Optimization (PPO) and in-context learning with LLMs. We also employed Qwen-Max to evaluate the quality of the generated texts. In the PPO experiments, we constructed a hybrid reward model to align with humor. For our final submission based on LLMs, we used multiple advanced LLMs, along with customized few-shot prompts and a small set of gold samples, to effectively guide the models in generating jokes that resonate with human humor. Experimental results show that our system achieves competitive performance, ranking 4th in the English track, 2nd in the Chinese track, and 2nd in the Spanish track.
Perspicere at SemEval-2026 Task 2: Modeling Longitudinal Valence and Arousal via Dense Embeddings and Agentic Reasoning
Kamyar Moradian Zehab | Mohammad Sadegh Poulaei | Nasser Mozayani
Kamyar Moradian Zehab | Mohammad Sadegh Poulaei | Nasser Mozayani
This paper presents our system for SemEval 2026 Task 2 (Subtask 1), modeling affect assessment as a longitudinal trajectory. We evaluate a tripartite affective framework of escalating contextual complexity, spanning zero-context feature extraction, latent temporal modeling via LSTM, and explicit semantic reasoning via the Teacher-Guided Clinical Reasoning Agent utilizing in-context learning. Our results show that robust static extraction outperforms explicit sequence modeling. Specifically, Matryoshka-distilled embeddings (Jasper) paired with XGBoost provided the best balance of speed and accuracy when utilizing the full training corpus (Valence composite r = 0.654, a 17.4% improvement compared with the baseline), mitigating the severe overfitting observed on partitions of the dataset. Additionally, we uncover a distinct agentic advantage: although the reasoning agent trailed mathematical regressors in tracking high-frequency fluctuations, its SOTA psychological profiling yielded the highest Between-User Valence correlation (r = 0.725), demonstrating its efficacy in user-level affective profiling. Finally, a persistent "arousal bottleneck" confirms the limitations of text-only modeling for physiological activation.
McMaster NLP at SemEval-2026 Task 2: A Lightweight Multi-Feature System for Predicting Emotional Valence and Arousal over Time
Hongyi Zhang | Daniel Hu | Allison Lahnala
Hongyi Zhang | Daniel Hu | Allison Lahnala
We present a lightweight, feature-based regression system for predicting \textbf{valence} (pleasantness) and \textbf{arousal} (activation) from longitudinal language data. The language data ranges from longer free-form ecological essays to short affect-word, organized by user and time, reflecting natural variation in affective expression and experience. Our approach combines three complementary signals: (i) sentence-level semantic embeddings, (ii) psycholinguistic category features capturing affect- and function-related word usage, (iii) similarity measures between the language data with archetypal sentences, and (iv) trainable user-embeddings to account for between-user differences. The resulting feature vector is passed to a multi-layer perceptron trained to jointly predict valence and arousal. Our design provides a strong and interpretable baseline by making it possible to isolate the contribution of semantic, psycholinguistic, similarity, and user-specific signals. We further analyze our model’s predictions to identify which feature groups are most informative and where errors are concentrated across users and input types.
YNU-HPCC at SemEval-2026 Task 9: Hybrid Augmentation and Regularization Strategies for Multilingual Polarization Type Classification
Di Bao | Jin Wang | Xuejie Zhang
Di Bao | Jin Wang | Xuejie Zhang
This paper introduces a system based on fine-tuned pretrained language models, which is constructed for SemEval 2026 Task 9: Multilingual Polarization Type Classification. The task aims to perform multi-label polarization classification on texts covering 22 languages, identifying five types of polarization: political, racial/ethnic, religious, gender/sexual, and others. The main challenges of the task lie in handling uneven data distribution across languages, extreme class imbalance, and the complexity of cross-lingual semantic understanding. To address these challenges, a training framework integrating hybrid augmentation and multi-strategy regularization is proposed. Based on XLM-RoBERTa-large, the framework combines feature-space Mixup augmentation, an asymmetric loss function, adversarial training, and exponential moving average. Multi-label decisions are made through dynamic threshold optimization. Experimental results show that the proposed method achieves a macro-F1 score of 0.48 on the validation set, effectively improving classification performance and generalization capability in multilingual and imbalanced scenarios.
Paradise at SemEval-2026 Task 12: Leveraging Instruction-Tuned Large Language Models with Chain-of-Thought Prompting for Abductive Event Reasoning
Dhruv Goyal | Ishita Gupta | Jatin Bedi
Dhruv Goyal | Ishita Gupta | Jatin Bedi
We present Paradise, our system for SemEval-2026 Task 12: Abductive Event Reasoning, which identifies plausible direct causes of real-world English-language events using retrieved contextual documents. Our approach employs Qwen2.5-7B-Instruct, a 7-billion-parameter instruction-tuned language model combined with carefully engineered chain-of-thought prompting, requiring no task-specific fine-tuning or training-data supervision (prompt components were selected using the development set). The system achieves a score of 0.79 on the official 612-instance test set by integrating explicit causal-inference rules, 4,000-character document context windows, and greedy decoding. Analysis reveals that conservative prediction patterns, 87.1% single-label and 36.9% Option D, effectively exploit the asymmetric scoring metric. Ablation studies confirm that document context contributes +6.4 points, chain-of-thought reasoning +5.3 points, and explicit causal rules +3.1 points to development performance. Our code is publicly available at https://github.com/DhruvGoyal404/semeval2026-task12.
Paradise at SemEval-2026 Task 5: On the Limitations of Surface-Level Features for Graded Word Sense Plausibility Prediction
Dhruv Goyal | Ishita Gupta | Jatin Bedi
Dhruv Goyal | Ishita Gupta | Jatin Bedi
This paper introduces a simple approach for predicting how plausible a word sense is in short narratives where meaning is ambiguous. We use 13 hand-crafted features, including text statistics, word-level similarity computed using basic set-based comparisons, and measures of annotator disagreement. Five diverse and largely independent traditional machine learning models are combined using a weighted ensemble with minimal tuning. Despite theoretical grounding in classical disambiguation methods, our system achieves essentially random performance, with Spearman correlation (ρ) of −0.038 and accuracy within standard deviation of 0.542 on the official test set. This result demonstrates that surface-level lexical features, while interpretable, are insufficient for graded sense plausibility prediction without deep semantic representations. By selecting features inspired by classical word sense disambiguation techniques and incorporating signals derived from human disagreement, our model produces plausibility predictions that are largely interpretable. This negative result provides important baselines and insights for future work on graded word sense disambiguation.
ES4MLL at SemEval-2026 Task 2: Set Attention Aggregation and Recurrent Temporal Modeling for Longitudinal Affect Prediction
Andrea Lolli | Chiara Lunazzi | Riccardo Coppola | Flavio Giobergia
Andrea Lolli | Chiara Lunazzi | Riccardo Coppola | Flavio Giobergia
Longitudinal modelling of affect from text requires capturing both linguistic content and temporal emotional dynamics. SemEval-2026 Task 2 introduces a dataset of essays and feeling words annotated with self-reported valence and arousal scores. In this work, we propose a neural architecture that combines pretrained Transformer encoders with temporal sequence modelling to predict continuous valence and arousal over user-specific timelines. Individual texts are encoded using a Transformer-based language model and aggregated through attention-based pooling before being processed by recurrent layers to capture longitudinal dependencies. To adapt pretrained representations under limited data conditions, we explore parameter-efficient fine-tuning strategies. We make the code available at https://github.com/AndreaLolli2912/SemEval2026-EmoVA.
TTLab at SemEval-2026 Task 10: Transformer-based Approaches for Psycholinguistic Conspiracy Detection in Social Media Discourse
Samuel Richter | Mounika Marreddy | Alexander Mehler
Samuel Richter | Mounika Marreddy | Alexander Mehler
Online platforms increasingly host conspiracy narratives that shape public debate, reduce trust in institutions, and contribute to polarization, highlighting the need for reliable automatic detection systems. In this paper, we participate in SemEval-2026 Task 10 (PsyCoMark), focusing on conspiracy detection in Reddit conversations using transformer-based models. We evaluate four approaches: raw text, structured psycholinguistic markers, a combined representation, and a stacking ensemble. Our results show that marker-based representations outperform text-only models, and that ensembling further improves robustness. These findings demonstrate the value of incorporating structured psychological cues for scalable conspiracy detection.
Our system for SemEval-2026 Task 1 Subtask A addresses constrained text-based humor generation in English. The approach relies on structured prompt engineering using a GPT-4–class large language model in a zero-shot setting without task-specific fine-tuning. Each input, consisting of either mandatory word pairs or a news headline, is embedded into a fixed instruction template enforcing strict stylistic and structural constraints.The system ensures single-sentence outputs between 8–12 words, adopts a dry and deadpan tone, and incorporates subtle expectation shifts while avoiding exaggerated punchlines or unsafe content. Deterministic decoding guarantees replicability, and an automatic validation step enforces compliance with official submission requirements.Experimental results show that structured prompting significantly improves stylistic alignment compared to unconstrained generation. The system demonstrates that controlled humor generation can be achieved through constraint-based prompt design without additional training.
YNU-HPCC at SemEval-2026 Task 10: Pretrained DistilBERT Models for Conspiracy Marker Extraction and Detection
Junpei Chen | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
Junpei Chen | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
In this paper, we present our submission to the SemEval-2026 Psycholinguistic Conspiracy Shared Task (Task 10), which consists of two tasks: conspiracy marker extraction and conspiracy detection. For conspiracy marker extraction, we formulate the problem as a token classification task and fine-tune pretrained language models, achieving performance above the official baseline and ranking 6th on the final leaderboard. For conspiracy detection, we apply data preprocessing, regularization, and ensemble inference strategies,resulting in improvements over the baseline and a 10th-place ranking. Overall, our results demonstrate the effectiveness of pretrained language models for both tasks.
SemEval 2026 task 5 asks us to provide a pro-gram to try to match the human ratings of sense-appropriateness of a particular word in a seriesof very structured, very short stories.Our system1 associates a fixed list of 50 wordswith each WordNet synset, and computes sev-eral scores for each of the phrases in the story,to determine how closely the phrase matchesthe wordlist.We received near-chance results, in spite ofseveral different approaches to building andemploying sets of word-lists. The stories inthis dataset are designed to be ambiguous, andevery story contains words associated with atleast two senses of the target word. We nowbelieve that our system’s approach is inappro-priate for this dataset.
zhangpeng at SemEval-2026 Task 10: PsyCoMark - Psycholinguistic Conspiracy Marker Extraction and Detection
Zhang Peng | Lu Gehao
Zhang Peng | Lu Gehao
We describe our system for SemEval-2026 Task 10 on psycholinguistic conspiracy marker extraction and conspiracy detection from English texts. The shared task consists of two subtasks: (1) extracting conspiracy-related markers—actor, action, effect, victim, and evidence—evaluated using an overlap-based macro F1-score, and (2) detecting conspiracy content as a binary text classification problem evaluated using macro-averaged F1-score. Our approach relies on fine-tuning pre-trained transformer encoders, including multilingual DistilBERT variants and DeBERTa-v3, without using external corpora or data augmentation techniques. Experimental results show that our best models achieve a macro-F1 score of 0.1476 for Subtask~1 and a Weighted-F1 score of 0.7267 for Subtask~2. These results show that simple fine-tuning of pre-trained models provides a strong baseline for both marker extraction and conspiracy detection.
AIvengers at SemEval-2026 Task 9: Utilizing Language Specific Encoders for Multilingual Text Classification
Boon Elschenbroich | Lars Britz
Boon Elschenbroich | Lars Britz
Polarizing language has evolved from a social media phenomenon into a pervasive feature of public and everyday discourse across cultures and geographies. And, this is not limited to certain countries, but a world wide trend. As we will show, detecting polarization, it’s type and manifestation is not a simple task for one ML model, but, it requires multiple different approaches depending on the language and culture. In this paper, we provide the best methods that we found for each language in all three SemEval 2026 - Task 9 multilingual text classification challenge subtasks. We achieved the best results with language specific pre-trained BERT and RoBERTa models, rather than using a general approach and using a generic multi-language model. Our approach secured a high to medium rank in all subtasks and languages.
SwanNLP at SemEval-2026 Task 5: An LLM-based Framework for Plausibility Scoring in Narrative Word Sense Disambiguation
Deshan Sumanathilaka | Nicholas Micallef | Julian Hough | Saman Jayasinghe
Deshan Sumanathilaka | Nicholas Micallef | Julian Hough | Saman Jayasinghe
Recent advances in language models have substantially improved Natural Language Understanding (NLU). Although widely used benchmarks suggest that Large Language Models (LLMs) can effectively disambiguate, their practical applicability in real-world narrative contexts remains underexplored.SemEval-2026 Task 5 addresses this gap by introducing a task that predicts the human-perceived plausibility of a word sense within a short story. In this work, we propose an LLM-based framework for plausibility scoring of homonymous word senses in narrative texts using a structured reasoning mechanism. We examine the impact of fine-tuning low-parameter LLMs with diverse reasoning strategies, alongside dynamic few-shot prompting for large-parameter models, on accurate sense identification and plausibility estimation. Our results show that commercial large-parameter LLMs with dynamic few-shot prompting closely replicate human-like plausibility judgments. Furthermore, model ensembling slightly improves performance, better simulating the agreement patterns of five human annotators compared to single-model predictions.
BAHAHA at SemEval-2026 Task 1: Benchmarking-Aware Humor Authoring with Hybrid Assessment and Adaptation
Utsav Arora | Andrew Hoblitzell
Utsav Arora | Andrew Hoblitzell
This paper describes the BAHAHA system for SemEval-2026 Task 1: MWAHAHA, which requires generating original jokes given either a news headline or a pair of rare words. Our approach uses a generate-then-rank pipeline, combining multi-style candidate generation via comedian-inspired few-shot prompting. We perform quality assessment from a smaller model fine-tuned on synthetic rating data from the generation model. Specifically, we produce up to 50 candidates per input across 15 stylistic templates and select outputs through a mixed-initiative interface that combines automated ranking with human judgment. There were 305 participants and 180 submissions in the contest. Our system ranks 2nd on Subtask A Chinese and 5th on Subtasks B1 and B2. The system generates jokes natively in each language rather than through translation.
dangphuduy at SemEval-2026 Task 10: Span-based Conspiracy Marker Extraction and Emotion-Aware Detection via Gated Fusion
Phu Duy Dang
Phu Duy Dang
Conspiracy theories on social media pose significantsocietal risks, making it essential todetect both conspiracy-related content and thetextual spans that serve as conspiracy markers.In this work, we propose two effective methodsto address these challenges. For markerextraction, we develop a span-based slidingwindow framework that improves efficiencyand accuracy by focusing on localized context.In addition, inspired by the distinctive emotionalpatterns in conspiracy texts, we designa dynamic gating mechanism to integrate emotionaland semantic representations. We evaluateour methods on the SemEval 2026 Task 10,where our team (dangphuduy) achieved competitiveresults, ranking 4th in Task 1 (SpanExtraction) and 3rd in Task 2 (Conspiracy Detection).Experimental results demonstrate thatboth proposed methods significantly enhancemodel performance.
Pixel Phantoms at SemEval-2026 Task 3: Language-Specific Transformer Regression for Dimensional Aspect-Based Sentiment Analysis
Jithu Morrison S | Abisha Rose S
Jithu Morrison S | Abisha Rose S
Aspect-Based Sentiment Analysis (ABSA) has traditionally relied on discrete polarity labels (positive, negative, or neutral) which fail to capture the continuous, multidimensional nature of human emotion. SemEval-2026 Task 3, Dimensional Aspect-Based Sentiment Analysis (DimABSA), addresses this limitation by requiring systems to predict continuous Valence (pleasantness) and Arousal (intensity) scores on a 1–9 scale for specific aspect terms in text, across 15 language–domain combinations in two tracks. Prior approaches to multilingual ABSA have largely depended on single generic multilingual encoders applied uniformly across languages, ignoring language-specific linguistic structures. The Pixel Phantoms system takes a language-aware strategy, selecting dedicated language-specific pre-trained transformer models for each language, including \url{cl-tohoku/bert-base-japanese-v3} for Japanese, \url{DeepPavlov/rubert-base-cased} for Russian, \url{bert-base-chinese} for Chinese, and a Davlan Swahili mBERT variant for Swahili, and falling back to \url{xlm-roberta-base} for morphologically complex low-resource languages such as Tatar and Ukrainian. All models share a common regression architecture: a dual-pooling head combining CLS and mean-pooled representations, trained with a composite MSE + MAE loss and aspect-prompted input formatting. We participated in both Track A (10 combinations) and Track B (5 combinations), with our strongest result in Japanese Hotel (rank 13/21, RMSE 0.7297) and competitive performance in Chinese restaurant (RMSE 0.9823 vs. Baseline Kimi-K2 Thinking 1.8959). We also analyze failure modes in low-resource languages and domain-shifted settings, highlighting where multilingual transfer remains brittle. Overall, the results indicate that language-specific encoders deliver consistent gains over generic multilingual baselines in dimensional sentiment regression.
Gradient Descenders at SemEval-2026 Task 9: Data-Centric Counterfactual Augmentation for Multi-Label Hate Speech Detection
Tran Nhan | Dang Thin
Tran Nhan | Dang Thin
In this paper, we describe the Gradient Descenders submission to SemEval-2026 Task 9 Subtask 2: Multi-Label Hate Speech Detection. Existing Transformer-based approaches often exhibit degraded performance on this task due to severe class imbalance and complex class intersectionality, leading to the learning of spurious correlations. To counteract this, we introduce a novel, data-centric counterfactual augmentation pipeline. We employ Large Language Models (LLMs) as semantic generators to synthesize diverse, targeted training samples via three distinct prompting strategies: Additive Label-Flipping (Attribute Injection), Context Decoupling, and Cross-Domain Identity Substitution. Fine-tuning a RoBERTa classifier on this augmented corpus significantly improves the model’s sensitivity to minority classes. Ultimately, our system achieves a Macro-F1 score of 44.15\% on the official test set, highlighting the efficacy of targeted LLM-based augmentation in highly imbalanced, multi-label environments.
SemEval-2026 Task 12: Knowledge Graph with hyperbolic embedding in Abductive Event Reasoning
Mingkai Wang | Varun Ojha | Huizhi Liang
Mingkai Wang | Varun Ojha | Huizhi Liang
This task introduces Abductive Event Reasoning (AER), a novel shared task, to investigate the ability of Large Language Models(LLMs) to reason about the causality of real-world events. More specifically, a data set consisting of different topics and choices is introduced, and we need to enable the model to select the best options for the given event. Three methods are separately introduced to explore thequestion, including the traditional natural language processing(NLP) method (DeBERTa), theenhanced knowledge graph(KG), and the KG embedded in hyperbolic space.
The system integrates a generative Large Language Model (Llama-3 8B, fine-tuned via LoRA) with a dual-expert bidirectional cross-encoder (DeBERTa-v3-large) optimized for both semantic similarity and Natural Language Inference (NLI). By aggregating these complementary models, the system effectively captures complex contextual dependencies. In the official test set, our architecture ranked 22nd out of 79 systems, achieving a Spearman Rank Correlation of 0.71 and an accuracy within the standard deviation of 82.04%.
SMASH at SemEval-2026 Task 9: Detecting Multilingual Polarisation with Encoder Ensembles and Calibrated Decision Thresholds
Zahra Bokaei | Alessandra Terranova | Yi Zheng | Tom Bidewell | Bjorn Ross
Zahra Bokaei | Alessandra Terranova | Yi Zheng | Tom Bidewell | Bjorn Ross
This paper describes the SMASH submission to SemEval-2026 Task~9 on multilingual, multicultural, and multi-event polarisation detection. The task comprises (i) binary polarisation detection, (ii) multi-label classification of polarisation types, and (iii) multi-label identification of polarisation manifestations across all available languages. We propose a language-adaptive ensemble framework combining monolingual and multilingual encoder-only transformers, together with a principled out-of-fold (OOF) threshold tuning strategy. Instead of relying on fixed probability thresholds, we jointly tune ensemble weights and class-wise decision thresholds to directly optimise macro-F1 under the official evaluation metric. Our experiments show that (1) monolingual encoders dominate in several high-resource languages but benefit from complementary multilingual signals, (2) no single multilingual backbone universally outperforms others across languages and subtasks, and (3) language-specific class threshold tuning substantially improves performance due to large cross-lingual variation in class distributions. Our results demonstrate that careful logit-level ensembling and threshold tuning provide strong performance for multilingual, imbalanced, multi-label polarisation detection. Across 22 evaluation languages, SMASH ranks among the top three systems in a substantial number of language–subtask pairs. Specifically, it ranks in the top three for 5 languages in Subtask 1, 14 languages in Subtask 2, and 16 languages in Subtask 3, demonstrating strong and consistent performance across diverse languages and tasks. Our system achieves average macro-F1 scores of 0.81, 0.62, and 0.53 for Subtasks 1, 2, and 3, respectively.
Lattice at SemEval-2026 Task 1: Why did the prompt engineer break up with their LLM? Because zero-shot was zero-fun.
Mathieu Dehouck | Olga Seminck | Marine Delaborde | Yoann Dupont | Noé Durandard
Mathieu Dehouck | Olga Seminck | Marine Delaborde | Yoann Dupont | Noé Durandard
This paper describes the contribution of theLattice Team to the humor generation MWA-HAHA Sem-Eval shared task on the Englishdata set for subtask A. During the developmentphase, we experimented with two different ap-proaches, but after a quick comparison of theoutputs, it turned out that one was clearly moresuccessful than the other. The winning strategycan be seen as consisting of two phases: first,we used a few-shot framework to let Deepseek-R1 32B generate multiple jokes based on theinput (headlines and word pairs). Second, weset up a voting protocol for Llama-3.1 8B torank the generated jokes and find the funniestone. The other strategy also consisted in twophases: first, we generate many more jokesin a zero-shot way with lighter, faster models,and then we turn back to ranking the generatedjokes, but since we have about ten time morejokes in this second setting, we follow a knock-out tournament procedure in order to find thebest jokes. Our Deepseek-R1 based model isone of the nine systems that shared a first placeon the English data set that received a total of32 valid submissions.
Comhis at SemEval-2026 Task 4: Embedding-Space Adaptation and LLM-Assisted Inference for Narrative Similarity
Ke Shu | Eetu Mäkelä | Mikko Tolonen
Ke Shu | Eetu Mäkelä | Mikko Tolonen
We present a two-stage system for the SemEval Narrative Similarity task that separates representation learning from comparative decision making. In Track B, we adapt a frozen large-scale embedding model using a lightweight projection layer trained with a triplet objective and hard example mining, producing a task-specific similarity space. In Track A, similarity scores derived from the adapted embedding space are incorporated into a large language model, which performs the final binary decision. On the official test set, our system achieves 0.68 accuracy on Track A and 0.66 on Track B, clearly outperforming the provided baselines and ranking 20th out of 44 teams on Track A and 10th out of 27 teams on Track B. These results demonstrate that efficient embedding adaptation combined with embedding-informed LLM reasoning is effective for modeling high-level narrative similarity.
FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures
Liliia Bogdanova | Shiran Sun | Lifeng Han | Natalia Amat-Lefort | Flor Miriam Plaza-del-Arco
Liliia Bogdanova | Shiran Sun | Lifeng Han | Natalia Amat-Lefort | Flor Miriam Plaza-del-Arco
This system paper describes our participation in the SemEval-2025 Task-7 “Everyday Knowledge Across Diverse Languages and Cultures”. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ).The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo.Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform.We share the prompts we developed using refinement techniques and report the learning curve of such prompts.The tested languages are English, Spanish, and Chinese for both tracks.Our resources and codes are shared via \url{https://github.com/aaronlifenghan/FLANS-2026}
PolDeck at SemEval-2026 Task 9: Multilingual Online Polarization Detection via Hybrid Model Ensembling and Data Augmentation
Ben Grandy | Daniel Khir
Ben Grandy | Daniel Khir
In this paper, we address SemEval 2026 Task 9: Multilingual Online Polarization Detection. We present our hybrid ensemble framework, integrating few-shot prompting with Qwen3-30B, a native multilingual XLM-R encoder, and a translation-augmented DeBERTa encoder. To mitigate label imbalance, we implement a multi-stage augmentation pipeline leveraging LLMs for synthetic paraphrasing and cross-lingual translation. Our system ranked in the Top 10 on the English and German leaderboards, proving that integrating both high-capacity monolingual models and flexible multilingual models in a holistic system is a viable method for detecting online polarization. Our code is available on GitHub.
CUCLASIC at SemEval-2026 Task 5: LLM Prompting Strategies for Rating Ambiguous Word Senses
Federico Ortega Riba | Jasper Wilkerson | Kelsey Lafreniere Adams
Federico Ortega Riba | Jasper Wilkerson | Kelsey Lafreniere Adams
Word sense disambiguation has been a foundational task in computational semantics since the 1990s, but remains an unsolved problem when it comes to bridging human and computational evaluation of ambiguity. The SemEval-2026 Task 5 attempts to address this gap. We test six Large Language Models (LLMs) from the Llama and Gemini families in order to evaluate LLMs’ ratings of ambiguous textual excerpts, experimenting with zero- and few-shot variants of prompts and analyzing how simple linguistic cues improve performance. We propose a methodology of eliciting human-like ratings from language models by using examples with low and high standard deviations between human ratings. We further evaluate and compare the prediction patterns of different models and how they align with the human generated ratings. Our best model (Gemini 3-Flash) achieves a 75% score combining Spearman correlation and accuracy within one standard deviation.
BBgame at SemEval-2026 Task 12: Small Lanugage Model Fintuning for Abductive Event Reasoning task
Shu Li | Huizhi(elly) Liang
Shu Li | Huizhi(elly) Liang
We introduce a three-stage training framework for abductive event reasoning(AER). The task dataset were decomposed into 3 subsets, causal judgment, cause generation, and multiple choice answering(MCQA). Abductive reasoning requires understanding complex causal relationships between events. However, small language models typically struggle due to the multi-step inference required. Our approach provided supervised fine-tuning with group relative policy optimization(GRPO) to enlarge the reasoning capabilities based on an 0.5b parameter model. On the SemEval-2026 Task 12 development set, out Casual-Qwen 0.5B model achieves $64.75\%$, abslute outperforming $63.78\%$ Qwen2.5:0.5b at $0.0975\%$. Our ablation study reveals that binary casual judgement rather than cause generation or direct MCQA training is the key skill for AER task, with more complex stages significantly underperforming due to the task misalignment or task complexicity.
VerbaNexAI at SemEval-2026 Task 7: Integrating Web Snippets and RAG for the Evaluation of Multilingual Cultural Knowledge in LLMs
Danileth Almanza | Jairo Serrano | Edwin Puertas | Juan Carlos Martinez Santos
Danileth Almanza | Jairo Serrano | Edwin Puertas | Juan Carlos Martinez Santos
In multilingual and multicultural contexts, LLMs require contextualization mechanisms to generate culturally coherent responses. In this sense, this study presents a LLaMA-based approach to answer short cultural questions in different languages within Task 7 of SemEval-2026 (Track 1: SAQ), without access to official training data. The system integrates controlled synthetic data generation, evidence retrieval through web snippets, and a Retrieval-Augmented Generation (RAG) framework with Few-shot learning. BLEnD is used solely as a thematic guide, ensuring semantic independence. During development, the LLaMA-3.1-8B model achieved 38.51\% global accuracy, while LLaMA-3.2-1B obtained 15.54\%. In large-scale evaluation (30,500 instances), the 1B model achieved 16.69\%, maintaining stability after prompt optimization. The results demonstrate that contextual retrieval improves multilingual cultural knowledge evaluation and highlight the importance of pipeline design and model capacity.
KDW at SemEval-2026 Task 12: Logic-Driven Distillation with Knowledge Graphs for Efficient Abductive Reasoning
Sihan Zhu | Hongjie Wu | Xinyan Xu
Sihan Zhu | Hongjie Wu | Xinyan Xu
Large language models (LLMs) such as GPT-4 and Gemini show strong reasoning ability but incur substantial computational cost in abductive reasoning settings. We present our system for "SemEval-2026 Task 12 — Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models", which integrates knowledge graph (KG) evidence extraction with knowledge distillation to transfer structured reasoning from a large teacher model to a compact student model. Our approach ranks 8th in the shared task while achieving performance comparable to frontier LLMs at a fraction of the inference cost.
kirito at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis via Sentence Structure Parsing Preprocessing and Prompt-Enhanced Instruction Tuning
Shuangjin Hu
Shuangjin Hu
Dimensional Aspect-Based Sentiment Analysis (DimABSA) integrates fine-grained aspect extraction with continuous Valence–Arousal (VA) regression, posing unique challenges for fine-grained opinion mining. This paper presents our system for SemEval-2026 Task 3, with task-aligned strategies for three heterogeneous subtasks. For the DimASR task, we frame dimensional sentiment prediction as a supervised regression problem, paired with Low-Rank Adaptation (LoRA)-based parameter-efficient fine-tuning and a deep nonlinear regression head. For DimASTE and DimASQP tasks, we propose a lightweight sentence structure parsing preprocessing module, combined with prompt-enhanced instruction tuning for unified structured generation of aspect elements and VA scores. Experimental results on the official English test sets show that our system outperforms both official baselines across most settings, with syntax-guided prompting effectively improving aspect-opinion alignment and the dedicated regression head enhancing continuous sentiment modeling stability.
YNU-NLP at SemEval-2026 Task 11: A Neuro-Symbolic Approach with Reflexion Mechanism Disentangling Content and Formal Reasoning in Language Models
Yu Wang | You Zhang | Hao Zhang | Dan Xu
Yu Wang | You Zhang | Hao Zhang | Dan Xu
This paper describes our systems for SemEval-2026 Task 11, Disentangling Content and Formal Reasoning in Language Models. We participated in all four subtasks, addressing the Content Effect-a phenomenon where models rely on real-world plausibility rather than logical validity. Existing methods, such as standard Chain-of-Thought (CoT) prompting or single-task Supervised Fine-Tuning (SFT), often struggle to completely decouple content from reasoning due to the inherent probabilistic biases in pre-trained models. To address these limitations, a hybrid neuro-symbolic framework based on the Qwen2.5-14B architecture is proposed, integrating multi-task instruction tuning with a robust neuro-symbolic pipeline. The principal innovation lies in the deployment of a Reflexion mechanism coupled with formal verification: natural language arguments are parsed into First-Order Logic (FOL) and subsequently verified by the Z3 Theorem Prover. Parsing anomalies are automatically rectified through an iterative self-correction module. The proposed system ranked 1st in Subtasks 1 & 2, 2nd in Subtask 4, and 9th in Subtask 3, validating its ability to decouple logical validity from content plausibility.
Duluth at SemEval-2026 Task 4: A Hybrid Approach to Narrative Similarity using Bi-Encoder Embeddings with Cross-Encoder Tie breaking using Learned Weights
Maxwell Bevers | Aidan Carlson | Ted Pedersen
Maxwell Bevers | Aidan Carlson | Ted Pedersen
We present a hybrid system for SemEval-2026 Task 4 on Narrative Similarity. Our approach decomposes the stories into four narrative components: theme, plot, emotion, and outcome. Each component is then encoded using a biencoder (all-mpnet-base-v2), and cosine similarities are combined through a learned pairwise ranking model. When similarity scores between candidate stories fall within a small margin of error, a cross-encoder (ms-marcoMiniLM-L-6-v2) is used as a tie-breaker. Our final system achieves 58.5% accuracy on the official test set, placing us at 39th overall. Error analysis reveals that the system struggles with complex themes, multiple protagonists, and contrasting outcomes.
This paper describes our system designed forSemEval-2026 Task 10: PsyCoMark—Subtask2: Conspiracy Detection. We proposed a two-stage approach that leverages large-scale pre-trained models and a fine-tuned smaller modelto detect conspiracy theories in text. In thefirst stage, we utilize a large model to test allthe test samples and filter out those that areclearly unrelated to conspiracy theories. Forthe remaining samples, we apply a retrieval-enhanced custom prompt strategy combinedwith the Roberta-Large model in the secondstage. This allows us to fine-tune the modelwith weighted predictions based on relevantretrieved information, enhancing detection ac-curacy. Our system achieved first place onthe leaderboard, with an impressive F1 Scoreof 0.8874. We also present a brief analysisof the effectiveness of the methods used, in-cluding the advantages and limitations of largemodel-based filtering and retrieval-augmentedfine-tuning.
Stochastic Gradient Descenders at SemEval-2026 Task 9: Few-Shot LLM Prompting for Polarization Type Classification
Huynh Phu | Dang Thin
Huynh Phu | Dang Thin
This paper presents our system for SemEval-2026 Task~9 (POLAR), Subtask~2, which focuses on classifying polarization types in social media text. We investigate three paradigms: (i) fine-tuning mDeBERTa-v3 with domain-adaptive pre-training, (ii) parameter-efficient adaptation of Qwen2.5-32B using LoRA, and (iii) few-shot prompting with Llama-3.3-70B-Instruct. Experimental results show that few-shot prompting, despite requiring no task-specific training, outperforms both fine-tuning and parameter-efficient approaches. Notably, it achieves non-zero F1 scores across all polarization categories, which is critical under macro-averaged evaluation. Our system ranks 2nd out of 29 English submissions on the official leaderboard, achieving an F1 Macro of 0.5157. These findings highlight the effectiveness of large instruction-tuned models in low-resource, label-imbalanced classification settings.
YNU-HPCC at SemEval-2026 Task 8: Parallel Generation and Multi-Metric Reranking for Faithful Extractive RAG
Bo Li | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
Bo Li | You Zhang | Jin Wang | Dan Xu | Xuejie Zhang
This paper presents our approach for the SemEval-2026 Task 8: MTRAGEval (SubtaskB: Answer Generation), which challenges sys-tems to generate faithful, extractive answers to multi-turn questions based strictly on provided gold-standard reference passages. The primary scientific challenge lies in maintaining high faithfulness and structural consistency while adapting to diverse answer styles across a conversation, as systems must generate responses that vary significantly in length and format without hallucinating. Conventional reference-based generation methods often rely on static prompting or greedy decoding, which fail to capture these dynamic stylistic requirements and lack robustness against generation noise. To address these limitations, we propose a Intent-Aware Parallel Generation and Reranking System powered by a large language model. Experimental results on the official test set demonstrate the effectiveness of our method, achieving competitive performance comparable to SoTA baselines. Ultimately,our approach secured the third place in the competition. The code of the paper is available at: https://github.com/viaviachris/SemEval-2026-Task8
ICT-NLP at SemEval-2026 Task 3: Less Is More — Multilingual Encoder with Joint Training and Adaptive Ensemble for Dimensional Aspect Sentiment Regression
Liyuan Huang | Jiawei He | Wutao Shen | Lin Li | Jin Zhang
Liyuan Huang | Jiawei He | Wutao Shen | Lin Li | Jin Zhang
This paper describes our system to SemEval-2026 Task 3 Track A Subtask 1 on Dimensional Aspect Sentiment Regression (DimASR). We propose a lightweight and resource-efficient system built entirely on multilingual pre-trained encoders, without relying on LLMs or external corpora. We adopt joint multilingual and multi-domain training to facilitate cross-lingual transfer and alleviate data sparsity, introduce a bounded regression transformation that improves training stability while constraining predictions within the valid range, and employ an adaptive ensemble strategy via subset search to reduce prediction variance. Experimental results demonstrate that our system achieves strong and consistent performance, ranking 1st on zho-res, 2nd on zho-lap, and 3rd on jpn-hot, with all remaining datasets placed within the top half of participating teams.
SemEval-2026 Task 13: Fine-tuned CodeBERT with Stratified Balancing, Dynamic Threshold Optimization, and Logit Bias Correction for Robust Multi-Language AI Code Detection
Udaythalavesh S | Rajalakshmi Sivanaiah | Angel Deborah S
Udaythalavesh S | Rajalakshmi Sivanaiah | Angel Deborah S
We present a CodeBERT-based system for detecting AI-generated code in SemEval-2026 Task 13 Subtask A. To address class imbalance and model overconfidence, we apply stratified balanced subsampling, dynamic per-epoch F1-macro threshold optimization, and label-flip bias correction. The model is trained using TPU-accelerated fine-tuning and achieves a validation F1-macro of 0.874 and a private leaderboard F1-macro of 0.53. Ablation studies confirm the effectiveness of our balancing and calibration strategies under distribution shift.
SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection
Gabriel Stefan | Sergiu Nisioi
Gabriel Stefan | Sergiu Nisioi
We describe our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which classifies English political interview responses by coarse-grained clarity (3-way) and fine-grained evasion strategy (9-way). Since responses frequently exceed the 512-token limit of standard Transformer encoders, we apply an overlapping sliding-window chunking strategy with element-wise Max-Pooling aggregation over chunk representations. A shared RoBERTa-large encoder supplies two task-specific heads trained jointly via a multi-task objective, with inference-time ensembling over 7-fold stratified cross-validation. Our system achieves a Macro-F1 of 0.80 on Subtask 1 and 0.51 on Subtask 2, ranking 11th in both subtasks.
LocuPrompt at SemEval-2026 Task 7: A Multilingual Prompting Framework for Cross-Cultural Everyday Knowledge in LLMs
Ningjingke Ning
Ningjingke Ning
Understanding everyday cultural knowledge remains a fundamental challenge for large language models (LLMs). This paper presents LocuPrompt, a multilingual framework for SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures. To address Short Answer Questions (SAQ), we employ an English-pivot generation strategy with back-translation, combined with empirical locale-specific routing that dynamically assigns the optimal LLM to each target region. For Multiple-Choice Questions (MCQ), we apply parameter-efficient fine-tuning to a robust multilingual base model, utilizing locale-aware instructions that frame the LLM as a "local resident." By integrating strategic model selection with resource-efficient adaptation, LocuPrompt effectively bridges cross-lingual cultural gaps while maintaining a fully reproducible pipeline.
SlugRAG at SemEval-2026 Task 8: Domain-Specific Fine-Tuning and Model Scaling for Multi-Turn RAG Retrieval
Pratibha Revankar | Jihye Kim | Umit Azirakhmet
Pratibha Revankar | Jihye Kim | Umit Azirakhmet
Multi-Turn Retrieval-Augmented Generation (MT-RAG) requires resolving context-dependent ambiguities across conversational turns. We present a systematic evaluation of dense retrieval optimization for the MTRAGEval benchmark (Task 8, Subtask A: Retrieval Only), investigating training-time strategies and inference-time query reformulation across four diverse English-language domains: CLAPNQ (legal/patent), FIQA (financial), GOVT (government documents), and CLOUD (cloud computing). Our experiments demonstrate that domain-specific fine-tuning yields the most substantial gains, with our best CLAPNQ model achieving Recall@10 of 0.6016 and nDCG@10 of 0.4981—representing 58.3\% and 66.0\% improvements over the pre-trained BGE baseline. Domain-specific models average 44.3\% improvement in Recall@10 and 47.8\% in nDCG@10 across all domains. Additionally, fine-tuning larger embedding models (BGE-large) achieves the best overall performance (nDCG@10: 0.5101, Recall@10: 0.6221), highlighting the complementary impact of model capacity and domain adaptation.
PEU Lab at SemEval-2026 Task 4: Pairwise Text Comparison using RoBERTa and Ranking Loss
Hangchao Ma | Jiaxu Dao | Jinli Tong | Zhuoying Li | Qingsong Zhou | Xiuzhong Tang
Hangchao Ma | Jiaxu Dao | Jinli Tong | Zhuoying Li | Qingsong Zhou | Xiuzhong Tang
This paper describes the system developed by the PEU Lab for SemEval-2026 Task 4, specifically focusing on Track A: Comparative Narrative Similarity. To address the pairwise nature of the task, a lightweight contrastive ranking approach is proposed. Specifically, the pretrained RoBERTa-Large model is utilized to encode the anchor and candidate stories. Rather than employing standard cross-entropy, a margin ranking loss is introduced, which allows the relative narrative proximity between different candidate stories to be explicitly modeled. Furthermore, a 5-fold cross-validation ensemble strategy is integrated to stabilize predictions on unseen data. Evaluated on the official dataset, the optimal configuration achieved an overall accuracy of 64.50%, demonstrating the effectiveness of relative order modeling. The code for this system is available at: https://github.com/mhchhh/SemEval2026-Task-4.
YoungDSMLKZ at SemEval-2026 Task 13: MIL-UniXcoder with Meta-Stacking and Handcrafted Features for AI-Generated Code Detection
Yeraly Gainulla | Agzam Shamsadinov
Yeraly Gainulla | Agzam Shamsadinov
We propose and validate a multi-view ensemble framework for 4-class AI-generated code detection (Human, AI, Hybrid, Adversarial) in realistic long-form repositories. Our system, Team YoungDSMLKZ, ranked 1st out of 50+ teams in SemEval-2026 Task 13 Subtask C with a macro F1 of 0.7855 (+5.2 over runner-up). The framework combines: (i) a Dynamic Multiple Instance Learning (MIL) pipeline over UniXcoder chunks for O(N)-scalable long-context detection, (ii) transformer-based meta-stacking (UniXcoder and ModernBERT), and (iii) an XGBoost classifier on 200+ handcrafted stylometric features. Evidence localization analysis shows that 62.4% of decisive AI-detection signals reside beyond the standard 512-token window, validating the MIL design.
VARH-AI at SemEval-2026 Task 10: Exploiting Architectural Diversity with Transformer-SSM Ensembles and Confidence-Based Iterative Refinement for Conspiracy Detection
Hritav Solanki | Shubham Sharma | Manish Prasad | Rakhi Agrawal | Yashvardhan Sharma
Hritav Solanki | Shubham Sharma | Manish Prasad | Rakhi Agrawal | Yashvardhan Sharma
This paper describes our system for SemEval 2026 Task 10 (PsyCoMark), focusing on Subtask 2: binary conspiracy classification in Reddit submission statements. We present a heterogeneous ensemble approach that combines Transformer-based models (DeBERTa, RoBERTa) with State-Space Models (Mamba) to leverage architectural diversity for improved generalization. Our key contributions include: (1) Bidirectional Mamba (BiMamba), adapting state-space sequence models for bidirectional document classification; (2) (2) a safety-switched multi-task training setup that uses marker supervision only for gold-annotated samples, preventing noisy pseudo-labeled rows from affecting the span extraction objective; and (3) Confidence-Based Iterative Refinement, using committee voting for high-quality pseudo-label generation. Our best official submission achieved a weighted F1 score of 0.78 on the Subtask 2 test set, ranking 4th on the public CodaBench leaderboard. We provide detailed ablation studies demonstrating the complementary contributions of each architectural component to inform future research directions.
HABIBTAZ at SemEval-2026 Task 11: Disentangling Formal Logic from Content via Synthetic Training and Multi-Objective Optimization
Abdullah Shaikh | Zain Naqi | Taha Zahid | Sandesh Kumar | Abdul Samad
Abdullah Shaikh | Zain Naqi | Taha Zahid | Sandesh Kumar | Abdul Samad
While Large Language Models (LLMs) excel in many general NLP tasks, their formal reasoning capabilities are often compromised by content effects, demonstrating a measurable bias towards real-world plausibility. In this paper, we present our system for SemEval-2026 Task 11, which evaluates the ability of models to disentangle formal logic from content across 12 languages with and without distractor premises. We address this challenge using mDeBERTa-v3 networks fine-tuned on a synthetic, rule-based dataset of syllogistic schemes to avoid the semantic noise of LLM-augmented data. To explicitly decouple plausibility from logical structure, our training pipeline employs a multi-objective loss function combining Adaptive Group Distributionally Robust Optimization (DRO), a scheduled differentiable bias penalty, and KL-Divergence consistency regularization. Our system achieved #1 ranks and perfect Ranking Scores (100.0) with 0.00% bias and 100.0% accuracy on Subtask 1 (English), Subtask 2 (Noisy English), and Subtask 3 (Multilingual). On the highly complex Subtask 4 (Noisy Multilingual), the system achieved the 6th rank with 89.06% Accuracy and F1-score, alongside a limited 2.89% Bias and a 37.78 Ranking Score. Our dataset generation engine and codebase are publicly available to facilitate future work on robust logical reasoning.
TransformerTrio at SemEval-2026 Task 13: Navigating Domain Shift and Representation Instability in Machine-Generated Code Detection
Avi Patel | Manthan Laddha | Pushti Sapovadiya | Pruthwik Mishra | Shrikant Malviya
Avi Patel | Manthan Laddha | Pushti Sapovadiya | Pruthwik Mishra | Shrikant Malviya
Detecting machine-generated code is increasingly challenging due to advances in code generation models and domain variation across programming tasks. We present our submissions to SemEval-2026 Task 13, evaluating detection in three settings: binary human vs. machine classification, multi-class generator attribution, and four-way authorship classification including hybrid and adversarial cases. We compare feature-based, transformer-based, and hybrid approaches under domain shift and limited supervision. Results show that domain-specific signals often dominate model decisions, degrading generalization when training and test distributions diverge. Increasing model complexity does not consistently improve performance in low-resource or cross-domain settings and may amplify spurious correlations. These findings emphasize robustness and feature alignment over model sophistication for reliable detection.
SSN-CSE-CODECATALYSTS at SemEval-2026 Task 13: Integrating Transformer Semantics and AST-Derived Structural Features for AI-Generated Code Detection.
Bhuvana J | Ramanan Mahendran | Siddharth Chandrasekar S | Pragatheesh J | Rethanya P
Bhuvana J | Ramanan Mahendran | Siddharth Chandrasekar S | Pragatheesh J | Rethanya P
Pre-trained transformers often struggle with multi-lingual code classification due to sequence length constraints and difficulties in explicitly capturing deep structural complexities. To address this for SemEval Task 13, a hybrid neural architecture that fuses CodeBERT’s semantic embeddings is proposed. Handcrafted software engineering metrics is presented, with a Head+Tail truncation strategy to preserve crucial logic in long sequences while simultaneously extracting explicit Abstract Syntax Tree (AST) features via tree-sitter—including maximum depth, branching factor, and cyclomatic complexity. By integrating dense language model representations with explicit structural heuristics, this work provides a robust and scalable solution for enhanced code classification.
king001 at SemEval-2026 Task 7: Cross-Language Cultural Everyday Knowledge Q A System Based on RAG
Meizhi Jin | Zhichao Meng | Junqi Yin | Lianxin Jiang | Jianyu Li
Meizhi Jin | Zhichao Meng | Junqi Yin | Lianxin Jiang | Jianyu Li
This paper describes our system used in the SemEval-2026 Task 7: Cross-Language Cultural Everyday Knowledge QA (track 1). Cultural knowledge typically exhibits significant regional specificity and is deeply rooted in particular linguistic conventions, posing severe challenges to general-purpose large language models (LLMs). We propose a retrieval-augmented generation (RAG) framework: this framework utilizes text-embedding-v4 as the retrieval core to precisely extract social knowledge and expression patterns from region-specific large-scale multilingual cultural knowledge bases, and drives the gpt-5.2-chat model to generate concise answers that are both logically factual and highly aligned with the target region’s cultural context. In the official evaluation, our system ranked first among all participating teams with a total score of 78.7672, fully demonstrating the method’s outstanding performance in cross-cultural accuracy and linguistic authenticity.
SteerForce at SemEval-2026 Task 11: Reducing Content Effects Using Layered Activation Steering
Noah Tratzsch | Asmaa Al-Raian | Mounika Marreddy | Alexander Mehler
Noah Tratzsch | Asmaa Al-Raian | Mounika Marreddy | Alexander Mehler
Large language models exhibit content effects, where surface plausibility interferes with formal logical reasoning. In SemEval-2026 Task 11, this appears as a performance gap between plausibility-aligned and plausibility-conflicting syllogisms, reflecting directional content bias. We address this issue using inference-time activation steering, modeling bias as a geometric deviation between plausibility-driven and validity-driven representations. We introduce a layered steering framework that combines Activation Transport (ACT) with input-adaptive contrastive steering (K-CAST), applied to layers identified through sensitivity analysis. This architecture-aware strategy enables targeted interventions without retraining.On BERT, sequential multi-layer steering improves validity accuracy from 77.1% to 82.3% while reducing bias by 75%. In contrast, for the decoder-only Qwen2.5-1.5B-Instruct, a single mid-to-late layer intervention reduces bias from 0.26 to 0.04 with modest accuracy gains, whereas multi-layer steering offers no additional benefit. These results reveal a fundamental architectural distinction: encoder-based models benefit from distributed low-intensity corrections, while decoder-only instruction-tuned models concentrate reasoning signals within a narrow late-layer band. Our findings demonstrate that effective bias mitigation requires architecture-aware activation steering.
Sabancigroup4 at SemEval-2026 Task 5: Uncertainty-Aware Semantic Plausibility Scoring via GNLL Regression and LLM Rationales
Salih Büyükbaş | Doruk Benli | Osman Kara | Dilara Keküllüoğlu
Salih Büyükbaş | Doruk Benli | Osman Kara | Dilara Keküllüoğlu
SemEval-2026 Task 5 is a shared task on rating the plausibility of an ambiguous homonym in a predetermined context. The dataset of this task consists of a precontext & sentence & ending combinations for each homonym, and the plausibility of the sample is manually rated by 5 annotators. The task of participating teams was to automatically predict the plausibility with respect to the mean rate given by the annotators. Unlike traditional models that rely on single-label selection, this task frames disambiguation as a probabilistic distribution over multiple plausible meanings. To this end, we propose an uncertainty-aware training strategy using GNLL regression, and semantic context enrichment through POS tags and LLM rationales. Our system exhibits competitive performance, achieving 90% accuracy within standard deviation and 81% Spearman correlation, and placing us in the ninth place in the leaderboard.
IITKanBDone at SemEval-2026 Task 8: MTRAGEval - Evaluating Multi-Turn RAG Conversations
Soumendra Ray | Garima Gupta
Soumendra Ray | Garima Gupta
This paper describes our system for the MT-RAG (Multi-Turn Retrieval-Augmented Generation) shared task, which addresses the challenge of multi-turn conversational question answering using retrieval-augmented generation. We participated in three sub-tasks of Task 8: Task A (retrieval), Task B (generation with reference passages), and Task C (end-to-end RAG). For Task A, we evaluated multiple retrieval approaches including BM25, BGE, and hybrid methods, achieving best performance with ELSER (Elastic Learned Sparse EncodeR) with nDCG@5 of 0.4018 (Rank 24/38). For Task B, we employed the Mistral-7B-Instruct-v0.2 model via HuggingFace for response generation using gold reference passages, achieving a harmonic mean score of 0.6976 (Rank 13/26). For Task C, we combined ELSER retrieval with Mistral-7B generation, using top-5 retrieved passages as context, achieving a score of 0.4289 (Rank 23/29). Our system demonstrates the effectiveness of learned sparse retrieval methods and instruction-tuned models for multi-turn conversational RAG scenarios.
asetclarity at SemEval-2026 Task 6: An Imbalance-Aware RoBERTa Cross-Encoder for Political Response Clarity Classification
Maria-Antonia-Emanuela Pascu | Dan Dodun-des-Perrieres | Daniela Gifu
Maria-Antonia-Emanuela Pascu | Dan Dodun-des-Perrieres | Daniela Gifu
We address response-clarity classification in political interviews as defined in SemEval-2026 Task 6: CLARITY - Unmasking Political Question Evasions, Task 1, where systems must label each question–answer pair as Clear Reply, Ambivalent, or Clear Non-Reply. We present a reproducible end-to-end pipeline built around a single-stream RoBERTa-large cross-encoder fine-tuned for three-way classification using deterministic text normalization, concatenated QA inputs, and imbalance-aware training (minority oversampling and class-weighted loss). To improve robustness, we train a 5-fold stratified ensemble and combine models via soft-voting. Our official shared-task submission obtains 0.76 macro-F1 on the official leaderboard, ranking 16 out of 41 participating systems. Finally, we deploy the classifier in a lightweight web application supporting both direct text input and audio-based analysis through automatic transcription, enabling interactive inspection of predicted clarity categories.
FactUEP at SemEval-2026 Task 4: Structured Narrative Similarity Scoring with Aspect Decomposition and Weak-Signal Gating
Marcin Sawinski
Marcin Sawinski
This paper presents approach to narrative similarity prediction for SemEval-2026 Task 4 Track A. We introduce an LLM-based system that operationalizes the three core dimensions—Abstract Theme, Course of Action, and Outcomes—via schema-constrained prompting to enforce structured outputs and alignment with the annotation protocol. The system proceeds in three stages: structured aspect decomposition and scoring, weak-signal gating for low-confidence cases, and a targeted LLM-based tiebreak. The final model achieved near-human performance and ranked second on the Track A leaderboard.
Narrative Team at SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative Understanding
Valentin Istrate | Mocanu Octavian | Tatiana Khaidukova
Valentin Istrate | Mocanu Octavian | Tatiana Khaidukova
This paper describes our system for SemEval-2026 Task 5, which focuses on predicting the plausibility of word senses in ambiguous narrative contexts. The task requires assigning a real-valued plausibility score to candidate word senses based on aggregated human judgments. Our approach compares two modeling paradigms: (i) a pretrained transformer-based regression model using DistilBERT fine-tuned on the task data, and (ii) a lightweight neural baseline based on a bidirectional LSTM trained either from scratch or initialized with GloVe embeddings. Input representations combine a candidate sense definition with the narrative context and target sentence, separated by a special token. On the official test set, the DistilBERT model achieves the strongest result among our submissions, with an Acc@SD score of 0.54 and Spearman correlation of 0.17, while the best BiLSTM submission reaches 0.52 Acc@SD and 0.02 Spearman correlation. Although DistilBERT performs best in our experiments, the recurrent baseline remains competitive under the tolerance-based metric. We discuss model variants, reproducibility details, and limitations of our analysis.
CSECU-DSG at SemEval-2026 Task 6: Imbalance-Aware Transformers for Unmasking Political Question Evasions
Subha Shesgin | Sumaiya Nazneen | Abu Nowshed Chy
Subha Shesgin | Sumaiya Nazneen | Abu Nowshed Chy
Clarity-Level Classification predicts the degree of clarity of a response to a query. It is essential to the advancement of many NLP activities, such as conversational AI, customer support automation, and instructional technology. However, it is challenging to assess answer clarity due to unclear wording, incomplete answers, and the contextual dependence between questions and answers. This paper describes our involvement in the shared work on Clarity Classification that SemEval2026 Task 6 created in order to address these issues. Using question-answer pair regression and classification, we suggested a transformer-based method. To train our model, we used a refined transformer model that included DeBERTa-v3-base. To address class imbalance, we used class-weighted loss functions and oversampling to implement class balancing. Results from experiments show that our suggested approach accomplished competitive performance.
YNU-ABSA at SemEval-2026 Task 3: A Unified Pipeline for Continuous and Structured Dimensional ABSA
Qimao He | Xiaobing Zhou
Qimao He | Xiaobing Zhou
Dimensional Aspect-Based Sentiment Analysis (DimABSA) aims to jointly model continuous Valence–Arousal (VA) regression and structured sentiment extraction at the aspect level in multilingual settings, requiring both fine-grained emotion modeling and structural consistency. Prior approaches often separate regression and extraction or rely on stagewise pipelines, which may limit numerical stability and structural alignment. To address this challenge, we propose a unified pipeline for all three subtasks of DimABSA Track A.Although Task 1 and Task 2/3 use different backbone architectures, they are integrated through consistent preprocessing, a shared dimensional sentiment perspective, and unified post-processing principles. For Task 1, we enhance aspect–context interaction via aspect-conditioned cross-attention and attention pooling, together with bounded output mapping and lightweight calibration for stable VA prediction.For Task 2/3, we formulate triplet and quadruplet prediction as constrained conditional generation with LoRA fine-tuning and structural validation. Experiments show consistent improvements across languages, including lower RMSE, higher correlation, and better cF1. Error analysis further shows that Arousal remains more difficult than Valence.
CuriosAI at SemEval-2026 Task 8: Hybrid retrieval system with repeated sampling for generation
Aiswariya Manoj Kumar | Hiroki Takushima | Fumika Beppu | Yuki Shibata | Daichi Yamaga | Takayuki Hori
Aiswariya Manoj Kumar | Hiroki Takushima | Fumika Beppu | Yuki Shibata | Daichi Yamaga | Takayuki Hori
SemEval-2026 Task 8 (MTRAGEval) evaluates multi-turn Retrieval-Augmented Generation (RAG) under conversational challenges such as non-standalone turns, underspecification, and answerability detection. These conditions amplify retrieval and generation errors that standard single-turn RAG pipelines fail to address effectively. We present a robustness-oriented multi-turn RAG system combining contextual query rewriting, heterogeneous hybrid retrieval fused with Reciprocal Rank Fusion (RRF), domain-adaptive Low-Rank Adaptation (LoRA) reranking, and repeated sampling with metric-guided selection. On the official test set, our approach outperforms the organizers’ baselines across all subtasks: Retrieval (nDCG@5: 0.5396 vs. 0.4795), Generation (0.7571 vs. 0.6390), and RAG (0.5486 vs. 0.5366). Our system ranks 5th in Subtask A, 5th in Subtask B, and 7th in Subtask C on the official leaderboard. These results demonstrate that calibrated hybrid retrieval combined with robust generation selection is effective for multi-turn RAG.
deepgpt at SemEval-2026 Task 1: A Chinese Humor Generation System via Instruction-Masked QLoRA and Reverse Constraint Data Mixing
城 陈
城 陈
AbstractThis paper presents the system description of the deepgpt team for SemEval2026 Task 1 (MWAHAHA: ComputationalHumor Generation), Subtask A. To address the challenge of generating highquality Chinese humor under strict textconstraints (e.g., incorporating speciffedrare words or relating to news headlines),we propose a parameter-eï¬ï¬cient generation system based on Qwen2.5-3B-Instruct.We reconstructed 8,000 multi-source Chinese jokes into a conversational instruction tuning format. Crucially, to mitigate the prevalent issues of formatting hallucinations and template collapse, we introduce a strict Instruction Masking strategy during 4-bit QLoRA ffne-tuning. Bycompletely isolating the loss calculationto the target humorous text, the modelis forced to treat constraints as conditional inputs rather than conversationaldistributions to mimic. Empirical resultsshow that this architectural interventioncompletely eradicates meaningless conversational ffllers. Our system signiffcantlyboosted the hard constraint adherence (CAcc) to 94.6% and achieved a highly competitive Elo rating of 903 in the oï¬ï¬cialPairwise Human Evaluation, validating theeffectiveness of speciffc masking ffne-tuningfor lightweight large language models instrictly constrained generation tasks.
CSECU-DSG at SemEval-2026 Task 10: Fine-Tuning DeBERTa Transformer Model for Conspiracy Detection
Debashish Chakraborty | Sumaiya Tabassum | Sabrina Ibnath | Abu Nowshed Chy
Debashish Chakraborty | Sumaiya Tabassum | Sabrina Ibnath | Abu Nowshed Chy
Conspiracy detection aims to determine whether a social media post expresses belief in conspiracy theories. This task is essential for understanding harmful online discourse and mitigating the spread of misinformation. However, detecting conspiracy beliefs is challenging due to subtle psycholinguistic cues and the strong contextual dependency of such claims. To address these challenges, SemEval-2026 Task 10 introduced a shared task named PsyCoMark. In this paper, we describe our approach to Subtask 2, which focuses on detecting conspiracy beliefs. We propose a transformer-based classification approach using a fine-tuned DeBERTa-v3-base model to detect conspiracy beliefs in Reddit comments. Each post is processed as a single input sequence. To address class imbalance and improve generalization, we employ class-weighted cross-entropy loss with label smoothing during training. Our approach achieves competitive performance, ranked ninth among participating teams. The findings demonstrate that fine-tuned transformer models effectively capture contextual and psycholinguistic patterns in conspiracy-related discourse and achieve competitive performance compared to other systems.
CUET-823 at SemEval-2026 Task 9: LoRA-Based Instruction Fine-Tuning of LLMs vs. Transformer Models for Bengali Polarization Detection
Arpita Mallik | Ratnajit Dhar
Arpita Mallik | Ratnajit Dhar
The rapid growth of social media has gone hand in hand with a sharp increase in heated public discussions, where debates about elections, conflicts, protests, and identity often turn into divisive and polarized rhetoric. In this paper, we present our system for SemEval 2026 Task 9 – Subtask 1: Multilingual Text Classification Challenge-Polarization Detection, focusing specifically on the Bengali language. The task is a binary classification problem aimed at determining whether a social media post exhibits attitude polarization, such as intolerance, dehumanization, deindividuation, vilification, or stereotyping toward others’ opinions, identities, or beliefs. Among 49 participating teams, our approach ranked 2nd, achieving a macro-F1 score of 0.8582. We experimented with both transformer-based models and large language models (LLMs), and observed that LoRA-based instruction fine-tuned LLM-based approaches delivered the strongest performance in detecting nuanced and context-dependent polarization in Bengali text.
H-RAG at SemEval-2026 Task 8: Hierarchical Parent–Child Retrieval for Multi-Turn RAG Conversations
Passant Elchafei | Hossam Emam | Mohamed Alansary | Monorama Swain | Markus Schedl
Passant Elchafei | Hossam Emam | Mohamed Alansary | Monorama Swain | Markus Schedl
We present H-RAG, our submission to SemEval-2026 Task 8 (MTRAGEval), addressing both Task A (Retrieval) and Task C (Generation with Retrieved Passages). Task A evaluates standalone retrieval quality, while Task C assesses end-to-end retrieval-augmented generation (RAG) in multi-turn conversational settings, requiring both accurate answer generation and faithful grounding in retrieved evidence. Our approach implements a hierarchical parent–child RAG pipeline that separates fine-grained child-level retrieval from parent-level context reconstruction during generation. Documents are segmented into overlapping sentence-based child chunks, while full documents are preserved as parent units to provide coherent context. weighting, and embedding-based similarity rescoring over child chunks. Retrieved evidence is aggregated at the parent level and supplied to an instruction-tuned language model for response generation. H-RAG achieves an nDCG@5 score of 0.4271 on Task A and a harmonic mean score of 0.3241 on Task C (RBagg: 0.2488, RLF: 0.2703, RBllm: 0.6508), underscoring the importance of retrieval configuration and parent-level aggregation in multi-turn RAG performance.
SLPGFJWUWarda at SemEval-2026 Task 1: A Multimodal Vision-Language Approach for Humor Generation Using Fine-Tuned BLIP
Warda Yousaf
Warda Yousaf
We present a BLIP-based multimodal system for image-based humor generation submitted to SemEval-2026 Task 1 (MWAHAHA), focusing on Task B1. Our approach fine-tunes a vision–language model on meme-style captions and handles animated GIFs via representative frame extraction to generate culturally grounded humorous captions.
hllwan at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis via LLM Feature Fusion and Test-Time Adaptation
Jinglong Li | Yang Yang
Jinglong Li | Yang Yang
This paper describes the system developed bythe team for SemEval-2026 Task 3: Di-mensional Aspect-Based Sentiment Analysis(DimABSA). Unlike traditional categorical sen-timent analysis, predicting continuous Valenceand Arousal (VA) scores across multiple lan-guages and domains poses significant theoret-ical and engineering challenges. To systemat-ically address data scarcity and cross-domaindistribution shifts, we propose a highly robustframework. First, we implement a translation-based data augmentation strategy with preciseHTML-tag alignment to mitigate low-resourceconstraints. Second, we introduce an unsuper-vised opinion extraction module based on syn-tactic dependency parsing to explicitly capturesentiment-bearing words. Third, we designa Tripartite Feature Fusion architecture builtupon both encoder-only (DeBERTa-v3) andcausal LLM (Qwen2.5) models to dynamicallyaggregate global and localized aspect-opinionembeddings. Finally, we apply an unsupervisedTest-Time Adaptation (TTA) mechanism to cal-ibrate normalization layers on the fly. Our sys-tem demonstrates highly competitive perfor-mance while offering critical insights into thelimitations of LLMs in cross-lingual sentimenttransfer.
CITD@UIT at SemEval-2026 Task 4: Structured Reasoning and Metric Specialization for Narrative Similarity
Thach Nguyen | Duc-Vu Nguyen | Dang Thin
Thach Nguyen | Duc-Vu Nguyen | Dang Thin
We present a synergistic dual-track approach for SemEval-2026 Task 4 on narrative similarity, covering Track A (triple-wise classification) and Track B (narrative representation) through failure-driven data enrichment. The shared task received 71 final submissions from 46 teams across its two tracks. For Track A, we explore three reasoning strategies: hybrid Cross-Encoder–LLM arbitration (66.5% dev), DSPy-based component-wise decomposition (68.0% dev), and a multi-stage pairwise reasoning pipeline with enforced moral agency hierarchies, where the final Gemini 2.5 Pro/Flash system achieves 77.39% on development and 69.25% on test data, ranking 17th among 46 participating teams in the official evaluation. For Track B, we propose BGE-M3 (LoRA), an instruction-guided dense representation model trained with Multiple Negatives Ranking Loss (MNRL); since Track B provides only unlabeled story instances, we specialize the embedding space using adversarial samples synthesized from Track A failure cases, achieving 68.75% in the official evaluation and ranking 6th among 26 participating teams. Our analysis shows that narrative similarity depends more on outcome alignment and moral trajectory than lexical overlap, highlighting the complementary roles of explicit reasoning and task-specific metric-space specialization.
YNU-HPCC at SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Stories through Narrative Understanding
Mingyu Bai | Jin Wang | Xuejie Zhang
Mingyu Bai | Jin Wang | Xuejie Zhang
This paper introduces our approach to SemEval 2026 Task 5, which evaluates the rationality of word-sense scores in ambiguous stories through narrative comprehension. This task requires models to assess the consistency between a given word-sense definition and the meaning of an ambiguous target word in a short narrative context, and to infer a rationality score on a 1-5 scale. We experimented and compared multiple methods. These methods include multi-head ensembles that simulate the behavior of individual annotators, ordinal classification and regression methods that treat scores as ordered categories, and direct regression using mean squared error (MSE) or L1 loss to predict human-average consensus scores. Additionally, we investigated instructional fine-tuning with low-rank adaptation (LoRA) on large language models (LLMs) such as Qwen3-4B-Instruct and Phi-4-mini. Our experimental results show that the direct MSE regression method performs best. This study indicates that directly optimizing to approach human consensus scores is effective for this task, while methods that model individual annotator differences are less applicable.
pfr821 at SemEval-2026 Task 9: Multilingual Polarization Detection via Hybrid XLM-RoBERTa with Targeted Data Augmentation and Imbalance-Aware Training
Antoine Durand | Rémi Hamon | Matthieu Pereira | Nathan Boucneau | Paul Cintra
Antoine Durand | Rémi Hamon | Matthieu Pereira | Nathan Boucneau | Paul Cintra
This paper describes HYPOLDET, the system submitted by team pfr821 to SemEval-2026 Task 9 (Polarization Detection, Subtask 1), a binary classification task over 22 typologically diverse languages. Our approach combines three complementary contributions. We first extend XLM-RoBERTa-Large with a custom transformer encoder layer and a learned attention-based pooling mechanism (Hybrid Architecture), allowing the model to aggregate token-level signals beyond the [CLS] representation. We then augment training data through a targeted LLM-based synthetic generation pipeline (Grok API), producing culturally grounded examples for low-resource and imbalanced languages. Finally, we address class imbalance at the training level through an imbalance-aware regime combining a per-language balanced batch sampler, weighted focal loss, and label smoothing. Our best single model achieves an unweighted macro-averaged F1 of 0.796, and a lightweight ensemble reaches 0.798, ranking in the top 10 for 7 languages and 2nd place for Hausa.
AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection
Panagiotis Spanakis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Panagiotis Spanakis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates and addresses these challenges separately. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an “Anti-Echo Chamber“ architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the “Reporter Trap“, where models falsely penalize objective reporting. Our system achieves 0.24 Macro F1 (+100% over baseline) on S1 and 0.79 Macro F1 (+49%) on S2, ranking 3rd on the S1 development leaderboard and 8th on the test set, demonstrating that structured agentic deliberation is an effective alternative to fine-tuning for interpretable psycholinguistic NLP.
One and Only at SemEval-2026 Task 2: Evaluating Zero-Shot Autonomous LLM Agents and Heuristic Proxies in Ecological Affect Forecasting
Nam Dinh
Nam Dinh
This paper presents team One and Only’s sys-tem for SemEval-2026 Task 2: PredictingVariation in Emotional Valence and Arousalover Time (Soni et al., 2026). We investigatewhether zero-shot LLM reasoning can replacefine-tuning for ecological affect forecasting bycombining deterministic statistical priors withfrozen LLMs (Gemini 3 Pro, Claude Opus4.6, GPT-5.2). For short-term state changes(Subtask 2A), an OLS mean-reversion anchoris paired with LLM-generated impulses; forlong-term disposition changes (Subtask 2B),a Chain-of-Thought prompt drives direct nu-meric prediction. Our system underperformsfine-tuned approaches on both subtasks. How-ever, post-submission ablation across threeLLMs reveals a task-dependent pattern: CoTreasoning substantially improves dispositionforecasting (rV : −0.185 → +0.129; MAEV :0.899 → 0.422), while uncalibrated LLM im-pulses degrade state-change prediction due tovariance collapse (σpred = 0.41 vs. σgold =1.73). We provide a detailed diagnostic anal-ysis of these failure modes and release allprompts and outputs for reproducibility.
AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis
Stavros Gazetas | Giorgos Filandrianos | Maria Lymperaiou | Paraskevi Tzouveli | Athanasios Voulodimos | Giorgos Stamou
Stavros Gazetas | Giorgos Filandrianos | Maria Lymperaiou | Paraskevi Tzouveli | Athanasios Voulodimos | Giorgos Stamou
In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.
TeleAI at SemEval-2026 Task 4: Few-Shot Narrative Similarity Modeling for Classification and Ranking
Weiwei Fu | Shiquan Wang | Ruiyu Fang | Shuangyong Song
Weiwei Fu | Shiquan Wang | Ruiyu Fang | Shuangyong Song
This paper presents a unified, task-adaptive modeling framework for the two tracks of SemEval-2026 Task 4: Narrative Similarity. For Track A, we build a three-stage pipeline of three-dimensional narrative-anchored chain-of-thought (CoT) reasoning, multi-view data augmentation, and Low-Rank Adaptation (LoRA) fine-tuning. For Track B, we design an architecture fully aligned with the ranking inference pipeline and task objective, along with corresponding data augmentation and expansion methods, and propose Smooth Cosine Contrastive Loss (SCCL) to stabilize training in low-resource settings. Systematic experiments verify the effectiveness of each core module, and our final systems rank 4th in both tracks, providing a reproducible technical solution for few-shot similarity modeling.
LogSigma at SemEval-2026 Task 3: Uncertainty-Weighted Multitask Learning for Dimensional Aspect-Based Sentiment Analysis
Baraa Hikal | Jonas Becker | Bela Gipp
Baraa Hikal | Jonas Becker | Bela Gipp
This paper describes LogSigma, our system for SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). Unlike traditional Aspect-Based Sentiment Analysis (ABSA), which predicts discrete sentiment labels, DimABSA requires predicting continuous Valence and Arousal (VA) scores on a 1–9 scale. A central challenge is that Valence and Arousal differ in prediction difficulty across languages and domains. We address this using learned homoscedastic uncertainty, where the model learns task-specific log-variance parameters (log σ²) to automatically balance each regression objective during training. Combined with language-specific encoders and multi-seed ensembling, LogSigma achieves 1st place on five datasets across both tracks. The learned variance weights vary substantially across languages due to differing Valence–Arousal difficulty profiles—from 0.66× for German to 2.18× for English—demonstrating that optimal task balancing is language-dependent and cannot be determined a priori.
NYCU-NLP at SemEval-2026 Task 9: Stacking Small Language Models for Multilingual, Multicultural and Multievent Polarization Detection
Ding-Xiang Lin | Po-Chun Chu | Lung-Hao Lee
Ding-Xiang Lin | Po-Chun Chu | Lung-Hao Lee
This paper presents the NYCU-NLP system for SemEval-2026 Task 9 on online polarization analysis. Our approach explores the effectiveness of instruction-tuned small language models (SLMs), including Phi-4 (14B), Mistral-small-3.2 (24B), and Gemma-3 (27B), with task-specific prompting strategies and combined them via a stacking ensemble to leverage complementary modeling capacities. Evaluated across 22 languages and three subtasks, our system achieved macro-averaged F1 scores of 0.8071 for Polarization Detection (Subtask 1), 0.6108 for Polarization Type Classification (Subtask 2), and 0.5111 for Polarization Manifestation Identification (Subtask 3). Notably, our approach ranked first in 15, second in 12, and third in 10 of the 62 language-specific leaderboards, demonstrating the robustness and competitiveness of stacking-based SLM ensembles in multilingual settings.
d’Olle Grieze at SemEval-2026 Task 11: Comparing the Impact of Supervised Fine-Tuning and Activation Steering on Mitigating Content Effect Bias in Syllogistic Reasoning
Twan Huiskens | Tian Niezing | Koen Snelten
Twan Huiskens | Tian Niezing | Koen Snelten
We investigate the content effect bias in Large Language Models (LLMs) as part of SemEval 2026 Task 11. We compare the impact of supervised fine-tuning using low-rank adaptation against activation steering across several model families, including LLaMA, Gemma and Qwen. Our results show that SFT improves accuracy, with LLaMa 8B reaching 98.75\% accuracy. Activation steering offers limited effectiveness in mitigating the content effect bias. A logit lens analysis further reveals that fine-tuning successfully shifts the model’s focus toward logical structure, specifically within the later layers.
Cryptix at SemEval-2026 Task 4: Zero-Shot Bi-Encoder Modeling for Narrative Story Similarity - A Sentence Transformer Approach
Sushmitha M | Sarath Kumar P | Thanalaxmi S | Beulah A
Sushmitha M | Sarath Kumar P | Thanalaxmi S | Beulah A
This submission presents a zero-shot embedding-based approach for SemEval-2026 Task 4 on Narrative Story Similarity. The system employs the pretrained sentence-transformers/all-mpnet-base-v2 model within a bi-encoder architecture to generate 768-dimensional story embeddings. Narrative similarity is modeled using cosine similarity in embedding space for comparative prediction in Track A and representation generation in Track B. The approach does not involve task-specific fine-tuning and treats narrative comparison as a geometric proximity problem. Experimental results and error analysis highlight the strengths of pretrained semantic encoders in capturing thematic similarity, while revealing limitations in modeling deeper narrative structure and causal progression.
Königsberg at SemEval-2026 Task 13: Beyond Language Models: A Low-Resource Feature-Driven and Data-Flow Embedding Approach for Machine-Generated Code Detection
Shahir Habib
Shahir Habib
The rise of Large Language Models (LLMs)has increased the need for reliable detection ofmachine-generated code. This paper presentsa low-resource, hybrid detection frameworkdeveloped for for SemEval-2026 Task 13 ,designed to operate efficiently without the computational overhead of end-to-end fine-tuningof large models. Our approach combines(i) comprehensive feature extraction pipelinethat calculates interpretable software metricscapturing stylistic and structural properties ofcode, and (ii) we leverage the semantic capabilities of GraphCodeBERT by extractingfrozen embeddings from its pre-trained encoder to model semantic and data-flow information while preserving generalizability. Thisfusion enables efficient detection of machinegenerated code across multiple programminglanguages (Python, C++, Java, and Go) andimproves robustness under out-of-distributionsettings. This feature-driven fusion offers acompetitive, computation-efficient alternativeto purely LLM-based fully fine-tuned models,achieving an F1-score of 38.26.
NUST PsyAI at SemEval-2026 Task 10: Parameter-Efficient RoBERTa for Conspiracy Detection and Character-Level Marker Extraction
Mian Muhammad Husnain Akram | Mehwish Fatima
Mian Muhammad Husnain Akram | Mehwish Fatima
We present the NUST PsyAI system for SemEval-2026 Task 10 (PsyCoMark), targeting document-level conspiracy detection and character-level psycholinguistic marker extraction from Reddit discourse. Our system ranks 7th in Extraction and 8th in Detection on the leaderboard. We benchmark feature-based and transformer approaches, adopting RoBERTalarge with LoRA for parameter-efficient finetuning. For detection, RB-DET-LoRA outperforms all baselines, achieving weighted F1 0.79 (dev) and 0.76 (test), with robust generalization under blinded evaluation. For extraction, we contrast a unified multi-type BIO scheme with a decomposed per-type setup; the latter mitigates cross-label interference and improves boundary consistency, reaching Overlap F1 of 0.16 (dev) and 0.21 (test). Results reveal a clear asymmetry: detection benefits from contextual semantic modeling, while extraction is limited by sparse supervision and boundary-sensitive evaluation.
YNWAAZ at SemEval-2026 Task 1: Bridging the Semantic-Visual Gap: Multimodal Humor Generation
Mohammad Erfan Zare | Tahere Abbasi | Hadi Veisi | Sayin Ala | Hanieh Naderi
Mohammad Erfan Zare | Tahere Abbasi | Hadi Veisi | Sayin Ala | Hanieh Naderi
Developing Computational Humor systems at a multilingual and multimodal scale requires transcending simple text generation paradigms to focus on intent and context understanding. In this study, we address two key limitations in Foundation Models:Association Failure in textual tasks, which prevents the formation of coherent semantic links between incongruous concepts, and Temporal Blindness in video processing, which disrupts narrative comprehension. To tackle these challenges, we propose a unified architecture comprising an Intent-Aware RAG system for mitigating linguistic gaps across English, Spanish, and Chinese, and a Cascaded Visual Perception pipeline for modeling the narrative structure of video content. A key innovation of this work is the utilization of small language models (TinyLlama) as a SemanticDenoise Filter, converting noisy visual signals into structured, coherent textual representations. Experimental results demonstrate that this modular architecture reduces cultural-semantic gaps in certain languages and produces outputs that generally align better with human humor preferences, though highly nuanced languages still present a challenge.
Stylometry at SemEval-2026 Task 13: Clustered Stylometric Modeling for Machine-Generated Code Detection
Sruthi Santhanam | Parthib Sarkar | Yashvardhan Sharma
Sruthi Santhanam | Parthib Sarkar | Yashvardhan Sharma
Machine-generated code detection is examined under out-of-distribution conditions where robust generalization is required. A hybrid feature representation is used in which code snippets are encoded through character-level TF–IDF patterns together with explicit structural indicators capturing properties such as verbosity and formatting behavior. Variability across generators is handled through clustering-based expert specialization, and predictions are produced using an ensemble of logistic regression and Naïve Bayes models with calibrated thresholds. Experimental results show that the proposed approach performs competitively despite relying on simple linear classifiers. The findings suggest that persistent structural patterns in code provide reliable cross-domain signals for identifying machine-generated programs.
JCT at SemEval-2026 Task 8: Resource-Efficient Multi-Turn RAG via Nano-LLM Rewriting and Hybrid Reranking
Tal Farhan | Chaya Liebeskind
Tal Farhan | Chaya Liebeskind
This paper describes our system submission for SemEval-2026 Task A (MTRAGEval), focusing on multi-turn Retrieval-Augmented Generation (RAG). Conversational queries often suffer from contextual ambiguity, rendering standard retrieval methods ineffective. We propose a highly resource-efficient pipeline that decouples query understanding from retrieval using a 1.5B parameter Nano-LLM (Qwen) for query rewriting, followed by parallel hybrid retrieval (Qdrant) and Cross-Encoder reranking. During internal development, our optimized system achieved an nDCG@5 score of 0.1991 on answerable queries, outperforming the official BM25 baseline. On the official blind test set, the system achieved a score of 0.1744. While our absolute performance trails behind baselines utilizing massive 20B parameter models, our work establishes a crucial baseline for extreme resource efficiency in conversational RAG. We provide a comprehensive error analysis detailing the impact of domain shifts, retrieval funnels, and we conduct a qualitative analysis on the organizers’ surprise “Underspecified” class to highlight the vulnerabilities of generative query rewriting.
JIA at SemEval-2026 Task 10: A Dual-Track System with BERT-based Encoders and LLMs for Conspiracy Analysis
Jiayue Zhu
Jiayue Zhu
This paper presents a dual-track system for conspiracy theory detection and psycholinguistic marker extraction. We evaluate multiple architectures, including DistilBERT, BERT-Base, DeBERTa-V3, RoBERTa, and instruction-tuned Qwen2.5 models. Qwen2.5-14B (full-shot) achieves the best performance with a Weighted F1-score of 0.80 in the detection task. Marker extraction remains challenging: while the fine-tuned LLM performs best on "Actors," its limited generalization in categories such as "Evidence" and "Effect" highlights persistent semantic ambiguity.
AILS-NTUA at SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations
Dimosthenis Athanasiou | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Dimosthenis Athanasiou | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
We describe the AILS-NTUA system for SemEval-2026 Task 8 (MTRAGEval), addressing all three subtasks of multi-turn retrieval-augmented generation: passage retrieval (A), reference-grounded response generation (B), and end-to-end RAG (C).Our approach is based on two main design principles. First, we adopt a query-diversity-over-retriever-diversity strategy, where multiple complementary LLM-based query reformulations are issued to a single corpus-aligned sparse retriever and combined using a variance-aware nested Reciprocal Rank Fusion scheme. Second, we employ an agentic generation pipeline that decomposes grounded response generation into evidence span extraction, dual-candidate drafting, and calibrated multi-judge selection.The proposed system achieves strong performance across subtasks, ranking first in Task A and second in Task B in the official evaluation. Our empirical findings indicate that query diversity over a well-aligned retriever is more effective than heterogeneous retriever ensembling, and that answerability calibration—rather than retrieval coverage—emerges as the primary bottleneck in end-to-end performance.
UNED at SemEval-2026 Task 9: Sentiment-Aware Transformer Models with Back-Translation Augmentation for Online polarisation Detection
Victor Garcia Sanabria | Alvaro Rodrigo | Roberto Centeno
Victor Garcia Sanabria | Alvaro Rodrigo | Roberto Centeno
This paper describes our submission to SemEval-2026 Task 9 (Subtask 1) on Spanish online polarisation detection. We investigate whether sentiment-adapted pretrained language models provide an advantage over general-purpose multilingual models for binary polarisation classification. Under a controlled training setup, we compare a base XLM-RoBERTa model, an emotion-adapted model, and a sentiment-adapted XLM-R model trained on Twitter data. To mitigate overfitting in the relatively small training dataset, we additionally apply back-translation as a data augmentation strategy. Experimental results show that the sentiment-adapted checkpoint consistently outperforms the alternative pretrained models under identical conditions. When combined with back-translation augmentation, the final system achieves a macro-averaged F1 score of 0.743 on the preliminary competition leaderboard. These findings suggest that prior adaptation to affective signals in social media can provide beneficial inductive bias for polarisation detection.
HyperparameterOmens at SemEval-2026 Task 13: Various approaches to detecting machine- generated code
Dmitry Sukhotin | How Yu
Dmitry Sukhotin | How Yu
We present our systems for SemEval-2026 Task 13, built on the Droid resource suite and benchmark setting. For Subtask A (binary classification of human-written vs. machine-generated code), lexical baselines such as TF–IDF and character n-grams transferred poorly from the LeetCode training distribution to the production-code evaluation split. After correcting pipeline errors that obscured true performance and selecting stable AST features under domain shift, our final system uses 5 uncorrelated features and achieves 0.57 macro F1 on the public test set.For Subtask C (4-way authorship classification of human, AI, hybrid, and adversarial) lexical baselines performed poorly under a significant vocabulary shift. Deep semantic models proved more promising, and a per-class weighted ensemble which included these models achieved 0.57 macro F1 on the public test set
Unibuc-NLP at SemEval-2026 Task 10: Unmasking Conspiracies with Pre-Trained Language Models
Teodor-George Marchitan | Liviu Dinu
Teodor-George Marchitan | Liviu Dinu
The paper describes the system submitted to SemEval-2026 Task 10 (PsyCoMark) Subtask 2: detecting whether a Reddit comment expresses a conspiracy belief. We investigate three modeling paradigms: (A) an embedding-and-classify pipeline using Jina-embeddings-v3, HateBERT and BERT-Sentiment with Optuna-tuned classical ML models, optionally enriched by 19 readability features from textstat; (B) end-to-end fine-tuning of encoder transformers (DeBERTa-v3-base, DistilBERT) with a compact 128-unit classifier head and multiple pooling strategies; and (C) parameter-efficient QLoRA fine-tuning of large decoder-only models (Mistral-7B-v0.3, Qwen3-0.6B). Our best system, DeBERTa-v3-base with a 128-dimensional classifier, achieves a weighted F1 of 0.74, ranking 29/52 on the official leaderboard. Post-submission analysis further reveals that a weighted pooling strategy outperforms [CLS] on the official validation split by +0.04, achieving a weighted F1 of 0.78 (rank 8/52), suggesting that conspiracy-relevant features are distributed across transformer layers rather than concentrated at the final output.
Team BOBW (Best Of Both Worlds) at SemEval-2026 Task 3: Modular Cross-Attention Encoders for Dimensional Aspect-Based Sentiment Analysis
Michal Rynowiecki | Rob Van Der Goot
Michal Rynowiecki | Rob Van Der Goot
This paper presents our system for SemEval-2026 Task 3, which identifies four-part opiniondetails in product reviews. We used a sequenceof pairs of BERT encoder models connectedby cross-attention layers. The cross-attentionmechanism provided marginally better resultsthan a self-attention equivalent, failing to show-case a significant improvement. Error propaga-tion through the pipeline hurt the correctness ofthe outputs, with certain stages collapsing thescores. The pipeline architecture’s performancewas largely independent of model size, sug-gesting that small modular encoders for down-stream tasks are an efficient alternative to largedecoder models. Our best model got a cF1score of 0.53 on restaurant data and 0.26 onlaptop data.
PolarizedTeam at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Maria Nestor | Maroan Al Shrafat | Ioana Pește | Daniela Gifu | Diana Trandabăț
Maria Nestor | Maroan Al Shrafat | Ioana Pește | Daniela Gifu | Diana Trandabăț
This paper presents the systems developed for SemEval-2026 Task 9, which targets the detection and categorization of multilingual, multicultural, and multi-event online polarization across 22 languages. To address the challenges posed by linguistic diversity and short, heterogeneous texts, we evaluate several Transformer-based architectures for multilingual polarization detection. Our approach models the task as a multi-label classification problem and incorporates mean pooling for sentence representation, focal loss to mitigate severe label imbalance, and label-wise attention mechanisms to capture polarization-specific linguistic cues. Experimental results show that combining robust multilingual encoders with label-aware modelling substantially improves the detection of polarized content across diverse communities and events
MKJ at SemEval-2026 Task 9: A Comparative Study of Generalist, Specialist, and Ensemble Strategies for Multilingual Polarization
Maziar Kianimoghadam Jouneghani
Maziar Kianimoghadam Jouneghani
We present a systematic study of multilingual polarization detection across 22 languages for SemEval-2026 Task 9 (Subtask 1), contrasting multilingual generalists with language-specific specialists and hybrid ensembles. While a standard generalist like XLM-RoBERTa suffices when its tokenizer aligns with the target text, it may struggle with distinct scripts (e.g., Khmer, Odia) where monolingual specialists yield significant gains. Rather than enforcing a single universal architecture, we adopt a language-adaptive selection strategy that chooses among multilingual generalists, language-specific specialists, and hybrid ensembles based on development performance. Additionally, cross-lingual augmentation via NLLB-200 yielded mixed results, often underperforming native architecture selection and degrading morphologically rich tracks. Our final system achieves an overall macro-averaged F1 score of 0.796 and an average accuracy of 0.826 across all 22 tracks. Code and final test predictions are publicly available at: https://github.com/Maziarkiani/SemEval2026-Task9-Subtask1-Polarization.
Proofbusters at SemEval-2026 Task 11: Neuro-Symbolic Syllogistic Reasoning via LLM-Guided Structure Extraction and Deterministic Validation
Mohamed Ayman | Khaled Marzouk | Abdallah Mashaly | Ahmed Heriez
Mohamed Ayman | Khaled Marzouk | Abdallah Mashaly | Ahmed Heriez
This paper presents the **Proofbusters** system for SemEval-2026 Task 11 (English syllogism validity classification). The task evaluates whether language models can perform *formal* syllogistic reasoning independent of semantic content—i.e., without being swayed by *belief bias* (judging arguments by plausibility or world knowledge instead of logical validity).The main idea is **symbolic abstraction**: before predicting validity, each syllogism is converted into a content-invariant logical form so the model reasons over structure rather than over concrete terms. Inspired by Euler’s abstraction in the Königsberg bridges problem (stripping away geography to reveal pure structure), the paper explores three abstraction strategies of increasing formal rigor:1. **Template abstraction** — Replace categorical terms with generic placeholders (e.g., x, y, z); keep syntax and quantifiers. Serves as a baseline (82.20% accuracy).2. **Symbolic OOP abstraction** — Map entities and relations into an object-oriented constraint graph with explicit tracking of supersets, disjoint sets, etc. (88.84% with Qwen-7B).3. **Set-theoretic abstraction** — Translate premises and conclusion into formal set notation (e.g., \(A \subseteq B\), \(A \cap B = \emptyset\)) and enforce *existential import* (\(A, B, C \neq \emptyset\)) to align with Aristotelian logic. The solver never sees the original natural-language terms.The system uses a **two-stage pipeline**: a **Formulation** stage (natural language → symbolic representation) and a **Solver** stage (validity judgment from symbols only). The set-theoretic variant, using Gemini Flash 2.5 for formulation and Gemini Pro 2.5 for solving, achieves **98.95% accuracy** with **2.13** total content effect (TCE) and an **overall score of 46.23**, substantially outperforming both task baselines and the other abstraction variants.The **conclusion** is that belief bias in LLMs is tied to semantic surface form: *explicit abstraction into mathematical set notation* sharply reduces plausibility-driven errors. Robust logical reasoning likely requires **architectural separation** between semantic parsing and formal inference, rather than prompt engineering alone. Remaining challenges include formulation errors (e.g., quantifier misclassification), multi-step constraint composition, and negation–inclusion interactions. Future work may combine the abstraction pipeline with formally verified theorem provers and extend it to multilingual or more complex multi-premise reasoning.
VGU-M.Tech-AI at SemEval-2026: Multilingual Multi-Label Classification of Online Polarization Types via Weighted Transformer Fine-Tuning and Adaptive Per-Label Threshold Optimization
Abdulkadir Bichi | Jyoti Shekhawat
Abdulkadir Bichi | Jyoti Shekhawat
Abstract This research paper proposed a multilingual multi-label classification of online polarization types via weighted transformer fine-tuning and adaptive per-label threshold optimization (MMCOPT). Our task is to classify social media posts according to a given set of five labels. A post could be deemed to be politically, racially, religiously, or gender/sexually polarizing, or fall into the category of other. We incorporate a distilbert-base-multilingualcased model and attach a two-layer MLP head. We also use a class-imbalance-weighted binary cross-entropy loss and optimize thresholds for each class to improve the validation micro-F1 score. Our training set is drawn from the POLAR benchmark, the first large multilingual polarization dataset that includes posts from seven languages and multiple social media platforms. MMCOPT’s best internal validation micro-F1 score is 0.7855, and its macro-F1 score is 0.7749. Our model (team username: asbichi362) is ranked on the official Codabench leaderboard and shows competitive results across 22 language tracks of the research project multilingual polarization type classification, with its best results in Hindi (0.7429) and Urdu (0.7073).
Sylloscope at SemEval-2026 Task 11: Decoupling Logic from Belief via DeepSeek-Enhanced Distillation in Qwen Models
Zhanyu Chen | María Teresa Muñoz Martín | Sem Huisman | Jingjing Lan
Zhanyu Chen | María Teresa Muñoz Martín | Sem Huisman | Jingjing Lan
This paper presents our approach for SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. We propose a neuro-symbolic teacher-student framework that utilizes DeepSeek-R1 as a Logical Auditor to generate a high-fidelity training corpus. We distill these analytical behaviors into Qwen-3 models using Low Rank Adaptation (LoRA), focusing on teaching the mechanics of logic rather than simple label matching. Our system yields robust results across both subtasks, with a ranking score of 39.81 (96.86% accuracy) on Subtask 1 and 26.02 on Subtask 3. However, validity bias partially persists, so we conclude that while structured distillation substantially mitigates belief bias, fully disentangling logical validity from plausibility remains a central challenge for future development.
VerbaNexAI at SemEval-2026 Task 6: Automatic Detection of Political Evasion through Hierarchical Classification with RoBERTa Large
Jeison Jimenez Alvear | Deyson Gómez Sánchez | Juan Carlos Martinez Santos | Edwin Puertas | Jairo Serrano
Jeison Jimenez Alvear | Deyson Gómez Sánchez | Juan Carlos Martinez Santos | Edwin Puertas | Jairo Serrano
This paper describes VerbaNex AI’s participation in SemEval-2026 Task 6: CLARITY, a shared task on automatic detection of question evasion in political interview transcripts. The task requires classifying question-answer pairs into three clarity levels (Task 1) and identifying nine evasion techniques (Task 2). We propose and evaluate two independent systems based on RoBERTa-Large. The first is a standard sequence classifier that treats each question-answer pair as a single input sequence, leveraging RoBERTa’s native two-segment encoding to model the relationship between the two texts jointly. The second is a dual-encoder architecture that processes the question and answer independently and computes geometric interaction features to model the semantic misalignment between them explicitly. Both systems are trained on Task 2 labels and derive Task 1 predictions via the hierarchical mapping proposed by the task organizers. Our best result was achieved by the standard sequence classifier, reaching Rank 10 on Task 2 and Rank 25 on Task 1.
pamaldi at SemEval-2026 Task 11: Neuro-Symbolic Syllogistic Reasoning via LLM-Guided Structure Extraction and Deterministic Validation
Pasquale Grimaldi
Pasquale Grimaldi
We describe our participation in SemEval-2026 Task 11, Subtask 1: determining the formal validity of syllogisms in English while minimizing the influence of content plausibility. Our system implements a neuro-symbolic pipeline that strictly separates neural and symbolic components. An LLM extracts the formal structure of natural-language syllogisms — proposition types (A, E, I, O) and the three terms — while the syllogistic figure is computed deterministically and a symbolic validator checks whether the resulting mood–figure pair belongs to the 24 classically valid Aristotelian forms. On the official evaluation we achieve 96.34% accuracy, Total Content Effect (TCE) of 1.02, and combined score of 56.57. Compared to pure-LLM baselines on the same backbone, our system more than doubles the combined score (from 26.52 to 56.57) and reduces TCE by nearly an order of magnitude. Swapping the extractor to Claude Sonnet 4.5 preserves combined score and TCE, confirming that content-invariance is contributed by the symbolic stage rather than any particular LLM. A paraphrase probe reveals that the validator is invariant to surface form but the extractor is sensitive to premise ordering — a specific, fixable limitation we identify as the primary target for future work.
COODetect at SemEval-2026 Task 13: Unsupervised Latent Domain Adaptation for Out-of-Distribution AI Code Detection
Aldan Creo | Atharv Nair | Mohana Ravikumar | Vaishak Menon | Dario Wisznewer | Vaibhav Jain
Aldan Creo | Atharv Nair | Mohana Ravikumar | Vaishak Menon | Dario Wisznewer | Vaibhav Jain
The widespread use of AI-generated code raises questions about software maintenance and academic integrity. However, tools to detect it are still in their infancy. In this article, we explore the issue of out-of-distribution (OOD) detection; while embedder models like CodeBERT can easily achieve high accuracies in the context of their training data, they are unable to properly generalize to unseen contexts or programming languages. We argue that this is caused by an overfitting of such models to the training distribution, e.g. memorizing a language’s "AI syntax" instead of the true generative artifacts, and develop a approach that is able to naturally generalize to completely unseen languages and domains. Our system is also considerably more interpretable than the deep neural alternatives. In particular, we propose three orthogonal views (lexical, structural, and symbolic) to capture the AI-generated code’s indicators. To deal with OOD shift, we normalize the scores per language with Z-scoring and a Gaussian Mixture Model to remove the language bias automatically. We test our approach on the SemEval-2026 Task 13 dataset, where our experiments reached a macro F1 of 0.602 compared to the task baseline of 0.305, demonstrating the generalization capabilities of our system. We make our source code and data available at https://github.com/ACMCMC/COODetect.
NCL HKU-NarrSim at SemEval-2026 Task 4: Aspect-Based Agents and Supervised Contrastive Embeddings for Narrative Similarity
Jianfei Xu | Ting Zhu | Mingyang Chen | Huizhi(elly) Liang
Jianfei Xu | Ting Zhu | Mingyang Chen | Huizhi(elly) Liang
SemEval-2026 Task 4 on Narrative Similarity requires models to assess narrative alignment between stories rather than relying on surface lexical similarity. For Track A, we introduce the Aspect-Based Narrative Similarity Agents(ABNS-Agents), a two-stage agent-based framework. It extracts three core narrative aspects aligned with the task definition under a schema constraint, and then performs aspect-aligned similarity adjudication using an LLM decision model. For Track B, Narrative Supervised Contrastive Embeddings(NSConE) is based upon supervised contrastive learning to model narrative similarity. Our experiments show that ABNS-Agents achieves 70.25% accuracy on the test set, while NSConE reaches 68.5% test accuracy, demonstrating competitive performance across both reasoning-based and representation-learning paradigms. The findings highlight the effectiveness of aspect-aligned structured modelling and task-specific supervised contrastive learning for capturing narrative similarity beyond surface semantics.
ILab-NLP at SemEval-2026 Task 9: Comparing XLM-RoBERTa and LLaMA-2 for Multilingual Polarization Detection
Declan Booth | Gavin Abercrombie | Simona Frenda
Declan Booth | Gavin Abercrombie | Simona Frenda
This submission describes a system for SemEval-2026 Task 9, Subtask 1, focused on binary detection of polarized versus non-polarized posts in English and Spanish. We compare two approaches: a fine-tuned multilingual encoder model (XLM-RoBERTa) and a prompted generative model (LLaMA-2 7B). Our experiments show that XLM-RoBERTa delivers stronger and more stable performance overall, while LLaMA-2 is more prone to false positives in Spanish due to a strong bias toward predicting the polarized class. In addition to headline results, we analyse model behaviour using confidence signals and SHAP, and report efficiency measurements with CodeCarbon to highlight practical tradeoffs between performance and computational cost.
VerbaNexAI at SemEval-2026 Task 5: Few-Shot Chain-of-Thought with Selective Self-Consistency and Isotonic Calibration for Word Sense Plausibility Rating
Daniel Peña Gnecco | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Daniel Peña Gnecco | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
We present a system for rating word sense plausibility in ambiguous narrative contexts for SemEval-2026 Task 5. Our approach ensembles three large language models (Llama-3.1 70B, Qwen-2.5 32B, and Gemma-2 27B) using a computationally efficient, uncertainty-aware pipeline. We combine few-shot chain-of-thought prompting with selective self-consistency, which applies stochastic multiple sampling exclusively to items identified as inherently ambiguous. This targeted strategy reduces inference costs by approximately 45% while maintaining robustness in predictions. To correct the systematic bias of LLMs toward extreme ratings, we apply isotonic regression to shift the output distribution toward patterns of human judgment. Our system achieves a Spearman correlation of 0.67 and an accuracy within 0.76 standard deviations, ranking 34th out of 79 participating teams (top 43% without task-specific fine-tuning). Detailed error analysis reveals that while our system performs strongly on clear contexts (ρ = 0.78), current prompting paradigms struggle significantly to model multimodal human disagreement in genuinely ambiguous cases (ρ = 0.58), highlighting an important challenge for future work on subjective semantic tasks.
NCL at SemEval-2026 Task 8: Deterministic Small-LLM RAG with Relation Classification
Zehao Liu | Huizhi Liang
Zehao Liu | Huizhi Liang
We present NCL’s system for SemEval-2026 Task 8B, the generation track for multi-turn retrieval-augmented dialogues. Our submission follows a compact and reproducible RAG pipeline: (1) global and local question rewriting with LLM-based multi-turn relation control, (2) passage reranking with BGE-M3, (3) context-level answerability filtering with strict binary LLM judgments (“yes”/“no”), and (4) deterministic inference with a small-LLM (Qwen2.5-1.5B-Instruct) plus post-generation quality fallback (cleaning, bad-answer gate, one stricter retry, then an IDK fallback).On the official test set, our system achieved a harmonic mean score of 0.5973 (RB${agg}$ 0.4993, RL$F$ 0.7235, RB${llm}$ 0.6105), ranking 19th out of 26 teams on the leaderboard.
SCUMesclab at SemEval-2026 Task 3: An Adaptive Dual-Track Framework for Dimensional Aspect-Based Sentiment Analysis
Chia-Yun Lee | Matus Pleva | Daniel Hladek | Ming-Hsiang Su
Chia-Yun Lee | Matus Pleva | Daniel Hladek | Ming-Hsiang Su
This paper describes our system for SemEval-2026 Task 3, which focuses on predicting continuous valence and arousal scores. The task poses significant challenges due to variations in data scale and pragmatic ambiguities across languages. To address these disparities, we propose an Adaptive Dual-Track Framework that dynamically selects modeling strategies based on task characteristics. For semantically stable tasks, we apply a robust single baseline optimized with layer-wise learning rate decay (LLRD) to ensure stability. For high-ambiguity scenarios such as the Environmental Protection domain, we adopt a heterogeneous ensemble strategy to mitigate prediction variance. Experimental results demonstrate that our system consistently outperforms the initial standard baseline across all subtasks. Furthermore, our lightweight approach exhibits remarkable parameter efficiency, achieving highly competitive performance against newly introduced large language model (LLM) baselines. Additionally, ablation studies reveal that under regression settings, conventional regularization techniques, cross-lingual data transfer, and homogeneous ensemble learning can lead to negative transfer, confirming the necessity of strategically diverging approaches tailored to linguistic characteristics.
PAI at SemEval-2026 Task 3: An LLM and Data Redistribution Adaptation-Based Predictive Strategy for Valence-Arousal Scores
Zhihao Ruan | Kaifeng Yang | Cheng Chen | Wenwen Dai | Wenjia Mao
Zhihao Ruan | Kaifeng Yang | Cheng Chen | Wenwen Dai | Wenjia Mao
To address the valence and arousal score prediction task in Dimensional Aspect-Based Sentiment Analysis (DimABSA), we propose a two-stage strategy. In the first stage, we conduct post-training on a Large Language Model (LLM) via a Supervised Fine-Tuning (SFT) scheme, followed by generating initial predictions for valence and arousal scores. In the second stage, we perform distribution adaptation on the initial results by leveraging the training set distribution through various techniques, including Gaussian distribution modeling, quantile mapping, and the Sinkhorn algorithm.
UCSC NLP at SemEval-2026 Task 10: Boundary-Aware Span Extraction and RoBERTa Classification for Conspiracy Detection
Dom Marhoefer | Milos Suvakovic | Glenn Grant-Richards | Aidan Pinero | Ryan King
Dom Marhoefer | Milos Suvakovic | Glenn Grant-Richards | Aidan Pinero | Ryan King
We present our systems for SemEval-2026 Task10 (PsyCoMark), addressing conspiracy markerextraction (Subtask 1) and document-level con-spiracy detection (Subtask 2). For marker ex-traction, we formulate the task as multi-labelspan classification over enumerated candidatespans, using IoU≥0.95 positive labeling, hard-negative sampling, and containment-based non-maximum suppression (NMS) with boundary-aware span representations. Document classi-fication is modeled independently using a se-quence classifier with label smoothing and astratified train–validation split. Analysis showsthat entity-like roles (Actor, Victim) are de-tected robustly, while abstract roles (Action,Effect, Evidence) remain sensitive to boundarycriteria. On the official test set, our systemsrank 7th in Subtask 1 (0.2251 macro F1) and12th in Subtask 2 (0.7694 weighted F1).
XplaiNLP at SemEval-2026 Task 1: BVAHAHA - Benign Violation Algorithm for Humor and Harmless Absurdity
Berk Bubus | Nebi Soyal | Vera Schmitt | Nils Feldhus | Veronika Solopova
Berk Bubus | Nebi Soyal | Vera Schmitt | Nils Feldhus | Veronika Solopova
We present BVAHAHA, a humor generationsystem for SemEval-2026 Task 1 (MWAHAHASubtask A), which frames constrained joke generation through the lens of Benign ViolationTheory (BVT). Given either two rare words ora news headline, the system generates contextually appropriate jokes while avoiding memorization and unsafe outputs. Our approachcombines BVT-guided humor generation witha parallel moderation pipeline ("Gatekeepers")that detects excessive emotional intensity andhate speech, triggering iterative revisions whennecessary. Finally, we employ an LLM-as-aJudge framework with persona-based rankingto approximate human humor preferences.
looploop at SemEval-2026 Task 3: A Dimensional Aspect-Based Sentiment System with DeBERTa Regression and Qwen3 Instruction Fine-Tuning
Liu Yang | Gang Hu | Jing Li
Liu Yang | Gang Hu | Jing Li
Aspect-Based Sentiment Analysis (ABSA) hasevolved to capture continuous affective states,posing challenges for traditional classificationmodels. We adopt a hybrid approach tailoredto the varying complexities of the subtasks. ForTask 1 (Valence-Arousal Regression), we em-ploy a discriminative architecture using pre-trained DeBERTa encoder with a MeanPool-ing mechanism to directly regress continuoussentiment scores. For Tasks 2 and 3, which re-quire complex structural extraction of opiniontriplets and quadruplets, we utilize a generativeapproach by fine-tuning the Qwen3-4B-Instructlarge language model via 4-bit QLoRA. Oursystem effectively handles both precise numer-ical regression and complex structural text gen-eration, achieving competitive results acrossthe English laptop and restaurant domains.
PFW at SemEval-2026 Task 6: Multi-Seed DeBERTa Ensembles for Political Response Clarity and Evasion Classification
Taleef Tamsal
Taleef Tamsal
This paper describes the PFW system for SemEval-2026 Task 6 (CLARITY), which addresses the classification of response clarity and evasion techniques in political interview question-answer pairs. Rather than relying on large language model prompting, we pursue a competitive non-LLM approach based on fine-tuning DeBERTa-xlarge and DeBERTa-v3-large with a multi-seed ensemble strategy: 5-fold cross-validation with 10 random seeds yields 50 models per architecture, combined through simple logit averaging. Our system achieves a macro F1 of 0.76 on Subtask 1 (clarity-level classification) and 0.50 on Subtask 2 (evasion-type classification). We also find that three post-hoc optimization techniques—learned ensemble weights, thresh old calibration, and hierarchical masking— each improve out-of-fold performance yet degrade evaluation scores by 0.02–0.10 F1. This pattern should be interpreted cautiously: the 237-sample evaluation set likely contributes substantial variance, and two of the three degradations fall within the ±0.06 95% CI expected from sampling noise. Still, the consistent directional pattern across all three prediction-level interventions provides suggestive evidence for an optimization paradox, highlighting the risk of overfitting to cross-validation predictions when evaluation data is limited. Our code is publicly available at https://github.com/ Taleef7/semeval-2026-task6.
PFW Task 8 at SemEval-2026 Task 8: Lightweight Tri-Fusion Retrieval with Prompt-Engineered Faithful Generation for Multi-Turn RAG
Taleef Tamsal
Taleef Tamsal
We describe PFW Task 8’s system for SemEval 2026 Task 8 (MTRAGEval), a benchmark for multi-turn retrieval-augmented generation across four English-language corpora. Our submission combines BM25, SPLADE-v3, and Jina Embeddings v4 with weighted reciprocal rank fusion for retrieval, plus zero-shot GPT 4o/GPT-4o-mini prompting for generation. Officially, our system ranks 6th of 26 on Task B (H = 0.756), 14th of 29 on Task C (H = 0.533), and 20th of 38 on Task A (nDCG@5 = 0.433). For the camera-ready analysis, we re-run retrieval at the official nDCG@5 cutoff, strengthen the prompt ablation with per-domain statistics and exact tests, and analyze official outputs by answerability and domain. On a balanced 100-example development sample, explicit citation-format instructions—not chain of-thought alone—raise citation use from 4% to 93%, and a fixed-context Task C control improves from H = 0.463 with GPT-4o-mini to H = 0.523 with GPT-4o. Official analytics also show near-perfect UNANSWERABLE handling (H = 0.990) but weak behavior on UNDERSPECIFIED turns, where the system answers or refuses instead of clarifying. Our code is publicly available.
YangSteam at SemEval-2026 Task 3: Transformer-Based Aspect-Aware Regression for Dimensional Sentiment and Stance Analysis
Tsung-Hsien Yang | Shu-Fei Yang
Tsung-Hsien Yang | Shu-Fei Yang
This paper describes our system for the SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA). We participate in Track A (DimABSA) and Track B (DimStance), both of which involve Subtask 1 – predicting continuous valence–arousal (VA) scores for given text–aspect pairs in English and Chinese.Our system combines pre-trained multilingual transformers with aspect-marker input encoding and dual regression heads for VA prediction, trained with a 5-fold cross-validation ensemble. We select XLM-RoBERTa-large as the backbone for Track A and mDeBERTa-v3-base for Track B based on systematic model comparison on the development sets. On the official test sets, our system substantially outperforms the organizer-provided baselines across all language domain settings. On the unofficial postevaluation leaderboard, the system achieves strong results on Chinese subsets, ranking 1st on zho-env (Track B) and 2nd on zho-fin (Track A).
PSK at SemEval-2026 Task 9: Multilingual Polarization Detection Using Ensemble Gemma Models with Synthetic Data Augmentation
Srikar Kashyap Pulipaka
Srikar Kashyap Pulipaka
We present our system for SemEval-2026 Task 9: Multilingual Polarization Detection, a binary classification task spanning 22 languages. Our approach fine-tunes separate Gemma 3 models (12B and 27B parameters) per language using Low-Rank Adaptation (LoRA), augmented with synthetic data generated by a large language model (LLM). We employ three synthetic data strategies (direct generation, paraphrasing, and contrastive pair creation) using GPT-4o-mini, with a multi-stage quality filtering pipeline including embedding-based deduplication. We find that per-language threshold tuning on the development set yields 2 to 4% F1 improvements without retraining. We also use weighted ensembles of 12B and 27B model predictions with per-language strategy selection. Our final system achieves a mean macro-F1 of 0.811 across all 22 languages, ranking 2nd overall out of 60 participating teams, with 1st place finishes in 2 languages and top-3 in 8 languages. We also find that alternative architectures (XLM-RoBERTa, Qwen3) that showed strong development set performance suffered 30 to 50% F1drops on the test set, highlighting the importance of generalization.
This paper describes Team SoloSemantics’ submissions to SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. We began with lightweight neuro-symbolic knowledge-graph baselines, but a triplet-tuned MPNet bi-encoder produced stronger semantic separation in our experiments. We adopted a shared dense encoder family across both tracks and kept the KG and fusion variants as diagnostic baselines. Team SoloSemantics ranked 22nd on Track A and 9th on Track B. Our reproducibility audit further shows that the KG branch was often too sparse on short summaries to represent abstract narrative relations reliably under the current extraction pipeline.
AsymVerify at SemEval-2026 Task 6: Asymmetric Confidence-Gated Verification for Political Evasion Detection
Sebastien Kawada
Sebastien Kawada
Political evasion is difficult to detect because evasive answers often appear cooperative while avoiding concrete commitment. We present AsymVerify, a confidence-gated verification system for SemEval-2026 Task 6, a three-way classification of Clear Reply, Ambivalent, and Clear Non-Reply responses. AsymVerify scored 0.85 Macro F1 on the evaluation split (Deval, n=237), placing 2nd out of 41 teams on the official leaderboard. The system first classifies each question-answer pair, then selectively applies downgrade verification (CR/CNR → AMB) or upgrade verification (AMB → CR) to low-confidence predictions. Development analysis shows that errors concentrate at the Ambivalent boundary in both directions, motivating this asymmetric two-verifier design while confidence gating keeps additional inference cost low. On Ddev (n=308), AsymVerify with GLM-4.7 gains +17.1 Macro F1 over single-pass classification at 1.48 calls/example, and the upgrade verifier alone improves every tested LLM backend on Ddev by +6.8 to +15.2 Macro F1 over its single-pass baseline. Code is available at https://github.com/kaons-research/AsymVerify-ACL.
NYCU Speech Lab at SemEval-2026 Task 3: Heterogeneous Model Ensemble with Adaptive Weighted Voting for Dimensional Aspect Sentiment Quadruplet Extraction
Hao-Chun Hsieh | Cheng-En Wu | Yuan-Fu Liao
Hao-Chun Hsieh | Cheng-En Wu | Yuan-Fu Liao
SemEval-2026 Task 3 (DimABSA) includes Dimensional Aspect Sentiment Quadruplet Extraction (DimASQP), which requires extracting structured tuples—aspect term, aspect category, and opinion term—together with continuous valence–arousal (VA) values from reviews (Yu et al., 2026a). In this work, we participate in Track A, Subtask 3. We describe NYCU Speech Lab’s submission for the Chinese Restaurant and Laptop domains. Our system is a post-processing ensemble over heterogeneous architectures: LoRA/QLoRA fine-tuned decoder-only LLMs, a fine-tuned encoder-only model, and (optionally) prompted API-based LLMs. To improve robustness under the continuous F1 (cF1) metric, we use validation-calibrated weighted voting for tuple selection and weighted VA fusion for numerical aggregation, with strict output validation to enforce task constraints. Experiments on a held-out validation split show consistent gains over single models and clarify the precision–recall trade-offs induced by the voting threshold. On the organizers’ released (tentative) test leaderboard snapshot, our submission ranks first in both domains.
CascadeMind at SemEval-2026 Task 4: A Hybrid Neuro-Symbolic Cascade for Narrative Similarity
Sebastien Kawada | Dylan Holyoak
Sebastien Kawada | Dylan Holyoak
Across self-consistency samples from an LLM, vote agreement tracks instance difficulty: on SemEval-2026 Task 4 (Narrative Story Similarity), supermajority cases (≥ 7/8 votes) resolve at 85% accuracy, split votes at 67%, and perfect ties at 61%, a monotone gradient that holds across the development set. We exploit this in CascadeMind, which routes eight Gemini 2.5 Flash votes by consensus, escalates split votes to additional sampling rounds, and falls through to a symbolic ensemble of theory-inspired narrative signals only on perfect ties (5% of cases). The system reached 72.75% on Track A test, placing 10th of 44 teams. Ablations show that the symbolic component contributes negligibly end-to-end and that nearly all gains come from confidence-aware routing. The takeaway is methodological: for narrative similarity, calibrating when to spend more compute on a hard instance matters more than adding auxiliary representations to reason about it. Code is available at https://github.com/chreia/CascadeMind-ACL.
YNU-HPCC at SemEval-2026 Task 13: Robust Machine-Generated Code Detection under Distribution Shifts
Lixian Xing | Jin Wang | Xuejie Zhang
Lixian Xing | Jin Wang | Xuejie Zhang
As Large Language Models (LLMs) become prevalent in software development, distinguishing machine-generated from human-written code is increasingly important. This paper describes the system developed by the YNU-HPCC team for SemEval-2026 Task 13, which evaluates detection under cross-language, multi-generator, and hybrid settings. Three modeling paradigms are systematically examined: encoder-based fine-tuning, feature-based machine learning, and task-specific robustness strategies. For Subtask A (Binary Detection), frozen pre-trained encoders and shallow stylometric features exhibit stronger cross-domain robustness than full fine-tuning, with indentation entropy identified as a key discriminative signal. For Subtask B (Multi-Class Attribution), a hierarchical two-stage framework is adopted to decouple human–machine discrimination from fine-grained generator attribution, alleviating severe class imbalance. For Subtask C (Hybrid Detection), a token-level splicing augmentation strategy combined with Supervised Contrastive Learning and Focal Loss is employed to model intra-sample stylistic variation. According to the official leaderboard, our system ranked 12th out of 81 teams in Subtask A, 14th out of 34 in Subtask B, and 8th out of 32 in Subtask C.
TeleAI at SemEval-2026 Task 6: A Confidence-Aware Multi-Stage Reasoning Framework with Chain-of-Thought
Lingling Shi | Haoyu Jin | Shiquan Wang | Fang Yu | Shuangyong Song | Xuelong Li
Lingling Shi | Haoyu Jin | Shiquan Wang | Fang Yu | Shuangyong Song | Xuelong Li
This paper describes our framework for SemEval-2026 Task 6 (CLARITY - Unmasking Political Question Evasions), which focuses on classifying clarity and fine-grained evasion types in political question-answering dialogues. We propose CAMSR-CoT, a confidence-aware multi-stage reasoning framework that unifies the two subtasks through hierarchical label modeling. The framework adopts a confidence-based routing strategy: high-certainty cases are directly resolved, while ambiguous samples are routed to deeper Chain-of-Thought reasoning stages with boundary-aware few-shot exemplars to mitigate label confusion. On the development set, our framework achieves Macro-F1 scores of 0.812 on SubTask 1 and 0.617 on SubTask 2. On the official hidden test set, it ranks 1st in both SubTask 1 (Macro-F1 = 0.89) and SubTask 2 (Macro-F1 = 0.68).
chengtang at SemEval-2026 Task 7: A Retrieval-Augmented Generation Framework for Cultural Perspective Alignment in Everyday MCQs
Cheng Tang | Zhichao Meng | Meizhi Jin
Cheng Tang | Zhichao Meng | Meizhi Jin
Large language models (LLMs) often exhibit significant cultural representation biases in multilingual everyday knowledge understanding, struggling to accurately capture region-specific customs and values. This paper presents our system submission for SemEval 2026 Task 7: BLEnD Challenge Track 2 (MCQ) (SemEval-2026 Task 7 Organizers, 2026). To address these challenges, we propose a training-free retrieval-augmented generation (RAG) framework. Without introducing any external data, we manuallyconstructed a localized multicultural knowledge base for each language-region and used text-embedding-v4 for region-specific cultural background retrieval. In the generation stage, we adopted a strict zero-shot setting: prompts contain no task instance question-answer examples, only injecting locale-relevant background cultural descriptions via RAG to compensate for contextual information absence, combined with a dual-model ensemble strategy using Gemini 3 Flash (preview) (Google DeepMind, 2025) and GPT-5.2 Chat (OpenAI, 2025). Our system achieved an overall score of 96.35 on the final Evaluation dataset.Additionally, we conducted in-depth analysis of model performance on specific languages, particularly highlighting severe cultural alignment challenges faced by large models in dialectal variants like Moroccan Arabic (ar-MA) and highly localized subjective Japanese (jaJP) everyday scenarios
Phatthachdau at SemEval-2026 Task 9: A Multi-Stage Augment-Judge-Train Pipeline for Multilingual Online Polarization Detection
Phan Phat
Phan Phat
Address the extreme label imbalance in the Hausa dataset where only 11% of instances are polarized—through the Augment-Judge-Train (AJT) pipeline. By leveraging Gemini 2.0 for taxonomy-driven data generation and an LLM-as-a-Judge layer for quality control, we expanded the minority class sixfold. Our ensemble architecture, combining specialized Encoders with LLM-LORA, achieved 1st Place in Hausa (0.8336 Macro-F1) and ranked in the Top 10 for English. These results demonstrate the efficacy of culture-aware synthetic data in enhancing social NLP for low-resource languages.
CYUT at SemEval-2026 Task 9: Monolingual vs. Multilingual LoRA Tuning for Multicultural and Multievent Polarization Detection
Shih-Hung Wu | Yun-Kuang Liao | Shih-Siang Su | Yi-Min Jian
Shih-Hung Wu | Yun-Kuang Liao | Shih-Siang Su | Yi-Min Jian
This study addresses SemEval-2026 Task 9 on Detecting Multilingual, Multicultural, and Multievent Online Polarization, exploring the performance differences between monolingual and multilingual LoRA (Low-Rank Adaptation) fine-tuning techniques when processing online polarization phenomena. The research points out that online polarization is not only a language phenomenon, but a complex social language problem highly influenced by cultural contexts and event backgrounds. To address the limitation of existing research that only treats polarization as a binary classification, this study participates in three levels of subtasks: Subtask 1: Polarization Detection, Subtask 2: Polarization Type Classification (e.g., politics, religion), and Subtask 3: Manifestation Identification (analyzing rhetorical strategies that construct polarization, such as stereotypes and dehumanization narratives). This study aims to establish a more contextually grounded and diagnostic model analysis framework to enhance the model’s generalization ability and fairness in cross-lingual environments. By exploring different fine-tuning configurations to build a robust ensemble system, the experimental results show that our approach demonstrates exceptional proficiency in the Chinese domain, securing the 1st place ranking in Subtask 1 (Polarization Detection) for Chinese. Furthermore, we observe that while the monolingual LoRA strategy exhibits strong performance in specific languages like Chinese, integrating it with multilingual LoRA models via ensembling provides the diverse features crucial for identifying complex cross-cultural rhetoric.
DeepSemantics at SemEval-2026 Task 9: Label-Wise Optimization with Adaptive Focal Loss for Polarization Manifestation Identification
Eliasse Tiao | Josue Edou | Mahugnon Gohouede
Eliasse Tiao | Josue Edou | Mahugnon Gohouede
In this paper, we present our system for SemEval-2026 Task 9, which focuses on the fine-grained identification of polarization manifestations in multilingual social media content.Our approach combines transformer-based encoders (RoBERTa-base for English and Afro-XLM-R-small for Hausa) within aOne-vs-Rest (OvR) framework, complemented by controlled oversampling, Adaptive Focal Loss, and label-wise threshold optimization. To mitigate severe class imbalance and label sparsity, we adopt language-specific optimization strategies supported by pairwise χ2 independence analysis.Our system achieves macro-F1 scores of 0.464 in English and 0.192 in Hausa on the official test sets, ranking 5th in Hausa and 14th in English on the official leaderboard.
UTokyo Tsuruoka Lab at SemEval-2026 Task 9: Efficient Single Forward Pass Inference for Multi-Label Polarization Classification
Howard Tangkulung | Yoshimasa Tsuruoka
Howard Tangkulung | Yoshimasa Tsuruoka
Detecting and interpreting polarized online content is increasingly crucial as online platforms become central to public information exchange. We present an efficient adaptation of large language models for multi-label polarization classification in SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization. Our single-forward-pass inference method outperforms baseline multi-step decoding approaches for multi-label classification by reducing error propagation while improving inference efficiency. Beyond performance and efficiency analysis, we investigate the cross-lingual transferability of the system, observing statistically significant generalization within language families, a result that offers a practical path for low-resource language adaptation. Our system ranked 1st in 8 languages for Subtask 1 and 6 languages for Subtask 2, and placed in the top 5 for 16 out of 22 languages across both subtasks.Overall, we provide a simple, effective, and efficient solution for multilingual polarization classification.
ALPS-Lab at SemEval-2026 Task 3: A Multilingual Generative LLM Approach for Dimensional Aspect Sentiment Analysis
Songqian Dai | Wei Lin
Songqian Dai | Wei Lin
We propose a SFT approach for the DimABSA shared task, which predicts aspect-level sentiment intensities using large language models. The approach uses Gemma-3 27B with QLoRA for efficient fine-tuning on multilingual datasets. Merging data across languages improves performance, especially in low-resource domains. Post-processing removes duplicate outputs for accurate evaluation.
XiaoM at SemEval-2026 Task 7: A Qwen-based System for Accurate Retrieval of Everyday Knowledge Across Diverse Languages and Cultures
Xiao Yao | Liang Yang
Xiao Yao | Liang Yang
This paper describes our system designed for SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures. We describe a practical inference system for a two-track benchmark consisting of short-answer questions (SAQ) and multiple-choice questions (MCQ). Our submission is implemented in a single script and targets competition constraints directly: strict TSV schemas, short answer limits, and reliability under batch inference. The system uses Qwen2.5-7B-Instruct with memory-aware initialization, deterministic decoding (no sampling, zero temperature), and post-processing rules that guarantee valid outputs. We further add retry-on-failure and file-write fault tolerance to reduce runtime interruptions.
MSqrd at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Syeda Samah Daniyal | Muneeba Badar | Manal Hasan | Shifa Shah | Sandesh Kumar | Abdul Samad
Syeda Samah Daniyal | Muneeba Badar | Manal Hasan | Shifa Shah | Sandesh Kumar | Abdul Samad
Online polarization, the critical division between social, political, or identity groups, often leads to hate speech and social fragmentation. Detecting polarization, especially across diverse linguistic and cultural contexts, is a critical challenge. This paper presents our submission for SemEval-2026 Task 9, which focuses on detecting online polarization of multilingual, multicultural, and multievent (Naseem et al., 2025). The task is divided into three subtasks: (1) binary polarization detection, (2) multi-label classification of polarization type (e.g., political, racial, religious), and (3) multilabel identification of its manifestation (e.g., stereotype, vilification, dehumanization). For each subtask, we employ fine tune BERT-based transformer models. Model configurations are described in Section 4. The results are evaluated using the F1 macro score. We have achieved scores of 78.6, 55.8, 44.6 on the developmenttest set for subtasks 1, 2, and 3, respectively. Overall, the results demonstrate the effectiveness of BERT-based models for multilingual polarization detection.
HU at SemEval-2026 Task 10: Psycholinguistic Conspiracy Marker Extraction and Detection
Muhammad Quddussi Kashaf | Shahmir Mustafa Chaudhry | Marium Zeeshan | Nahyan Javed | Sandesh Kumar | Abdul Samad
Muhammad Quddussi Kashaf | Shahmir Mustafa Chaudhry | Marium Zeeshan | Nahyan Javed | Sandesh Kumar | Abdul Samad
Modern media poses a complex challenge to verifying the credibility of information and public discourse due to the advent of conspiracy theory content. This paper presents our methodology in "SemEval-2026 Task 10: Psycholinguistic Conspiracy Marker Extraction and Detection". It consists of two subtasks: extracting psycholinguistic markers from text using Named Entity Recognition (NER) techniques, and classifying Reddit comments as conspiratorial or non-conspiratorial. Our approach involved: (1) diverse extraction methodologies, including traditional bio tagging schemes, the GlobalPointer framework, and the GLiNER2 architecture, (2) data augmentation and synthetic data generation via Large Language Models (LLMs), and (3) evaluating various transformer-based models, such as DistilBERT and Covid Twitter-BERT. Our final system achieves a macro F1 score of 0.26 on Subtask 1 and 0.76 on Subtask 2.
RaguTeam at SemEval-2026 Task 8: Meno and Friends in a Judge-Orchestrated LLM Ensemble for Faithful Multi-Turn Response Generation
Roman Derunets | Ivan Bondarenko | Oleg Sedukhin | Mikhail Komarov | Ivan Chernov | Mikhail Kulakov
Roman Derunets | Ivan Bondarenko | Oleg Sedukhin | Mikhail Komarov | Ivan Chernov | Mikhail Kulakov
This paper describes our first-place submission to Task B (generation with reference passages) of the SemEval-2026 Task 8 MTRAGEval shared task on multi-turn retrieval-augmented generation. We propose a heterogeneous ensemble of seven LLMs organised into two groups with distinct prompting strategies, and use a GPT-4o-mini judge to select the best candidate response for each instance. Our system ranked first among 26 teams, achieving a conditioned harmonic mean score of 0.78 and substantially outperforming the strongest organiser baseline (0.64). Ablation experiments show that diversity across model families, scales, and prompting strategies is critical: the ensemble consistently outperforms any individual model. We also include Meno-Lite-0.1, a 7B domain-adapted model with a favourable cost–performance trade-off, and present an analysis of MTRAGEval that highlights annotation limitations and directions for benchmark improvement.
AFourP at SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Shrika Thota | Lakshmi Priya Swaminatha Rao | Shivaanee Sk | Thirumurugan Ra | Vishal Muralidharan | Dhannya Santhakumari Madhavan
Shrika Thota | Lakshmi Priya Swaminatha Rao | Shivaanee Sk | Thirumurugan Ra | Vishal Muralidharan | Dhannya Santhakumari Madhavan
We describe our submission to SemEval-2026 Task 2 (Subtask 1), which asks systems to predict continuous Valence and Arousal scores from ecological diary texts. We fine-tune RoBERTa-base with a single linear regression head, treating each essay independently. Our system scores rcomposite of .679 (Valence) and .466 (Arousal) on the official test set, placing 4th on the Subtask 1 leaderboard.
HausaNLP at SemEval-2026 Task 7: Prompt-based Hausa Cultural Question Answering
Faisal Adam | Lukman Aliyu | Sani Aji | Abdulhamid Abubakar | Aliyu Rabiu Shuaibu
Faisal Adam | Lukman Aliyu | Sani Aji | Abdulhamid Abubakar | Aliyu Rabiu Shuaibu
We describe HausaNLP’s submission toSemEval-2026 Task 7 Track 1 (short-answercultural question answering). Our system is atraining-free, prompt-based pipeline targetingnative Hausa (ha-NG). Two design decisionsdistinguish it from a generic zero-shot baseline.We use locale-conditional prompting: ha-NGquestions receive a system prompt instructingconcise standard Hausa output with explicitBoko-script characters (á, â, Î, ű). Second, weuse a two-model fallback pipeline: GPT-4o handles the primary pass, and Gemini 1.5 Flash retries any rows where the primary call returnedan error or empty output, separating modelknowledge failures from API-availability failures. On the official development leaderboard,our best run reached 36.4 accuracy. Error analysis shows that a non-trivial fraction of failures are placeholder strings caused by APIerrors rather than incorrect generations, andthat surface-level mismatches (verbosity, orthographic variation) account for many of the remaining errors. Code, prompts, and processingscripts are released for reproducibility.
Takoyaki at SemEval-2026 Task 3: Ensembling LLM Predictions using Demonstration Retrieval for Dimensional Aspect-based Sentiment Analysis
Kosuke Yamada | Sho Takase | Ryosuke Kohita
Kosuke Yamada | Sho Takase | Ryosuke Kohita
This paper describes our system for SemEval-2026 Task 3 (DimABSA). We participate in Subtask 2 (DimASTE), which requires extracting triplets of aspect term, opinion term, and valence-arousal scores from review sentences, and Subtask 3 (DimASQP), which additionally requires aspect category classification to form quadruplets. Our proposed system consists of a multi-step pipeline: (1) retrieval-based in-context learning using BM25 to select relevant demonstrations for LLM inference, (2) agreement-based ensemble combining LLM predictions from multiple retrieval variants, and, for a subset of datasets, (3) error-pattern correction refining uncertain predictions using correction rule sets based on training data. Retrieval-based ICL and the agreement-based ensemble show consistent improvements across languages and domains. Error-pattern correction yields further improvement for the Japanese dataset. To further investigate output quality beyond automated evaluation metrics, we conducted human evaluation. The results suggest that LLM-based labeling achieves higher agreement with gold labels than human annotators, and additionally indicate a discrepancy between automated scores and practical output quality.
Team hugang11 at SemEval-2026 Task 1, Subtask A (Chinese): A CoT-SFT, Teacher-Constructed DPO, and Deterministic Post-processing Pipeline for Humor Generation
Gang Hu | Liu Yang | Jing Li
Gang Hu | Liu Yang | Jing Li
We present a system for SemEval-2026 Task 1, Subtask A (Chinese), which addresses humor generation with a three-stage pipeline combining CoT-SFT, teacher-constructed DPO, and deterministic post-processing. Built on Qwen2.5-7B-Instruct-bnb-4bit, the system achieved a live leaderboard rating of 991 and ranked in the second group. Our results suggest that robust inference-time control is as important as alignment-oriented training for humor generation.
ITLC at SemEval-2026 Task 11: Normalization and Deterministic Parsing for Formal Reasoning in LLMs
Wicaksono M. | Joanito Lopo | Tack Hwa Wong | Muhammad Ravi Shulthan Habibi | Samuel Cahyawijaya
Wicaksono M. | Joanito Lopo | Tack Hwa Wong | Muhammad Ravi Shulthan Habibi | Samuel Cahyawijaya
Large language models suffer from content effects in reasoning tasks, particularly in multilingual contexts. We introduce a novel method that reduces these biases through explicit structural abstraction that transforms syllogisms into canonical logical representations and applies deterministic parsing to determine validity. Evaluated on the SemEval-2026 Task 11 multilingual benchmark, our approach achieves top-5 rankings across all subtasks while substantially reducing content effects and offering a competitive alternative to complex fine-tuning or activation-level interventions.
Scmhl5 at SemEval-2026 Task 3: Uncertainty-Aware Adversarial Learning for Embedding Enhancement in Dimensional Aspect-Based Sentiment Analysis
Haohuan Chen | Han Liu
Haohuan Chen | Han Liu
This paper presents an uncertainty-aware adversarial learning framework developed for SemEval-2026 Task 3, a shared task focusing on Dimensional Aspect-Based Sentiment Analysis (ABSA). Our framework involves three key components: Uncertainty modeling, Heterogeneous Mixture-of-Experts (HMoE) architecture, and embedding-level adversarial training. Experimental results demonstrate that our framework effectively reduces the Root Mean Square Error (RMSE), thereby validating the synergistic advantages of uncertainty modeling and heterogeneous fusion strategies in fine-grained sentiment regression tasks.
Team VYN at SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis
Vishal Thenuwara | Widanalage De Mel | Nisansa De Silva
Vishal Thenuwara | Widanalage De Mel | Nisansa De Silva
This paper describes our system for the DimABSA 2026 Shared Task (SemEval-2026 Task 3), Track A, covering all three subtasks. We develop two complementary approaches: (1) DESS (Thenuwara and de Silva, 2025), an adaptation of our span-based extraction model incorporating dual-channel GCNs and a valence–arousal (VA) regression head.
Dual-View Consistency Testing for Content-Invariant Multilingual Syllogistic Reasoning
Ishita Gupta | Dhruv Goyal | Jatin Bedi
Ishita Gupta | Dhruv Goyal | Jatin Bedi
Team 0704mis addressed the SemEval-2026 Task 11 Subtask 3 by building a neuro-symbolic system designed for multilingual syllogistic validity classification across 12 typologically diverse languages. The process involves a neural parser that extracts logical forms from text, which are then validated by a symbolic verifier implementing the full set of 24 valid Aristotelian forms via a hash lookup.Our standout contribution is the dual-view consistency test: the system compares a "native" parse of the original text with a "masked" version where content terms are replaced by abstract symbols (X, Y, Z), only proceeding with high confidence if both views agree. By comparing how the model interprets the same logic in two different formats, the system can detect if the model’s reasoning changes when the context shifts from real-world objects to abstract symbols. The primary goal is to combat belief bias, the human-like tendency of LLMs to accept invalid arguments if the conclusion sounds true, or reject valid arguments if the conclusion sounds false. By enforcing this dual-view check, we found that symbol abstraction (View B) acts as a structural regularizer, forcing the model to ignore semantic interference and focus on the relationship between terms.
Caraman at SemEval-2026 Task 8: Three-Stage Multi-Turn Retrieval with Query Rewriting, Hybrid Search, and Cross-Encoder Reranking
David Caraman | Gheorghe Cosmin Silaghi
David Caraman | Gheorghe Cosmin Silaghi
We describe our system for SemEval-2026 - Task 8 (MTRAGEval), participating in Task A (Retrieval) across four English-language domains. Our approach employs a three-stage pipeline: (1) query rewriting via a LoRA-finetuned Qwen 2.5 7B model that transforms context-dependent follow-up questions into standalone queries, (2) hybrid BM25 and dense retrieval combined through Reciprocal Rank Fusion, and (3) cross-encoder reranking with BGE-reranker-v2-m3. On the official test set, the system achieves nDCG@5 of 0.531, ranking 8th out of 38 participating systems and 10.7% above the organizer baseline. Development comparisons reveal that domain-specific temperature tuning for query generation, where technical domains benefit from deterministic decoding and general domains from controlled randomness, provides consistent gains, while more complex strategies such as domain-aware prompting and multi-query expansion degrade performance.
REGLAT at SemEval-2026 Task 9: Enhancing Arabic Online Polarization Detection Using AraBERT and Synonym Replacement Augmentation
Ahmed Fetouh | Mariam Francies | Nsrin Ashraf | Hamada Nayel | Rahmath Mohammed
Ahmed Fetouh | Mariam Francies | Nsrin Ashraf | Hamada Nayel | Rahmath Mohammed
In this paper, we present our system, which was submitted to SemEval-2026 Task 9 (Subtask 1: Polarization Detection) and focuses on binary classification of polarized content in Arabic social media text. To address Arabic linguistic variations, we propose a single-model approach that combines fine-tuned AraBERT with synonym-based data augmentation. On the Arabic bind set, our method achieves a competitive macro F1-score of 0.831 and an accuracy of 0.833. Among the 45 participating teams, our system ranked 11th overall, with a performance gap of 0.018 macro F1 from the top-ranked team (0.8488). The results show that a fine-tuned AraBERT with synonym replacement is a strong, simple, and reproducible baseline that outperforms more complex setups in dealing with Arabic attitude polarization nuances.
RAGTUM at SemEval-2026 Task 8: Contextual Query Rewriting and Dense Retrieval for Multi-Turn RAG
Finn Wigger | Maximilian Podolsky | Merle Wilmink | Zelong Peng
Finn Wigger | Maximilian Podolsky | Merle Wilmink | Zelong Peng
This paper describes the system developed by a team for the TUM practical course Human-Centered Computing: applications in natural language processing, network science, machine learning, and AI for the SemEval MTRAG. Our approach addresses the challenges of multi-turn retrieval-augmented generation (RAG) by combining context-aware query rewriting with a dense retrieval strategy. We employ a pipeline that cleanses noisy corpora and utilizes dense OpenAI embeddings via Milvus for robust retrieval, and leverages Gemini 2.5 flash family of models for standalone query generation and final response synthesis. Our system demonstrates the effectiveness of integrating high-precision retrieval with fact-based generation across diverse domains.
d-itlab at SemEval-2026 Task 12: Per-Option Surprisal and Multi-Stage Gating for Precision-Oriented Causal Reasoning
Yasunori Terao | Yuuki Tachioka
Yasunori Terao | Yuuki Tachioka
We describe the system submitted by d-itlab to SemEval-2026 Task~12 (Abductive Event Reasoning), which requires selecting the most plausible direct cause(s) of an observed event from candidate options grounded in reference documents. Our approach combines (i) per-option multi-stage LLM inference that evaluates each option independently with progressively stricter verification, (ii) surprisal-based features obtained by teacher-forcing candidate sentences and measuring token-level negative log-likelihood, and (iii) an XGBoost ensemble trained on these heterogeneous features to produce a precision-oriented final prediction. In the official test set, our system scored 0.91, ranking third among 116 participating teams.
AICOE-Tredence at SemEval-2026 Task 11: Mitigating Content Bias in Syllogisms via Symbolic Logic-Language Decoupling
Rakshith R | Ankush Chopra
Rakshith R | Ankush Chopra
Content bias remains a key limitation of large language models (LLMs), which often conflate formal logical validity with real-world plausibility. SemEval-2026 Task 11 examines this challenge through multilingual syllogistic reasoning, requiring models to judge validity independently of content. We propose a structure-first reasoning paradigm that abstracts natural language syllogisms into Aristotelian logical forms. By mapping arguments to mood–figure representations and classifying validity in this symbolic space, our approach removes semantic content from the reasoning process. On the private test sets of Subtasks 1 and 3, our method achieves a perfect combined score, with 100% validity accuracy and zero content bias in both English and multilingual settings using Gemini-3 Pro Preview. We also explore transferring this paradigm to smaller models via structural supervision, finding that distilled systems retain high accuracy with minimal bias. These results suggest that explicitly separating logical form from linguistic content is a promising direction for bias-resilient and cross-lingually robust reasoning in LLMs.
hermeneutichools at SemEval-2026 Task 4: Multiperspectivity as a Resource for Narrative Similarity Prediction
Max Upravitelev | Veronika Solopova | Jing Yang | Charlott Jakob | Premtim Sahitaj | Ariana Sahitaj | Vera Schmitt
Max Upravitelev | Veronika Solopova | Jing Yang | Charlott Jakob | Premtim Sahitaj | Ariana Sahitaj | Vera Schmitt
Predicting narrative similarity can be under-stood as an inherently interpretive task: differ-ent, equally valid readings of the same text canproduce divergent interpretations and thus dif-ferent similarity judgments, posing a fundamen-tal challenge for semantic evaluation bench-marks that encode a single ground truth. Ratherthan treating this multiperspectivity as a chal-lenge to overcome, we propose to incorporateit in the decision making process of predic-tive systems. To explore this strategy, we cre-ated an ensemble of 31 LLM personas. Theserange from practitioners following interpretiveframeworks to more intuitive, lay-style charac-ters. Our experiments were conducted on theSemEval-2026 Task 4 dataset, where the sys-tem ranked 13th out of 47 teams and achievedan accuracy score of 0.705. Accuracy improveswith ensemble size, consistent with CondorcetJury Theorem-like dynamics under weakenedindependence. Practitioner personas performworse individually but produce less correlatederrors, yielding larger ensemble gains undermajority voting. Our error analysis reveals aconsistent negative association between gender-focused interpretive vocabulary and accuracyacross all persona categories, suggesting ei-ther attention to dimensions not relevant for thebenchmark or valid interpretations absent fromthe ground truth. This finding underscores theneed for evaluation frameworks that accountfor interpretive plurality.
SLPGFJWUInsa at SemEval-2026 Task 1: Enhancing Linguistic Creativity for English Text-Based Humor
Insa Abbas | Sadaf Abdul Rauf
Insa Abbas | Sadaf Abdul Rauf
For Subtask A, our main goal is to create a joke generating system that focuses on humor generation under constrained conditions using unusual words and news headlines as input. We trained our model on LLM-generated and human-curated augmented data aimed to produce constrained humor and to bridge the gap between the two. We demonstrate that using parameter-efficient fine-tuning (PEFT) on high-quality pre-trained base models in conjunction with a well-crafted prompt design allows our model to produce high-quality innovative output while maintaining the desired style.
ConTexT at SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Stories through Narrative Understanding
Fakeha Faisal | Rubab Shah | Syeda Zaidi | Azkaa Nasir | Sandesh Kumar | Abdul Samad
Fakeha Faisal | Rubab Shah | Syeda Zaidi | Azkaa Nasir | Sandesh Kumar | Abdul Samad
In this paper, we report our system for SemEval-2026 Task 5, which predicts graded plausibility scores for target word senses in narrative context. We explore embedding-based similarity, transformer fine tuning, and a three-stage curriculum combining WiC pretraining, Wasserstein distribution learning, and KL-based calibration. Our best model, DeBERTa-XLarge with curriculum training, achieves 78% accu-racy within one standard deviation and a Spear-man correlation of 0.70, with an overall test score of 0.74. Results show that distribution modeling better aligns with human plausibility judgments than single-score prediction
TeleAI at SemEval-2026 Task 3: Large Language Models for Dimensional Aspect-Based Sentiment Analysis
Yan Zhou | Wangshicheng Wang | Shiquan Wang | Mengjiao Bao | Ruiyu Fang | Shuangyong Song | Yongxiang Li | Xuelong Li
Yan Zhou | Wangshicheng Wang | Shiquan Wang | Mengjiao Bao | Ruiyu Fang | Shuangyong Song | Yongxiang Li | Xuelong Li
This paper describes TeleAI’s system for SemEval-2026 Task 3, Track A, Subtask 1 (DimASR), which focuses on predicting continuous Valence-Arousal (VA) scores for specific aspects in text. We frame this task as an end-to-end regression problem and propose a robust framework utilizing Qwen2.5-7B as the feature extraction backbone, combined with parameter-efficient fine-tuning via LoRA. To enhance model generalization and mitigate domain shifts, we primarily leverage multilingual and multi-domain mixed training. Furthermore, our system integrates several optimization and robustness techniques to stabilize continuous score prediction, including R-Drop-style consistency regularization, embedding-level PGD adversarial training, Smooth L1 (Huber) loss, sigmoid-based output interval mapping, and post-hoc linear calibration. Our comprehensive ablations demonstrate that the combination of joint training and robustness regularizations substantially reduces the official evaluation metric, $RMSE{VA}$. The proposed system achieves highly competitive performance across multiple language and domain settings, demonstrating the efficacy of applying lightweight LLM adaptation for dimensional aspect-based sentiment analysis.
L3IRIT at SemEval-2026 Task 4: Learning Narrative Similarity from Aligned Film Plot Summaries
Ahmed Hamdi | Emanuela Boros | Jose G. Moreno | Adam Jatowt | Georgeta Bordea | Carlos-Emiliano González-Gallardo | Antoine Doucet
Ahmed Hamdi | Emanuela Boros | Jose G. Moreno | Adam Jatowt | Georgeta Bordea | Carlos-Emiliano González-Gallardo | Antoine Doucet
This paper presents the participation of the L3IRIT team in SemEval Task 4.The team is a joint research group working on narrative extraction from historical text, led by the IRIT laboratory (University of Toulouse) and the L3i laboratory (University of La Rochelle). Our participation is grounded in the construction of a novel bilingual resource extracted from Wikipedia by automatically aligning film plots. Leveraging this dataset, we train embedding models using contrastive learning objectives to capture higher-level narrative structures more effectively. The resulting resource goes beyond surface-level lexical overlap, providing supervision for narrative similarity without manual annotation. In addition, we introduce a named-entity masking strategy designed to promote narrative abstraction and reduce superficial entity-based matching. Overall, our approach aims to support representation learning that captures structural and event-level similarities across stories in different languages more effectively.Our system ranked in 24 of the 44 scoreboards for Task A and 20 of the 27 scoreboards for Task B, achieving accuracies of 65.75 and 61.00, respectively.
YEZE at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization via Heterogeneous Ensembling
Fengze Guo | Yue Chang
Fengze Guo | Yue Chang
We present a multilingual system for SemEval-2026 Task 9 on detecting and characterizing online polarization across languages, cultures, and events. Our approach participates in all three subtasks and models each subtask independently using a heterogeneous weighted ensemble of XLM-RoBERTa-large and mDeBERTa-v3-base. For the multi-label settings, we adopt weighted binary cross-entropy to mitigate severe label imbalance. The system is trained exclusively on the provided task data and achieves robust performance across languages.
ChulaNLP at SemEval-2026 Task 6: A Hybrid BERT-LLM Framework for Political Response Clarity and Evasion Detection
Wisarut Peerachaidecho | Attapol Rutherford
Wisarut Peerachaidecho | Attapol Rutherford
SemEval-2026 Task 6 (CLARITY: Unmasking Political Interview) focuses on detecting equivocation and evasion techniques in political interviews. While encoder-only models and Large Language Models (LLMs) individually struggle with this task, we propose a hybrid BERT–LLM framework to leverage their complementary strengths: the discriminative efficiency of fine-tuned encoders and the sophisticated reasoning of LLMs. We benchmarked several long-context architectures—DeBERTa, RooseBERT, and BigBird—finding that a truncated DeBERTa-large provided the most reliable candidates for the LLM. By using DeBERTa’s top-5 predicted labels as constrained options for LLM inference, we significantly improved evasion-level classification. This hybrid approach achieved competitive rankings in the shared task, placing 7th in Subtask 1 and 2nd in Subtask 2.
UTRAG at SemEval-2026 Task 8: History-Aware Query Rewriting and LoRA-Finetuned Generation for Multi-Turn RAG
Ke Zhou | Yi-Shan Lin
Ke Zhou | Yi-Shan Lin
This paper describes our system for SemEval-2026 Task 8: Evaluating Multi-Turn RAG Conversations (MTRAGEval), which evaluates retrieval-augmented generation (RAG) in multi-turn, context-dependent settings. We improve retrieval with history-aware query rewriting and enhance generation faithfulness with a LoRA-adapted model, integrating both into an end-to- end pipeline.Our approach achieves competitive performance across all subtasks, with nDCG@5 of 0.4855 in Subtask A, a harmonic mean score of 0.6554 in Subtask B, and 0.5159 in Subtask C, outperforming strong baselines in Subtasks A and B while remaining competitive in Subtask C.Our analysis shows that increasing dialogue length introduces cumulative errors in history selection and query formulation, leading to incomplete or drifting retrieval results and increasing the risk of hallucination.
CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse
Nawar Turk | Lucas Miquet-Westphal | Leila Kosseim
Nawar Turk | Lucas Miquet-Westphal | Leila Kosseim
In this paper, we present our system for SemEval-2026 Task 6 (CLARITY) on response clarity and evasion detection in question-answer pairs from U.S. presidential interviews, comparing fine-tuned encoders with prompt-based LLMs. Our LLM ensemble achieves 80 macro-F1 on the 3-class Task 1 (9th/41) and 59 on the 9-class Task 2 (3rd/33). Across 8 transformer encoders optimized through a four-stage pipeline, partial encoder layer unfreezing outperforms full fine-tuning by a wide margin. Combining English and multilingual encoders further improves ensemble performance over either family alone, despite multilingual models being individually weaker. Prompt-based LLMs, without any task-specific parameter updates, outperform fine-tuned encoders, particularly on minority classes; among open-weight LLMs, parameter count does not predict performance. Enriched input, concatenating the full interviewer turn, improves LLM performance but not that of encoders, an effect that persists with Longformer’s extended context window, suggesting the divergence is not attributable to sequence-length capacity alone in our settings. The Clear Reply/Ambivalent boundary remains the dominant failure mode, mirroring the disagreement among human annotators. Our code, prompts, model configurations, and results are publicly available.
Semantic Vectors at SemEval-2026 Task 9: Robust Multilingual Polarization Detection via Dual-Encoder Fusion and Expert Ensembling
Ankit Dash | Priyanshu Mittal | Piyush Prashant | Sunil Saumya
Ankit Dash | Priyanshu Mittal | Piyush Prashant | Sunil Saumya
We present SEMANTIC VECTORS, our system for POLAR@SemEval-2026 Task 9 on multilingual online polarization detection across 22 typologically diverse languages. Polarization is frequently conveyed through implicit rhetorical framing, making cross-lingual detection highly challenging. We address this with a Siamese dual-encoder jointly fine-tuning mDeBERTa-v3-base and XLM-ROBERTa-large via 4-bit QLoRA, fused with language-specific expert models (GBERT, Italian BERT, Swahili BERT) through an XGBoost meta-stacker with per-language Platt calibration. Rather than addressing class imbalance, focal loss functions as a hard-example miner, concentrating gradients on subtly framed instances rather than lexically obvious ones. Combined with per-language threshold optimization, our system achieves macro-F1=0.797 and accuracy=0.827 across all 22 languages.
NCL-BU at SemEval-2026 Task 3: Fine-tuning XLM-RoBERTa for Multilingual Dimensional Sentiment Regression
Tong Wu | Nicolay Rusnachenko | Huizhi(elly) Liang
Tong Wu | Nicolay Rusnachenko | Huizhi(elly) Liang
Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA from categorical polarity labels to continuous valence–arousal (VA) regression. This paper describes a system developed for Track A, Subtask 1 (Dimensional Aspect Sentiment Regression), aiming to predict real-valued VA scores in the [1, 9] range for each given aspect in a text. A fine-tuning approach based on XLM-RoBERTa-base is adopted, using dual regression heads with sigmoid-scaled outputs for valence and arousal prediction. Separate models are trained for each language–domain pair (English and Chinese across restaurant, laptop, and finance domains), and training and development sets are merged for final test predictions. In development experiments, the fine-tuning approach is compared against several large language models under a few-shot prompting setting, demonstrating that task-specific fine-tuning outperforms these LLM-based methods across all evaluation datasets.
SokraTUM at SemEval-2026 Task 3: A hybrid cascade of Label Distribution Learning, RAG supported generative extraction and contrastive metric learning for dimensional sentiment analysis
Denis Laschenko | Albert Korotyk
Denis Laschenko | Albert Korotyk
The Dimensional ABSA (DimABSA) sharedtask extends traditional aspect-based sentimentanalysis from categorical polarity to continuousvalence–arousal (VA) prediction. We presentour system for all three subtasks: DimensionalAspect Sentiment Regression (DimASR),Dimensional Aspect Sentiment Triplet Extrac-tion (DimASTE), and Dimensional AspectSentiment Quad Prediction (DimASQP).Due to the cascading nature of the differentsubtasks, we built a modular interlockingpipeline that uses classical Machine Learningand NLP methods.Experiments across domains show consistentgains in regression accuracy and structuredextraction performance. Our results demon-strate the effectiveness of distribution-awareregression, retrieval-augmented generation, andcontrastive prototype learning for dimensionalsentiment analysis.
NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating
Tong Wu | Thanet Markchom | Huizhi(elly) Liang
Tong Wu | Thanet Markchom | Huizhi(elly) Liang
Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1–5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task.
RAGonauts at SemEval-2026 Task 8: BM25 Retrieval with Last-Turn Query Formulation for Multi-Turn RAG Conversations
Rajalakshmi Sivanaiah | Angel Deborah S | Karthik Raja C | Rithika S
Rajalakshmi Sivanaiah | Angel Deborah S | Karthik Raja C | Rithika S
This paper describes the submission to Task~A of SemEval-2026 Task~8: MTRAGEval, which evaluates passage retrieval for multi-turn Retrieval-Augmented Generation (RAG) conversations across multiple knowledge domains. The task requires retrieving relevant supporting passages given conversational history, where user queries often contain implicit references and incomplete contextual information. This paper proposes a lightweight and training-free retrieval framework based on BM25 ranking combined with conversational query formulation. Queries are derived from dialogue turns and retrieval is performed using domain-specific indices to preserve corpus relevance. Without neural retrievers or fine-tuning, our system achieves an nDCG@5 score of 0.2836 on the official evaluation set, ranking 33\textsuperscript{rd} on the leaderboard. This result demonstrates that sparse lexical retrieval remains an efficient and reproducible baseline for conversational RAG systems.
0704mis at SemEval-2026 Task 11: Single-Call Joint Abstraction for Robust Neuro-Symbolic Retrieval
Ishita Gupta | Dhruv Goyal | Jatin Bedi
Ishita Gupta | Dhruv Goyal | Jatin Bedi
Neuro-symbolic Basis for Robust Syllogistic Reasoning Under Distractors.We present our submission to SemEval-2026 Task 11 Subtasks 2 and 4, on syllogistic premise retrieval with distractors. Our system is based on a robustness-first neuro-symbolic pipeline. The key innovation is single-call joint abstraction: rather than parsing all statements independently, one LLM call jointly abstracts all premises and the conclusion into categorical logical forms (A/E/I/O) where symbolic (X/Y/Z) mappings are globally consistent. This allows reliable detection of the shared middle term needed for syllogistic validation. Parsed forms are passed through an exhaustive O(n²) premise-pair search with deterministic validation against the 24 valid Aristotelian syllogistic forms via constant time lookup. Ablation studies show that more theoretically sophisticated variants degrade performance when logical-form extraction is the primary bottleneck. Our approach achieves competitive rankings in both English and multilingual settings while remaining simple, deterministic, and content-invariant.
CiNet-Handai-Kyodai at SemEval-2026 Task 5: Combining LLM Prompting, Semantic Similarity, and Synthetic Gaze for Graded Sense Plausibility
Lis Kanashiro Pereira | Fei Cheng
Lis Kanashiro Pereira | Fei Cheng
We present a hybrid system for SemEval-2026 Task 5 on graded word-sense plausibility in narrative contexts. Our approach combines prompt-based large language model (LLM) scoring with three complementary features: semantic embedding similarity, story-conditioned definition generation, and a synthetic gaze signal based on predicted fixation time. We combine these signals using an ordinary least squares regressor. On the official test set, our best system achieves 90.10 Acc±SD and 79.19 Spearman correlation. The system surpasses the reported human reference score on Acc±SD, highlighting the value of combining LLM-based judgments with targeted linguistic and cognitive-inspired features.
SRCB at SemEval-2026 Task 5 A Multi-Target Finetuning Framework for Large Language Models with Joint Regression and Text Generation
Yuming Zhang | Junyu Zhou | Hongyu Li | Yongwei Zhang | Shanshan Jiang | Bin Dong
Yuming Zhang | Junyu Zhou | Hongyu Li | Yongwei Zhang | Shanshan Jiang | Bin Dong
This paper presents our winning system for SemEval-2026 Task 5 on rating the plausibility of word senses in ambiguous stories. Unlike traditional Word Sense Disambiguation, the task requires predicting continuous plausibility scores that reflect human variability rather than selecting a single correct sense. We propose a multi-target fine-tuning framework for decoder-only large language models that jointly optimizes regression for score prediction and text generation for interpretable explanations. To address inter-annotator variability, we adopt objective-level strategies to enhance robustness. Our system achieves first place, demonstrating the effectiveness of unified regressive–generative modeling for fine-grained plausibility estimation.
ICT-NLP at SemEval-2026 Task 1: Humor Generation via RAG-based Augmentation and Multi-LLM Internal-External Voting
Wutao Shen | Liyuan Huang | Jiawei He | Lin Li | Jin Zhang
Wutao Shen | Liyuan Huang | Jiawei He | Lin Li | Jin Zhang
This paper presents the system we developed for SemEval-2026 Task 1: Humor Generation. The task focuses on developing systems capable of generating genuinely humorous content under various constraints. In this work, we propose using a Retrieval-Augmented Generation approach to preprocess news headlines and obtain summaries of news content. Furthermore, we employ a unified humor generation mode to adapt to the two types of generation constraints. Finally, we conduct an internal-external voting process to produce the final optimal joke output. Our approach achieves competitive performance in this task: it ranks 1st (tied) among all participating teams in the Chinese track of Subtask A.
ThinkVision at SemEval-2026 Task 6: A Transformer-Based Ensemble System for Clarity Detection
Purohit Ghanshyam | Praveen Swami | Shriyans Sahoo | Jenish Bhati | Supriya Nadiger | Sunil Saumya
Purohit Ghanshyam | Praveen Swami | Shriyans Sahoo | Jenish Bhati | Supriya Nadiger | Sunil Saumya
We study the problem of assessing the clarity of political question–answer pairs, where the goal is to determine whether a response directly addresses the question, avoids it, or remains ambiguous. This task is particularly challenging in political discourse, where evasiveness can be subtle and context-dependent.To tackle this problem, we propose an ensemble-based approach built on the transformer-based model DeBERTa-v3-base, fine-tuned on concatenated question–answer inputs. Special attention is given to class imbalance during training to ensure robust performance across all categories.To better capture uncertainty in borderline cases, we train multiple models with different random seeds and employ Monte Carlo Dropout at inference time. Final predictions are obtained by averaging logits across ensemble models and stochastic forward passes, yielding more stable and robust predictions.Our system achieves a Macro-F1 score of 0.76 on the evaluation dataset. Error analysis reveals that responses that partially engage with the question while failing to provide a direct answer remain the most challenging, highlighting the inherent difficulty of detecting nuanced evasiveness in political communication.
Team YTY at SemEval 2026 task 12: Option-Aware Retrieval and Cross-Encoder Reasoning Framework for Abductive Event Reasoning
Junxin Lin | Zhichao Meng | Lianxin Jiang
Junxin Lin | Zhichao Meng | Lianxin Jiang
We describe a unified system for SemEval-2026 Task 9 on multilingual polarization detection. The task requires binary polarization detection, multi-label target type classification, and multi-label manifestation identification across languages and events with severe class imbalance. Our approach combines (i) targeted data augmentation for low-frequency labels, (ii) merged multitask fine-tuning of Subtask 2 and Subtask 3, and (iii) model fusion to improve cross-lingual stability. Subtask 1 predictions are derived via calibrated inference from the multi-label head. On the development set, multitask training consistently out-performs single-task variants, and fusion yields additional gains, especially for rare labels. We also report ablations and error analyses, highlighting remaining challenges such as implicit polarization and partial-label uncertainty.
This article presents our study on task 10: Psycholinguistic conspiracy marker extraction and detection, which includes token-level extraction tasks and sentence-level conspiracy detection tasks. Focusing on conspiracy theory texts on social media, this paper proposes a classification method that combines semantic encoding with large language model reasoning and generation. Semantic features are extracted using DeBERTa-v3, and explanatory reasoning text is generated through ConspEmoLLM-v2. The two are then combined for classification, thereby enhancing the model’s ability to recognize implicit conspiratorial logic. For the extraction subtask, this paper provides systematic comparison results of several mainstream pre-trained models, mainly conducting baseline model comparisons and performance analysis.
SCUZANE at SemEval-2026 Task 3: Dimensional Aspect-based Sentiment Analysis with Supervised Contrastive Regression and R-Drop Regularization
Ziang Zhou | Xiangmei He | Chenhongyi Bai
Ziang Zhou | Xiangmei He | Chenhongyi Bai
Current Aspect-Based Sentiment Analysis (ABSA) often relies on coarse-grained categorical labels, such as Positive and Negative, and this often leads to fail capturing the subtle intensity of emotional expression in real-world text. To address this issue, the SemEval-2026 Shared Task 3: Dimensional ABSA (DimABSA) extends the traditional ABSA by replacing categorical sentiment polarity with continuous valence-arousal (VA) scores. In this paper, we describe our system for Subtask 1 (Dimensional Aspect Sentiment Regression) of Track A (DimABSA). Our system utilizes a DeBERTa-v3-large backbone, enhanced by a prompt-based learning strategy that concatenates aspect information with the context. And we employ multi-sample dropout and a weighted aggregation of the hidden states from the last four layers to capture rich semantic representations. Our experimental results across all provided domains on different languages demonstrate the effectiveness of integrating consistency regularization with dimensional contrastive learning for fine-grained sentiment regression.
AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning
Nikolaos Karafyllis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
Nikolaos Karafyllis | Maria Lymperaiou | Giorgos Filandrianos | Athanasios Voulodimos | Giorgos Stamou
We present a winning three-stage system for SemEval 2026 Task 12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design informed by reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95. Cross-model error analysis across 14 models (7 families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51% cause-count reduction) indicates systematic rather than model-specific failure modes in multi-label causal reasoning.
SilkPeak at SemEval-2026 Task 6: When Politicians Dodge — Unmasking Evasion in Political Interviews through Joint Multi-Task Transformer Learning
Amruth Tetakali | Lavanya Tetakali
Amruth Tetakali | Lavanya Tetakali
This paper describes a system for SemEval-2026 Task 6 (CLARITY), which focuses on recognizing evasive communication in political interviews. The approach treats the one subtask—determining the clarity level of an answer —as a single joint multi-task problem. A DeBERTa-v3-Large encoder is shared across both tasks, processing the question and answer as a single concatenated sequence. By updating independent linear classification heads simultaneously, the model allows the fine-grained learning signals from the evasion taxonomy to directly inform the broader clarity-level decisions, and vice versa. On the official evaluation set, this joint discriminative system achieves a 0.76 macro F1 score on Task 1. This approach significantly outperforms standard single-task baseline models, hierarchical bi-encoding architectures, and generative large language models like LoRA-tuned LLaMA-3-8B.
5ting at SemEval-2026 Task 8: Strong End-to-End Multi-Turn RAG via LLM-Based Reranking and Faithfulness Control
Thien-Qua T-Nguyen | Chi Hoang | Nguyen Tran | Tri Le | Khanh Truong | Chinh Nguyen
Thien-Qua T-Nguyen | Chi Hoang | Nguyen Tran | Tri Le | Khanh Truong | Chinh Nguyen
This paper presents a modular multi-turn Retrieval-Augmented Generation (RAG) system designed to mitigate hallucination, context drift, and underspecification. The pipeline combines dual-query merged retrieval and LLM-based reranking to deliver high-precision evidence, improving nDCG@5 by 17.7%. To strictly control hallucination during generation, we introduce a role-separated prompting strategy. - This approach explicitly isolates the conversation history (used solely for intent and coreference resolution) from the retrieved passages (enforced as the exclusive source of factual grounding). - By preventing the language model from misinterpreting prior dialogue turns as factual evidence, the system ranked 3/29 in the SemEval-2026 Task 8 end-to-end evaluation. - Notably, our faithfulness-oriented design achieved a high ROUGE-L F1 score of 0.7692, outperforming larger baselines and demonstrating that explicit grounding constraints are highly effective at ensuring lexical faithfulness and reducing hallucinations.
TeamOmega at SemEval-2026 Task 13: Frozen vs. Trainable Representations for Out-of-Distribution AI-Generated Code Detection: A CodeBERT Fine-Tuning Study
Nahid Niyaz Shovon | Md. Naim Parvez
Nahid Niyaz Shovon | Md. Naim Parvez
We propose a CodeBERT-based system for detecting AI-generated code under severe cross-language and cross-domain distribution shift. Our approach conducts a controlled comparison between a fully frozen backbone and a partially fine-tuned configuration that unfreezes only the final transformer layer with discriminative learning rates. While partial fine-tuning substantially improves in-domain performance, the frozen backbone demonstrates stronger robustness under out-of-distribution evaluation. Our results highlight a trade-off between task adaptation and cross-language generalization in machine-generated code detection.
Pixel Phantoms at SemEval-2026 Task 13: Exploring Classical and Neural Approaches for AI-Generated Code Detection
Jithu Morrison S | Janani Hariharakrishnan | Angel Deborah S | Rajalakshmi S
Jithu Morrison S | Janani Hariharakrishnan | Angel Deborah S | Rajalakshmi S
This paper describes our system for SemEval-2026 Task 13, Subtask A: detecting whether a given code snippet is AI-generated or human-written. We explored a range of approaches from classical machine learning baselines using TF-IDF representations to fine-tuned transformer models pre-trained on code, specifically CodeBERT and GraphCodeBERT. Our experiments revealed a notable degradation in model performance when CodeBERT was trained beyond an optimal number of steps, indicating that continued training within an epoch leads to overfitting or representation drift. GraphCodeBERT, by contrast, yielded our best submission with a macro F1 score of 0.36866. Our findings highlight the sensitivity of code-specific transformers to training duration and suggest that early checkpoint selection is critical for this task.
schmerle at SemEval-2026 Task 4: Exploring Large Language Model Prompting Strategies for Low-Resource Narrative Similarity Detection
Maximilian Schmerle | Nils Constantin Hellwig
Maximilian Schmerle | Nils Constantin Hellwig
Narrative similarity detection has broad applications in plagiarism detection, content recommendation, and comparative narrative analysis. We present a training-free, prompting-only framework for SemEval-2026 Task 4 (Track A), which requires identifying which of two candidate stories is narratively more similar to a given anchor story. Without any fine-tuning or additional annotations, we systematically evaluate three prompt templates across five structural prompting strategies, including zero-shot and few-shot inference, narrative summarization, keyword extraction, aspect splitting, and pairwise comparison. Structured prompt templates and decomposed pairwise comparisons consistently outperform baseline configurations, achieving a peak accuracy of 72.50% on the test set and 67.75% on the final leaderboard (23th out of 44 teams).
Team UBSE at SemEval-2026 Task 4: Adapting Generalist Embeddings for Narrative Representations
Marius Marogel | Marius Popescu
Marius Marogel | Marius Popescu
The Narrative Story Similarity and Narrative Representation Learning (NSNRL) task measures the narrative similarity between two stories based on three core aspects: the abstract theme, the course of action, and the outcomes. Our system leverages LLMs both for extracting high-level aspects and to encode them with state-of-the-art generalist embedding models. We then apply a series of embedding post-processing steps and learn to fit the embedding space with a Mahalanobis-like diagonal metric. We show that some of these techniques should not be applied universally, as they do not necessarily increase performance or overfit, depending on the base encoder. Our system outperforms the baseline only in Track B, ranking twelfth out of twenty-seven on the final leaderboard, while performing lower than the baseline accuracy in Track A.
This paper presents our solution for subtask2, which focuses on the automated detection of conspiracy in text. Unlike traditional toxic text detection, conspiracy identification is particularly challenging as it often lacks explicit semantic indicators. To address this, we leveraged a Large Language Model (LLM) as our backbone and employed Low-Rank Adaptation (LoRA) for fine-tuning to enhance detection performance. To generate probabilistic confidence scores while maintaining the generative capabilities of the LLM, we implemented a hybrid loss function that integrates both generative and token classification losses. Additionally, semi-supervised learning with unlabeled data was incorporated to further refine classification accuracy. Our approach achieved a test accuracy of 0.87, ranking 2nd in both stages of the competition leaderboard.
Taien at SemEval-2026 Task 9: Multilingual Polarization Detection Using Transformer-based Models
Saida Taien | Palash Hossen
Saida Taien | Palash Hossen
This submission describes a multilingual polarization detection system for SemEval-2026 Task 9. The system leverages parallel fine-tuning of XLM-RoBERTa and mDeBERTa-v3 transformer models with a probability-level ensemble to improve prediction reliability. We employ language-independent preprocessing, subword tokenization, and a standardized classification head for all 22 languages to ensure a consistent modeling framework across the multilingual setting. Experimental results demonstrate strong performance on both high-resource and low-resource languages, highlighting the effectiveness of the ensemble approach in stabilizing predictions and improving multilingual polarization detection.
Clutch or Cry at SemEval-2026 Task 12: Offline Retrieval-Augmented Generation with Frozen DeBERTa for Abductive Event Reasoning
Aayush Prasad | Rudra Trivedi | Arshad Khatib | Shrikant Malviya | Naveen Kumar
Aayush Prasad | Rudra Trivedi | Arshad Khatib | Shrikant Malviya | Naveen Kumar
We present our system for SemEval-2026 Task 12 on abductive event reasoning. Initial experiments with direct fine-tuning of large language models suffered from severe overfitting due to limited training data, while smaller models failed under context-length constraints, leading to random guessing under the strict Exact Match evaluation metric. To address these challenges, we propose a two-stage offline Retrieval-Augmented Generation (RAG) pipeline that separates semantic evidence retrieval from multi-label classification. We employ a dense retriever (all-MiniLM-L6-v2) to extract the single most relevant sentence (top-k=1) and feed it into a partially frozen DeBERTa-v3-Large classifier trained with BCEWithLogitsLoss. Freezing the lower 12 layers effectively mitigates overfitting while preserving pre-trained semantic knowledge. Our approach eliminates long-context truncation issues, reduces hallucination, and achieves a final Exact Match accuracy of 0.72 on the official test set.
transformer_1376 at SemEval-2026 Task 9: A Multi-Stage Pipeline with Calibrated Ensembles and Lexical Post-Processing for Online Polarization Detection in Bengali
Shuvodwip Saha | Pritha Saha
Shuvodwip Saha | Pritha Saha
The POLAR @ SemEval-2026 Task 9 deals with the detection of online polarization in a variety of multilingual and multicultural environments. Our team participated in Subtask 1 of the POLAR @ SemEval-2026 Task 9, which mainly deals with binary classification of textual sequences for the detection of polarized stances. In this paper, we proposed a strong classification system for Bengali language based on fine-tuning the BanglaBERT Large model. The methodology used here involves a stratified five-fold cross-validation approach along with a performance-weighted ensemble method, combined with temperature scaling probability calibration and a set of lexical post-processing rules.
Team Yuvan at SemEval-2026 Task 13: Task-Adaptive Ensemble Strategies for AI-Generated Code Detection
Yuvan Ramesh | Tongtong Wu
Yuvan Ramesh | Tongtong Wu
We describe our system for SemEval-2026 Task 13 on detecting machine-generated code across eight programming languages and three subtasks: binary human-vs-AI detection, 11-way source identification, and 4-way generator classification. Our approach uses a task-specific combination of Qwen2.5-Coder-1.5B with LoRA fine-tuning, abstract syntax tree (AST) features, CodeBERT with head-tail chunking, and TF-IDF features. Experiments reveal three main findings. For Task A, neural detectors degrade markedly on the official test split, while AST-based structural features remain more stable, suggesting substantial distribution shift. For Task B, inverse-frequency class weighting is essential under extreme label imbalance and greatly improves macro-F1. For Task C, combining neural and statistical models performs better than relying on a single model alone, indicating complementary strengths across representations. Our final system achieves 0.638 macro-F1 on Task A, 0.449 macro-F1 on Task B, and 0.714 macro-F1 on Task C, offering practical insights into robustness, imbalance handling, and model complementarity for AI-generated code detection.
MedHastra at SemEval-2026 Task 13: Stylometric Ensembles and Transformer Fine-Tuning for Robust AI Code Detection, Attribution, and Adversarial Analysis
Shruti Chandrasekar | Vedajanaani R S | Vijayalakshmi P
Shruti Chandrasekar | Vedajanaani R S | Vijayalakshmi P
This paper describes Team MedHastra’s submission to SemEval-2026 Task 13 on detecting machine-generated code across diverse programming languages, generators, and application scenarios. We participated in all three subtasks: (A) binary detection of AI-generated code under out-of-distribution conditions, (B) multi-class attribution across ten large language model families, and (C) classification of human, fully AI-generated, hybrid, and adversarial code.For Subtask A, we implemented a stylometric ensemble combining structural formatting features with word- and character-level TF-IDF representations, trained using Random Forest, Gradient Boosting, and Logistic Regression with soft voting. For Subtasks B and C, we fine-tuned CodeBERT to leverage contextual code representations, incorporating class balancing strategies such as downsampling and weighted cross-entropy.Our results demonstrate that handcrafted stylometric features struggle under strong distribution shift, while transformer-based contextual modeling is more effective for fine-grained attribution and hybrid/adversarial detection. The study highlights the importance of robust contextual representations for realistic AI-assisted programming scenarios.
Team Duo at SemEval-2026 Task 13: Fine-tuning CodeBERT for Out-of-Distribution AI-Generated Code Detection
Subhiksha G | Sanjai M | Rajalakshmi Sivanaiah | Angel Deborah S
Subhiksha G | Sanjai M | Rajalakshmi Sivanaiah | Angel Deborah S
This paper addresses detecting AI-generated code in out-of-distribution settings by fine-tuning CodeBERT on algorithmic code from C++, Python, and Java. While the model achieves near-perfect performance on training data (F1 = 0.9935), it degrades significantly on unseen languages and domains (F1 = 0.3532). The high recall (0.8789) but low precision (0.2210) indicates over-prediction of machine-generated code. Error analysis reveals three failure modes: domain mismatch, unfamiliar syntax patterns, and insufficient training. Multi-epoch training and domain-specific augmentation are needed to improve OOD generalization.
Segmentation Fault at SemEval-2026 Task 13: A Regularization-First Approach with Generator-Based Out-of-Distribution Splits for Detecting AI-Generated Code
Lakshmi Priya Swaminatha Rao | Dhannya Santhakumari Madhavan | Sreya Kodeswaran | Nithila R | Kanmani R
Lakshmi Priya Swaminatha Rao | Dhannya Santhakumari Madhavan | Sreya Kodeswaran | Nithila R | Kanmani R
This paper describes our submission to SemEval-2026 Task 13 (Subtask A) on detecting AI-generated code. We fine-tune CodeBERT-base using a generator-aware out-of-distribution (OOD) validation split to better simulate unseen test generators. Strong regularization techniques, including stochastic data augmentation, dropout, weight decay, and label smoothing, are applied to prevent overfitting to generator-specific patterns. Experiments with logistic regression, UniXcoder, and vanilla CodeBERT reveal that evaluation design has a larger impact on generalization than model scale or training data volume. Our final system achieves a macro F1 score of 0.439 on the hidden test set, representing a 62% relative improvement over unregularized baselines.
TechSSN at SemEval-2026 Task 8: MTRAG Retrieval and Generation using Ensemble Re-encoders and Anchor Prompting
Anne Jacika J | Anishka K | Guruprakash K | Rajalakshmi Sivanaiah | Angel Deborah S
Anne Jacika J | Anishka K | Guruprakash K | Rajalakshmi Sivanaiah | Angel Deborah S
This paper discusses the Retrieval-Augmented Generation (RAG) system submitted to the MTRAG-UN shared task on multi-turn conversational question answering. The paper describes the proposed solution for Task A (Document Retrieval) and Task C (Full RAG Pipeline), focusing on retrieval robustness and grounded response generation in complex English multi-turn dialogs. The proposed retrieval architecture uses a cascaded hybrid pipeline, which combines sparse retrieval (BM25) with dense bi-encoder models (BGE-base-en-v1.5 and E5-base), integrated via Reciprocal Rank Fusion and refined using a weighted ensemble of cross-encoders. For the generation part, the top-3 retrieved passages are injected into FLAN-T5-Large using an anchor-prompting strategy to output grounded faithful responses. Experimental results show that the proposed hybrid retrieval framework with multi-stage reranking significantly enhances passage selection, particularly for non-standalone conversational queries. Further analysis reveals persistent difficulties in handling underspecified and unanswerable questions, as well as an increased susceptibility to retrieval noise in later dialog turns.
DataBees at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Tanisha Sriram | Sathvika Shankar | Sowmya Anand | Rajalakshmi Sivanaiah | Angel Deborah S | Mirnalinee Thankanadar
Tanisha Sriram | Sathvika Shankar | Sowmya Anand | Rajalakshmi Sivanaiah | Angel Deborah S | Mirnalinee Thankanadar
This paper describes our submission toSemEval-2026 Task 9, Subtask 1: Multilingual Text Classification Challenge — Polarization Detection. Our focus is on how classicaland transformer-based models compare whenapplied to multilingual polarization detection.We aim to understand where each type tendsto do well and where it breaks down, particularly once you move from high-resource tolow-resource settings. Our experimental setupevaluates classical machine learning models(TFIDF with Naive Bayes, Logistic Regression, and Linear SVM) alongside languagespecific transformer models across multiplelanguages. For Arabic, Bengali, German, Italian, and Spanish, we leveraged both multilingual and monolingual pre-trained transformers such as mBERT, XLM-R, AraBERTv2,BanglaBERT, and BETO. We compare individual classical and transformer-based modelsto identify which modeling choices work bestfor each language. Our results varied substantially across languages. We achieved our bestleaderboard rankings in Bengali (6th out of 48teams) and Italian (6th out of 43 teams), whileperformance was lower in Arabic (33rd out of44), German (41st out of 44), and Spanish (46thout of 48). The study highlights the value ofcomparing classical and transformer-based approaches for multilingual polarization detectionand identifies language-specific challenges forfuture improvement.
Team Habib Disambiguators at SemEval-2026 Task 5: Assessing Semantic Plausibility using Regularized Transformer Fine-Tuning
Zohaib Aslam | Ahsan Siddiqui | Ayesha Enayet
Zohaib Aslam | Ahsan Siddiqui | Ayesha Enayet
This paper presents a system for SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative Understanding. The task involves predicting the plausibility of a specific word sense within a short story where context provided by the ending resolves a deliberate ambiguity. We model this as a regression problem, fine-tuning a DeBERTa-v3 transformer to predict the distribution of human judgments rather than a single hard label. To address the challenge of limited training data and potential overfitting, we employ R-Drop (Consistency Regularization) to enforce prediction stability across dropout masks and Layer-wise Learning Rate Decay (LLRD) to preserve the model’s pre-trained linguistic knowledge. Our experiments demonstrate that treating plausibility as a soft-label distribution, combined with aggressive regularization, improves generalization on ambiguous samples. The submitted system achieves a Spearman correlation of 0.56 and an Accuracy (within SD) of 0.74 on the official test set.
Ellat at SemEval-2026 Task 11: Comparing Encoder and Decoder Models for Syllogistic Reasoning
Farzaneh Bayan Memar | Hanneke Huls | Matthijs Ten Hove
Farzaneh Bayan Memar | Hanneke Huls | Matthijs Ten Hove
For SemEval-2026 Task 11 (Subtask 1: English), Team Ellat investigates whether language models can assess logical validity independently of semantic plausibility. Since these models learn statistical patterns instead of explicit logical rules, they often rely on world knowledge and semantic shortcuts rather than formal logic. To address this challenge, we evaluate three architectures: MiniLM-L6-mnli-binary, DeBERTa-v3-small, and Llama 3.1-8B-Instruct, applying task-specific fine-tuning for encoder models and Abstract Logic Augmentation with QLoRA for LLaMA. DeBERTa achieved the strongest overall performance, MiniLM showed clear reductions in content bias after fine-tuning, and Llama 3.1-8B exhibited strong plausibility bias in the zero-shot setting. However, our augmented fine-tuning approach led to only modest improvements and a partial shift toward structure-based reasoning. Overall, fine-tuning and abstraction-based augmentation reduce plausibility bias, but fully separating logical validity from semantic content remains challenging across architectures.
AI-Monitors at SemEval-2026 Task 4: A Hybrid Embedding and LLM Ensemble for Narrative Similarity
Vishnu Tripathi | Azad - | Prakhar Joshi | Pragyananda Sahoo | Gaurav Kumar | Piyush Arora | Neel Mani
Vishnu Tripathi | Azad - | Prakhar Joshi | Pragyananda Sahoo | Gaurav Kumar | Piyush Arora | Neel Mani
Narrative similarity requires reasoning over the deeper structural properties of stories - shared themes, causal progression, and outcomes - rather than surface-level lexical overlap. We describe AI-Monitors, our system for SemEval-2026 Task 4 (Track A), which determines which of two candidate stories is more narratively similar to a given anchor. We explore a progression of approaches - from embedding-based similarity to structured LLM prompting and ensemble construction - guided by four hypotheses about where narrative reasoning gains can be found. The final system achieves 75\% test accuracy on 400 instances, ranking 3rd out of 47 systems and approaching the individual human annotator ceiling of 78\%.Our key findings are: i) structured few-shot prompting substantially outperforms dense embedding similarity; ii) selecting ensemble components by how differently they make errors - rather than by accuracy alone - produces stronger predictions; and iii) how you describe an example to the model affects its predictions.
Team ewelinaksiez at SemEval-2026 Task 11: Reducing Content Bias in Syllogistic Reasoning via Semantic Abstraction
Ewelina Księżniak
Ewelina Księżniak
This paper presents our system for SemEval-2026 Task~11 Subtask~1 on content-independent syllogistic reasoning. The task evaluates whether language models can determine the formal validity of logical arguments independently of their semantic plausibility. To reduce content-driven biases, we propose a data augmentation strategy that progressively abstracts lexical semantics by replacing content words with symbolic placeholders and pseudo-words while preserving logical structure. Experiments based on fine-tuning microsoft/deberta-large-mnli show that abstraction-based augmentation reduces Content Effect and improves accuracy, leading to competitive performance on the official leaderboard. However, we observe substantial sensitivity to random initialization, suggesting that evaluation outcomes are partly influenced by stochastic factors. To better understand these effects, we conduct a layer-wise probing analysis using a Minimum Description Length framework, showing that the proposed approach decreases the accessibility of plausibility information in later transformer layers, indicating a shift toward more structure-oriented reasoning.
TeamLasse at SemEval-2026 Task 3: A Hybrid Generative-Discriminative Framework for Dimensional Aspect-Based Sentiment Analysis
Lasse Strothe | Shaghayegh Kolli | Jana Diesner
Lasse Strothe | Shaghayegh Kolli | Jana Diesner
In this paper, we present our system for SemEval-2026 Task 3 Track A: Dimensional Aspect-Based Sentiment Analysis (DimABSA). The core objective is to extract structural sentiment elements—such as aspects, opinions, and categories—from text and predict their corresponding continuous Valence-Arousal (VA) scores. The primary challenge lies in simultaneously handling structural extraction and continuous numerical regression across highly imbalanced datasets encompassing multiple languages and domains. To address this complexity, we propose a decoupled, two-stage hybrid generative-discriminative framework. A generative Large Language Model first extracts structured sentiment tuples, while an encoder-based language model performs the continuous VA regression. To foster cross-lingual and cross-domain generalization, we train our models using a targeted data balancing mechanism.
CUNI at SemEval-2026 Task 4: Multi-Head Narrative Aspect Disentanglement via Entangled Synthetic Dataset
Jan Mitka | Jindrich Helcl
Jan Mitka | Jindrich Helcl
We participate in Track B of the SemEval 2026 Task 4 on narrative similarity, focusing on narrative representation learning. We introduce a synthetic dataset designed to disentangle core narrative aspects-abstract theme, course of action, and outcome-and propose a multi-head multi-positive extension of the InfoNCE objective to train aspect-specific embeddings. Our best model achieves 64.25\% accuracy on the test set. A nearest-centroid analysis indicates partial aspect-specific structure in the submitted checkpoint, while the training dynamics reveal a partial misalignment between the contrastive objective and the triplet-based evaluation protocol.
FMISUYotkovaKastreva at SemEval-2026 Task 13: Lightweight Detection of LLM-Generated Code via Stylometric Signals
Elitsa Yotkova | Violeta Kastreva | Dimitar Dimitrov | Ivan Koychev | Preslav Nakov
Elitsa Yotkova | Violeta Kastreva | Dimitar Dimitrov | Ivan Koychev | Preslav Nakov
SemEval-2026 Task 13 investigates machine-generated code detection across multiple programming languages and application scenarios, asking participating systems to generalize to unseen languages and domains. This paper describes our participation in Subtask A (binary classification) and explores both pretrained code encoders and lightweight feature-based methods.We design ratio-based features that are less sensitive to snippet length. To support the extraction of descriptiveness-related signals, we use parsing engines and a programming-language classifier. Additionally, we train a separate code-vs-text line classifier to identify raw natural language segments embedded within samples. We combine a shallow decision tree with heuristic rules derived from data analysis to produce the final predictions. Our approach is computationally efficient, requires only CPU resources for training, and achieves near-instant inference time, offering a lightweight alternative to large pretrained models.
Thiyaga6851 at SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models using Neuro-Symbolic Mapping
Thiyagarajaa Pk | Thenmozhi D.
Thiyagarajaa Pk | Thenmozhi D.
This paper presents our system for SemEval-2026 Task 11 Subtask 1, which evaluates the formal validity of English syllogisms independently of semantic plausibility. To reduce content effects, we use a hybrid neuro-symbolic pipeline that separates natural-language abstraction from logical inference. The system maps each syllogism into categorical propositions using template rules and a learned parser, followed by explicit role mapping for the major, minor, and middle terms. If the abstraction is structurally complete, an exact Venn-style satisfiability solver checks validity; otherwise, the instance is routed to a learned fallback classifier. Our official submission achieved 71.73% accuracy, a Total Content Effect of 11.84, a Combined Score of 20.19, and a rank of 41st. Development analysis shows that symbolic inference is reliable on well-formed abstractions, while most remaining errors arise from paraphrase, multiword terms, and unstable term alignment.
LIAAD INESCTEC at SemEval-2026 Task 4: Unsupervised Narrative Similarity via Discourse Representation Structures and Sentence Embeddings
Evelin Amorim | Alípio Jorge | Purificação Silvano
Evelin Amorim | Alípio Jorge | Purificação Silvano
In this paper, we describe an unsupervised approach using Discourse Representation Structures (DRS) for the SemEval-2026 Task 4. This task was Narrative Similarity and was formulated in two different tracks. Our team only developed a solution for track A, where the input is composed of a triplet: an anchor story, a story A, and a story B. The output in this formulation is to predict which story, A or B, is more similar to the anchor story. Our approach parsed each story and transformed in a DRS format,then we leveraged its structure and extracted features, performing ablation experiments inthe development dataset. Our strategy achieved 0.5975 accuracy in the official blind test set.
HUS@NLP-VNU at SemEval-2026 Task 3: Dual-Stream Syntax-Aware Modeling and Direct Preference Optimization for Dimensional ABSA
An Cao | Lam Hoang | Le Ngoc Toan | Ha Linh
An Cao | Lam Hoang | Le Ngoc Toan | Ha Linh
Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends traditional ABSA by predicting continuous sentiment intensity in the Valence-Arousal space. To tackle the regression subtasks (DimASR and DimStance), we propose a Dual-Stream Syntax-Aware architecture synergizing contextual semantics with a Deep Syntax-Guided Graph Convolutional Network (GCN). It utilizes a Context-Aware Anchor for semantic filtering and post-norm residuals to prevent oversmoothing. For generative extraction, we apply Direct Preference Optimization (DPO) via a resource-efficient, heuristic-based data perturbation strategy to construct preference pairs without costly LLMs. Across multilingual settings, our regression model achieves top-5 rankings in nine domains and obtains the best result on the Chinese-Finance dataset. Empirical analysis shows that explicit syntactic modeling consistently improves continuous sentiment regression, while DPO provides modest but stable gains for boundary-constrained extraction.
PolAR Bears at SemEval-2026 Task 9: Parameter-Efficient Fine-Tuning and Cross-Lingual Augmentation for Multilingual Polarization Detection
Vinay Ulli | Jyoti Kumari
Vinay Ulli | Jyoti Kumari
This paper describes our system for SemEval-2026 Task 9: Detecting Multilingual, Multicul-tural and Multievent Online Polarization. Wefocus on four low-resource Indian languages(Hindi, Bengali, Telugu, and Odia) across threesubtasks: Polarization Detection, Type Classi-fication, and Manifestation Identification. Toaddress data scarcity, we employ cross-lingualdata augmentation using IndicTrans2, expand-ing our dataset fourfold. Our unified architec-ture leverages Qwen3-4B-Instruct optimizedvia QLoRA, training a linear classification headon masked mean-pooled hidden states withonly ∼33M trainable parameters. Our systemachieved highly competitive results in Subtask1, with an average Macro F1 of 0.813 across alllanguages (peaking at 0.8668 for Telugu). Forthe complex multi-label frameworks of Sub-tasks 2 and 3, our results expose a significantpre-training bias within foundational LLMs;while Hindi maintained strong F1 scores of0.7008 and 0.7248, performance dropped con-siderably for the other three languages, high-lighting the ongoing challenges of cross-lingualtransfer for nuanced rhetorical techniques.
Rasende Rakete at SemEval-2026 Task 6: LLM-First Approach with Iterative Prompt Repair for Classifying Evasion in Political Interviews
Omar Elbeltagui | Nils Knittel | Leonie Süß | Umut Yıldırır | Qiyan Zhai | Shaghayegh Kolli | Jana Diesner
Omar Elbeltagui | Nils Knittel | Leonie Süß | Umut Yıldırır | Qiyan Zhai | Shaghayegh Kolli | Jana Diesner
We describe our system for SemEval-2026 Task 6 (CLARITY), which addresses automatic detection of evasive responses in political interviews. We adopt an LLM-first approach built around two core contributions: (i) an iterative prompt repair loop that diagnoses classification errors on concrete failure examples and applies prompt revisions and (ii) a configurable end-to-end Java Pipeline that supports multiple LLM providers, strategies, and systematic experimentation.
hdharpure at SemEval-2026 Task 3: BERT-Based Modeling and Prediction Behavior Analysis for Multilingual Valence–Arousal Scoring
Harshal Dharpure | Nicolay Rusnachenko
Harshal Dharpure | Nicolay Rusnachenko
The SemEval-2026 Task 3 is a Dimensional aspect-based sentiment analysis (DimABSA) task which extends traditional ABSA by predicting continuous regression in two dimensions: valence (V) and arousal (A). The Track A/Subtask 1 represent multilingual task in which for a given text and aspects mentioned in it, there is a need to predict V/A scores for each aspect. Our approach is based on the pretraining-finetuning concept: we first pretrain multilingual model (M ′) followed by its fine-tuning (M ′′ l,d) on the training data of specific domain (d) and language (l). We adopt XLM-RoBERTa (M ) as the encoder with separate regression heads for valence and arousal prediction. Our experiments on manual split of official SemEval-2026 Task 3 dataset (D20% train) demonstrate that fine-tuning model in two stages (M ′′ l,d) results in average ≈ 1.36 times improvement by RMSEva over approach of direct fine-tuning (Ml,d). To investigate limitations of the existing approach we visualize and discuss limitations of our system. Our code is publicly available.
Bitzkrieg at SemEval-2026 Task 13: Calibration-Aware Dual CodeBERT for Multilingual Machine-Generated Code Detection
Thenmozhi D. | Adithya S | Harshil Malisetty | Aadit P | Rohan R
Thenmozhi D. | Adithya S | Harshil Malisetty | Aadit P | Rohan R
We describe our submission to SemEval-2026 Task 13, addressing binary detection (Subtask A), generator attribution (Subtask B), and hybrid/adversarial authorship classification (Subtask C) of machine-generated code (MGC). For Subtask A, we fine-tune two CodeBERT models with complementary sampling strategies and apply percentile-based post-hoc calibration, improving Macro-F1 from 0.47 to 0.56 without additional training. For Subtask B, we combine TF-IDF n-grams, frozen CodeBERT embeddings, and language features with XGBoost, using synthetic augmentation and class weighting to handle an 11-class dataset skewed 88% toward the human class, achieving Macro-F1 of 0.289. For Subtask C, we fine-tune a CodeBERT classifier for four-way authorship classification, achieving Macro-F1 of 0.49. Our results highlight the importance of probability calibration for binary detection and class balancing for multi-class attribution.
harapalb at SemEval-2026 Task 4: Multi-Signal Neuro-Symbolic Ensembles for Narrative Similarity
Andrei Tiberiu Carp
Andrei Tiberiu Carp
This paper presents a neuro-symbolic ensemble for determining narrative similarity by moving beyond surface-level text matching toward structural and causal alignment. The architecture fuses three primary signals: action-focused neural embeddings that isolate event trajectories , a symbolic Structural Survival Ratio (SSR) that measures the preservation of discrete event tuples via dependency parsing , and high-level structural comparisons conducted by the gpt-5-mini model. Evaluated on the SemEval-2026 Task 4 test set, the integrated ensemble achieved an accuracy of 68.25%.
SemTechLab at SemEval-2026 Task 5: Context-Aware Homonym Disambiguation via Span-Specific Interaction Features
Karlo Babić | Ana Meštrović | Slobodan Beliga
Karlo Babić | Ana Meštrović | Slobodan Beliga
This paper presents the SemTechLab system submitted to SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative Understanding. The task involves predicting the plausibility of a specific word sense given a short story context. Our approach (HINTS) utilizes a hybrid Transformer architecture based on nli-mpnet-base-v2. Unlike standard Cross-Encoders that rely solely on the [CLS] token, HINTS extracts span-specific embeddings for the target homonym from both the narrative context and the sense definition. We compute interaction features (concatenation, difference, and element-wise product) between these spans to explicitly model the semantic alignment between the story and the proposed sense. The model is trained using Kullback-Leibler Divergence to predict the full distribution of human ratings. For the official submission phase, scores were rounded to integers (1–5). However, subsequent analysis and ablation studies detailed in this paper utilize continuous (float) scores derived from the expected value for improved metric sensitivity. On the test set, our best configuration, which relies exclusively on local homonym features, achieved a Spearman correlation of 0.603 and an accuracy of 75.8%.
SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance
Hanna Abi Akl | Fabien Gandon | Catherine Faron | Pierre Monnin
Hanna Abi Akl | Fabien Gandon | Catherine Faron | Pierre Monnin
This paper revisits our pipeline called Syllogistic Evaluation Framework-Common Logic Grammar Construction (SEF-CLGC). We combine formal logical notations with Small Language Models (SLMs) to evaluate reasoning performance on the SemEval-2026 Task 11 Subtask 1: Disentangling Content and Formal Reasoning in Large Language Models. Our experiments show that by relying solely on SLMs, trained on a combination of natural and symbolic languages, our best model achieves a content score of 27.80% on the task while significantly lowering the content bias in reasoning.
Aaron at SemEval-2026 Task 9: Multilingual Polarization Detection using Transformer-Based Models with Class Weighting and Threshold Tuning
Aaron Anampiu
Aaron Anampiu
This paper describes our submission to SemEval-2026 Task 9 on detecting multilingual, multicultural, and multievent online polarization. We address all three subtasks: binary polarization detection, polarization type classification, and manifestation identification for English and Swahili. Our approach leverages transformer-based models (RoBERTa-base for English, AfroXLMR-base for Swahili) with class-weighted loss functions to address severe label imbalance and per-label threshold tuning to optimize multi-label classification. On the test set, we achieve F1 macro scores of 0.7901 (English) and 0.7910 (Swahili) for Subtask 1, 0.4615 (English) and 0.4808 (Swahili) for Subtask 2 and 0.4791 (English) and 0.5830 (Swahili) for Subtask 3, which give competitive performance on the leaderboard, demonstrating the effectiveness of our methods for handling imbalanced multi-label polarization detection. Our error analysis reveals that models struggle with dehumanization detection and lack of empathy.
OseiBrefo-Liang at SemEval-2026 Task 12: Hybrid Causal Knowledge Graphs and Neural-Symbolic Policy Optimisation for Abductive Event Reasoning
Emmanuel Osei-Brefo | Huizhi(elly) Liang
Emmanuel Osei-Brefo | Huizhi(elly) Liang
Abductive Event Reasoning (AER) requires selecting plausible causal explanations for observed events from incomplete and noisy textual evidence. Unlike deductive reasoning, abductive inference proceeds from effects to candidate causes and is highly sensitive to distractor information and implicit multi-hop relationships. We present a hybrid neural-symbolic framework that models abductive reasoning as structured causal validation rather than unconstrained generation. Our framework integrates hybrid retrieval, micro-level evidence grounding, concept-level causal abstraction, reinforcement learning-based decision calibration, and structured Theorem-of-Thought verification. Experiments on SemEval-2026 Task 12 show that LLM reasoning constrained by structured causal graphs achieves the strongest development performance of 0.5288 and a leaderboard score of 0.61 on the test set, substantially outperforming symbolic-only and policy-only variants. These findings indicate that explicit causal modelling improves robustness in document-grounded abduction tasks.
Team Poznan at SemEval-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios
Dawid Siera | Anatol Kaczmarek | Wiktor Kamzela | Adam Dobosz | Jakub Dutkiewicz
Dawid Siera | Anatol Kaczmarek | Wiktor Kamzela | Adam Dobosz | Jakub Dutkiewicz
Detecting machine-generated code is crucial for maintaining software security and quality. Traditional approaches often rely on stylistic or statistical features, which are increasingly circumvented by advanced code generation models. This paper introduces a novel approach leveraging Graph Neural Networks (GNNs) to capture the structural characteristics of code, representing it as a program dependency graph. We demonstrate that our GNN-based classifier outperforms both traditional and embedding based methods on benchmark datasets, achieving improved accuracy and robustness in identifying code produced by various generation techniques. This work highlights the potential of GNNs for a more structural understanding of code authorship.
X-NLP at SemEval-2026 Task 12: Prompting LLMs for Abductive Event Reasoning
Caelen Mattie | Patrick Bowen | Milton King
Caelen Mattie | Patrick Bowen | Milton King
In this work, we applied two different systems to the SemEval 2026 Shared Task 12, which exploresabductive event reasoning. Specifically, this task involves determining the cause of an event from a list of candidate causes. Instances are accompanied with documents that can provide applicable knowledge for the target event. Both of our systems involve prompting LLMS and our best performing system leverages retrieval-augmented generation. Our best performing system achieved a score of 84% and ranked 40th out of the 221 total submissions.
COGNAC at SemEval-2026 Task 4: Evaluating Narrative Components with LLMs for Hard Story Similarity Cases
Tisa Islam Erana | Azwad Anjum Islam | Anshu Sharma | Mark Finlayson
Tisa Islam Erana | Azwad Anjum Islam | Anshu Sharma | Mark Finlayson
This paper presents a two-stage system for the SemEval-2026 shared task on narrative similarity. The task defines similarity in terms of three components: abstract theme, course of action, and outcome. For Track A, the system first applies majority voting over multiple independent large language model (LLM) judgments to handle high-agreement (easy) cases. For low-agreement (difficult) cases, it routes examples to a second stage that decomposes stories into theme, course of action, and outcome, and either (i) scores these components individually with learned weights or (ii) uses structured chain-of-thought prompting to compare stories along the three dimensions. This two-stage approach improves robustness on difficult examples and achieves first place with 0.78 test accuracy. For Track B, the system generates embeddings of full stories and of individual narrative components using several embedding models. Experiments show that embeddings derived from the course-of-action component alone yield the best performance, achieving 0.72 accuracy and ranking first. Additional analyses reveal substantial annotation variability in the dataset and highlight the importance of handling ambiguity and disagreement when modeling narrative similarity.
L52+-IIMAS-UNAM at SemEval-2026 Task 1 (MWAHAHA): Joke Selection Through a Multi-Stage Prompt-Engineering and Heuristic Pipeline
Adolfo Tonatihu Camacho Gonzalez | Ximena Cruz | Natalia Godínez-Aldana | Lizeth Palacios-Patiño | Ramón Rangel | Ivan Vladimir Meza Ruiz
Adolfo Tonatihu Camacho Gonzalez | Ximena Cruz | Natalia Godínez-Aldana | Lizeth Palacios-Patiño | Ramón Rangel | Ivan Vladimir Meza Ruiz
Humor generation remains one of the most challenging tasks in natural language processing, requiring creativity, incongruity resolution, cultural sensitivity, and strict structural control. We present a fully prompt-based system for headline-conditioned joke generation in SemEval-2026 Task 1 (MWAHAHA) for both English and Spanish. Deliberately avoiding fine-tuning, our approach relies on structured prompt engineering combined with a multi-stage heuristic pipeline. For Spanish, we extract a “stylistic-humor DNA” from a public joke corpus to guide generation. The pipeline integrates multi-candidate generation, diversity enhancement, iterative refinement, LLM-based rewriting, and constraint-aware selection. Human evaluation performed by the team (n=180) shows substantial gains over single-pass generation, particularly in funniness and punchline clarity. Official shared-task results were modest (12th/16 Spanish, 24th/31 English), underscoring that limited originality remains a key bottleneck. In an era dominated by large language models (LLMs) such as GPT-4o and Grok, our work demonstrates the value of linguistically grounded heuristics as an efficient, interpretable, and low-cost complement to black-box generation systems.
Rating Plausibility of Word Senses in Ambiguous Sentences Using Multi-Architecture Analysis
Naina Jain | Nidhi Arora | Pal Thakkar | Siba Sahu
Naina Jain | Nidhi Arora | Pal Thakkar | Siba Sahu
Word sense disambiguation in narrative contexts requires systems to reason about subtle semantic relationships between candidate senses and discourse context. This paper addresses SemEval 2026 Task 5, which reformulates WSD as a graded plausibility prediction problem on a 1–5 Likert scale using the AmbiStory dataset. We present two complementary approaches: (1) a DeBERTa-v3-Large encoder with attention-weighted pooling and ordinal regression, achieving a Spearman correlation of 0.718, and (2) a rank-based ensemble combining FLAN-T5 and RoBERTa, achieving 0.692. Through ablation studies, we show that explicitly modeling ordinal structure improves performance over standard regression by 17.3%. We further analyze the strengths of each approach, showing that fine-tuned encoders capture fine-grained semantic relationships, while ensemble methods provide robustness through complementary modeling biases. Our results provide a detailed empirical analysis of design choices for graded plausibility prediction in narrative understanding.
MindFlayer at SemEval-2026 Task 8:DUALRAG:Answerability-Aware Generation for Multi-Turn RAG Conversations
Jerin Romijah Tuli | Md. Sartaj Alam Pritom | Talukder Naemul Hasan Naem
Jerin Romijah Tuli | Md. Sartaj Alam Pritom | Talukder Naemul Hasan Naem
Our system, DualRAG (team MindFlayer), tackles SemEval-2026 Task 8 Subtask B - generating faithful responses in multi-turn RAG conversations. The core idea is simple: before generating anything, we first check whether reference passages exist for the current question. If they do, we route through a domain-guided generation prompt that instructs the model to answer using only those passages. If they do not, we route through a strict refusal prompt that tells the model to politely decline rather than guess.We used Meta’s Llama-4-Scout-17B through the Groq API, with no training or fine-tuning - purely zero-shot prompting. A lightweight post-processing layer catches the rare cases where the model ignores its instructions: if it refuses when passages are available, we replace the response with a neutral fallback; if it answers when no passages exist, we replace it with a standard refusal. Out of 507 test tasks, only 7 needed this correction.The system ranked 8th out of 26 teams with a harmonic mean of 0.7492, beating the strongest baseline (GPT-OSS-120B at 0.639) by a notable margin. The standout result is 100% refusal accuracy on all 130 unanswerable questions - something even GPT-4o and Llama 3.1 405B failed to achieve consistently according to prior work. Our RLF score of 0.8782 shows the responses stay tightly grounded in the reference passages. The relatively lower RBagg (0.6024) reflects the challenge of matching human-written phrasing in a zero-shot setting, which we identify as the clearest direction for improvement.
MindFlayer at SemEval-2026 Task 13:LACR-ENS: Calibration-Aware Ensemble Routing for Cross-Language AI-Generated Code Detection
Jerin Romijah Tuli | Talukder Naemul Hasan Naem | Md. Sartaj Alam Pritom
Jerin Romijah Tuli | Talukder Naemul Hasan Naem | Md. Sartaj Alam Pritom
This paper presents LACR-ENS, a calibration-aware ensemble system for detecting AI-generated code across eight programming languages (SemEval-2026 Task 13). We identify a severe asymmetric out-of-distribution (OOD) failure in fine-tuned code transformers: Expected Calibration Error doubles from 0.09 (seen languages) to 0.18 (unseen languages), and high-confidence predictions (p0.80) are wrong 39% of the time on OOD inputs. We propose Language-Aware Confidence Routing (LACR), formally equivalent to implicit per-language temperature scaling, which reduces OOD ECE to 0.11 and improves macro-F1 by +0.013 over fixed-threshold ensembling. A language-family proximity analysis reveals that syntactic distance to training languages predicts OOD F1 with Pearson r=+0.94, providing a principled, label-free signal for deployment risk assessment and motivating a continuous routing extension. Our system combines UniXCoder and GraphCodeBERT via weighted logit-level fusion and achieves macro-F1 0.538 , outperforming comparable encoder-only systems. We additionally document a HuggingFace label inversion pitfall that suppressed our initial score by approximately 0.29 F1.
abateam at SemEval-2026 Task 1: Plan2joke – Humor Policies for Type-Specific Two-Pass Humor Generation
Andrii Dikhtiar | Antonii Viter | Bohdan Karaziia | Daryna Dementieva | Alexander Fraser
Andrii Dikhtiar | Antonii Viter | Bohdan Karaziia | Daryna Dementieva | Alexander Fraser
Our work was inspired by several recent directions in computational humor and evaluation, including:- Baranov, Kniazhevsky, and Braslavski, "You Told Me That Joke Twice: A Systematic Investigation of Transferability and Robustness of Humor Detection Models" (2023).- Tikhonov and Shtykovskiy, "Humor Mechanics: Advancing Humor Generation with Multistep Reasoning" (2024).- Zhong, Huang, Gao, Wen, Lin, Zitnik, and Zhou, "Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation" (2023).At a high level, we designed a policy-driven humor generation approach covering multiple humor types. We used optimal humor recognition systems and a context enrichment strategy, as well as SFT training based on a dataset composed from previous research samples and adjusted for alignment with our humor policies. This allowed us to perform an ablation study of the approach and to calibrate our system.
Lacuna Inc. at SemEval-2026 Task 4: Structurally Gated State-Space Models for Disentangling Narrative Similarity
Aleksey Kudelya | Rafif Alshawi | Alexander Shirnin
Aleksey Kudelya | Rafif Alshawi | Alexander Shirnin
In this paper, we present the Invariant-Variant Disentangled State-Space Model (IVD-SSM),our submission to SemEval-2026 Task 4 on Narrative Story Similarity and Narrative Representation Learning. Evaluating narrative similarity is a profound computational challenge that requires models to look past concrete, superficial elements such as specific names, actors, objects, or settings to isolate and compareabstract patterns of causality and plot progression. To model these extended causal chainswithout the quadratic bottlenecks of standard Transformers, we leverage a hybrid State-SpaceModel (Jamba-1.5-Mini). Building upon this backbone, we introduce the Structurally Gated Alignment (SGA) head, a novel, differentiable algorithmic architecture. The SGA head operates on two scales: a heavily strided Macro-path maps the coarse structural skeleton of a story, which then acts as a gating mechanism to filter a full-resolution Micro-path, actively suppressing semantic noise and superficial keyword overlaps. Evaluated on both pairwisecomparative judgments (Track A) and dense representation learning (Track B), our approach demonstrates that explicitly disentangling structural invariants from lexical variants provides a robust, principled framework for deep narrative understanding.
IReLIIT(BHU) at SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Soumadip Majumder | Arjun Mukherjee | Krishna Tewari | Sanjaya Lenka | Sukomal Pal
Soumadip Majumder | Arjun Mukherjee | Krishna Tewari | Sanjaya Lenka | Sukomal Pal
This paper presents the IReLIIT(BHU) submission to SemEval-2026 Task 9 for the Chinese language track. We participated in all three subtasks: binary polarization detection,multi-label polarization type classification, and multi-label manifestation identification. Our approach is based on a unified transformer based framework with cross-validation, prediction aggregation, and threshold optimization to improve robustness across tasks. On the official evaluation, our systems achieved Macro-F1 scores of 0.9081, 0.7962, and 0.6484 for Subtasks 1, 2, and 3, respectively on test data.
WWTC@UniA at SemEval-2026 Task 13: BERT-based Code Authorship Detection and Qualitative Analysis
Linda Kupfer | Lisa Hader | Christian Jaumann | Annemarie Friedrich
Linda Kupfer | Lisa Hader | Christian Jaumann | Annemarie Friedrich
This paper describes our system for SemEval-2026 Task 13 on detecting machine-generated code. We fine-tune small encoder-only models for detecting human-written versus machine-generated code and for identifying which large language model (LLM) family was used to obtain code. We find that a strong, general-purpose model (ModernBERT) outperforms models specifically pre-trained for the code domain. In the official evaluation, our system ranks 5th on subtask B and 6th on subtask C. Our detailed analysis reveals that comments and other natural language text that is part of the code snippets provide valuable information for identifying the LLM family that generated it. Moreover, we show that the embeddings of our finetuned ModernBERT do not distinguish well between LLM families, but they cluster human-written code by programming language.
Spinfo Cologne at SemEval-2026 Task 4: Explainable Creation of Narrativity Embeddings
Janis Pagel | Nils Reiter
Janis Pagel | Nils Reiter
We describe our submission to SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning.The task requires (i) selecting, for a given anchor story summary, which of two candidate summaries is narratively closer (Track A) and (ii) producing a narrative representation of a story as a vector embedding (Track B).Our approach emphasizes interpretability by explicitly eliciting three narrativity aspects with a prompted large language model.We then construct a fixed-size narrative embedding by concatenating aspect-wise representations, comparing a static-embedding baseline (GloVe) to contextualized sentence-transformer embeddings (all-MiniLM-L6-v2).On the development set, the sentence-transformer variant outperforms the static baseline and achieves 61.5% accuracy on Track A, while the GloVe variant performs near chance.Our official submission reaches 60.25% accuracy on the Track A test set and 57.75% accuracy on Track B.Additional ablations show that the aspect pipeline slightly outperforms raw-text embeddings, but that aspect contributions are uneven.Qualitative analysis suggests that failures often stem from inconsistent aspect generation and from overemphasizing theme overlap over event-level similarity.
UFG-Semantic at SemEval-2026 Task 6: CLARITY - Unmasking Political Question Evasions
Aline Hamano | Beatriz Felicio | Henrique Galvão | Nádia Da Silva
Aline Hamano | Beatriz Felicio | Henrique Galvão | Nádia Da Silva
We propose an approach for Task 6: CLARITY - Unmasking Political Question Evasions. We make use of data augmentation, supervised fine-tuning, and model benchmarking to detect and classify response ambiguity in political discourse. Building on well-founded theory on equivocation and leveraging recent advancements in language modeling, our system was structured based on question/answer (QA) pairs extracted from presidential interviews, and it was evaluated in Clarity-level Classification and Evasion-level Classification.
Seals-NLP at SemEval-2026 Task 9: A Comparative Study of Transformer Architectures for Polarization Detection
Minh Smith | Cheryl Seals
Minh Smith | Cheryl Seals
We describe the Seals-NLP system for SemEval-2026 Task 9 (POLAR) Subtask 1, binary polarization detection. Our study compares (i) fully fine-tuned encoder-only transformers, (ii) QLoRA-based fine-tuned open-weight LLMs, and (iii) zero-shot prompted LLMs. ModernBERT-large emerges as the most cost-effective option, matching or surpassing larger fine-tuned and zero-shot LLMs in macro-F1 while requiring substantially less memory and lower latency. An error analysis by failure mode and polarization subtype reveals systematic over-triggering on political cue words and under-detection of sarcastic vilification and multifaceted attacks in the POLAR dataset across all models.
Team JAT at SemEval-2026 Task 9: Enhancing Polarization Detection with Cross-Lingual Transfer and Feature Fusion
Aleksandra Matkowska | Taya Lin | Yu-Chun Chao
Aleksandra Matkowska | Taya Lin | Yu-Chun Chao
We describe our system for SemEval-2026 Task9 (POLAR), Subtask 1 - binary polarizationdetection. Our approach investigates polariza-tion detection through monolingual and cross-lingual experimental settings. We first utilizea RoBERTa-based architecture enhanced withfeature fusion, combining contextual sentencerepresentations with handcrafted sentiment andintensity cues. As for multilingual joint train-ing, we explore it within the Indo-Europeanfamily to test whether cross-lingual transfer canelevate performance in data-scarce scenarios.Our final fine-tuned model achieves averageF1-score of 0.763 on the test set, compared to0.491 for a random baseline. We also reportablations for augmentation, feature fusion, andclass weighting to quantify each component’scontribution.
yasaminal at Semeval2026: Constraint-Aware Humor Generation with Knowledge Graph Guidance
Yasamin Aali
Yasamin Aali
This paper presents a knowledge-guided humor generation system, which involves generating humorous text from either a pair of words or a news headline. The proposed approach integrates structured semantic reasoning derived from a knowledge graph (KG) with controlled generation using large language models (LLMs). The system constructs an intermediate KG hint consisting of related concepts retrieved in the target language, which is appended to the prompt to guide the generation process in a structured manner. A single candidate joke is generated per input using controlled top-p decoding. Experimental results show that incorporating KG reasoning improves relevance and constraint satisfaction, while LLM-based generation ensures fluency and creativity. Overall, the proposed method offers a structured and interpretable framework for humor generation across multiple languages.
MALTO at SemEval-2026 Task 13: Detecting Human, AI, and Hybrid Code via Hard Negative Mining and Curriculum-Driven Ensembles
Hüseyin Arslan | Evren Ayberk Munis | Timofei Khudonogov | Mert Akgun | Murat Besli | Ayhan Meherrem | Claudio Savelli | Flavio Giobergia
Hüseyin Arslan | Evren Ayberk Munis | Timofei Khudonogov | Mert Akgun | Murat Besli | Ayhan Meherrem | Claudio Savelli | Flavio Giobergia
The rapid advancement of Large Language Models (LLMs) has significantly impacted software engineering, posing challenges for determining the origin and authenticity of source code. This paper presents the MALTO team’s submission for SemEval-2026 Task 13, explicitly focusing on Subtask B (Authorship Attribution among 11 classes) and Subtask C (Hybrid Code Detection). To address severe class imbalance and the complex boundaries of mixed human-machine code, we propose a unified framework that leverages an ensemble of UniXcoder and CodeT5. Our approach integrates a robust Tree-sitter-based Universal Canonicalization strategy, Data Augmentation, and a novel 3-Phase Curriculum Training schedule enhanced by Hard Negative Mining. Specifically, UniXcoder’s cross-modal representations excel at distinguishing among semantically overlapping LLM families (Subtask B), whereas CodeT5’s identifier-aware architecture is superior at detecting subtle structural anomalies in hybrid and adversarial snippets (Subtask C). By aggregating these complementary strengths, our soft-voting ensemble overcomes the limitations of individual models, demonstrating strong robustness against imbalanced distributions and effectively discriminating between purely human, purely machine, hybrid, and adversarial code snippets.
blue at SemEval-2026 Task 4: Synergizing Long-Context Reranking with Semantic Similarity for Narrative Alignment
Krish Sharma | Lakksh Sharma | Rhea Singhal | Jatin Bedi
Krish Sharma | Lakksh Sharma | Rhea Singhal | Jatin Bedi
This paper describes the system submitted by team blue for SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning, with a primary focus on the Pairwise Similarity subtask (Track A). The core challenge of this task lies in identifying deep structural alignments between stories, which is fundamentally hindered by the restricted context windows of standard transformer architecturesthat truncate narratives before reaching critical plot resolutions. To overcome this context bottleneck, we propose a hybrid ensemble architecture designed to capture extended narrative arcs. Our approach synergizes a cross-encoder (Jina Reranker v2), which processes long inputs via a sliding-window strategy over 1,024-token chunks, to evaluate the global "course of action," with a semantic bi-encoder (RoBERTa-Large) to validate local tonal consistency. This dual-stream system achieved a Pearson correlation score of 0.63, demonstrating that processing narrative content beyond the 512-token truncation boundary is strictly necessary for accurate pairwise narrative comparison.
blue at SemEval-2026 Task 5: NarrBERT : Narrative-Aware BERT for Word Sense Disambiguation
Rhea Singhal | Krish Sharma | Lakksh Sharma | Jatin Bedi
Rhea Singhal | Krish Sharma | Lakksh Sharma | Jatin Bedi
This paper outlines the method submitted by team blue for the SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Sentences through Narrative (AmbiStory). The task requires predicting reasonable scores that match human thoughts and judgments instead of just picking a single correct sense as the output. This means that contextual reasoning with fine-grain contextual modeling is vital. In order to tackle this problem, we suggest a BERT-based cross-encoder regression model. This model encodes the entire narrative context, which includes the precontext, the ambiguous sentence, and the ending, along with candidate sense definitions and example usages. Unlike bi-encoder sentence-level methods, our model allows for token-level interaction between story cues and sense meanings. This interaction helps capture subtle narrative disambiguation signals. We conduct a systematic exploration of model architectures and training strategies, progressing from a sentence-transformer baseline to an optimised BERT cross-encoder. On the development set, our best configuration achieves a Spearman rank correlation of 0.66. On the official test set, the system achieves a Spearman correlation of 0.4866 and an Accuracy-within-Standard-Deviation of 0.6613, substantially outperforming sentence-transformer bi-encoder baselines.
LATE-iimas at SemEval-2026 Task 10: Conspiracy Detection via DeBERTa-v3 Ensemble and Weighted Loss Optimization
Jose Vazquez-Cerrillo | Helena Gomez-Adorno | Gemma Bel-Enguix
Jose Vazquez-Cerrillo | Helena Gomez-Adorno | Gemma Bel-Enguix
This paper describes the system developed by the LATE-iimas team for Task 10 of SemEval-2026: Psycomark, specifically for Subtask 2, which involves conspiracy detection. Our approach was based on fine-tuning the popular pre-trained language model DeBERTa-v3-Large. To address the challenges inherent in the provided dataset, such as class imbalance and the linguistic ambiguity of the "Can’t tell" label, we implemented a 5-Fold Stratified Cross-Validation technique combined with a Weighted Cross-Entropy Loss function. The final system, which operates using an ensemble of the resulting models, achieved a Weighted F1-Score of 0.75, placing it in the top 10 of the ranking.
GIL-Zaragoza at SemEval 2026 Task 11: Comparing Classification, Autoformalization, and Ontologies for Formal Reasoning Capabilities
Francisco Lopez-Ponce | Lucia Pitarch | Iván Saavedra Martínez | Ignacio Huitzil | Sergio Ojeda Trueba | Fernando Bobillo | Gemma Bel-Enguix
Francisco Lopez-Ponce | Lucia Pitarch | Iván Saavedra Martínez | Ignacio Huitzil | Sergio Ojeda Trueba | Fernando Bobillo | Gemma Bel-Enguix
This paper describes our participation in Task 11 of SemEval-2026, which evaluates the ability of models to determine logical validity of syllogisms independent of real-world content. We develop and compare three approaches for Subtask 1: (1) an encoder-based classification baseline using both classical ML methods and fine-tuned BERT with debiasing strategies; (2) an autoformalization pipeline combining DPO-aligned models with first order logic translation and formal inference via Prover9; and (3) a hybrid neuro-symbolic approach using GPT to generate OWL 2 ontologies evaluated with the HermiT reasoner. Our best result was achieved by the encoder-based classifier, obtaining a 72.25\% accuracy and a combined score of 20.37, placing 40th out of 45 participating teams. Analysis shows that classification methods exhibit lower content bias, autoformalization approaches suffer from translation inconsistencies and syntax incompatibilities, and ontology-based reasoning is hindered by prompt design limitations and verbose serialization formats. All our code can be found in the paper’s repository.
Polito Team at SemEval-2026 Task 8: Scaling Multi-Turn RAG: High-Performance Parallelized Pipeline for the MTRAG Benchmark
Murat Çelik | Nejla Dinçer | Can Ersoy | Mert Toprak | Barış Ünal | Riccardo Coppola | Flavio Giobergia
Murat Çelik | Nejla Dinçer | Can Ersoy | Mert Toprak | Barış Ünal | Riccardo Coppola | Flavio Giobergia
Recently, Retrieval-Augmented Generation (RAG) has become a significant task in Large Language Models (LLMs). In multi-turn RAG, a good system must overcome the challenges of maintaining context as the dialogue turns progress and manage the issue of generating answers based on conversation history. In this work, we address the MTRAGEval task 8 at SemEval-2026, by presenting a high-performance, parallelised Multi-Turn RAG pipeline designed to address three subtasks: Retrieval (Subtask A), Generation (Subtask B), and End-to-End RAG (Subtask C). Our methodology utilises a Streamlit framework that allows users to embed diverse corpora with varying vector spaces and embedding models, facilitating configuration for each task based on its nature. Some key experiments focus on the performance of different vector databases and embedding models, the necessity of LLM-based query rewriting (QR) for non-standalone questions, the use of different rerankers, and the scale and performance of the selected LLM for answer generation. We conclude that a configuration utilising query rewriting along with reranking delivers the best results. The code is available on GitHub https://github.com/merttoprak1/MTRAGEval-Evaluating-Multi-Turn-RAG-Conversations.
CICL26 at SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning
Wanzhao Zhang | Yue Yu
Wanzhao Zhang | Yue Yu
This paper describes our submission to SemEval-2026 Task 4 (Track A) on narrative similarity.The task requires systems to determine which of two candidate stories is more narratively similar to a given anchor story. While large language models (LLMs) demonstrate strong semantic reasoning abilities, their predictions in comparative settings can be sensitive to stochastic decoding and input order.We propose a lightweight inference-time cascade strategy that improves robustness without modifying the underlying model. Our approach combines self-consistency voting to reduce sampling variance,a swap-based symmetry test to mitigate positional bias, and a margin-based deterministic decision rule to resolve disagreements. This design explicitly leverages model uncertainty while maintaining reproducibility and simplicity.
UCSC-NLP at SemEval-2026 Task 13: Multi-View Generalization and Diagnostic Analysis of Machine-Generated Code Detection
Kargi Chauhan | Sadiba Nusrat
Kargi Chauhan | Sadiba Nusrat
This paper presents the system for SemEval-2026 Task 13, addressing both binary detection (Subtask A) and multi-class attribution (Subtask B). For Subtask A, we propose a robust multi-view training framework using UniXcoder-base, incorporating domain-specific structural prefixes, delexicalization with symmetric KL consistency loss, and token dropout. Our system achieves a high macro F1 of 0.845 on the out-of-distribution test set, demonstrating strong generalization across five unseen languages and two unseen domains. For Subtask B, we provide a rigorous diagnostic analysis of majority-class bias in transformer-based detectors. We reveal a significant performance gap where an 88.4% accuracy masks a near-complete failure in minority-class attribution (0.086 Macro F1), highlighting that standard fine-tuning is insufficient for fine-grained generator identification. Our results expose distinct regimes in code detection and motivate the need for imbalance-aware, structure-focused modeling in future work.
MIUN BiasPatrol at SemEval-2026 Task 13: Why TF-IDF can Beat Transformers for OOD Code Detection
Loviza Sahlen | Thomas Springfeldt | Mehwish Fatima | Raja Khurram Shahzad
Loviza Sahlen | Thomas Springfeldt | Mehwish Fatima | Raja Khurram Shahzad
The increasing use of AI-generated code underscores the need for effective detection systems. However, their performance often deteriorates when faced with distribution shifts. This paper presents our system for SemEval-2026 Task 13: A, which focuses on binary classification of human-written versus machine-generated code across various programming languages and domains. We systematically compare traditional classifiers, such as Random Forest and XGBoost, which utilize statistical and TF-IDF features, against pipelines that leverage frozen embeddings from advanced code transformers like UniXcoder and GraphCodeBERT. Our results reveal a notable trade-off, i.e., while transformer-based pipelines excel in in-distribution validation (reaching up to 0.89 Macro F1), they experience severe performance drops up to 77%; when applied to out-of-distribution languages and domains. In contrast, models employing TF-IDF with tree-based classifiers demonstrate significantly greater stability. We identify this fragility as a result of a bias toward superficial formatting, particularly whitespace, which is accentuated by transformers. By implementing simple space normalization, we enhance the out-of-distribution robustness of traditional models; however, this also highlights the ongoing dependence of embeddings on these non-semantic features. Our findings suggest that for creating generalizable code detection systems, straightforward, well-normalized lexical features may be more reliable than complex, unrefined embeddings.
MINDS at SemEval-2026 Task 9: A Multi-Paradigm Approach to Cross-Lingual Polarization Detection
Angelo Iannielli | Samuele Maroli | Marco Roberto | Stefano Sammartino | Valentino Vacirca | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Angelo Iannielli | Samuele Maroli | Marco Roberto | Stefano Sammartino | Valentino Vacirca | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Online polarization has become a central challenge in digital discourse, characterized by hostility, identity-based division, and culturally dependent expressions that vary across languages. Automatically detecting such phenomena is particularly difficult in multilingual settings, where semantic nuance and implicit rhetoric complicate cross-lingual generalization.In this context, we participate in POLAR, a shared task at SemEval 2026 on multilingual polarization detection and categorization across 22 languages. We compare three modeling paradigms: multilingual encoder fine-tuning, translation-based transfer learning, and prompting-based generative reasoning. For the multi-label categorization task, we introduce a two-stage cascaded architecture to mitigate false positives under severe class imbalance.Our results show that multilingual encoders achieve the most robust performance for binary detection, whereas reasoning-based prompting is competitive for fine-grained category classification. This comparative study highlights the strengths and limitations of each paradigm for cross-lingual polarization analysis.
GuysLLM at SemEval-2026 Task 5: NLI-Informed Regression for Graded Word-Sense Plausibility in Narrative Contexts
Niccoló Antonelli-Dziri | Sixtine Marcotte | Emanuele Rosapepe | Gabriele Santona | Omar Wafaay | Lorenzo Vaiani | Riccardo Coppola | Flavio Giobergia
Niccoló Antonelli-Dziri | Sixtine Marcotte | Emanuele Rosapepe | Gabriele Santona | Omar Wafaay | Lorenzo Vaiani | Riccardo Coppola | Flavio Giobergia
While large language models (LLMs) excel at semantic reasoning, their discrete token-based outputs introduce limitations for fine-grained regression tasks requiring continuous scoring. We address graded word-sense plausibility estimation by reformulating it as a Natural Language Inference (NLI) regression problem, adapting DeBERTa-v3-large with NLI pretraining and a regression head to predict continuous plausibility scores from story-sense pairs. We compare this model against BERT, vanilla DeBERTa, SmolLM variants and state-of-the art LLMs under various prompting strategies, and show that the NLI-finetuned model achieves superior rank correlation and alignment with human judgments. While several baselines collapse toward mean predictions and LLMs show unstable prompting sensitivity, our findings establish NLI-informed pretraining as highly effective for narrative plausibility regression, highlighting fundamental LLM limitations for word sense disambiguation.
AbstractReasoner at SemEval-2026 Task 11: Reducing Content Effects via Knowledge Distillation and Structured Reasoning Prompts
Akash Chowdhury | Vlad Pavlovich | Julius Dunfoy | Sophia Yang | Abhiram Borra
Akash Chowdhury | Vlad Pavlovich | Julius Dunfoy | Sophia Yang | Abhiram Borra
Syllogistic reasoning serves as a critical diagnostic for evaluating whether Large Language Models (LLMs) perform genuine logical inference or rely on semantic shortcuts. SemEval-2026 task 11 explores "content effects"—where model judgments are biased by world knowledge rather than logical form. Recent work has illustrated that LLM optimization techniques have provided substantial performance gains in mitigating content effect. To contribute to this research domain, this paper performs a systematic study of different intervention strategies: zero-shot chain of thought, symbolic representation, activation-steering, and supervised fine-tuning along with prompting optimization during inference. We achieved the best performance with our largest model (Phi-4 14B) by fine-tuning with chain of thought distillation, symbolic abstractions and LLM as optimizer prompting (FTOptim) evaluated on the held-out split derived from the training data. This approach achieved the highest Combined Smooth Score (CSS) of 31.16. Additionally, Llama 3.1 provided noteworthy performance with 31.01 CSS under the same FTOptim approach, indicating the performance gain was LLM-agnostic.
AI4PC-Howard University at SemEval-2026 Task 9: Evaluating Teacher-Student Weak Supervision and Direct LLM Prompting for Multilingual Political Polarization Detection
Surangana Aryal | Saurav Aryal
Surangana Aryal | Saurav Aryal
We describe the AI4PC–Howard University submission to SemEval-2026 Task 9, Subtask 1 on multilingual political polarization detection across 22 languages. We investigated two approaches: (1) a weakly supervised teacher–student framework in which a large language model (LLM) generated pseudo-labels to train an XLM-RoBERTa-base classifier, and (2) (2) a context-engineered prompt-based approach using Meta-Llama-3.1-8B-Instruct. The teacher–student approach exhibited instability under distribution shift and collapsed toward majority predictions at test time. Consequently, our final submission used direct inference with Meta-Llama-3.1-8B-Instruct. While this approach produced competitive macro-F1 across evaluated languages, results reveal strong positive-class bias and substantial precision–recall imbalance. Our findings highlight limitations of weak supervision for subjective political tasks and underscore trade-offs between scalability, bias, and computational cost in LLM-only multilingual systems.
SpyComet at SemEval-2026 Task 11: When Adversarial Debiasing Backfires - A Comparison of Data-Level and Model-Level Debiasing
Sai Aravind C | Sunil Saumya | C Pothan Reddy
Sai Aravind C | Sunil Saumya | C Pothan Reddy
We describe MLA-CI (Multi-Layer Adversarial for Content Invariance), a DeBERTa-v3-base system for SemEval-2026 Task 11 Subtask 1 on content-invariant syllogistic reasoning. MLA-CI combines multi-layer feature extraction, gradient-reversal adversarial training, structure-preserving template augmentation, implausible-class oversampling, and focal loss. Our principal contribution is a systematic ablation study, confirmed across three random seeds, showing that adversarial training at standard strength is counterproductive: removing gradient reversal improves the mean validation score from 26.41 ± 0.99 to 38.15 ± 5.32. Per-condition analysis reveals that gradient reversal over-suppresses plausibility-correlated features, creating an inverted content effect that disproportionately harms plausible-valid accuracy. A sweep over seven adversarial pressure values reveal that only very light adversarial pressure value (≤ 0.1) preserves accuracy, while the submitted adversarial pressure value (1.0 or above) cause severe degradation. We conclude that data-level debiasing through structure-preserving augmentation is more effective and robust than model-level adversarial debiasing for this task.
TeamSLS at SemEval-2026 Task 13: Detecting Machine-Generated Code with CodeBERT and Structural Features
Sai Laasya Gorantla | Shreemithra Naveen | Steven Bethard
Sai Laasya Gorantla | Shreemithra Naveen | Steven Bethard
We describe our system for SemEval-2026 Task 13 Subtask A, which focuses on detecting whether source code is written by a human or generated by an AI system. We propose a hybrid approach that combines semantic embeddings from CodeBERT with lightweight, language-agnostic structural features extracted using Tree-sitter. We compute normalized structural ratios such as nesting depth, logic density, complexity per line, average line length, and punctuation frequency. These structural signals are concatenated with CodeBERT embeddings and passed to a linear classifier for binary prediction. Experimental results on the official validation split show that combining semantic and normalized structural representations substantially improves the model’s detection performance on seen-language distributions. However, results on unseen test data reveal significant performance degradation under cross-language distribution shifts. On the official leaderboard, our system ranked 47th out of 81 participating teams.
Howard University-AI4PC at SemEval-2026 Task 1: Exploring Prompt Strategies for Automatic Humor Generation
Lawal Abdulmujeeb | Saurav Aryal
Lawal Abdulmujeeb | Saurav Aryal
We present our solution system for SemEval-2026 Task 1-Subtask A, a humor generation task requiring systems to generate jokes, given either a news headline or word-pair inputs. Our approach used the Llama-3.1-8B-Instruct model and we selected this model after comparing several candidate models and humor strategies across our experiments. For the headline inputs, we used a two-shot prompt to frame the output as a tweet and specifying the tone proved to be a particularly important factor in output quality. As for the word-pair inputs, we instructed the model to commit to an everyday situation and generate a funny thought based on that. Also, while experimenting, we noticed that models would start a joke one way with the first word and abruptly shift context mid-joke just to include the second word, and committing to a single situation helped handle that. We also made use of personas here, specifically using Dave Chappelle. Our final system shared 2nd place with 3 other systems out of 32 total systems and achieved an Elo score of 1020. Achieving these results, with no fine-tuning, suggests that careful prompt design alone can yield competitive results.
Howard University-AI4PC at SemEval-2026 Task 8: Query Reformulation and Dense-Lexical Retrieval Fusion for Multi-Turn Retrieval-Augmented Generation
Sijan Shrestha | Saurav Aryal
Sijan Shrestha | Saurav Aryal
We present a training-free hybrid retrieve-then-rerank system for multi-turn retrieval-augmented generation, submitted to allthree subtasks of SemEval-2026 Task 8(MTRAGEval): passage retrieval (Task A),generation with reference passages (Task B),and end-to-end RAG (Task C). Our system ad-dresses the core multi-turn challenges—non-standalone questions, unanswerable queries,and shifting passage relevance—across fourdomain-specific corpora: ClapNQ, Cloud,FiQA, and Govt. Queries are reformulatedthrough LLM-driven rewriting, decompositioninto sub-queries, and Hypothetical DocumentEmbeddings (HyDE). Retrieved candidatesfrom dense vector search (BGE-base-en-v1.5)and BM25 lexical matching are fused via Re-ciprocal Rank Fusion and reranked by a cross-encoder (BGE-reranker-large). Llama-3.3-70B-Instruct generates extractive, context-groundedresponses with built-in abstention for unanswer-able queries. Using only open-source mod-els without fine-tuning, the system achievesnDCG@5 of 0.4098 on Task A (22nd/38), aharmonic mean of 0.7462 on Task B (9th/26),and 0.5796 on Task C (2nd/29), coming within1.1% of the top submission. We attribute thestrong Task C result to the synergy betweenmulti-signal query reformulation and faithfulextractive generation.
UNF-BMI at SemEval-2026 Task 3: Research Domain Criteria-Guided Large Language Models for Dimensional Aspect-Based Sentiment Analysis
Athlene Jones | Vishwaa Shah | Indika Kahanda
Athlene Jones | Vishwaa Shah | Indika Kahanda
We present UNF-BMI system for SemEval-2026 Task 3, Track A, Subtask 1 (Dimensional Aspect Sentiment Regression, DimASR), which focuses on predicting continuous Valence–Arousal (VA) scores for aspects in text. Our approach integrates psychologically grounded affective signals inspired by the Research Domain Criteria (RDoC) framework. We investigate two complementary methods: first, an in-context learning framework using Mistral-7B-Instruct with semantically retrieved few-shot examples augmented by lexicon-derived RDoC valence and arousal cues; second, a supervised multi-task learning model based on RoBERTa, where VA regression is the primary objective and RDoC-based positive/negative signal prediction serves as an auxiliary task to regularize shared representations. Experiments on english laptop and restaurant review datasets demonstrate that incorporating RDoC-inspired affective priors reduces RMSE compared to baselines, particularly in low-signal text where explicit sentiment cues are sparse.
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
Yusser Al Ghussin | Daniil Gurgurov | Yasser Hamidullah | Josef Van Genabith | Cristina España-Bonet | Simon Ostermann
Yusser Al Ghussin | Daniil Gurgurov | Yasser Hamidullah | Josef Van Genabith | Cristina España-Bonet | Simon Ostermann
Large language models (LLMs) are increasingly used across diverse linguistic and cultural contexts, yet their cultural knowledge remains uneven across regions and languages. We present the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, where we apply activation steering to multilingual LLMs using language vectors extracted from parallel FLORES data. Our method performs inference-time adaptation by adding language-specific steering vectors to the residual stream at a selected transformer layer, without any parameter updates. We participated in both the short-answer (SAQ) and multiple-choice (MCQ) tracks; however, only our MCQ submission received an official score. In the official MCQ track, we achieved 86.96% accuracy, ranking 7th out of 17 teams. To better understand system behavior, we conduct post-hoc analyses on the shared-task MCQ and SAQ settings. These analyses show that activation steering yields modest and heterogeneous improvements on cultural reasoning: gains are strongly layer-sensitive, vary substantially across language–region pairs (some configurations even degrade performance), and interact with prompt formulation (generic vs. culturally conditioned prompts). Our findings suggest that prompt design and activation steering should be jointly optimized for culturally aware multilingual inference. We release our code and experimental configurations at https://github.com/Yusser96/SemEval-2026-Track7.
sutta at SemEval-2026 Task 12: A Multi-Perspective Retrieve-Verify-Aggregate Framework for Abductive Event Reasoning
Junliu Zou | Liang Yang | Jingjie Zeng
Junliu Zou | Liang Yang | Jingjie Zeng
We present our system for SemEval-2026 Task 12: Abductive Event Reasoning (AER). The task asks models to identify the direct causes of real-world events from multiple-choice options using retrieved documents. Rather than fine-tuning on the training data, we built a zero-shot "Retrieve-Verify-Aggregate” pipeline around Qwen3-8B. We first isolate relevant evidence using BM25 and cross-encoder reranking. To evaluate causal links, we prompt the model with several distinct "personas” and aggregate their independent decisions through majority voting. Our system scored 0.7614 on the official test set. This performance suggests that strict retrieval combined with diverse reasoning prompts can help compact open-source models ignore irrelevant context and perform complex causal inference, entirely without task-specific training.
Mendel292 at SemEval-2026 Task 4: Disentangled Narrative Embeddings for Story Similarity
Mauricio Gruppi | Sankalpa Rijal | Justin Debenedetto
Mauricio Gruppi | Sankalpa Rijal | Justin Debenedetto
This paper describes Mendel292, our system for SemEval-2026 Task 4 on Narrative Story Similarity. We introduce a narrative encoder that decomposes story representations into explicit subspaces for abstract theme, course of action, and outcome, built on a pre-trained sentence embedding model and trainable BiLSTM projection layer with a triplet margin loss objective. We augment the training set via backtranslation, and incorporate weakly supervised multi-task objectives derived from unsupervised narrative clustering.The proposed architecture was designed to learn a latent representation of narratives in a few-shot setting due to a limited amount of traninig data.Despite using a rich pre-trained transformer, the model was outperformed by a unsupervised pooling approach on the classification task.While our systems do not match the top leaderboard scores, they allow us to systematically study the effects of subspace factorization, weak labels, and data augmentation on narrative similarity modeling.
GUIR at SemEval-2026 Task 8: Training-Free Multi-Query Fusion for Robust Conversational Retrieval
Pasha Abrishamchian | Ophir Frieder | Nazli Goharian
Pasha Abrishamchian | Ophir Frieder | Nazli Goharian
We describe our SemEval-2026 Task 8 Subtask A system, which focuses on evaluating and improving the retrieval aspect of multi-turn Retrieval-Augmented Generation (RAG) conversations. We implement a training-free fusion approach that combines three distinct query representations to retrieve documents independently. The results from these three views are pooled and reranked using a MonoT5 cross-encoder. Our findings demonstrate that this fusion approach consistently outperforms single-strategy baselines, revealing that optimal retrieval strategies vary significantly at the query level, and establishing multi-query fusion as a baseline for multi-turn RAG systems.
AI4PC-Howard University at SemEval-2026 Task 5: Calibrated Hybrid Ensembling and Retrieval-Augmented LLM Reasoning for Narrative Word-Sense Plausibility
Kwaku Asare | Saurav Aryal
Kwaku Asare | Saurav Aryal
We present two complementary approaches for rating word-sense plausibility in SemEval-2026 Task 5 (literary homonyms in five-sentence stories). Approach 1 is a retrieve-then-generate pipeline using an open-weight Llama 3.1 70B Instruct model with structured reasoning and a self-correction pass. Approach 2 is a hybrid ensemble that combines API-based LLM prompting with transformer representations and a learned calibration layer trained on the development set. On the development set, Approach 2 achieves Spearman ρ = 0.7393 (p 10-102) with accuracy 0.8010 (471/588). Approach 1 achieves ρ = 0.5187 (p 10-65) with accuracy 0.6032 (561/930). We emphasize that Approach 1 does not exceed RoBERTabase in accuracy (0.6032 vs. 0.6410), but provides stronger rank correlation.
Howard University-AI4PC at SemEval-2026 Task 7: Culturally Aware Multilingual Model Routing Through a Mixture-of-Specialists Framework
Isaac Adjei | Saurav Aryal
Isaac Adjei | Saurav Aryal
SemEval-2026 Task 7 (BLEnD) evaluates culturally contextual multiple-choice reasoning across 26 languages and 30 geographic regions, emphasizing everyday knowledge, cultural norms, and region-specific variations in language use. This paper presents the Howard University–AI4PC system, a Phase~1 implementation of a culturally aware Mixture-of-Specialists (MoS) framework designed to improve multilingual cultural reasoning without requiring large-scale fine-tuning. Our approach integrates four key components: (1) linguistic and regional metadata extraction for identifying language, dialect, and cultural context; (2) a hierarchical routing strategy that selects the most culturally aligned model path; (3) Model Control Prompting (MCP), which injects region-aware constraints, dialectal hints, and output-format controls; and (4) a lightweight retrieval-augmented layer that supplies culturally specific factual cues. Although specialist LoRA/QLoRA adapters are planned for future phases, the routing and prompting layers alone achieve 80.01\% accuracy on 47{,}014 test MCQs, demonstrating that cultural grounding and linguistically informed routing substantially enhance performance even in the absence of trained experts. We summarize the task, describe the system in detail, present quantitative and qualitative analyses, and outline next-stage extensions involving specialist model training and expanded cultural knowledge integration.
GenAIus at SemEval-2026 Task 8: Beyond Retrieval with Relevance-Aware RAG for Faithful Multi-Turn Generation
Suveyda Yeniterzi | Reyyan Yeniterzi
Suveyda Yeniterzi | Reyyan Yeniterzi
This paper describes our submission to SemEval-2026 Task 8 on multi-turn retrieval-augmented generation (RAG). We propose a hybrid multi-stage pipeline that combines high-recall lexical retrieval, dual-embedding dense re-ranking with reciprocal rank fusion, LLM-based relevance judging, and strictly constrained evidence-grounded generation. Our design emphasizes robustness and faithfulness across the full retrieval-to-generation pipeline. Our results suggest that relevance-aware filtering and constrained generation are important for improving faithfulness and overall RAG performance.
FregeLogic at SemEval 2026 Task 11: A Hybrid Neuro-Symbolic Architecture for Content-Robust Syllogistic Validity Prediction
Adewale Akinfaderin | Nafi Diallo
Adewale Akinfaderin | Nafi Diallo
We present FregeLogic, a hybrid neuro-symbolic system for SemEval-2026 Task 11 (Subtask 1), which addresses syllogistic validity prediction while reducing content effects on predictions. Our approach combines an ensemble of five LLM classifiers, spanning three open-weights models (Llama 4 Maverick, Llama 4 Scout, and Qwen3-32B) paired with varied prompting strategies, with a Z3 SMT solver that serves as a formal logic tiebreaker. The central hypothesis is that LLM disagreement within the ensemble signals likely content-biased errors, where real-world believability interferes with logical judgment. By deferring to Z3’s structurally-grounded formal verification on these disputed cases, our system achieves 94.3% accuracy with a content effect of 2.85 and a combined score of 41.88 in nested 5-fold cross-validation on the dataset (N = 960). This represents a 2.76-point improvement in combined score over the pure ensemble (39.12), with a 0.9% accuracy gain, driven by a 16% reduction in content effect (3.39→2.85). Adopting structured-output API calls for Z3 extraction reduced failure rates from ∼22% to near zero, and an Aristotelian encoding with existence axioms was validated against task annotations. Our results suggest that targeted neuro-symbolic integration, applying formal methods precisely where ensemble consensus is lowest, can improve the combined accuracy-plus-content-effect metric used by this task.
Tübingen-CL at SemEval-2026 Task 12: Reinforcement Learning and Verification for Abductive Reasoning
Bolun Liang | Ayperi Khudaybergenova | Shashikala Kankanamge
Bolun Liang | Ayperi Khudaybergenova | Shashikala Kankanamge
We investigate the reliability of verifier-based pipelines for abductive reasoning in SemEval-2026 Task 12. While reinforcement learning improves the base generator’s performance, we find that incorporating a small-model verifier introduces a significant generalization gap: although effective on validation data, the verifier systematically degrades correct predictions on the unseen test set by appending false positives. Furthermore, we reveal a critical vulnerability in the official evaluation metric, which assigns zero reward to abstentions but does not sufficiently penalize incorrect selections. This asymmetry enables trivial heuristic strategies such as blindly selecting a default option to substantially inflate performance, even outperforming more principled reasoning systems. Our analysis demonstrates that current evaluation protocols can misrepresent true reasoning ability and highlights the need for more robust verification methods and scoring schemes.
AI4PC-Howard University at SemEval-2026 Task 12: Evidence-Guided Abductive Scoring with Option-Conditioned Retrieval and Constrained LLM Evaluation
Ifeoluwakiitan Ayandosu | Saurav Aryal
Ifeoluwakiitan Ayandosu | Saurav Aryal
Abductive event reasoning in the wild requires selecting plausible explanations for an event from noisy, partially relevant multi-document context. We present an evidence-guided abductive scoring pipeline for SemEval-2026 Task~12 that separates evidence selection from explanation scoring.For each topic, we chunk documents and retrieve option-conditioned evidence using dense embeddings, then apply a cross-encoder reranker to form compact evidence packs per option. A constrained large language model scorer evaluates each option using only its evidence pack and outputs structured signals capturing evidence support, explanatory directness, and contradiction. We then apply deterministic decision rules to produce single or multi-label predictions, including robust handling of “none of the above” style options through lexical-cue detection rather than reliance on option position. This modular design reduces distraction from irrelevant documents, improves comparability across options, and enables controlled calibration for multi-answer outputs. Our approach demonstrates that retrieval-focused evidence compression combined with disciplined, signal-based scoring can effectively support abductive reasoning without explicit knowledge graphs or end-to-end prompting over full document context.
UPR at SemEval-2026 Task 9: Polarization Detection in Urdu with Language-Specific Transformer and Data Augmentation
Alishba Wazir | Muhammad Asad Khan | Junaid Rashid | Shamaila Hayat | Samira Kanwal
Alishba Wazir | Muhammad Asad Khan | Junaid Rashid | Shamaila Hayat | Samira Kanwal
This paper addresses polarization detection in Urdu, a low-resource language characterized by complex morphology and insufficient annotated data. We formulate the task as a binary classification problem of social media posts into polarized and non-polarized categories. Our approach is based on Urdu-BERT, a language-specific transformer model combined with language-specific preprocessing, duplicate removal, and data augmentation to mitigate class imbalance and improve generalization. Experimental results show that the fine-tuned Urdu-BERT outperforms TF-IDF-based lexical machine learning baselines and achieves strong performance relative to multilingual transformer baselines. The findings indicate that language-specific pretrained transformers, when combined with appropriate preprocessing and augmentation strategies, provide an effective and generalizable framework for low-resource Urdu polarization detection.
UPR at SemEval-2026 Task 9: Multi-Label Classification of Polarization Across Social Dimensions and Manifestation Identification in Urdu
Mtayyaba Shahzad | Inzmam Khadam | Zaufishan Mahmood | Junaid Rashid | Shamaila Hayat | Fakhar Ayub
Mtayyaba Shahzad | Inzmam Khadam | Zaufishan Mahmood | Junaid Rashid | Shamaila Hayat | Fakhar Ayub
The analysis of polarized content on social networks is crucial for understanding public discourse; however, research on low-resource languages such as Urdu remains limited. In this work, we address two complementary subtasks of polarization analysis in Urdu social media text. First, we formulate polarization classification across multiple social dimensions as a multi-label task, including political, religious, racial/ethnic, gender/sexual, and other. We fine-tune XLM-RoBERTa for multi-label classification with language-specific preprocessing, duplicate filtering, and data augmentation to handle class imbalance. The proposed model achieves a Macro F1-score of 0.758 for social-dimension polarization classification.Second, we perform polarization manifestation identification, focusing on how polarization is expressed in text through six manifestations: stereotype, vilification, dehumanization, extreme language, lack of empathy, and invalidation. Using the same transformer-based framework with imbalance-aware training, our system achieves a Macro F1-score of 0.72 on the official test set. These results demonstrate the effectiveness of multilingual transformer models for multi-dimensional polarization analysis in low-resource Urdu text.
The Classics at SemEval-2026 Task 3: Combining Transformer Models and LLM-Generated Annotations for Dimensional Aspect-Based Sentiment Analysis
Rafif Alshawi | Amit Raj - | Aleksey Kudelya | Alexander Shirnin
Rafif Alshawi | Amit Raj - | Aleksey Kudelya | Alexander Shirnin
This paper presents an approach to the SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis. We investigate methods for moving beyond traditional categorical sentiment (e.g., positive or negative) to predict fine-grained, real-valued scores for sentiment "valence" (positivity) and "arousal" (intensity). We participate in two subtasks: predicting these scores for given aspects (Subtask 1) and extracting full sets of sentiment details, including aspects, categories, and opinions alongside their scores (Subtask 3). Our approach for the regression task involves a weighted ensemble of transformer-based encoder models. For the Russian language, we further enhance the input by using a large language model (LLM) to generate synthetic sentiment descriptions. For the extraction task, we fine-tune a decoder LLM to perform structured prediction, allowing the system to identify sentiment elements and estimate their numerical scores simultaneously.
UTD-HLTRI at SemEval-2026 Task 7: Bridging Cultural Knowledge Gaps in LLMs via Web-Augmented Context
Mohammad Marufur Rahman | Rakshitha Rao Ailneni | Sanda Harabagiu
Mohammad Marufur Rahman | Rakshitha Rao Ailneni | Sanda Harabagiu
Though Large Language Models (LLMs) have been serving global users through a wide range of services, concerns remain regarding their cultural bias and misalignment with people of underrepresented communities. Increasing use of LLMs presents significant implications, as they have the potential to influence people’s original values toward a certain cultural perspective. Cultural alignment of LLMs with culture-specific knowledge offers a suitable solution to this concern. In our participation in the Semeval-2026 Task 7 we considered a prompt engineering-based cultural alignment strategy to address the cultural knowledge gap in LLMs. Our approach achieved promising 86.34% accuracy for Japanese culture-relevant multiple-choice questions from the BLEND benchmark.
MoodMetric at SemEval-2026 Task 4:Narrative Story Similarity and Narrative Representation Learning
Samanvitha Bolisetty | Shreya Ashar | Nishchay Mittal | Pruthwik Mishra
Samanvitha Bolisetty | Shreya Ashar | Nishchay Mittal | Pruthwik Mishra
This paper presents our system for narrative similarity modeling in SemEval Task 4, focusing on transformer-based dense embedding approaches. Modeling similarity between long-form narratives is particularly challenging due to the need to capture event progression, causal structure, character dynamics, and thematic coherence beyond surface-level lexical overlap.We evaluate multiple pretrained encoder-only architectures, including DeBERTa-v3, BGE-Base, BGE-Large, and E5-Large, fine-tuned using triplet margin and contrastive objectives. In addition, we implement a hybrid lexical–semantic baseline combining TF-IDF and SBERT features. Our experiments analyze the impact of model scale, pooling strategies, layer freezing, training duration, and embedding-level ensembling under low-resource conditions (approximately 1,900 training triplets, with additional synthetic augmentation).Results show that larger contrastively pretrained embedding models consistently outperform smaller variants, with BGE-Large achieving the strongest standalone performance. However, performance saturates quickly, and moderate fine-tuning (4–5 epochs) yields optimal validation accuracy, while extended training leads to overfitting. Instruction-tuned embeddings do not demonstrate significant advantages over contrastively aligned alternatives for this task. Finally, arithmetic averaging of embeddings from diverse models produces the most robust representations, achieving approximately 65% validation accuracy.
PLlama at SemEval-2026 Task 4: Zero-shot Prompting with Llama-3.2 for Narrative Similarity
Kanishka Jain
Kanishka Jain
This paper describes our submission to the SemEval-2026 Task 4 on Narrative Story Similarity and Narrative Representation Learning. The shared task focuses on modeling the similarity across narratives on the basis of perceived relatedness between events’ causality. The task frames narrative similarity as a binary classification problem in which the models determine which of the two stories is more narratively similar to a given anchor story. Our approach leverages the pre-trained language model Llama-3.2-3B-Instruct with prompt engineering, allowing the system to assess narrative similarity without explicit fine-tuning. On the test data, our system achieved an accuracy of approximately 55% in Track A. While modest, our results establish a baseline for narrative similarity detection in large language models (LLMs) highlighting both their potential and challenges of applying computationally efficient instruction-tuned models to this task. Our analysis highlights the struggle of LLMs in capturing event causality and long range narrative dependencies.
Team HITS at SemEval-2026 Task 4:Enhancing narrative text embedding model training with hard negatives generation and self-distillation
Qian Zhou | Yi Fan | Wei Liu | Michael Strube
Qian Zhou | Yi Fan | Wei Liu | Michael Strube
We first use Qwen2.5-32B-Instruct model to generate hard negatives from threenarrative dimensions. We then train a Qwen3-Embedding-8B model with a multi-negativecontrastive objective and use self-distllation.
LATE-IIMAS at Semeval-2026 Task 13: Evaluating GNNs, PLMs, LLMs, and Stylometry for Automatic Code Identification
Andric Valdez | Emmanuel Ancona | Sebastián Bernardino | Helena Gomez-Adorno | Fazlourrahman Balouchzahi | Fabian Herrera
Andric Valdez | Emmanuel Ancona | Sebastián Bernardino | Helena Gomez-Adorno | Fazlourrahman Balouchzahi | Fabian Herrera
The generation of source code via Artificial Intelligence has become a prevalent practice in both academia and industry, posing significant challenges to academic integrity and authorship attribution. In this work, we address SemEval-2026 Task 13: Detecting Machine-Generated Code by evaluating the effectiveness of four distinct methodologies: Graph Neural Networks (GNNs), Pre-trained Language Models (PLMs), Large Language Models (LLMs), and Stylometric Feature Engineering using XGBoost. Our approach focuses on three specific scenarios: Subtask A (Binary Detection), Subtask B (Multi-Class Authorship), and Subtask C (Hybrid Code Detection). While our models achieved high performance during the validation phase, the transition to the final test set revealed substantial challenges in generalization, likely due to the increased diversity of programming languages and generators in the unseen data. This work serves as a foundational first step, identifying critical gaps in model robustness and highlighting the need for more sophisticated methodologies to bridge the performance gap in complex, real-world environments.
UAlberta at SemEval-2026 Task 5: Disambiguating Stories via Task Decomposition
David Basil | Junhyeon Cho | Chirooth Girigowda | Guoqing Luo | Sahir Momin | Sevryn Robinson | Ning Shi | Grzegorz Kondrak
David Basil | Junhyeon Cho | Chirooth Girigowda | Guoqing Luo | Sahir Momin | Sevryn Robinson | Ning Shi | Grzegorz Kondrak
We describe our system for predicting sense plausibility in short narratives. Our approach centers on task decomposition: instead of predicting a score directly, we break the problem into simpler subtasks and combine their outputs. We further improve performance by ensembling complementary signals, including word sense disambiguation and fine-tuned embedding models. We also find empirical support for the one-homonym-per-translation principle of Hauer and Kondrak (2020a). Our best ensemble system achieves competitive performance in the official evaluation. Our code and data are available on GitHub.
ChulaNLP at SemEval-2026 Task 4: Neural Aspect Composition for Narrative Story Embeddings
James Gampper | Attapol Rutherford
James Gampper | Attapol Rutherford
Comparing stories and narratives has proven to be a difficult task to automate because traditional vector representations fail to capture the layered and multi-faceted aspects of stories such as theme, plot progression, and resolution. We address SemEval-2026 Task 4, which requires generating vector embeddings that preserve narrative similarity relationships. We propose Neural Aspect Composition, which functions by using a Large Language Model (LLM) to decompose stories into 13 semantic narrative aspects (theme, course of action, outcomes, etc.), encodes each aspect separately using an encoder model, and learns a global importance weight for each aspect through a trained weighting layer. Our approach achieves the official test scores of 0.64 on Track A and 0.61 on Track B. During validation, it outperformed vectors produced by inputting the raw story text directly into an encoder model and a sentence-averaging baseline. The analysis of the learned weights on the development set reveals that thematic elements and narrative resolutions were the primary drivers of perceived similarity, receiving significantly higher weights than intermediate plot events and other minor details such as character introductions.
GUNLP at SemEval-2026 Task 10: Psycholinguistic Conspiracy Marker Extraction and Detection (PsyCoMark)
Rojin Ziaei | Mahsa Khoshnoodi | Nazli Goharian
Rojin Ziaei | Mahsa Khoshnoodi | Nazli Goharian
This paper presents the Georgetown University NLP (GUNLP) system developed for SemEval 2026 Task 10: Psycholinguistic Conspiracy Marker Extraction and Detection, addressing the classification of conspiratorial beliefs in Reddit posts (Subtask 2). Our approach leverages COVID-Twitter-BERT v2 (CT-BERT-v2) within a multi-task learning framework that jointly optimizes conspiracy classification and emotion label prediction through a dual-head architecture. To address data scarcity, we enrich the training set using paraphrasing-based data augmentation and GPT-5-generated chain-of-thought emotion annotations, effectively doubling the training corpus to approximately 8,600 examples. We evaluate two input configurations: text only and text concatenated with emotion labels. The emotion-aware configuration achieves the strongest performance with an F1 score of 0.87 on the official development set, outperforming the text-only baseline by five F1 points and demonstrating the value of paraphrased samples and affective auxiliary supervision for conspiracy detection in social media text.
ChulaNLP at SemEval-2026 Task 5: Regression-Calibrated LLM for Word-Sense Scoring
Wayu Limsuwan | Attapol Rutherford
Wayu Limsuwan | Attapol Rutherford
Word Sense Disambiguation (WSD) is typically framed as a classification task that selects one correct sense for a word. However, real language is often less clear-cut, as a homonym may support several plausible interpretations. SemEval 2026 Task 5 addresses this limitation by introducing plausibility rating, where models estimate how likely each sense is in a narrative context, aligning predictions with graded human judgments. We use GlossBERT and BEM as encoder-based baselines and show that large language models (LLMs) produce more accurate plausibility estimates. Building on this observation, we propose a regression-calibrated LLM model that applies linear regression to adjust raw LLM outputs to better match human annotation patterns. Our calibrated model achieves the highest within-standard-deviation accuracy among our evaluated systems, demonstrating that lightweight post-hoc calibration can substantially improve LLM performance on graded semantic judgment tasks.
The Argonauts at SemEval 2026 Task 6: Large Language Models for Response Clarity Classification: Prompting, Fine-Tuning, and Data-Centric Approaches
Sajib Bhattacharjee | Sha Newaz Mahmud | Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque
Sajib Bhattacharjee | Sha Newaz Mahmud | Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque
Detecting equivocation is essential, as indirect or evasive responses can shape public perception, influence political narratives, and undermine transparency in democratic discourse. To address the challenge of detecting evasive political responses on digital platforms, participation in the CLARITY SemEval-2026 Task was undertaken, which focuses on (i) clarity-level classification and (ii) fine-grained evasion-type classification in political question-answer contexts. This study introduces a data-centric framework that systematically examines the effects of class distribution and refinement strategies on the performance of Large Language Models (LLMs). A distribution-aware, LLM-augmented dataset was constructed by selectively paraphrasing minority-class instances to enhance class balance, and its performance was benchmarked against full, rebalanced, and undersampled training configurations. To comprehensively assess the proposed method, Qwen3-14B, Phi-4, Gemma-2 9B, and Mistral 7B were evaluated in in-context learning (ICL) settings (zero-shot and few-shot) and with LoRA fine-tuning. Experimental results indicate that fine-tuning Phi-4 with class rebalancing yields strong performance, achieving 74.77% on Subtask-1 and 51.55% on Subtask-2. Consequently, the system ranked 21st in Subtask-1 and 22nd in Subtask-2 on the official evaluation leaderboard.
IIMAS-RAG at SemEval-2026 Task 8: Hybrid Sparse-Dense Retrieval and Answerability-Conditioned Generation for Multi-Turn RAG
Vania Raya-Rios | Helena Gomez-Adorno | Leon Hecht | Pedro Vázquez-Osorio | Erick Fabián-Sandoval | Jesús Vázquez-Osorio | Diego Hernández-Bustamante
Vania Raya-Rios | Helena Gomez-Adorno | Leon Hecht | Pedro Vázquez-Osorio | Erick Fabián-Sandoval | Jesús Vázquez-Osorio | Diego Hernández-Bustamante
This paper presents IIMAS-RAG, our system for SemEval-2026 Task 8 on evaluating multi-turn retrieval-augmented generation. Our approach combines LLM-based query rewriting, hybrid sparse-dense retrieval with SPLADE and Voyage-3-large fused via Reciprocal Rank Fusion, and answerability-conditioned generation with GPT-4.1. The system ranked 4th out of 38 teams in Subtask A (Retrieval) and 13th out of 29 teams in Subtask C (Full RAG). Our results show that query rewriting is the most impactful retrieval component, while generation remains challenging in low-context and partially answerable scenarios.
ServSocIA at Semeval-2026 Task 9: Evaluating Prompt Strategies for Polarization Detection
Jacob Altamirano | Mario Leon Pérez | Bruno Ruiz-Juarez | Luis Chiruzzo | Helena Gomez-Adorno | Fazlourrahman Balouchzahi
Jacob Altamirano | Mario Leon Pérez | Bruno Ruiz-Juarez | Luis Chiruzzo | Helena Gomez-Adorno | Fazlourrahman Balouchzahi
This paper presents our approach to Subtask 1 of SemEval-2026 Task 9 on multilingual polarization detection in social media texts in English and Spanish. We model the task as a prompt-based binary classification problem and systematically compare zero-shot, one-shot, and few-shot strategies across multiple large language models accessed via commercial APIs, without task-specific fine-tuning. Our controlled experimental setup enforces strict data separation and consistent decoding conditions to analyze the impact of in-context supervision across architectures and languages. Results indicate that well-structured prompting enables competitive performance, though implicit and culturally nuanced polarization remains challenging.
This paper describes our system for POLAR Subtask 1 on multilingual polarization detection. The task involves binary sequence classification over 22 languages, where the model aims to predict whether a given text exhibits polarized discourse. To deal with the multilingual and resource-imbalanced nature of the dataset, we fine-tune the XLM-R, a pre-trained multilingual transformer encoder, using a language-aware sampling strategy that combines all available training data into a unified multilingual corpus. Our system achieves an overall macro-F1 of 0.781 and an average accuracy of 0.823 on the official test set. Results show strong performance in low-resource languages, though some discrepancies indicate remaining class imbalance.
Cherish at SemEval-2026 Task 2: Enhancing RoBERTa-Based Models for Emotional Valence and Arousal Prediction in Ecological Essays with Personalized PLoRA and Temporal Embeddings
Cetta Parahita
Cetta Parahita
This paper describes the system developed by Team Cherish for SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays. Our approach models emotional dynamics in user-generated text by incorporating both personalization and temporal information into a transformer-based architecture. We use RoBERTa-large as the backbone encoder and enhance it with PLoRA and a temporal embedding module. Cherish’s model architecture is designed to maintain general semantic knowledge while subtly adapting to individual users and emotional shifts over varying temporal gaps. Our system achieved 13th place out of 29 teams in Subtask 1, obtaining a Pearson’s r composite score of 0.596 for valence prediction and 0.505 for arousal prediction. While the team also participated in Subtask 2a, technical issues during inference led to zero variance in predictions, resulting in an undefined (NaN) official correlation score.
NLP-CEIA-UFG at SemEval-2026 Task 8: Iterative Retrieval with Notes-Guided Query Refinement for Multi-Turn RAG
Guilherme Dutra | André Felipe Caraíba | Nádia Félix Da Silva | Paulo Dos Santos | Deborah Silva Fernandes | Sávio Salvarino De Oliveira
Guilherme Dutra | André Felipe Caraíba | Nádia Félix Da Silva | Paulo Dos Santos | Deborah Silva Fernandes | Sávio Salvarino De Oliveira
We describe NLP-CEIA-UFG, our system forSemEval-2026 Task 8, which evaluates multi-turn retrieval-augmented generation (RAG)over heterogeneous document corpora. Ourpipeline centers on a three-iteration dynamicretrieval loop in which two gpt-oss-120b-powered modules—an Iterative Query Genera-tor and a Notes Builder—interact at each stepto diversify queries and accumulate structurednotes on information gaps. After the loop, anAnswerability Classifier routes the query to oneof three generation paths (Complete Answer,Partial Answer, or Clarification Request). Hy-brid BM25 and dense retrieval is fused via Re-ciprocal Rank Fusion and refined by the Jinalistwise reranker. The retrieval pipeline is com-piled under DSPy and optimized with GEPA.We achieve nDCG@5 of 0.4502 (rank 17/38,Subtask A) and HM = 0.3774 (rank 24/29, Sub-task C). Post-hoc analysis identifies an over-conservative Answerability Classifier as theprimary bottleneck: 75.5% of all responseswere flagged as IDK by the evaluator, includ-ing 69.8% of ANSWERABLE questions, whilethe retrieval and generation components per-form well when the classifier routes correctly.Our code is available at https://github.com/GuiiCorreia/SemEval-2026.
Sentiment Syndicate at SemEval-2026 Task 6: Reframing Political Question–Answer Interactions via Natural Language Inference for Clarity Level Classification
Rafi Rafsan
Rafi Rafsan
This paper presents the Sentiment Syndicate team’s submission to SemEval-2026 Task 6, Subtask 1 (CLARITY: Unmasking Political Question Evasions), which focuses on classifying the clarity level of political question–answer interactions. We investigate three modeling strategies: (1) fine-tuning a RoBERTa-based classifier, (2) reformulating the task as a Natural Language Inference (NLI) problem, and (3) leveraging large language models (LLMs) for classification. All approaches are evaluated using macro F1-score on the official dataset. Experimental results demonstrate that the NLI based formulation outperforms the other strategies, highlighting the effectiveness of modeling semantic alignment between questions and answers. Our best-performing system achieves an F1-score of 0.67 on the test set.
clulab-retrieval at SemEval-2026 Task 8: A Comparative Analysis of Dense Retrievers and HyDE for Multi-Turn Conversational Retrieval
Hyungji Kim | Siva Rohit Kondapaneni | Steven Bethard
Hyungji Kim | Siva Rohit Kondapaneni | Steven Bethard
We present a comparative analysis of dense retrievers and retrieval strategies for multi-turn conversational retrieval in SemEval-2026 Task 8 (MTRAGEval). Our official submission employed a fine-tuned E5-based dense retriever (E5-FT, ~110M parameters) with Hypothetical Document Embeddings (HyDE), achieving nDCG@5 of .3309, ranking 31 out of 38 systems. On the development set we also compared E5-FT versus BGE embeddings, dense-only versus hybrid retrieval strategies, and HyDE versus keyword extraction approaches. We found: (1) BGE (general-purpose, ~110M) outperforms our domain-fine-tuned E5-FT (~110M) by 30.5% on baseline retrieval, suggesting that model selection may matter more than domain-specific fine-tuning, (2) hybrid retrieval combining BM25 and dense methods provides complementary signals, with HyDE improving BM25 by 26.7% and dense retrieval by 4.0%, and (3) keyword-based query simplification degrades performance by 11-28% across domains, validating HyDE’s approach of preserving semantic richness through passage-level text.
Narrative Nexus at SemEval-2026 Task 4: Modeling Narrative Similarity via Instruction-Based Fine-Tuning and Synthetic Data Augmentation
Haotan Guo | Hongbin Na | Zimu Wang | Wei Wang
Haotan Guo | Hongbin Na | Zimu Wang | Wei Wang
Narrative similarity assessment requires models to reason beyond surface-level lexical overlap and capture higher-level plot structures and thematic relationships. In this paper, we address SemEval-2026 Task 4 Track A: Narrative Story Similarity by reformulating it as an instruction-following generation problem. We employ parameter-efficient fine-tuning via LoRA to adapt pretrained large language models for triplet-based narrative comparison. To overcome the limitations imposed by the scarcity of human-annotated data, we further incorporate synthetic triplet samples generated by a large language model for data augmentation. Experimental results demonstrate that our fine-tuned Qwen2.5-7B model achieves competitive performance, outperforming the zero-shot GPT-4o-mini baseline. These findings underscore the effectiveness of task-specific adaptation combined with synthetic data augmentation for narrative similarity modeling.
ttda704 at SemEval-2026 Task 4: Modeling Narrative Structures via Pseudonymization and Multi-View Sentence Alignment
Tai Tran Tan | An Thien
Tai Tran Tan | An Thien
We present our approach to SemEval 2026 Task 4: Narrative Story Similarity and Narrative Representation Learning. Our solution uses contrastive learning with fine-tuned sentence transformers to capture narrative similarity across abstract themes, course of action, and outcomes. We develop two pipelines: (Track A) a single-view method that encodes full narratives with smart layer freezing to reduce overfitting, and (Track B) a multi-view method that models theme, plot, and outcome with view-specific projection heads and self-supervised alignment. Both pipelines build on sentence-transformers models and are trained with contrastive loss on synthetic data. The code is available at the following GitHub repository: https://github.com/dinhthienan33/SemEval2026-Task4-ttda704.
PingAn-NLP at SemEval-2026 Task 9: Multi-Stage Alignment via GRPO and Tiered Ensemble Voting for Multilingual Polarization Detection
Diyang Chen | Youzhen Pang
Diyang Chen | Youzhen Pang
This submission describes the PingAn-NLP system for SemEval-2026 Task 9 Subtask 3, identifying polarization manifestations in 18 languages. We employ a tiered optimization framework integrating Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Key technical innovations include synthetic reasoning distillation from a 235B teacher model , a Smart-Tradeoff reward function designed to mitigate extreme label imbalance , and a tiered ensemble voting strategy that adaptively adjusts decision thresholds based on language resources. Our 8B-GRPO-Vote system demonstrated robust internal performance in tracks like English and Hindi and officially secured second place in the Bengali, English, Odia, and Turkish competitions.
ttda704 at SemEval-2026 Task 6: Structured Chain-of-Thought Prompting for Political Evasion Detection
Tai Tran Tan | An Dinh
Tai Tran Tan | An Dinh
We present our system for SemEval-2026 Task 6 (CLARITY: Unmasking Political Question Evasions), which addresses political evasion detection in English question-answer pairs from U.S. presidential interviews.We compare two paradigms: (1) parameter-efficient fine-tuning of Qwen3 models (4B–32B) using QLoRA with tiered upsampling and weighted cross-entropy loss to address severe class imbalance, and (2) structured Chain-of-Thought (CoT) prompting with reasoning-capable API models, including DeepSeek-V3.2 and Grok-4-Fast.Our best system uses Grok-4-Fast with extended reasoning and few-shot hierarchical CoT prompting, achieving Macro F1 scores of 0.5147 on Subtask 2 (9-class evasion) and 0.7979 on Subtask 1 (3-class clarity). On the official leaderboard, it ranks 8/33 on Subtask 2 and 13/41 on Subtask 1. Ablation results show that hierarchical label presentation provides a useful reasoning scaffold and that extended reasoning helps models handle subtle pragmatic distinctions, although the strongest prompt variants are not statistically distinguishable in Macro F1.
Multi-Label Polarization Classification with twHIN-BERT and SCUT Threshold Optimization
Ilinca Vandici | Ådne Jøssing | Lukas Viestädt
Ilinca Vandici | Ådne Jøssing | Lukas Viestädt
Tackling task 2, we fine tune a BERT-style encoder with classification heads added on top. We first try out different pre-trained encoder models, before settling on the Twhin-bert multilingual model, since its pretraining corpus (mainly tweets) provides a suitable starting point for our task. To resolve the issue of diverging label annotation styles, we apply the S-Cut algorithm, in order to calibrate thresholds for label selection, and examine its impact. We take a look at the resulting hidden representations in a reduced dimensional space, and examine the linguistic information encoded by our model after fine-tuning using linguistic probing.
DigiS-FBK at SemEval-2026 Task 9: Multi-task Learning for Multilingual and Cross-cultural Polarization Classification
Veronica Orsanigo | Alan Ramponi | Elisa Leonardelli
Veronica Orsanigo | Alan Ramponi | Elisa Leonardelli
Online polarization promotes social fragmentation, misinformation, hate, and toxic language. Polarization has been studied from social and communication perspectives, but it can also be addressed computationally as a text classification task. Due to the variety of polarization targets and manifestations, polarization is a complex phenomenon to study, and both detecting and characterizing it are challenging tasks.In this paper, we present the systems submitted by the DigiS-FBK team to SemEval-2026 Task 9 POLAR aimed at detecting polarization in textual content (subtask 1) and identifying its type (subtask 2) and manifestation (subtask 3) in a multilingual, multicultural, and multievent context. Considering the strong link between subtasks, we propose an approach that leverages a multi-task learning paradigm. Our results reveal that, despite the variability in scores across languages, the overall performance when using multi-task learning is higher than when adopting a single task approach in all subtasks
CausalMinds at SemEval-2026 Task 12: Simple Fine-Tuning with Option Shuffling Outperforms Complex Pipelines for Abductive Event Reasoning
Vidur Gupta | Xiaofei Zhao | Jason Shaye
Vidur Gupta | Xiaofei Zhao | Jason Shaye
We describe our system for SemEval-2026 Task 12 on Abductive Event Reasoning, which requires identifying plausible direct cause(s) of real-world events. We conduct a systematic evaluation of 23 configurations spanning prompting, retrieval-augmented generation, multi-stage verification, and supervised fine-tuning across models of different scales. Across experiments, we found that fine-tuning GPT-4.1-mini with data augmentation via option shuffling consistently outperformed more complex multi-stage pipelines and larger-model prompting strategies. Our system scores 0.88 on the test dataset, ranking 19th out of 221 submissions, which is only 0.07 away from the highest scoring submission of 0.95. Interestingly, chain-of-thought prompting and multi-stage verification hurt performance compared to simpler baselines. This reinforces that simplicity can outperform complex pipelines. We document these negative results and examine the persistent gap between development (0.991) and test (0.88) scores.
CUETClashing at SemEval-2026 Task 1: Multilingual Joke Generation Under Lexical and Topical Constraints Using Small Instruction-Tuned LLMs
Madiha Ahmed Chowdhury | Lamia Khan | Faozia Fariha | Symom Hossain Shohan | Mohammed Moshiul Hoque
Madiha Ahmed Chowdhury | Lamia Khan | Faozia Fariha | Symom Hossain Shohan | Mohammed Moshiul Hoque
Generating humorous text is one of the most challenging tasks in natural language generation, as models must simultaneously juggle creativity, cultural understanding, and rules. To tackle these issues, this paper introduces our system for Subtask A of SemEval-2026 Task 1: MWAHAHA - Models Write Automatic Humor And Humans Annotate, which asks for single-sentence jokes with two rules—certain words must be included, and the joke must relate to a news headline—in English, Spanish, and Chinese. Our method uses instruction-tuned language models: Qwen2.5-3B-Instruct for English and Chinese, and Salamandra-2B-Instruct for Spanish, paired with language-specific prompts, special sampling for outputs, and a strong cleaning process after jokes are generated. Without additional task-specific training, our system generates jokes that adhere to the rules in all three languages, demonstrating that simple prompt design and small, instruction-tuned models can be a strong, efficient way to generate funny text across multiple languages.
Osint at SemEval-2026 Task 13: A Distribution-Aware Framework for Machine-Generated Code Detection and Multi-Source Authorship Attribution
Shifali Agrahari | Abhishek Anand | Shubham Kannaujiya | Sanasam Ranbir Singh | Sujit Kumar
Shifali Agrahari | Abhishek Anand | Shubham Kannaujiya | Sanasam Ranbir Singh | Sujit Kumar
The rise of code-generating LLMs such as DeepSeek, Qwen, and Meta-LLaMA has improved developer productivity but also increased risks of plagiarism, copyright misuse, and insecure machine-generated code. While AI-text detection is well studied, machine-generated source-code detection especially across multiple languages, LLM families, and OOD conditions-remains underexplored. SemEval-2026 Task 13 addresses this via two subtasks: (A) binary human–machine code detection and (B) multi-class authorship attribution across ten LLM families. For Subtask A, we fine-tune RoBERTa, CodeBERT, GraphCodeBERT, and StarCoderBase-1B, introducing a stratified sampling strategy with class-weighted loss to mitigate imbalance and OOD shifts. For Subtask B, we mitigate the extreme human-class imbalance using undersampling, inverse-frequency weights, syntactic noising, and curriculum-based dual-path training with TinyStarCoderPy and CodeBERT. Both results show that long-context modeling, distribution-aware sampling, and noise-robust training are crucial for reliable in real-world settings. Overall, long-context modeling, distribution-aligned sampling, and lightweight noise-robust training emerge as key factors for reliable machine-generated code detection and authorship attribution.
PolaFusion at SemEval-2026 Task 9: Ensemble Transformers with Targeted Augmentation for Multilingual Polarization Detection
Abdullah Mohammad
Abdullah Mohammad
We present PolaFusion, our system for SemEval-2026 Task 9, which requires detecting polarization in social media posts across 22 languages, classifying its type (Subtask 2), and identifying its rhetorical manifestation (Subtask 3). The task is characterized by severe and pervasive class imbalance across all three subtasks and all 22 languages. We address this through a combination of three strategies: a hierarchical gating architecture where a binary gatekeeper model gates two specialist classifiers trained exclusively on polarized content; an eight-model mega-ensemble combining fivefold mDeBERTa-v3-base and three-fold XLM-RoBERTa-large with soft-vote probability aggregation; and a Macro-F1-aware augmentation strategy using Qwen3-235B that generates synthetic minority-class examples only for language-label pairs that are both scarce and poorly learned. Throughout training, inverse-frequency class weighting within BCEWithLogitsLoss forces the model to attend proportionally to rare labels. Our system achieves official Macro-F1 scores of 0.800, 0.576, and 0.502 on Subtasks 1–3 respectively, outperforming the POLAR baseline by +0.040, +0.089, and +0.082 average Macro-F1 across languages. Our code is publicly available at https://github.com/Abdullah4152/PolaFuse.
NLP-CIMAT at SemEval-2026 Task 9: LLM-Based One-Shot and Cross-Lingual Data Augmentation for Polarization Detection
Miriam Calderon-Reyes | Fernando Sanchez-Vega | Adrian Pastor Lopez Monroy
Miriam Calderon-Reyes | Fernando Sanchez-Vega | Adrian Pastor Lopez Monroy
This paper describes our participation in SemEval 2026 Task 9: Multilingual Text Polarization. The task requires estimating polarization levels across languages, where linguistic variability and limited annotated data pose significant challenges. To address data scarcity, we propose a pipeline that combines cross-lingual translation, synthetic data augmentation via LLMs, and domain-specific pre-trained models. Our approach leverages the hypothesis that polarization signals can transfer across languages without substantial loss of semantic alignment, enabling effective data augmentation through translation. Notably, one-shot synthetic example generation emerges as a viable strategy for enriching training data in topic-specific scenarios. Experimental results demonstrate high stability and competitive performance, achieving a macro F1-score of 0.7869 for Spanish and 0.7939 for English on the test set, ranking 21th on the official English leaderboard, while our Spanish results are competitive with top-performing systems, corresponding to 7th place.
Dawn at SemEval-2026 Task 8: Structured Control Decomposition for Faithful Multi-Turn Retrieval-Augmented Generation
Feiling Li | Xiaoya Qi | Xunyue Wang | Pusheng Chen | Zhiwen Tang | Han Yang
Feiling Li | Xiaoya Qi | Xunyue Wang | Pusheng Chen | Zhiwen Tang | Han Yang
Multi-turn Retrieval-Augmented Generation faces structural challenges that go beyond single-turn retrieval and fusion. Context-dependent queries, cross-turn evidence accumulation, and uncertain answerability jointly affect retrieval quality and generation reliability. We propose a structured control framework that formulates multi-turn RAG as a regulated reasoning process rather than a loosely coupled pipeline. The system first performs evidence and context structuring, extracting atomic facts strictly grounded in reference passages while reconstructing a self-contained query from dialogue history. It then conducts decision-conditioned generation, where explicit control signals regarding question intent, dialogue dependency, and answerability govern response feasibility, scope, and organization. By separating structural decision making from surface realization, the framework enforces consistent information flow across stages and reduces hallucination.Experiments on SemEval-2026 Task 8 show that our approach achieves strong faithfulness and stable overall performance, ranking 17/26 on Task B (generation, H=0.6333).
SYSUpporter at SemEval-2026 Task 13: Leveraging Stylistic Signals and Language-Aware Truncation for Machine-Generated Code Detection
Longfeng Chen | Zheng Xiao
Longfeng Chen | Zheng Xiao
This paper describes our system for SemEval-2026 Task 13 Subtask B, which requires attributing source code to either a human author or one of 10 LLM families. Guided by dataset analysis, we identify three practical challenges: formatting fingerprints discarded by tokenizers, heterogeneous code lengths, and extreme class imbalance. We build on unixcoder-base with Explicit Stylistic Prompting, Language-Aware Truncation, and imbalance-aware training (Focal Loss, GeM pooling, multi-sample dropout, and bucket batching). Our system achieves 0.434 Macro F1 on the official hidden test set, ranking 4th out of 34 teams with only 125M parameters. Controlled 5-fold cross-validation confirms that each component contributes to the final system, and a formatting-normalization study quantifies the model’s reliance on formatting cues.
ssurface3 at SemEval-2026 Task 3: Efficient Methods for Multilingual Dimensional Aspect-Based Sentiment Analysis
Anatolii Frolov | Elisei Rykov
Anatolii Frolov | Elisei Rykov
This paper describes our submission to thedimABSA Shared Task (Subtask 1), whichrequires predicting continuous Valence andArousal scores for target aspects in multilin-gual reviews. We evaluate three approaches:prompting-based baselines, a multilingual en-coder model, and a decoder-only LLM withsupervised fine-tuning. Our main focus isefficient adaptation under multilingual datascarcity. We show that compact encoder anddecoder models, when properly fine-tuned,achieve strong performance across languagesand domains. To improve training stability andenforce valid predictions, we use a boundedregression formulation that maps outputs to thetarget score range. We also explore parameter-efficient fine-tuning and intermediate trainingon external affective data. Results show thatprompting-based baselines are substantiallyweaker than supervised models. The mul-tilingual encoder provides a strong and effi-cient baseline, while the best performance isachieved by a compact decoder model withparameter-efficient fine-tuning. Overall, ourfindings highlight the importance of carefulfine-tuning and training design for multilingualdimensional sentiment analysis.
The Argonauts at SemEval-2026 Task 9: Multilingual Polarization Detection and Classification Using LLM Prompting and Transformer Fine-Tuning
Sha Newaz Mahmud | Sajib Bhattacharjee | Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque
Sha Newaz Mahmud | Sajib Bhattacharjee | Md. Refaj Hossan | Kawsar Ahmed | Mohammed Moshiul Hoque
Online polarization, defined as the pronounced division of public opinion into antagonistic groups, poses a significant threat to social cohesion. Automatic detection of polarization across diverse languages and cultures is essential for effective monitoring of online discourse. The challenge extends beyond identifying hate speech to recognizing more nuanced forms, including negative stereotypes, attribution of blame, and dehumanization. This work addresses SemEval-2026 Task 9, which focuses on detecting polarization in multiple languages. Specifically, Subtask 1 involves binary classification of message polarization, while Subtask 2 requires assigning multiple polarization labels in English and Bengali. For Subtask 1, Qwen3-14B is employed with structured few-shot prompting in 4-bit mode, yielding test macro-F1 scores of 0.847 for Bengali (4th place) and 0.808 for English (9th place). For Subtask 2, XLM-RoBERTa-large and RoBERTa-base are fine-tuned using an uneven loss (γ+ = 1, γ− =4) and label-specific thresholds, which increase development macro F1 by up to 24.6 points. The final test macro F1 for English is 0.454 (21st place). Analysis indicates that large language model prompting enhances binary polarization detection, while threshold adjustment is critical for addressing class imbalance in multi-label tasks.
TFB at SemEval-2026 Task 4: Diagnosing Model Failures in Narrative Understanding
Anna Colli | Benedictus Kent Rachmat | Eve Sauvage | Delphine Battistelli | Thomas Gerald | Cyril Grouin | Julien Tourille | Zheng Zhang
Anna Colli | Benedictus Kent Rachmat | Eve Sauvage | Delphine Battistelli | Thomas Gerald | Cyril Grouin | Julien Tourille | Zheng Zhang
We describe the participation of team TFB in SemEval-2026 Task 4 on narrative similarity. We explore ColBERT-inspired sentence-level late interaction to capture event reordering, compare fine-tuning with synthetic data at multiple difficulty tiers, finding that distribution proximity to the target data matters more than volume and evaluate chain-of-thought prompting. We complement our approaches with a human annotation study (Krippendorff’s alpha=0.32) confirming the task’s inherent difficulty, an analysis of synthetic data distribution shift explaining why fine-tuning on out-of-distribution data hurts the model’s performance. Despite our tests, we didn’t surpass results of sentence-t5-xxl on Track B and Qwen2.5-7B on Track A. We finally decided to submit these two models for the task.
DeltaSHAP: a Shapley Value Framework for Interpreting Political Ambiguity
Sven-Alexander Gal | Rodica-Ioana Lung
Sven-Alexander Gal | Rodica-Ioana Lung
Political ambiguity and response clarity have become increasingly important research topics in computational social science and natural language processing. In this paper, we present a solution to the SemEval 2026 Task 6 "Clarity" Challenge. We propose a novel framework that employs TF–IDF representations and Shapley-value–based feature selection for multi-class classification. Shapley-based feature importances are used both for post-hoc explanation and as an active mechanism for label-specific vocabulary selection. For each label, features exceeding a predefined threshold are retained, label-specific vocabularies are filtered through set differences, and independent one-versus-all classifiers are trained using specific features. Experimental results show that threshold tuning substantially impacts performance, with the best performance achieved at intermediate threshold values. Our findings demonstrate that using the game-theoretic feature selection provides an interpretable approach to clarity classification, offering a flexible methodology for ambiguity-sensitive text analysis.
INFOTEC-NLP at SemEval-2026 Task 9: Comparing Regional Transformers and Bag-of-Words Approaches for Polarization Detection in Spanish
Eduardo C. C. Hernandez-Garcia | Guillermo Ruiz | Mario Graff
Eduardo C. C. Hernandez-Garcia | Guillermo Ruiz | Mario Graff
Polarization detection in short texts is a challenging and relevant problem in Natural Language Processing, particularly in social media environments where regional variationsand subtle discursive nuances converge. Inthis paper, we describe our participation inSubtask 1 (Spanish) of SemEval-2026 Task 9(Naseem et al., 2026a), which focuses on binary polarization classification. We evaluatetwo main strategies: lexical models based onBag-of-Words representations and regionallypre-trained Transformer models for Spanish. Inaddition, we explore a logistic stacking framework that combines lexical and contextual representations. Our experiments show that regionally adapted Transformers generally outperform purely lexical approaches, with BILMALATachieving the strongest performance in this task.The results highlight the importance of regionally aligned pre-training on social media datafor effective polarization detection in Spanish.
Aatman at SemEval-2026 Task 9: Transfer Learning for Multilingual Polarization Detection
Aatman Vaidya
Aatman Vaidya
This paper describes our system for Subtask 1 of SemEval-2026 Task 9: POLAR, which focuses on multilingual polarization detection. The task is formulated as a binary classification problem across 22 languages drawnfrom diverse online platforms and real-world events. We investigate three complementary approaches: supervised fine-tuning of multi-lingual encoder-only transformer models, zero-and few-shot classification using large language models (LLMs), and transfer learning from related harmful language tasks such as hate speech, toxicity, abusive language, and gender-based violence. Among the supervised models, mDeBERTa achieved the strongest baseline performance. Prompt-based methods with open-weight LLMs showed limited effectiveness, particularly in zero-shot settings. The best resultswere obtained using transfer learning, where the model was first fine-tuned on related task datasets and then adapted to the polarizationtask, achieving a Macro-F1 score of 0.81. Our findings indicate that supervised multilingualencoders remain highly effective for polarization detection and that incorporating related harmful language tasks can substantially improve performance, especially for nuanced and context-dependent expressions of polarization.
ZYC at SemEval-2026 Task 5: Application of BERT-based Contextual Embeddings Similarity for WSD
Sunny Zhou | Jordan Youner | Dean Cahill
Sunny Zhou | Jordan Youner | Dean Cahill
We investigate contextual embedding manipulation for Word Sense Disambiguation (WSD)as part of SemEval-2026 Task 5. We propose four approaches built on BERT-like pretrainedmodels, experimenting with the informativeness of similarity calculations and classificationmethods. We introduce scratch-trained cross-attention mechanisms inspired by GLiNER to compute similarity between definition or synonym representations and the full context. Our best performance achieved 57% accuracy with a Spearman correlation of 0.20. Our results suggest that finetuning strategy and trainng curriculum matter more than pretrained model choice for this novel task, and we identify several directions for future improvement. View our code base at: https://github.com/heliosraz/SemEval52026
MINDS at SemEval-2026-Task 1: Enhancing Humor Generation through RAG and Synthetic DPO Alignment
Sina Eskandari | Seyed Amirreza Mousavi | Amirreza Rahimi | Mona Pouresmaeil | Marcello Vitaggio | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Sina Eskandari | Seyed Amirreza Mousavi | Amirreza Rahimi | Mona Pouresmaeil | Marcello Vitaggio | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Humor generation presents significant challenges due to subjectivity and the limitations of automatic metrics. In this work, we address Task 1 of SemEval 2026 (Subtask A) by evaluating three instruction-tuned models (Llama 3.1, Gemma 2, and Qwen 2.5) via a round-robin LLM judging framework. We investigate the impact of Retrieval-Augmented Generation and Direct Preference Optimization (DPO) on performance. Our results identify Llama 3.1 as the strongest baseline and demonstrate that DPO consistently improves humor quality across configurations. These findings confirm the efficacy of LLM-based judging as a practical training signal for optimizing subjective generation tasks.
uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking
Simon Lupart | Kidist Mekonnen | Zahra Abbasiantaeb | Mohammad Aliannejadi
Simon Lupart | Kidist Mekonnen | Zahra Abbasiantaeb | Mohammad Aliannejadi
This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM–based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.
Team CV at SemEval-2026 Task 4: Prompting LLMs and Benchmarking Embedding Models for Narrative Story Similarity
Chandan Kumar R S | Vinay Ulli
Chandan Kumar R S | Vinay Ulli
This paper describes Team CV’s systems forSemEval-2026 Task 4: Narrative Story Sim-ilarity and Narrative Representation Learn-ing (Hatzel et al., 2026). For Track A (com-parative judgment), we explore five prompt-ing strategies—zero-shot, chain-of-thought,structured feature extraction, pairwise scor-ing, and few-shot—and QLoRA fine-tuningof smaller models. For Track B (narrativeembeddings), we benchmark twelve dedicatedtext embedding models of varying dimen-sionality (384–4096) spanning open-source(E5-Large-v2, BGE, GTE, Qwen3 Embed-ding) and closed-source (OpenAI, Gemini,Mistral) families, and fine-tune Qwen3 Em-bedding 4B on task-specific triples. Few-shot prompting with Qwen-2.5 7B (64.00%)outperforms all fine-tuned variants (best57.50%) on Track A; scaling to LLaMA-3.3-70B yields 75.00%. On Track B, Ope-nAI text-embedding-3-large (3072-d) achieves the best dev accuracy (67.00%),while fine-tuning Qwen3 Embedding 4B(2560-d) on synthetic triples slightly de-creases accuracy. Our final submission—LLaMA-3.3-70B (3-shot) for Track A andtext-embedding-3-large for Track B—achieves 70.75% and 64.50%, exceeding theGPT-4o-mini and STORY-EMB baselines respec-tively.
DANGNT@SGU at SemEval-2026 Task 1: A Two-Stage Mistral Generator with DistilBERT Reranking for English Humor Generation
Tan Loc Nguyen | Dang Tuan Nguyen
Tan Loc Nguyen | Dang Tuan Nguyen
We describe DANGNT@SGU’s system for the English track of SemEval-2026 Task 1 (MWAHAHA), Subtask A (text-based humor generation). Our pipeline combines a two-stage QLoRA-adapted generator based on mistralai/Mistral-7B-Instruct-v0.2 with a DistilBERT reranker trained to distinguish jokes from non-jokes. The generator is first adapted on a raw joke corpus for general humor style, then further tuned on synthetic task-format instruction–response pairs for Word Inclusion and News Headlineprompts. At inference time, we generate five candidates per input, optionally enforce lexical constraints for Word Inclusion prompts, and rerank candidates with the classifier. In the official English Subtask A results, our team DANGNT@SGU obtained Elo 962 (95% CI: 926–986), ranking 13th. The system is practical, reproducible, and based entirely on open models and public data.
LingoResearchGroup at SemEval-2026 Task 9: Evaluating Prompt Variants for Polarization Detection
Pritam Kadasi | Anuj Tiwari | Mayank Singh
Pritam Kadasi | Anuj Tiwari | Mayank Singh
Our submission presented in this paper is for SemEval-2026 Task 9: Multilingual Text Classification Challenge - Polarization Detection and it covers all three subtasks: (1) binary polarization detection, (2) polarization type classification and (3) polarization manifestation identification. We adopt a systematic approach of research on short designed prompts by considering twelve designed prompts that are different in terminology clarity, detail of the definition, guidance of reasoning and in-context examples use. The experiments are conducted using aya-101 and Gemma3-27B, with the latter chosen for the submission at the end of the development through performance considerations. Our system has an average macro level \textbf{F1-score of 0.762 on Subtask 1, 0.587 on Subtask 2 and 0.444 on Subtask 3} with the average accuracy of 0.819, 0.678 and 0.498, respectively, on the official test set averaged among 22 languages, respectively. With cross-task and cross-lingual analysis, we demonstrate that prompt-based approaches can be used effectively to detect coarse-grained polarization but encounter more and more difficulties as far as fine-grained and multi-label sociolinguistic classification is concerned.
ABARUAH at SemEval-2026 Task 9: Multilingual Polarization Detection across Seven Indic Languages using Qwen3
Arup Baruah
Arup Baruah
Online polarization creates division within the society. As such, it is important to detect and remove polarized messages from social media. This study presents fine-tuned Qwen3-8B Large Language Model (LLM) based models to identify online polarization, its specific categories, and its manifestation types. This study used Quantized Low-Rank Adaptation (QLoRA) to fine-tune the model in seven Indic languages: Bengali, Hindi, Nepali, Oriya, Punjabi, Telugu, and Urdu. The experimental results demonstrate the efficacy of this approach, achieving macro F1-scores of 0.82, 0.78, 0.90, 0.76, 0.78, 0.87, and 0.79, respectively, for polarization detection. The proposed model surpassed the established baseline systems in several of the subtasks, suggesting that parameter-efficient fine-tuning is a viable and powerful strategy for addressing linguistic diversity in low-resource and high-variability Indic language datasets. To leverage cross-lingual transfer, a unified model was developed by fine-tuning on a concatenated dataset of seven Indic languages. This approach proved superior to standalone language-specific models, yielding substantial improvements in F1-score (most notably a 28.76 point gain in Subtask 2 for Punjabi language). This provides strong evidence for the benefits of cross-lingual knowledge transfer in low-resource settings.
DUTIR at SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning
Tala Borjigin | Liang Yang
Tala Borjigin | Liang Yang
This paper presents our approach for SemEval 2026 Task 4. Our method leverages a large language model fine-tuned via Low-Rank Adaptation, incorporates data cleaning, and employs a multi-prompt strategy, all trained on the official synthetic dataset. Evaluated on Track A, our system achieved an official score of 0.70, representing a reasonable performance under the given task constraints. In addition, we explore an alternative contrastive learning framework originally designed for Track B, where narrative-structure embeddings are learned and subsequently applied to Track A via similarity comparisons. Our analysis suggests that direct supervised adaptation may be more suitable for narrative reasoning tasks.
j10official at SemEval-2026 Task 1: Neurosymbolic Humor Generation via GTVH-Guided LLM Decomposition
Jatin Agrawal | Radhika Mamidi
Jatin Agrawal | Radhika Mamidi
We present a neurosymbolic pipeline for computational humor generation grounded in the General Theory of Verbal Humor. The system constructs the joke in five sequential stages: context analysis, humor architecture (identifying core incongruity), delivery strategy, content writing, and pairwise judging, orchestrated through the DSPy framework. The system generates four candidate jokes per input with independent humor strategies, then selects the best through knockout tournament-style evaluation. Despite using Gemma 3 27B, a model with roughly 20× fewer total parameters than frontier systems, our approach achieves competitive results across all five subtasks of SemEval- 2026 Task 1 (MWAHAHA), placing 2nd in two subtasks. We argue that these results demonstrate the viability of structured, theory-driven decomposition for solving complex tasks and that how a model reasons about humor is just as important as how large the model is.
BertKittens at SemEval-2026 Task 3: Multi-Domain Aspect Sentiment with BERT/DeBERTa Ensembles for VA Regression and Aspect–Opinion–VA Triplets
Arseny Sukhodolsky | Ruslan Salimgareev | Tatiana Ianshina
Arseny Sukhodolsky | Ruslan Salimgareev | Tatiana Ianshina
Our system is built on transformer encoders (BERT and DeBERTa) fine-tuned in a multi-task learning framework. For the regression subtask (evaluated with RMSE), we jointly predict Valence–Arousal (VA) scores and token-level opinion spans using a shared encoder with task-specific output heads. This formulation introduces auxiliary supervision at the token level, which stabilizes training and improves regression accuracy compared to single-task optimization.When gold abstracts and opinion annotations are provided, our models achieve strong performance. However, in fully end-to-end settings requiring automatic span extraction, performance degrades substantially due to error propagation from token-level predictions.Our findings highlight the benefits of joint affective regression and span modeling, while exposing the limitations of transformer-based sequence labeling under strict end-to-end evaluation constraints.
NarSiL at SemEval-2026 Task 4: A Multi-Expert, Multi-Pathway System for Narrative Story Similarity
Bogdan Octavian Grecu | Costin Chiru | Oana Cocarascu
Bogdan Octavian Grecu | Costin Chiru | Oana Cocarascu
We present NarSiL (Narrative Similarity Learners), our system for SemEval-2026 Task 4 Track A on Narrative Story Similarity. NarSiL employs a two-stage architecture: a Mixture-of-Experts (MoE) initial classifier that also leverages supermajority voting across three large language models (Gemma-3-12B, GPT-3.5-turbo-instruct, and Gemini-2.5-Flash) over multiple runs, followed by a structured three-pathway fallback for ambiguous cases. The three pathways correspond directly to the task’s three core similarity components, abstract theme, narrative outcome, and course of action. Each path yields a similarity score corresponding to its respective component, and the scores are then combined through a weighted aggregation step. NarSiL achieves 64.25% accuracy on the official test set. An improved score of 70.25% is obtained by considering only the supermajority voting of GPT, followed by the previously described fallback.
Sagarmatha at SemEval-2026 Task 9: Heterogeneous Ensembling and Hierarchical Task Conditioning for Multilingual Latent Distributional Divergence Modeling
Sujal Maharjan | Astha Shrestha | Pratikshya Shrestha
Sujal Maharjan | Astha Shrestha | Pratikshya Shrestha
The digital public square is increasingly fragmented by affective polarization, requiring computational systems capable of identifying discursive strategies such as dehumanization and vilification. This paper presents Sagarmatha, the system developed for SemEval-2026 Task 9. We propose a heterogeneous ensemble architecture that addresses the limitations of standard transformer fine-tuning across 22 languages. Our approach integrates mDeBERTa-v3, ReMBERT, LaBSE, mmBERT, and XLM-RoBERTa, through two primary architectural pillars: learnable weighted layer pooling and hierarchical task conditioning. While our final submission (a broad ensemble, R3) demonstrated high stability on the leaderboard, our primary architectural configuration (Weighted Polyglot, R1) yielded superior performance in complex multi-label tasks. The system ranked 1st globally in English and Hausa manifestation identification, and 1st in Telugu detection (2nd in categorization). All code and resources are available at https://github.com/SUJAL390/SagarmathaatSemevaltask9.git.
Archaeology at SemEval-2026 Task 13: Fine-Tuning Pre-Trained Code Models for AI-Generated Code Detection
Jany-Gabriel Ispas | Sergiu Nisioi
Jany-Gabriel Ispas | Sergiu Nisioi
This paper describes the system submitted by team Archaeology to SemEval-2026 Task~13 on AI-generated code detection. The shared task consists of three subtasks; we participate in Subtask-A (binary classification: human-written vs.\ AI-generated code) and Subtask-B (11-class attribution of the generating model).Starting from a TF-IDF and Logistic Regression baseline, we fine-tune four pre-trained code models (CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5+) with separate strategies for each subtask.For Subtask-A, we use leave-one-language-out cross-validation, code augmentation, chunked inference with trimmed-mean aggregation, and threshold calibration on a difficult dataset.For Subtask-B, we use sandwich token packing, class-balanced loss, and multi-seed ensembling with test-time augmentation. Our best submissions obtain macro-F1 scores of 0.737 on Subtask-A and 0.422 on Subtask-B.
B B at SemEval-2026 Task 6: A RoBERTa-based Model with NLI-derived Semantic Features for Clarity-Level Classification of Political Question Evasion
Chi-Bo Lin | Boyang Yu
Chi-Bo Lin | Boyang Yu
Equivocation and ambiguity are common in political interviews, where public figures often avoid directly answering challenging questions. We present our submission to SemEval-2026 Task 6, Subtask 1 on English political response clarity classification. Our system builds on RoBERTa and incorporates NLI-derived semantic features to distinguish Clear Reply, Ambivalent, and Clear Non-Reply responses. To address class imbalance and performance instability, we explore class weighting, multi-seed ensembling, and a hierarchical two-stage framework with threshold tuning. Our best model achieves 60% macro-F1 on the official test set and 64% macro-F1 on an additional evaluation set, demonstrating stable performance across splits. Our results show that carefully engineered smaller models, combined with structured semantic features and imbalance-aware training, provide an effective and computationally efficient solution under limited training data.
BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection
Atharva Gupta | Dhruv Kumar | Yash Sinha
Atharva Gupta | Dhruv Kumar | Yash Sinha
The POLAR SemEval-2026 Shared Task aims to detect online polarization and focuses on the classification and identification of multilingual, multicultural, and multi-event polarization. Accurate computational detection of online polarization is challenging due to nuanced rhetoric, implicit framing, and the high cost of human-in-the-loop annotation. Building on recent findings that contextual prompting enables large language models to function as strong polarization detectors, we present a two-stage approach for detecting polarization in social media text that combines structured supervised fine-tuning with Direct Preference Optimization (DPO) refinement. We fine-tune Qwen 2.5-7B-Instruct with LoRA using an interpretable slot-filling template (target, claim type, manifestation checklist, and justification). We then apply DPO with automatically generated preference pairs to reduce costly false negatives. Our submitted system achieves 0.7664 Macro-F1 on the English test set. Post task submission experiments with Mistral-Nemo-Instruct-2407 and LLM-judge-filtered preference pairs further improve to 0.8162 Macro-F1 (not submitted to CodaBench), surpassing the organiser baseline of 0.7802. Code released publicly.
MINDS at SemEval-2026-Task 13: Robust Detection of Machine-Generated Code under Distribution Shift
Giorgia Rosalia Buccelli | Antonella Coviello | Alexandra Elena Holota | Marco Scaglione | Simone Scalora | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
Giorgia Rosalia Buccelli | Antonella Coviello | Alexandra Elena Holota | Marco Scaglione | Simone Scalora | Claudio Savelli | Riccardo Coppola | Flavio Giobergia
The growing use of large language models for code generation makes distinguishing machine-generated code from human-written code increasingly difficult, especially under distribution shifts in language, domain, and generator family. SemEval-2026 Task 13 targets this challenge through three subtasks: binary detection, multi-class authorship attribution, and hybrid/adversarial code detection.In this paper, we conduct an empirical study across all subtasks, comparing a variety of approaches: frozen encoder representations, feature-based classifiers, fine-tuned transformer models, post-hoc calibration, and probability-level ensembling. Our results show a consistent generalisation gap: strong in-domain validation scores substantially overestimate performance on shifted test conditions.The code is available at https://github.com/AlexandraElena-Holota/SemEval-2026-Task13.git
JCT at SemEval-2026 Task 4: A Multi-Method Approach to Narrative Story Similarity
Dvori Rosenfeld | Rinat Walles | Chaya Liebeskind
Dvori Rosenfeld | Rinat Walles | Chaya Liebeskind
Narrative similarity detection involves under-standing the underlying structure of a storyrather than just matching similar words orphrases. This paper details our multi-strategyapproach to the SemEval-2026 Task on Nar-rative Similarity, which requires identifyingwhich of two candidate stories most closelyresembles an anchor story based on three di-mensions: abstract themes, the sequence ofevents, and the final outcomes.We developed three distinct but complemen-tary methods to address this challenge. First,we fine-tuned a specialized story-embeddingmodel using parameter-efficient techniques onsynthetic data. Second, we utilized a "Distill-then-Embed" workflow, where a large languagemodel extracts the essential narrative core ofeach story before computing similarity. Third,we employed direct zero-shot prompting to al-low a high-reasoning model to compare thestories organically.Our analysis reveals that each approach excelsat different types of narrative comparisons, andtheir combination leads to robust performance.We demonstrate the importance of narrative dis-tillation in removing surface-level distractorsand the effectiveness of carefully engineeredprompts in guiding language models to focuson narrative structure
Tifin India at SemEval-2026 Task 5: Semantic Bridge: Augmented Encoding for Word Sense Plausibility
Pawan Rajpoot
Pawan Rajpoot
We present a hybrid system for SemEval 2026Task 5: Rating Plausibility of Word Senses inAmbiguous Stories. Our approach reframesLLMs as feature generators rather than directpredictors. We combine two subsystems: onethat appends LLM-generated hints to the in-put context and trains an encoder-based regres-sion model, and another that feeds structuredhints from multiple LLM configurations into alightweight regression ensemble. We generatemultilingual enrichments to probe LLMs forcomplementary signals and take advantage ofthe fact that translation into certain languagesimplicitly disambiguates word senses, makingthe encoder more robust. The 50/50 ensem-ble achieves 859/930 (92.37%) accuracy withSpearman ρ= 0.8384 on the test set, exceed-ing the estimated human ceiling of 89.2%. Thesame LLM enrichments, processed through fun-damentally different paradigms (tabular regres-sion vs. full-text encoding), produce comple-mentary errors that cancel under ensembling.Notably, simple 50/50 averaging captures thisgain without any learned combiner, suggest-ing that
GigitAI at SemEval-2026 Task 8: Hybrid Sparse-Dense Retrieval and Zero-Shot Generation for Multi-Turn Conversational RAG
Saran Krishnasamy | Inez Wihardjo
Saran Krishnasamy | Inez Wihardjo
We describe our system for SemEval-2026 Task 8 (MTRAGEval) on multi-turn conversational RAG. Our approach combines hybrid retrieval (fusing SPLADE-v3 learned sparse representations with dense embeddings via Reciprocal Rank Fusion) with a fine-tuned cross-encoder reranker and zero-shot LLM generation using Claude Opus 4.5. We systematically evaluate 56 retrieval configurations across 4 domains, and 5 generation strategies across 5 LLMs. Our findings show that: (1) SPLADE-v3 with dataset rewrites substantially outperforms BM25 across all configurations, (2) simple zero-shot prompting matches sophisticated strategies like Self-RAG and CRAG, and (3) performance varies significantly by answerability class. On the test set, we achieve rank 5/29 on Task C (end-to-end RAG, H=0.5564), rank 7/26 on Task B (generation, H=0.7495), and rank 13/38 on Task A (retrieval, nDCG@5=0.4782). Our analysis reveals strong performance on answerable queries (H=0.685) but degradation on underspecified queries (H=0.254).
GheGheGhe at SemEval-2026 Task 11: Decoupling Logic from Belief with Bias-Targeted Fine-Tuning and Neuro-Symbolic Syllogistic Reasoning
Razvan Gogu | Stefan Placintescu | Sofia Vultur
Razvan Gogu | Stefan Placintescu | Sofia Vultur
This paper presents a multi-paradigm approach to the first two subtasks of SemEval-2026 Task 11. For the first subtask, we explore two complementary strategies: a Llama-3 8B PEFT Majority Vote Ensemble, trained with bias-targeted augmented data, and a hybrid approach that separates LLM processing from logical reasoning, converting sentences into canonical logical forms for deterministic analysis. The hybrid approach is further extended to the second subtask. Official results placed us 17th in the first subtask and 15th in the second. Post-evaluation analysis indicates that our best model achieved perfect accuracy on the first subtask and revealed several errors in the ground truth data. After identifying certain implementation issues in the second subtask approach, the F1 retrieval score increased to over 98%, which would place us within the top 5 on the leaderboard.
contestant001 at SemEval-2026 Task 13 Stylometric and TF-IDF-Based Detection of Machine-Generated Code
Bora Ozaylar
Bora Ozaylar
Reliable detection of machine-generated codehas become increasingly important for aca-demic integrity and software quality as codegeneration is largely being undertaken by largelanguage models. This paper presents our ap-proach to SemEval-2026 Task 13, Subtask A:detecting machine-generated code in a binaryclassification setting, where we propose anensemble approach combining TF-IDF lexi-cal representations with 23 hand-crafted sty-lometric features for binary classification ofAI-generated code. Our system aims to addressthe challenge of cross-language generalizationby extracting language-agnostic patterns andcombining them with TF-IDF. While we ob-served that transformer-based models (Code-BERT, UniXcoder) noticeably underperformedunder distribution shift, our analysis revealedthat AI-generated code exhibits distinct sty-lometric patterns and our TF-IDF ensembleachieved 0.5175 Macro F1 on the task submis-sion.
VerbaNexAI at SemEval-2026 Task 4: Two-Stage Narrative Similarity via Fine-Tuned Bi-Encoder with MLP Ensemble
Pablo Pertuz-Duran | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Pablo Pertuz-Duran | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
This paper describes VerbaNex AI’s participation in SemEval-2026 Task 4: Narrative Similarity, a shared task on assessing semantic relatedness between short narrative texts. The task comprises two tracks: Track A requires selecting which of two candidate stories is more similar to an anchor, and Track B requires producing fixed-size story embeddings whose cosine similarity reflects narrative relatedness. We propose a unified two-stage system built on Qwen3-Embedding-0.6B. The first stage fine tunes the encoder as a bi-encoder with a 512 dimensional projection head using a composite loss combining margin ranking, pairwise softmax, and multiple negatives ranking objectives. The second stage trains a lightweight MLP head over frozen bi-encoder embeddings using pairwise interaction features, with k-foldcross-validation and logit-averaging ensemble inference. The system was trained exclusively on the official supervised data without leveraging the additional 1,900 synthetic triples generated by LLM released by the organizers. Al though the system ranked first on both tracks in the development phase, its performance did not transfer to the official test set, where it ranked 47 on Track A and 22 on Track B.
CultRAG at SemEval-2026 Task 7: Hybrid Sparse-Dense Retrieval with Entity-Centric Knowledge Bases for Cultural MCQ Answering
Aditya Singh | Rickarya Das
Aditya Singh | Rickarya Das
We developed CultRAG, a trust-weighted Retrieval-Augmented Generation system for BLEnD Track 2 (SemEval-2026 Task 7), targeting culturally grounded multiple-choice QA across 30 countries. Built on Llama-3.1-8B-Instruct, the six-phase pipeline integrates entity extraction via spaCy, hybrid BM25+FAISS retrieval with Reciprocal Rank Fusion, country-aware filtering, keyword-based intent detection, tiered prompt routing, anti-leak quality filtering to suppress answer-anchoring artifacts, and trust-weighted document reranking with source-credibility tiers. Ablation analysis across eight cumulative configurations and per-country decomposition identify which components contribute and where retrieval helps versus hurts, informing future directions for confidence-conditioned selective retrieval.
uircis at SemEval-2026 Task 8: A Unified Lightweight Pipeline for Multi-Turn RAG Evaluation
Jiaqi Zhang | Wenbin Duan | Yingqi Zhang | Yan Li | Binyang Li
Jiaqi Zhang | Wenbin Duan | Yingqi Zhang | Yan Li | Binyang Li
We submit a system description paper for SemEval-2026 Task 8 (MTRAGEval), covering both Subtask A (retrieval) and Subtask B (generation). Our approach is a lightweight, fully reproducible multi-turn RAG pipeline using open-weight models: Qwen2.5-7B-Instruct for query rewriting and grounded answer generation, BGE-M3 for dense retrieval, and BGE-Reranker-v2-M3 for cross-encoder reranking. We report official test performance, conduct ablation experiments to quantify the impact of rewriting and reranking across domains, and provide error analysis using the organizers’ analytics and answerability classes, highlighting key failure modes in multi-turn retrieval specificity and grounded generation.
AKCIT-UFG at SemEval-2026 Task 8: Structured Chunking and Optimized Query Reformulation for Efficient Multi-Turn Retrieval
David Ferreira | Wilson Ramos | Priscila Ribeiro | Emanuel Passinato | Diogo Fernandes | Arlindo Filho
David Ferreira | Wilson Ramos | Priscila Ribeiro | Emanuel Passinato | Diogo Fernandes | Arlindo Filho
This submission investigates efficient multi-turn retrieval under constrained computational settings. We analyze how passage granularity and conversational query rewriting affect retrieval effectiveness across four benchmark domains. Using compact, locally deployable components, we show that smaller passage segmentation improves early-rank performance and that lightweight keyword-oriented query reformulation substantially enhances dense retrieval quality.Importantly, we observe that rewriting interacts differently with encoder backbones: some compact models benefit significantly from increased query specificity, while others degrade, indicating sensitivity to rewrite-induced distribution shifts. Our findings demonstrate that competitive multi-turn retrieval does not require large proprietary models, but can emerge from principled structural and preprocessing design choices. The results highlight the importance of aligning chunking strategy, rewriting policy, and encoder characteristics in resource-efficient MT-RAG systems.
INF-rsrs at SemEval-2026 Task 1: Is the best really better? The limits of creative work in the era of LLMs
Guilherme Bazzo | Eduardo Faé | Júlia Junqueira | Higor Moreira | Lucas Rafael Costella Pessutto
Guilherme Bazzo | Eduardo Faé | Júlia Junqueira | Higor Moreira | Lucas Rafael Costella Pessutto
Generating humor is a complex and challenging task for Large Language Models (LLMs), requiring both linguistic creativity and strict adherence to constraints. This paper presents INF-rsrs, our solution for SemEval 2026 Task~1: Humor Generation, which tasks models with creating jokes from headlines and word pairs without labeled data. We propose a two-stage framework: a production stage and a selection stage. The production stage employs diverse model families and hyperparameter configurations to generate a wide range of candidate jokes, with each candidate generated by an LLM prompted in the role of a comedian under structured constraints to ensure relevance and humor. Our system was designed to substantiate our claim that the direct use of LLMs in creative works, such as humor generation, hits a hard ceiling that is inescapable through simple prompting. Our proposed system tied in first place in the task ranking, obtaining a top-tier performance.
CodeDet-NITS at SemEval-2026 Task 13: AI Code Authorship Detection Beyond Truncation
Lekkala Sai Teja | Annepaka Yadagiri | Kshitij Patiyal | Sangam Sai Anish | Partha Pakray
Lekkala Sai Teja | Annepaka Yadagiri | Kshitij Patiyal | Sangam Sai Anish | Partha Pakray
Automatically determining whether source code is human written or produced by a specific family of large language models is becoming essential for reliable assessment, provenance tracking, and dataset curation. We present a lightweight yet competitive system for SemEval 2026 Task 13 Subtask B, which requires attributing each snippet to one of eleven classes: human or one of ten LLM families. Our method repurposes code oriented instruction tuned backbones from the Qwen2.5 Coder series as sequence classifiers and adapts them using QLoRA, combining frozen low precision weights with low rank trainable adapters to reduce memory and compute overhead. The core design choice addresses long snippets without losing evidence. Instead of truncating to a fixed context, we apply an overlapping sliding window strategy that expands long examples into multiple fixed length windows during training, all sharing the same label. For validation and test, windows are generated on the fly and their evidence is aggregated by averaging logits to yield a single prediction per snippet, enabling token complete use of the input while keeping inference stable. Our final submission ranked 8th on the official Subtask B test set leaderboard.
NIT-Agartala-NLP-Team at SemEval-2026 Task 9: A Weighted Soft-Voting Ensemble Framework of Fine-Tuned LLMs for Binary and Multi-Label Polarization Detection
Shivam | Manish Kumar | Anupam Jamatia
Shivam | Manish Kumar | Anupam Jamatia
This paper presents the NIT-Agartala-NLPTeam’s submission to SemEval-2026 Task 9on polarization detection in textual data. Thetask comprises two subtasks: (i) binary classification to distinguish polarized from nonpolarized content, and (ii) multi-label classification to identify the specific type(s) of polarization. We propose a weighted soft-votingensemble framework that integrates multiplefine-tuned large language models (LLMs). Theprobabilistic outputs of the individual models are combined using weighted averagingto effectively leverage their complementarystrengths and enhance overall performance.Our system achieved a test macro F1-score of78.6 (26th out of 44 teams) in Subtask 1 and46.0 (18th out of 29 teams) in Subtask 2.
uir-cis-7 at SemEval-2026 Task 7: Zero-Shot Chain-of-Thought Reasoning for Cross-Cultural Daily Knowledge
Jianning Gao | Xianling Mao | Shumin Shi | Duanzhi Zhaxi | Yingbo Sun | Xiandeng Li | Binyang Li
Jianning Gao | Xianling Mao | Shumin Shi | Duanzhi Zhaxi | Yingbo Sun | Xiandeng Li | Binyang Li
SemEval-2026 Task 7 evaluates the ability of Large Language Models (LLMs) to reason about diverse daily knowledge across 30 geographic regions. In this paper, team uir-cis-7 approaches this challenge not merely as an accuracy optimization problem, but as a diagnostic probe to evaluate the representational limits of LLMs without fine-tuning. To address Western-centric bias and the "overthinking penalty" frequently observed in high-resource contexts, we introduce a Two-Tier Dynamic Routing framework. Based on cultural resource density, queries are routed either to a direct-answer pathway or a complex reasoning pathway. The complex pathway utilizes an Anti-Bias Persona-Conditioned Chain-of-Thought enhanced with Knowledge Anchoring and multi-path Self-Consistency voting to mitigate majority-culture heuristics. Evaluated using a strict macro-average metric, our system achieved an overall accuracy of 89.02% on the official leaderboard. Our fine-grained evaluation and theoretical error analysis quantify the epistemological boundaries of prompt-based alignment, proving our dynamic strategy effectively rescues marginalized cultural knowledge while exposing persistent instances where safety-aligned models project Western progressive norms onto traditional contexts. Furthermore, cross-model validation on open-source architectures explicitly confirms our framework’s generalizability.
HHU-SyLo at SemEval-2026 Task 11: Logic in the Loop – Hybridizing LLMs and Theorem Provers for Robust Formal Reasoning
Wiebke Petersen | Cherine Jaziri | Diem Tran
Wiebke Petersen | Cherine Jaziri | Diem Tran
We present our system for SemEval-2026 Task 11 on reasoning disentanglement, separating syllogistic validity from semantic plausibility. We compare direct neural inference against two neuro-symbolic pipelines: translation to first-order logic and to syllogistic triples. By offloading inference to symbolic theorem provers, these hybrid models effectively mitigate content bias and improve logical fidelity.
UMUSP at SemEval-2026 Task 9: Mitigating Cross-Lingual Interference via Selective Multilingual and Multitask Specialization
Julio Cesar Fuganti | Tulio Ferreira Leite Da Silva | Adelino Gala | Francisco S. Marcondes | José Machado | Paulo Novais
Julio Cesar Fuganti | Tulio Ferreira Leite Da Silva | Adelino Gala | Francisco S. Marcondes | José Machado | Paulo Novais
This paper proposes a selective multilingual and multitask fine-tuning strategy for online polarization detection that improves cross-lingual stability over fully joint training. Covering all three subtasks — polarization detection (POLARDETECT), polarization type classification (POLARTYPE), and rhetorical manifestation identification (POLARMANIFEST) — across all 22 languages of the shared task, the approach introduces controlled specialization, where languages and subtasks are grouped empirically and separate specialist models are fine-tuned for each subset. Restricting parameter sharing substantially improves performance even without ensemble averaging, whereas ensembling jointly trained models fails to mitigate instability. The final specialist ensemble improves Task 3 macro-F1 from 0.3330 to 0.4920 and reduces cross-lingual dispersion (CV: 0.613 → 0.321). Under the official ranking framework, the system ranks 7th among 16 submissions with complete multilingual and multitask coverage and remains within 5% of the best system in 37.70% of evaluation conditions.
ASTraNet at SemEval-2026 Task 13: Not All Code Looks the Same: Multi-View Structural and Semantic Detection of Machine-Generated Code
Ruwad Naswan | Dipit Saha | Md. Kabir | Nabiha Tahseen
Ruwad Naswan | Dipit Saha | Md. Kabir | Nabiha Tahseen
The growing adoption of large language models for code generation poses challenges for code quality, security, and authorship verification—particularly when test conditions involve unseen programming languages, generators, or application domains. We present our system, which combines three code-pretrained transformer encoders (CodeT5p-220M, CodeBERT, UniXcoder) with a structure-first Flow-Augmented AST (FA-AST) encoder implemented as a Gated Graph Neural Network. On Subtask A our best single model achieves macro F1 of 0.559; a post-competition layered rank-fusion ensemble across all three encoders raises this to 0.643. On Subtask C we obtain 0.585 officially; a three-stage ensemble combining neural probabilities with LightGBM-based features and class-priority routing raises this to 0.652. Our contributions include a language-agnostic structural detector, a diversity-driven rank-fusion strategy exploiting low inter-model correlation for binary classification, and a meta-learner stacking pipeline for multi-class detection under distribution shift.
RPI Team at SemEval-2026 Task 3: An LLM-Encoder Ensemble for Coarse-to-Fine Valence-Arousal Sentiment Prediction
Mohammed Shahid Modi | Boleslaw Szymanski
Mohammed Shahid Modi | Boleslaw Szymanski
We present our coarse-to-fine Valence-Arousal (VA) ensemble system for subtask 1 of task 3 (DimABSA) which covers aspect-level VA prediction. We use a pair of trained Qwen 3 8B LoRA-tuned LLMs to predict coarse bins between 1 and 8, providing ordinal VA guidance signals along with distributional features. We then train an instruction-style, multilingual E5 encoder model with a multitask head using these LLM-derived guidance features to produce continuous VA predictions. At inference time, the same guidance signals are generated for the test set by the trained LLMs and fed into the trained encoder. This approach leverages the LLM as a high-level prior while relying on the encoder for precise calibration across languages and domains. Our system achieves an RMSEVA of 1.20 across six languages and five domains. We compare the joint VA model to separated valence and arousal models trained on coarsened ground truth data, showing that it outperforms them, particularly on arousal correlations.
CLRG at SemEval-2026 Task 3: One Size Does Not Fit All: A Resource Adaptive Framework for Dimensional Sentiment Regression
Wardat Iqbal | Ruwad Naswan | Swakkhar Shatabda
Wardat Iqbal | Ruwad Naswan | Swakkhar Shatabda
Predicting continuous Valence and Arousal scores across diverse languages poses significant challenges due to typological differences and the difficulty of modeling affective intensity. We introduce AdaptStance, a parameter-efficient framework designed for the SemEval-2026 Task 3 benchmark. To address cross-lingual disparities, AdaptStance routes inputs through resource-specific pipelines: direct regression with a hybrid concordance loss for high-resource languages, and an auxiliary multi-task mechanism to stabilize regression in low-resource and non-Western contexts. Architectural analysis reveals that decoupling task heads benefits morphologically related languages, whereas joint representations act as crucial regularizers for distant language families. Ultimately, this lightweight approach achieves competitive performance over generative baselines, demonstrating the efficacy of targeted architectural alignment while identifying Valence as the primary bottleneck in continuous affect prediction. Our code is available on GitHub.
PolarMind at SemEval-2026 Task 9: Leveraging LaBSE with Progressive Curriculum Learning for Multicultural Polarization
Sandeep s | Mothish M | Sachin Sundar
Sandeep s | Mothish M | Sachin Sundar
Detecting online polarization remains a critical challenge, particularly in multilingual and multicultural on texts where intergroup hostility is prevalent. The problem is particularly challenging due to the data scarcity for these tasks in the low-resource languages. Identifying such phenomena has become an activearea of research and is addressed in SemEval 2026 Task 9: Multilingual, Multicultural Online Polarization Detection. To address this problem we propose an architecture that leverages LaBSE embeddings—an unconventional choice typically reserved for retrieval tasks—toobtain strong cross-lingual learning which enhances scores in low-resource language by ascore up to 0.2 macro F1. Furthermore, we provide a comprehensive ablation study evaluatingthe performance of diverse encoder models in the Qwen model family within a retrieval-basedprompting framework.
CredenceAI at SemEval-2026 Task 10: A Span-Consistency Network with Cross-Marker Attention for Conspiracy Marker Extraction
Ishaan Karan
Ishaan Karan
We present a Span-Consistency Network (SCN) for conspiracy marker extraction in English social media text. The task requires identifying character-level spans for five marker types (Actor, Action, Effect, Evidence, and Victim) under overlap-based Macro F1 evaluation. Standard token-level classifiers often produce fragmented spans, ignore inter-marker dependencies, and struggle with severe class imbalance.Our approach addresses these challenges through three components. First, a Span Consistency Layer (SCL) propagates span-level confidence signals to encourage coherent boundary formation. Second, Cross-Marker Attention (CMA) models co-occurrence patterns between marker types via a learned correlation matrix. Third, we introduce Span Count Regularization (SCR), a total-variation-based constraint that aligns soft token probabilities with the expected number of discrete spans, mitigating prediction collapse under threshold decoding.Built on DeBERTa-v3-large and trained with a recall-biased Tversky loss, our system is ensembled across five stratified folds. It achieved a Macro F1 of 0.24 on the official test set, placing second among participating teams. Ablation studies show that SCR plays a critical role in maintaining span structure, particularly for low-frequency and long-span markers.
Models Without Borders at SemEval-2026 Task 7: Bridging Cultural Contexts with Search-Grounded QA
Swetha Krishna Sriram | Nirupama Sekar
Swetha Krishna Sriram | Nirupama Sekar
We present our submission to SemEval-2026 Task 7, focusing on the MCQ track, where models must identify culturally specific answers across language-region locales. Our system augments a compact open-source model with locale-targeted web retrieval at inference time, requiring no task-specific fine-tuning, and places 10th on the leaderboard. Beyond the submitted system, we explore how retrieval depth and search localization affect performance across locales, finding that localizing search parameters meaningfully shifts the geographic composition of retrieved sources and that gains from retrieval are most pronounced for lower-resource locales. We also investigate whether culturally informed prompt framing can complement retrieval, finding that it does, but only when grounding context is present. Taken together, our results point to inference-time web grounding as a practical path toward more culturally aware NLP under resource constraints.
KCLarity at SemEval-2026 Task 6: Encoder and Zero-Shot Approaches to Political Evasion Detection
Archie Sage | Salvatore Greco
Archie Sage | Salvatore Greco
This paper describes the KCLarity team’s participation in CLARITY, a shared task at SemEval 2026 on classifying ambiguity and evasion techniques in political discourse. We investigate two modelling formulations: (i) directly predicting the clarity label, and (ii) predicting the evasion label and deriving clarity through the task taxonomy hierarchy. We further explore several auxiliary training variants and evaluate decoder-only models in a zero-shot setting under the evasion-first formulation. Overall, the two formulations yield comparable performance. Among encoder-based models, RoBERTa-large achieves the strongest results on the public test set, while zero-shot GPT-5.2 generalises better on the hidden evaluation set.
ICI Innolabs at SemEval-2026 Task 13: Sliding Windows Meet Code Transformers
Sebastian Balmus | Bogdan Dura
Sebastian Balmus | Bogdan Dura
We describe our system for SemEval-2026 Task 13, Subtask B, which focuses on multi-class authorship attribution for code: given a code snippet, the goal is to predict whether it is human-written or generated by one of ten LLM families. The task presents two central challenges: severe class imbalance and long input sequences that frequently exceed the context length of encoder-based Transformers. To address these issues, we adopt a window-based fine-tuning and inference framework. During training, we randomly sample 512-token windows from each snippet and optimize a class-weighted cross-entropy objective with label smoothing. At inference time, we apply a sliding-window strategy and aggregate window-level logits to obtain a snippet-level prediction. We fine-tune three pretrained code encoders (CodeBERT, UniXcoder, and StarEncoder) under this framework and combine their outputs via majority voting. On the official validation split, our best single model (StarEncoder) achieves 0.60 macro F1. On the final test set, the three-model ensemble reaches 0.41 macro F1, ranking 10th on the leaderboard. Our results demonstrate that window-based modeling combined with imbalance-aware optimization provides a robust and reproducible baseline for multi-class LLM attribution under distribution shift.
K-NLPers at SemEval-2026 Task 7: Multiple LLM Agent Debate System for Everyday Knowledge Across Diverse Languages and Cultures
Jiwoo Song | Sihyeong Yeom | Harksoo Kim
Jiwoo Song | Sihyeong Yeom | Harksoo Kim
This paper presents the K-NLPers system for SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures. The task extends the BLEnD benchmark to evaluate cultural understanding of language models across more than 30 language-country pairs. Although Large Language Models (LLMs) achieve strong overall performance, they exhibit performance disparities across cultural contexts and tend to produce regionally biased responses. To address this limitation, we propose a continent-based multi-agent debate framework that leverages culture-specific performance differences instead of relying on a single model. For the Short Answer Question (SAQ) track, we employ three agents: a general-purpose model, a continent-specific model, and a country-level or culturally adjacent model. These agents engage in independent generation, mutual refinement, and final adjudication. For the Multiple-Choice Question (MCQ) track, we adopt a debate structure centered on high-performing general-purpose models due to the track’s simpler structure. Our system participated in all language-region pairs and achieved overall scores of 55.75 on SAQ and 88.32 on MCQ. Further analysis reveals that grouping the performance of various individual models by continent explains performance patterns more consistently than language-based grouping, highlighting the importance of cultural and historical context in model generalization.
ShefFriday at SemEval-2026 Task 9: LLM-Based Annotation Methods for Detecting Multilingual, Multicultural and Multievent Online Polarisation
Owen Cook | Meredith Gibbons | Xingyi Song
Owen Cook | Meredith Gibbons | Xingyi Song
This paper presents our findings for SemEval-2026 Task 9. We submit to all three subtasks using an LLM-as-an-annotator strategy, simulating the data annotation process with large language models. We created 30 LLM annotators using persona injection (also known as sociodemographic prompting) and experimented with various annotation aggregation methods, including Dawid-Skene and MACE. To further increase the variability in annotator responses, we used the hatefulness detection task as proxy for identifying polarisation. Our findings indicate that this reframing of the problem is effective for the binary classification of polarisation, but is less effective for finer-grained polarisation detection. For subtasks 2 and 3, majority voting yielded the best overall performance. While our unsupervised approach does not rank as highly as supervised ones, this work provides insight into the utility of persona-based prompting and the issue of LLM annotators exhibiting high intra-model agreement.
REGLAT at SemEval-2026 Task 12: Multi-Strategy Ensemble Reasoning for Event Causality Identification
Mariam Francies | Nsrin Ashraf | Ahmed Fetouh | Asad Khalil | Hamada Nayel
Mariam Francies | Nsrin Ashraf | Ahmed Fetouh | Asad Khalil | Hamada Nayel
This paper describes the multi-strategy ensemble approach that has been used to develop the model submitted to the Abductive Event Reasoning shared task. The proposed model combines semantic similarity, causal pattern recognition, and Large Language Models (LLMs) to identify causal relationships between news events and their causes. Our system achieved competitive performance by integrating semantic embedding-based similarity, explicit causal pattern matching, keyword overlap analysis, temporal alignment scoring, and LLM-enhanced reasoning. Our system achieved accuracies of 65.4\% and 43.2\% on the development set using the LLM-enhanced configuration and the non-LLM ensemble, respectively. The final score using the test set on the leaderboard is 0.3.
NASIMLab at SemEval-2026 Task 9: A Comparative Analysis of Fine-Tuned Small Language Models vs. Generative Large Language Models for Multilingual Polarization Type Detection
Neel Sabhahit | Sanjeevan Selvaganapathy | Mehwish Nasim
Neel Sabhahit | Sanjeevan Selvaganapathy | Mehwish Nasim
The POLAR dataset contains various social media texts that might be polarized (conflict-inducing or dangerously divisive). The task at hand is to identify whether any of the following types of polarization are present: political, racial/ethnic, religious, gender/sexual, and other types across 22 languages. In this paper, we propose a system of fine-tuned language-specific small language models and compare our approach with state-of-the-art large language models on the POLAR dataset. By fine-tuning models for each language, we demonstrate that fine-tuned small encoder-only models consistently outperform large language models, especially for low-resource languages. Our system performs well on this task for most low-resource languages, notably taking the top spot on the leaderboard in Burmese (mya), appearing within the top 10 for 12 languages, and within the top 20 for all remaining languages.
COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives
Azwad Anjum Islam | Tisa Islam Erana
Azwad Anjum Islam | Tisa Islam Erana
We present a system for SemEval-2026 Task 5 that predicts 1–5 plausibility ratings for candidate senses of homonyms in ambiguous short stories using prompting with closed-source LLMs. We evaluate three prompting strategies: zero-shot, chain-of-thought, and comparative prompting that jointly scores competing senses. We also find simple unweighted ensembling better aligns with subjective human judgments better than individual models. Our official submission ranked 4th on the leaderboard with an average score of 0.86, with post-competition experiments improving performance to 0.89.
uir-cis at SemEval-2026 Task 12: Mitigating Prior-Induced Hallucinations in Retrieval-Augmented Reasoning via Precision-Oriented Decoding
Chiyao Zhou | Zebing Wang | Kexin Deng | Yaru Zhao | Lin Deng | Binyang Li
Chiyao Zhou | Zebing Wang | Kexin Deng | Yaru Zhao | Lin Deng | Binyang Li
This paper describes our system for the SemEval-2026 Task 12 on Abductive Event Reasoning (AER). We systematically address the "over-selection" hallucination pathology in Instruction-tuned Large Language Models (LLMs), where models erroneously align distractors with semantic priors rather than retrieved evidence. Our framework utilizes a 32-billion parameter Qwen2.5 foundational model adapted via Low-Rank Adaptation (LoRA) and evaluated under a Zero-shot Chain-of-Thought (CoT) setting. To mitigate epistemic noise, we propose a Precision-Oriented Decoding (POD) strategy that couples low-temperature sampling (T=0.45) with scaled majority voting (K=9). Following a three-stage empirical evolution—from baseline diagnosis to precision optimization and ensemble analysis—our system achieved a score of 0.802 on the official test set. Our findings demonstrate that in causal reasoning tasks with strict penalization for incorrect predictions, epistemic noise suppression is strictly superior to heuristic recall compensation.
RAGthoven at SemEval-2026 Task 1: A Multi-Stage Pipeline Walks Into a Benchmark and Barely Clears the Bar
Marek Suppa | Viktória Ondrejová | Lucia Ganajová | Gregor Karetka | Daniel Skala
Marek Suppa | Viktória Ondrejová | Lucia Ganajová | Gregor Karetka | Daniel Skala
We present \textsc{RAGthoven}, our system for SemEval-2026 Task~1 (MuWaHaHa), Subtask~A (multilingual constrained humor generation in English, Spanish, and Chinese).\textsc{RAGthoven} decomposes creative text generation into a multi-stage large language model (LLM) pipeline (\textit{Planner}, \textit{Writer}, \textit{Reflector}, \textit{Judge}) grounded in computational humor theories (Benign Violation Theory, Script-based Semantic Theory of Humor) and iteratively refined through prompt engineering across ten experiments.In our final configuration, we augment the Planner with retrieval-augmented generation (RAG) from a curated joke corpus, seeding generation with diverse joke mechanisms.We additionally explore an agentic variant that exposes the same four pipeline stages as tool-calling agents orchestrated by a model loop with a \textsc{ConstraintAudit} checker. While it achieves full constraint compliance, human pairwise evaluation did not reveal a significant quality advantage over the simpler non-agentic baseline.\textsc{RAGthoven} achieves Rank~1 in all three languages, with the strongest result in Spanish (Elo 1182, 42 points above the Gemini~2.5~Flash baseline).However, while the system leads in raw Elo in Spanish, it shares Rank~1 with the baseline in all three languages due to overlapping confidence intervals; in English and Chinese the gap narrows further, suggesting that elaborate multi-stage prompt engineering may offer diminishing returns once a strong frontier model is in the loop.
AKCIT at SemEval-2026 Task 13: A Lightweight LightGBM Baseline for Cross-Language Detection of LLM-Generated Code
Rone Brandao Filho | Walcy Santos Rezende Rios | Lucas Neves | Jose Ricardo Fleury Oliveira | Diogo Fernandes | Arlindo Galvão Filho
Rone Brandao Filho | Walcy Santos Rezende Rios | Lucas Neves | Jose Ricardo Fleury Oliveira | Diogo Fernandes | Arlindo Galvão Filho
The widespread use of LLMs in software development has made the detection of machine-generated code a pressing challenge, particularly when models must generalize across programming languages and domains. We present a lightweight, LLM-free pipeline that combines stylometric feature extraction with a LightGBM classifier and explicitly prioritizes structural generalization over deep semantic modeling. Despite its simplicity, the method achieves a Macro F1 of 0.70–0.72, more than doubling the CodeBERT baseline (0.30) in SemEval-2026 Task 13 Subtask A, while operating without GPUs or any fine-tuning.
UFAL-CUNI at SemEval-2026 Task 11: An Efficient Modular Neuro-symbolic Method for Syllogistic Reasoning
Ivan Kartac | Kristyna Onderkova | Jan Bronec | Zdeněk Kasner | Mateusz Lango | Ondrej Dusek
Ivan Kartac | Kristyna Onderkova | Jan Bronec | Zdeněk Kasner | Mateusz Lango | Ondrej Dusek
This paper describes our system submitted to SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models. We present an efficient modular neuro-symbolic approach, combining a symbolic prover with small reasoning LLMs (4B parameters). The system consists of an LLM-based parser that translates natural language syllogisms to a first-order logic (FOL) representation, an automated theorem prover, and two optional modules: machine translation for multilingual inputs and a symbolic retrieval component for the identification of relevant premises. The system achieves competitive accuracy and relatively low content effect on most subtasks. Our ablations show that this approach outperforms LLM-based zero-shot baselines in this parameter size range, but also reveal limited multilingual capabilities of small LLMs. Finally, we include a discussion of the task’s main ranking metric and analyze its limitations.
SyntaxMind at SemEval-2026 Task 6: Exploring Transformers and LLMs for Unmasking Political Question Evasions
Md. Shihab Uddin Riad
Md. Shihab Uddin Riad
This paper describes our approach to Subtask 1: Clarity-level Classification in SemEval-2026 Task 6. The task focuses on determining the clarity of political responses with respect to their corresponding questions. To enhance model performance, we introduced a direct answer generation strategy as an additional input feature and applied Task-Adaptive Pre-Training (TAPT) to enhance encoder-only Transformer models with the task domain. We further explored both cross-entropy and focal loss to address potential class imbalance. Experimental results show that TAPT enhanced encoder models, particularly DeBERTa-V3-base, achieved the strongest performance, while generative small language models fine-tuned via parameter-efficient methods exhibited comparatively lower results. Our system obtained a macro-F1 score of 0.72 on the official evaluation set, ranking 24th out of 40 teams.
IIITH Boys at SemEval-2026 Task 4: StoryNet - Understanding Narrative Story Similarity through Symbolic Representations
Amol Vijayachandran | Ananth Rajesh | Siddharth Mago | Maitreya Chitale | Aparajitha Allamraju
Amol Vijayachandran | Ananth Rajesh | Siddharth Mago | Maitreya Chitale | Aparajitha Allamraju
Narrative similarity extends beyond standard semantic tasks, requiring alignment of temporal, causal, and emotional structures. We present StoryNet, a framework that represents stories as heterogeneous graphs with character, event, and theme nodes. Stories are decomposed into structured narrative facets using large language models, and similarity is evaluated through both weighted semantic facet comparison and a graph neural network trained with contrastive learning. We analyze how integrating symbolic structure with learned graph representations compares to purely embedding-based baselines.
Yam at SemEval-2026 Task 4: Failure-Driven Prompt Evolution for Narrative Comparison
Yen Yee Yam | Hong Meng Yam
Yen Yee Yam | Hong Meng Yam
We present a structured, parameter-free system for SemEval-2026 Task 4 on Narrative Story Similarity. Instead of treating similarity as scalar embedding proximity, we align model reasoning with the task ontology by decomposing each story into abstract theme, course of action, and outcome, and performing contrastive comparison over these dimensions. Our primary contribution is a closed-loop, failure-driven prompt optimization procedure that iteratively refines concise guideline documents while keeping model parameters fixed and reverting updates that degrade performance, thereby isolating improvements attributable to structured reasoning rather than representation learning. Ontology-aligned decomposition alone achieves 70% accuracy on both train and test sets; with controlled guideline evolution, performance improves to 76% on train and 73% on test without additional supervision or fine-tuning. These results demonstrate that structured prompt optimization can meaningfully enhance contrastive narrative reasoning in a fully parameter-free setting.
Pinetree at SemEval-2026 Task 7: A Large-Scale Failure Analysis of Cultural Grounding in Language Models
Yen Yee Yam | Hong Meng Yam
Yen Yee Yam | Hong Meng Yam
Using a simple prompting strategy without fine-tuning or retrieval augmentation, our system achieved 88.85% micro-average and 90.55% macro-average accuracy, ranking #4 overall on SemEval-2026 Task 7. Our primary contribution is a failure analysis of 5,241 incorrect predictions (11.15% of the dataset), categorized using the six-topic BLEnD taxonomy. Errors concentrate in Food (39.42%) and Holidays/Celebration/Leisure (15.76%), but within-topic error rates are highest on Family (21.04%) and Work life (20.45%), which topics with limited representational density. Global-brand attractor errors account for only 2.50% of failures and are tightly localized: 98.5% fall on a single template (most popular sport team) in four low-resource cultures. Outside these templates, brand-default effects are statistically negligible. These findings support representational sparsity and knowledge-density asymmetry, not ideological skew, as the dominant cause of cultural misalignment in everyday behavioral tasks.
TUCNLP at SemEval-2026 Task 11: Neuro-Symbolic Content Stripping for Debiased Syllogistic Reasoning
Rafael Butas | Alex Lapusan | Camelia Lemnaru | Rodica Potolea
Rafael Butas | Alex Lapusan | Camelia Lemnaru | Rodica Potolea
In this paper, we present the solution submitted by TUCNLP at SemEval-2026 Task~11: Disentangling Content and Formal Reasoning in Large Language Models. The task requires predicting the formal validity of categorical syllogisms while minimizing susceptibility to content-driven biases in English and 11 additional languages. We show that a modestly-sized model (Qwen3-8B) can achieve near-perfect logical reasoning on the English validity-only subtask, and large reductions in content effect on multilingual and premise-retrieval variants, when augmented with a multi-stage neuro-symbolic pipeline: LLM-based content stripping with iterative error correction converts natural language to abstract categorical forms, a classical symbolic parser validates against the twenty-four Aristotelian syllogistic forms, and asymmetric confidence thresholds mediate between symbolic and neural decisions. Across the four subtasks (ST1 to ST4), our system achieves accuracy ranging from 91.1\% to 100\% and bias-penalized ranking scores ($\mathcal{M}$) from 31.8 to 100.0, with the main bottleneck being overconfident neural predictions that bypass symbolic verification.
Truth Gradient at SemEval-2026 Task 10:Conspiracy Belief Detection via Narrative Density and Mean Pooling
Ekansh Goyal
Ekansh Goyal
Conspiracy believers use significantly more psycholinguistic markers per post than nonbelievers (Cohen’s d = 0.56, p 10⁻⁸⁰), a pattern we term narrative density, suggesting that belief manifests as structurally denser conspiratorial frames distributed across the full text rather than concentrated in specific lexical cues.We present Truth Gradient’s system for SemEval-2026 Task 10 Subtask 2 (Samory et al., 2026): a DeBERTaV3-large model with mean pooling and a 5-seed probability-averaging ensemble achieving macro F1 = 0.829 on the 77-sample development set and 0.75 on the official test set. The 5-fold CV estimate (0.734 ± 0.007) proves the more reliable predictor of test performance, and we recommend it as standard practice for low-resource shared tasks.Two convergent tests support the narrative density account: masking annotated marker spans drops F1 by 5.3 pp, and direct marker-count fusion recovers +0.9 pp, though we note these are not conclusive given the small dev set. Cross-validated ablation identifies encoder fine-tuning as the dominant design factor (−7.2 pts), and layer-wise probing confirms belief information peaks at mid-stack layers (layer 16/24).
GigitAI at SemEval-2026 Task 11: Hybrid Symbolic-Neural Approach for Syllogistic Validity Classification
Saran Krishnasamy
Saran Krishnasamy
We present our system for SemEval-2026 Task 11 on classifying whether syllogisms are logically valid. The main challenge is that language models tend to judge arguments based on whether the conclusion sounds true in the real world, rather than whether it follows logically from the premises. We evaluate direct prompting across six models (GPT-4o, GPT-5.2, o3, o3-mini, Claude Opus 4.6, Claude Sonnet 4) with three prompt strategies, finding that even the best achieves only 89.5% accuracy. Our best-performing system splits the task into two parts: GPT-4o-mini extracts the logical structure, then deterministic rules check validity, enhanced with bidirectional premise checking, predicate negation post-processing, and a targeted rule-based fallback for double negation. This achieves 98.95% accuracy on Subtask 1 (combined score 57.74) and 85.8% validity accuracy on Subtask 2. We also explore self-consistency with symbolic verification (93.1%), content abstraction, activation steering, contrastive fine-tuning, RLVR, and diffusion-based reasoning, finding that content abstraction surprisingly degrades performance, revealing that semantic content provides essential parsing scaffolding alongside the bias it introduces.
Team Evaluators at SemEval-2026 Task 6: Instruction-Tuned LLMs for Clarity and Evasion Classification in Political Interviews
Siva Nuthakki | Sanjay Pulagam | Sai Woona
Siva Nuthakki | Sanjay Pulagam | Sai Woona
This work is part of the SemEval-2026 CLARITY shared task (Task 6), which focuses on detecting clarity and evasion in political question–answer pairs from interviews and debates. The competition includes two subtasks: clarity-level classification (Clear Reply, Ambiguous,Clear Non-Reply) and evasion-level classification, which identifies one of nine fine-grained evasion techniques. The dataset consists of annotated question–answer pairs with hierarchical labels for both clarity and evasion, enabling comprehensive evaluation of nuanced discoursephenomena. We fine-tune open-source large language models using Low-Rank Adaptation (LoRA) and supervised fine-tuning (SFT), employing structured prompts that jointly encode the question and answer to capture discoursecues. Models are evaluated using Macro F1, the official metric of the shared task. Our system achieves a Macro F1 of 0.83 on Subtask 1 (5th place) and 0.54 on Subtask 2 (9th place), demonstrating that parameter-efficient fine-tuning of LLMs is effective for modeling strategic ambiguity in political discourse.
FunnyBorg at SemEval-2026 Task 1: Humor Generation
Stefan Oprea | Lacrimioara Toma Oprea | Maria Paval-Istrate | Diana Trandabat | Daniela Gifu
Stefan Oprea | Lacrimioara Toma Oprea | Maria Paval-Istrate | Diana Trandabat | Daniela Gifu
Our team competed in the SemEval-2026 Task1: MWAHAHA: Humor Generation. This isa task for generation of computational humor.The generated jokes are text-based, but alsoinclude memes, for captioning an image. Ourapproach involved prompt engineering using avoting system. We obtained rank 1 in one ofthe subtasks, and rank 2 in three other subtasks.
Habib University at SemEval-2026 Task 3: A Pipeline Approach for Dimensional Aspect-Based Sentiment Analysis
Muhammad Affan | M Hassan Shahzad | Mikaal Imam | Moiz Zulfiqar | Sandesh Kumar | Abdul Samad
Muhammad Affan | M Hassan Shahzad | Mikaal Imam | Moiz Zulfiqar | Sandesh Kumar | Abdul Samad
Aspect-based sentiment analysis has evolved from categorical polarity classification to fine-grained modeling of continuous affective dimensions. Dimensional Aspect-Based Sentiment Analysis (DimABSA) extends this paradigm by requiring both structured sentiment extraction and continuous valence–arousal (VA) regression in multilingual settings. In this paper, we present our system for SemEval-2026 Task 3, which evaluates this challenge across six languages and four domains, requiring systems to extract aspect–category–opinion quadruplets and predict VA scores on a 1–9 scale.We propose a modular four-stage multilingual transformer pipeline for element extraction, aspect–opinion pairing, category prediction, and VA regression. We conduct experiments over multiple models and training configurations, including VA rescaling to [-1,1], Gaussian label noise injection, Concordance Correlation Coefficient (CCC) loss, and Savitzky–Golay smoothing. Among all languages, our system achieves the lowest RMSE of 0.5333 on Subtask 1 and the highest cF1 of 0.5492 on Subtask 2. We further investigate data augmentation to improve low-resource performance and address label imbalance. Ultimately, our modular architecture demonstrated highly competitive cross-lingual transfer, achieving top-tier placements in low-resource settings, including 2nd place for Tatar and 6th place for Russian in dimensional regression.
SemEval-2026 Task 4: Narrative Story Similarity and Narrative Representation Learning
Hans Ole Hatzel | Ekaterina Artemova | Haimo Stiemer | Evelyn Gius | Chris Biemann
Hans Ole Hatzel | Ekaterina Artemova | Haimo Stiemer | Evelyn Gius | Chris Biemann
We present the shared task on narrative similarity and narrative representation learning — NSNRL (pronounced "nass-na-rel").The task operationalizes narrative similarity as a binary classification problem: determining which of two stories is more similar to an anchor story.We introduce a novel definition of narrative similarity, compatible with both narrative theory and intuitive judgment.Based on the similarity judgments collected under this concept, we also evaluate narrative embedding representations.We collected at least two annotations each for more than 1,000 story summary triples, with each annotation being backed by at least two annotators in agreement.This paper describes the sampling and annotation process for the dataset; further, we give an overview of the submitted systems and the techniques they employ.We received a total of 71 final submissions from 46 teams across our two tracks.In our triple-based classification setup, LLM ensembles make up many of the top-scoring systems, while in the embedding setup, systems with pre- and post-processing on pretrained embedding models perform about on par with custom fine-tuned solutions.Our analysis identifies potential headroom for improvement of automated systems in both tracks.The task website includes visualizations of embeddings alongside instance-level classification results for all teams.
CophiWue at SemEval-2026 Task 4: Symbolic Narrative Profiling with Taxonomy-Guided Extraction and Contrastive Fine-Tuning
Leonard Konle | Fotis Jannidis
Leonard Konle | Fotis Jannidis
We present our system for SemEval-2026 Task 4, focusing primarily on Track B (narrative embedding). Our approach, the Decompose & Align Cycle, converts each story into a structured NarrativeProfile consisting of abstract themes, a five-step course of action, and an outcome. We then build a NarrativeTaxonomy from these initial extractions via agglomerative clustering, and use the resulting controlled vocabularies to guide a second extraction pass, producing terminologically standardized profiles across the full dataset. Finally, we contrastively fine-tune the Qwen3-Embedding8B model on profile text representations using TripletLoss, deriving story embeddings from this fine-tuned model. For Track A, we adapt the task’s provided baseline script by substituting Gemini 3 Pro as the judge, using the organizers default prompt on raw story texts.
Farhan Nafis Rayhan at SemEval-2026 Task 13: Supervised Contrastive Learning Approach with Gated Multiclass Decomposition Ensemble Architecture for Code Authorship Identification
Farhan Rayhan | Fariska Ruskanda
Farhan Rayhan | Fariska Ruskanda
This paper present our submission for SemEval-2026 Task 13 Subtask B, which requires the multi-class attribution of code snippets across 10 distinct AI generator families and a human baseline. Our proposed system utilizes a three-stage ensemble architecture specifically designed to navigate extreme class imbalance and capture subtle stylometric fingerprints. Initially, we employ Supervised Contrastive Learning to fine-tune a UniXcoder and ModernBERT backbone. Resulting embeddings are then processed by five heterogeneous shallow experts, each utilizing a multiclass decomposition to master specific generator lineages through specialized architectures. A Human Shield acts as a hierarchical safety auditor as an aggressive binary layer of human vs machine. Finally, a Context-Aware Gated Meta-Learner dynamically aggregates these expert opinions into a final predictions. Our experiments reveal that streamlining the system to a pure UniXcoder backbone fine-tuned with supervised contrastive learning improves performance, outclassing the official CodeBERT baseline with a final Macro-F1 score of 0.31389, ranking 26th overall.
CUET320 at SemEval-2026 Task 10: Few-Shot Large Language Models for Psycholinguistic Marker Extraction and Conspiracy Detection
Faozia Fariha | Lamia Khan | Madiha Ahmed Chowdhury | Kawsar Ahmed | Mohammed Moshiul Hoque
Faozia Fariha | Lamia Khan | Madiha Ahmed Chowdhury | Kawsar Ahmed | Mohammed Moshiul Hoque
Conspiracy theories widely spread on social media and can harm society by increasing mistrust, vaccine hesitancy, and political radicalization. However, most automated detection systems have traditionally relied on topic-specific classifiers, which often struggle to generalize across domains and provide little explanation for why a text is considered conspiratorial. To address these limitations, this paper explores various LLMs on the SemEval-2026 Task 10: psycholinguistic conspiracy marker extraction and binary conspiracy detection from Reddit submission statements. Specifically, we adopt a training-free few-shot prompting approach using different instruction-tuned large language models under a variety of few-shot settings (k in {0,1,5,10,15, 20}). Within this framework, the proposed prompting strategy incorporates psychology-informed instructions to guide the models in identifying conspiracy-related signals. As a result, the presented system achieves an F1 score of 0.21 for marker extraction and 0.81 for conspiracy detection, ranking 16th out of 30 teams in Subtask~1 and 36th out of 52 in Subtask~2 without any task-specific fine-tuning. These results suggest that psycholinguistically grounded prompting can support interpretable conspiracy analysis; however, challenges remain in identifying implicit markers.
UTD-HLTRI at SemEval 2026 Task 4: Reasoning like an Expert for Inferring Narrative Similarity
Rakshitha Rao Ailneni | Maitry Bhavsar | Sanda Harabagiu
Rakshitha Rao Ailneni | Maitry Bhavsar | Sanda Harabagiu
Narrative similarity is a challenging problem that requires reasoning over three aspects of narratives, including (1) the abstract theme; (2) the course of action and (3) the outcomes of narratives. We present UTD.HLTRISIM.NARRATIVES, our method developed for SemEval 2026 Task 4 (Narrative Story Similarity), which combines contrastive reasoning prompting with careful selection of few-shot examples to guide a Large Language Model(LLM) toward decisions of narrative comparative similarity. A curriculum learning framework orders examples of narrative triplets presented to the LLM by using a score that quantifies the impact of common narratives aspects with information discerned from several distractors of narrative similarity between pairs ofnarratives 1.
Team Vivek Dhayaal at SemEval-2026 Task 13 Subtask B: Multi-Class Authorship Detection
David Rodriguez | Mario Graff
David Rodriguez | Mario Graff
This paper describes the system for SemEval-2026 Task 10 Subtask 2 on conspiracy detection. We explore a progressive modeling strategy comparing traditional lexical representations with contextual transformer models. Lexical baselines include Bag-of-Words and TF-IDF features combined with Logistic Regression and Ridge classifiers. We then fine-tune a DistilRoBERTa transformer model for binary classification.All experiments were conducted using only the official task data in a CPU-only environment without external datasets or data augmentation. Our objective was to achieve acceptable performance while minimizing computational resources and model complexity. Results show that the transformer model improves the best lexical baseline from 0.67 to 0.75. The work highlights that competitive performance in conspiracy detection can be obtained with lightweight and reproducible configurations.
CSE-UOI at SemEval-2026 Task 6: A Two-Stage Heterogeneous Ensemble with Deliberative Complexity Gating for Political Evasion Detection
Christos Tzouvaras | Konstantinos Skianis | Athanasios Voulodimos
Christos Tzouvaras | Konstantinos Skianis | Athanasios Voulodimos
This paper describes our system for SemEval-2026 Task 6, which classifies clarity of responses in political interviews into three categories: Clear Reply, Ambivalent, and Clear Non-Reply. We propose a heterogeneous dual large language model (LLM) ensemble via self-consistency (SC) and weighted voting, and a novel post-hoc correction mechanism, Deliberative Complexity Gating (DCG). This mechanism uses cross-model behavioral signals and exploits the finding that an LLM response-length proxy correlates strongly with sample ambiguity. To further examine mechanisms for improving ambiguity detection, we evaluated multi-agent debate as an alternative strategy for increasing deliberative capacity. Unlike DCG, which adaptively gates reasoning using cross-model behavioral signals, debate increases agent count without increasing model diversity. Our solution achieved a Macro-F1 score of 0.85 on the evaluation set, securing 3rd place and tied with the second-best reportedscore.
ABARUAH at SemEval-2026 Task 1: Leveraging High-Resolution VLMs and Reasoning LLMs for Multimodal Humor Generation
Arup Baruah
Arup Baruah
This paper describes the systems developed for "SemEval 2026 Task 1: Humor Generation". This shared task covered both unimodal text constraints and multimodal GIF-based humor generation. The proposed approach used a two-stage pipeline consisting of a Multimodal Grounding stage to extract semantic descriptions from GIFs and a Humor Synthesis stage to generate the final humorous output. The Qwen2-VL and Qwen3-8B models were used for these respective stages. The system achieved competitive Elo-like ratings of 1009, 973, and 914 for Subtasks A, B1, and B2, respectively, demonstrating its ability to address diverse humorous constraints. The system was ranked 4th in overall standings for Subtasks A and B1.
AI@UMS at SemEval-2026 Task 6: Handling Long Question-Answer Pairs with Sliding Window Models for Clarity and Evasion Analysis
Ikhlasul Amal | Zia Ul Zafar | Choiru Firdaus | Endang Pamungkas
Ikhlasul Amal | Zia Ul Zafar | Choiru Firdaus | Endang Pamungkas
This paper presents the AI@UMS system for SemEval-2026 Task 6: CLARITY - Unmasking Political Question Evasions. The task requires classifying question-answer (QA) pairs from political interviews along two dimensions: clarity level (Subtask 1) and evasion technique (Subtask 2). A key challenge is that political interview transcripts often exceed the 512-token input limit of standard transformer encoder models. We address this with a sliding-window fine-tuning strategy applied to roberta-base, where each QA pair is segmented into overlapping windows of 512 tokens with a stride of 256 tokens. Per-window predictions are aggregated via softmax probability averaging across multiple windows and across an ensemble of three independently trained models with different random seeds. We further employ class-weighted focal-inspired loss and label smoothing to mitigate the pronounced class imbalance in both subtasks. Our system achieves macro F1 scores of 0.62 (Subtask 1) and 0.48 (Subtask 2) on the official evaluation set.
GUIR at SemEval-2026 Task 7: Probing Cultural Knowledge in LLMs via Multi-Agent Debate
Reihaneh Iranmanesh | Ophir Frieder | Nazli Goharian
Reihaneh Iranmanesh | Ophir Frieder | Nazli Goharian
We present the GUIR system for SemEval-2026 Task 7, Everyday Knowledge Across Diverse Languages and Cultures, which probes the extent to which general-purpose LLMs encode cultural knowledge without any culture-specific supervision or fine-tuning. Our system addresses two tracks built on the BLEnD benchmark. For the short-answer question (SAQ) track, we employ zero-shot prompting with gpt-4.1, achieving 55.5% accuracy across 61 language locales. For the multiple-choice question (MCQ) track, we propose a three-stage pipeline: (1) zero-shot chain-of-thought inference with gpt-5-mini, (2) cross-locale majority voting to correct inconsistent predictions, and (3) a multi-agent debate protocol in which three LLM instances argue and adjudicate over residual errors. This pipeline achieves 97.47% overall accuracy across 30 locales, ranking first among all submitted systems on the MCQ track. We further conduct a targeted human evaluation on the Persian locale, revealing that BLEnD’s lemma-matching scorer systematically underestimates model performance, with human annotators scoring the system 18 percentage points higher than the lemma-matching evaluation. This reveals the need for better evaluation of morphologically rich languages like Persian.
NAMAA at SemEval-2026 Task 9: Comparing Generative, Retrieval-Augmented, and Discriminative Methods for Arabic Online Polarization Detection and Type Classification
Abdelbasset Djamai | Sahara Al-Madi | Norah Al-Zaid | Khloud Al Jallad | Mona Azim
Abdelbasset Djamai | Sahara Al-Madi | Norah Al-Zaid | Khloud Al Jallad | Mona Azim
Detecting polarization in online discourse is important for understanding social fragmentation , yet it remains difficult for Arabic due to dialect variation, informal writing, and implicit framing. In this paper, we study Arabic polarization modeling in the SemEval-2026 Task 9 (POLAR) setting, focusing on polarization detection (ST1) and polarization type classification (ST2). We compare three approaches: encoder fine-tuning, zero-shot prompting, and retrieval-augmented in-context learning (RAG-ICL), across six Arabic encoders and different LLMs. For ST1, RAG-ICL with Gemma-3-27b-it achieves the best result (test macro F1 = 0.83), while remaining competitive with the best fine-tuned encoder (0.82), and substantially outperforming zero-shot prompting. For ST2, a pipeline that first applies the best ST1 encoder as a hard filter and then performs RAG-ICL achieves a macro F1 = 0.62. Prompt-language effects are model-and task-dependent, with some settings doing better with English prompts and others with Arabic prompts. Chain-of-thought, self-refinement, and contrastive prompting do not outperform standard RAG-ICL.
MoMo at SemEval-2026 Task 9: Inference-Only Prompting vs. Fine-Tuning for Multilingual Polarization Detection
Sushant Ray | Rakshita Saksainaa
Sushant Ray | Rakshita Saksainaa
We describe our submission to SemEval-2026 Task 9 Subtask 1, which focuses on multilingual polarization detection over the POLAR dataset. We compare three adaptation paradigms: fully fine-tuned multilingual encoders, frozen encoders augmented with lightweight residual heads, and inference-only multilingual LLM prompting in zero-shot and few-shot settings. For few-shot prompting, we evaluate both random and similarity-based support example selection. Similarity-based few-shot prompting with a multilingual LLM competes with our fine-tuned encoder baselines while requiring no task-specific training. We further analyze energy usage, stability across prompt selections and per-language behavior to characterize trade-offs between architectural adaptation and prompt-based inference. While our submission uses a fully fine tuned XLM-RoBERTa Large, the results indicate that inference-only prompting can be a competitive and energy-efficient alternative to task-specific fine-tuning in multilingual classification.
Codexa at SemEval-2026 Task 13: Loss Engineering and Diverse Ensemble Strategies for Multi-Class Code Authorship Attribution
Anıl Dervişoğlu | Atakan Site
Anıl Dervişoğlu | Atakan Site
We describe our system for SemEval-2026 Task 13, Subtask B: code classification into 11 categories (human-written or generated by one of 10 LLM families). The task presents extreme class imbalance and distribution shift across multiple generators provided in the dataset (31 in training, 59 in test, with 36 unseen). On that focus, we approached with two components: (1) UniXcoder as the encoder with Label-Distribution-Aware Margin (LDAM) loss for handling class imbalance, which provides a +7% absolute improvement over the cross-entropy baseline; and (2) a diverse ensemble of 12 models trained with different objectives and architectures which is detailed in the appendix, combined with hard voting. Our system achieves 41.28% Macro F1 on the official test set. We find that loss engineering and ensemble diversity matter more than domain adaptation techniques, which consistently degraded test performance.
StanceLab at SemEval-2026 Task 9: Addressing Class Imbalance in Multilingual Polarization Detection
Teodor Ivanusca | Dan Dodun-Des-Perrieres | Stefana Gheorghita
Teodor Ivanusca | Dan Dodun-Des-Perrieres | Stefana Gheorghita
Polarization in online discourse poses significant challenges for natural language processing, particularly in multilingual and culturally diverse environments. In this paper, we address the SemEval-2026 POLAR shared task on multilingual polarization detection across 22 languages. We adopt a staged experimental strategy that first investigates the problem in a controlled monolingual English setting before extending the approach to multilingual modeling. Our system evaluates several transformer-based architectures, including RoBERTa, XLM-RoBERTa, MPNet, and mDeBERTa-v3, combined with techniques designed to mitigate class imbalance such as weighted loss functions, focal loss, and data augmentation using back-translation and large language models. Experimental results show that no single configuration consistently dominates across all languages. However, focal loss and augmentation frequently improve performance in languages with skewed label distributions. Our findings highlight the importance of contextual representations, imbalance-aware training strategies, and language-specific considerations for robust multilingual polarization detection.
CoPol at SemEval-2026 Task 9: Modeling Polarization Type Co-occurrence with Label Correlation Networks
Pushkar Arora
Pushkar Arora
POLAR-LDA is a label-dependency–aware system for SemEval-2026 Task 9 (multi-label polarization type classification) that augments an mDeBERTa-v3-base encoder with a Label Correlation Network (language-specific directed co-occurrence matrices + GAT), Asymmetric Loss tuned for extreme positive scarcity, and a language-grouped ensemble. The system scores 0.567 macro F1 across 22 languages (range 0.784 Hindi — 0.256 Italian) and shows clear ablation gains (ASL +0.041, LCN +0.030, ensemble +0.018). Key findings: absolute data voids (0–1 positive examples) form an unrecoverable floor for supervised learning; label co-occurrence is culturally situated (e.g., political↔religious in Indic vs. political↔racial in some Western languages) and benefits from language-specific graphs; and per-label training volume predicts cross-lingual performance better than linguistic family. Limitations are honest and important: noisy AL estimates under scarcity, an incoherent residual "other" category, and domain mismatch between pretraining corpora and polarization discourse. Overall, the paper offers a strong shared-task system and useful empirical diagnostics—practical and well-executed, but incrementally novel methodologicall
SemEval-2026 Task 10: PsyCoMark – Psycholinguistic Conspiracy Marker Extraction and Detection
Mattia Samory | Felix Soldner | Veronika Batzdorfer
Mattia Samory | Felix Soldner | Veronika Batzdorfer
Despite the need to address the proliferation of conspiracy theories in online discussions, there is a lack of benchmarks for effectively detecting conspiracy-related content in everyday conversational settings. We introduce a novel dataset of comments from Reddit, ranging from politics to TV series, as well as two synergetic tasks: (1) extracting five psycholinguistic markers, grounded in evolutionary psychology, and (2) detecting conspiracy content. The data enable multi-task approaches, allowing testing of whether marker extraction improves detection performance.
SemEval-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
Daniil Orel | Dilshod Azizov | Indraneil Paul | Yuxia Wang | Iryna Gurevych | Preslav Nakov
We present the results and the main findings of SemEval-2026 Task 13: Detecting Machine-Generated Code with Multiple Programming Languages, Generators, and Application Scenarios. Our task featured three subtasks. Subtask A is a binary classification taskthat determines whether a given code snippet is written by a human or generated by a machine. This subtask focuses on the development of robust methods for AI-generated code identification, since the training and the test data splits have code in different languages and cover diverse usage domains. Subtask B focuses on defining synthetic code smells and requires participants to identify the provenance of the generator family of the model that generated the given code snippet. Subtask C aims at more fine-grained attribution of the written code: whether it was fully AI-generated, fully human-written, produced in human-AI collaboration (hybrid) or by a model tuned or prompted to give human-like code. The task attracted a large number of team members: subtask A (81), subtask B (34), and subtask C (32). In this study, we present the task, analyze the results and discuss the submissions of the system and the methods they used.
SemEval-2026 Task 12: Abductive Event Reasoning: Towards Real-World Event Causal Inference for Large Language Models
Pengfei Cao | Mingxuan Yang | Yubo Chen | Chenlong Zhang | Mingxuan Liu | Kang Liu | Jun Zhao
Pengfei Cao | Mingxuan Yang | Yubo Chen | Chenlong Zhang | Mingxuan Liu | Kang Liu | Jun Zhao
Understanding why real-world events occur is important for both natural language processing and practical decision-making, yet direct-cause inference remains underexplored in evidence-rich settings. To address this gap, we organized SemEval-2026 Task 12: Abductive Event Reasoning (AER). The task asks systems to identify the most plausible direct cause of a target event from supporting evidence. We formulate AER as an evidence-grounded multiple choice benchmark that captures key challenges of real-world causal reasoning, including distributed evidence, indirect background factors, and semantically related but non-causal distractors. The shared task attracted 122 participants and received 518 submissions. This paper presents the task formulation, dataset construction pipeline, evaluation setup, and system results. AER provides a focused benchmark for abductive reasoning over real-world events and highlights challenges for future work on causal reasoning and multi-document understanding.
SemEval-2026 Task 8: MTRAGEval: Evaluating Multi-Turn RAG Conversations
Sara Rosenthal | Vraj Shah | Yannis Katsis | Marina Danilevsky
Sara Rosenthal | Vraj Shah | Yannis Katsis | Marina Danilevsky
We present the results and findings from SemEval Task 8: MTRAGEval. MTRAGEval measures three Retrieval Augmented Generation (RAG) subtasks: A. Retrieval, B. Generate, and C. Retrieve+Generate (full RAG) on multi-turn conversations. The task is evaluated using MTRAG-UN, a new benchmark for Multi-Turn RAG focusing on Unanswerable, Underspecified, Non-Standalone, and Unclear Questions. The MTRAGEval task attracted strong participation with 107 registered teams and 92 submissions across all subtasks, and yielded several interesting findings on effective retrieval and query rewriting techniques, the use of ensemble models, and the compounding costs of retrieval errors on downstream generation quality.
SemEval-2026 Task 5: Rating Plausibility of Word Senses in Ambiguous Stories through Narrative Understanding
Janosch Gehring | Selina Meyer | Michael Roth
Janosch Gehring | Selina Meyer | Michael Roth
We introduce SemEval-2026 Task 5 on "Rating Plausibility of Word Senses in Ambiguous Stories through Narrative Understanding". The dataset for this task consists of 4-5 sentence English short stories. In each story, one sentence includes a lexical ambiguity and different senses are to be judged in terms of plausibility on a Likert scale. The task is intentionally constructed to be challenging by stories only providing sparse contextual cues. We give an overview of well-performing, frequent and interesting approaches used by participating systems. From a total of 175 registered participants and 27 submitted system description papers, the best system achieved an "accuracy within standard deviation" score of 93.3%.
SemEval-2026 Task 6: CLARITY – Unmasking Political Question Evasions
Konstantinos Thomas | Giorgos Filandrianos | Maria Lymperaiou | Chrysoula Zerva | Giorgos Stamou
Konstantinos Thomas | Giorgos Filandrianos | Maria Lymperaiou | Chrysoula Zerva | Giorgos Stamou
This paper presents CLARITY, the SemEval-2026 shared task on detecting and classifying evasive responses in political discourse. The task is grounded in an expert-designed two-level taxonomy and a benchmark dataset of question-answer pairs from U.S. presidential interviews, requiring systems to distinguish clear from evasive responses at a coarse level and identify one of nine fine-grained evasion strategies at a fine-grained level. With 124 registered teams and over 1,400 combined valid submissions, the task attracted broad participation spanning a wide range of methodological approaches, from fine-tuned encoder models to multi-stage large language model pipelines. Analysis of submitted systems reveals that hierarchical exploitation of the taxonomy and chain-of-thought prompted LLMs were the most effective strategies, while fine-grained evasion classification remained a substantially harder and largely unsolved challenge. CLARITY advances the study of strategic ambiguity in political language as a formal NLP benchmark and highlights key open problems in computational discourse analysis.
SemEval-2026 Task 11: Disentangling Content and Formal Reasoning in Large Language Models
Marco Valentino | Leonardo Ranaldi | Giulia Pucci | Federico Ranaldi | André Freitas
Marco Valentino | Leonardo Ranaldi | Giulia Pucci | Federico Ranaldi | André Freitas
SemEval-2026 Task 11 evaluates the ability of Large Language Models (LLMs) to perform content-independent reasoning through a novel multilingual syllogistic dataset designed to measure the "content effect" — the tendency to conflate semantic plausibility with logical validity. The competition featured four subtasks, covering English and multilingual settings with both standard and noisy premise sets. Evaluations of zero-shot baselines reveal that the content effect is pervasive in open models, whereas newer versions demonstrate a significant shift in performance. Across the subtasks, findings indicate that introducing distracting premises can challenge the models’ ability to filter misleading information, while multilingual settings amplify their susceptibility to content biases compared to English. Participants proposed diverse approaches, including neuro-symbolic decomposition, fine-tuning and distillation methods, data augmentation, and activation steering. While explicit symbolic verification remains the most reliable strategy, activation-level interventions and fine-tuning methods offer promising pathways for internalising formal logic within neural architectures. Ultimately, the task reinforces the efficacy of neuro-symbolic approaches and emerging architectural trends for logical reliability, while also highlighting that multilingual setups and longer contexts still pose significant challenges to be investigated in future research.
SemEval-2026 Task 2: Predicting Variation in Emotional Valence and Arousal over Time from Ecological Essays
Nikita Soni | H. Andrew Schwartz | Ryan Boyd | Phi Long Bui | Syeda Mahwish | August Nilsson | Adithya V Ganesan | Lyle Ungar | Niranjan Balasubramanian | Saif Mohammad
Nikita Soni | H. Andrew Schwartz | Ryan Boyd | Phi Long Bui | Syeda Mahwish | August Nilsson | Adithya V Ganesan | Lyle Ungar | Niranjan Balasubramanian | Saif Mohammad
We present our shared task on predicting variation in emotional valence and arousal over time from ecological essays. The shared task uses a longitudinal dataset collected over 7 data collection phases of 14-day each spanning from 2021 to 2024, consisting of real-time essays and feeling words (e.g., happy, calm, sad, etc.) written in English by U.S. service-industry workers about “how they are feeling”. Each text is associated with self-reported valence (V) (0 - 4, highly negative to highly positive affect) and arousal (A) (0 - 2, low to high energy) scores. The shared task consists of three parts, Subtask (1): Longitudinal Affect Assessment, Subtask (2): Forecasting Variation in Affect as a (2a): \textit{state change}, and (2b): \textit{disposition change}.The task attracted over 200 member registrations on Codabench, receiving official system submissions from 31 teams (total 104 team members), of which 28 teams (with 90 team members) submitted system description papers making it to our leaderboard. We discuss baseline results along with findings from 28 systems, highlighting the best-performing systems, a deeper analysis of performance on essays versus feeling words, and assessments for authors seen versus unseen during training. The datasets for this task are publicly available.
SemEval-2026 Task 3: Dimensional Aspect-Based Sentiment Analysis (DimABSA)
Liang-Chih Yu | Jonas Becker | Shamsuddeen Hassan Muhammad | Idris Abdulmumin | Lung-Hao Lee | Ying-Lung Lin | Jin Wang | Jan Philip Wahle | Terry Lima Ruas | Natalia Loukachevitch | Alexander Panchenko | Ilseyar Alimova | Lilian Diana Awuor Wanzare | Nelson Odhiambo | Bela Gipp | Kai-Wei Chang | Saif Mohammad
Liang-Chih Yu | Jonas Becker | Shamsuddeen Hassan Muhammad | Idris Abdulmumin | Lung-Hao Lee | Ying-Lung Lin | Jin Wang | Jan Philip Wahle | Terry Lima Ruas | Natalia Loukachevitch | Alexander Panchenko | Ilseyar Alimova | Lilian Diana Awuor Wanzare | Nelson Odhiambo | Bela Gipp | Kai-Wei Chang | Saif Mohammad
We present the SemEval-2026 shared task on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which improves traditional ABSA by modeling sentiment along valence–arousal (VA) dimensions rather than using categorical polarity labels. To extend ABSA beyond consumer reviews to public-issue discourse (e.g., political, energy, and climate issues), we introduce an additional task, Dimensional Stance Analysis (DimStance), which treats stance targets as aspects and reformulates stance detection as regression in the VA space. The task consists of two tracks: Track A (DimABSA) and Track B (DimStance). Track A includes three subtasks: (1) dimensional aspect sentiment regression, (2) dimensional aspect sentiment triplet extraction, and (3) dimensional aspect sentiment quadruplet extraction, while Track B includes only the regression subtask for stance targets. We also introduce a continuous F1 (cF1) metric to jointly evaluate structured extraction and VA regression.The task attracted more than 400 participants, resulting in 112 final submissions and 42 system description papers. We report baseline results, discuss top-performing systems, and analyze key design choices to provide insights into dimensional sentiment analysis at the aspect and stance-target levels. All resources are available on our GitHub repository.
SemEval-2026 Task 9: Detecting Multilingual, Multicultural and Multievent Online Polarization
Usman Naseem | Robert Geislinger | Ada Ren | Sarah Kohail | Rudy Garrido Veliz | P Sam Sahil | Yiran Zhang | Marco Antonio Stranisci | Idris Abdulmumin | Özge Alacam | Cengiz Acarturk | Aisha Jabr | Saba Anwar | Abinew Ali Ayele | Elena Tutubalina | Aung Kyaw Htet | Xintong Wang | Surendrabikram Thapa | Tanmoy Chakraborty | Dheeraj Kodati | Sahar Moradizeyveh | Firoj Alam | Ye Kyaw Thu | Shantipriya Parida | Ihsan Ayyub Qazi | Lilian Diana Awuor Wanzare | Nelson Odhiambo | Clemencia Siro | Ibrahim Said Ahmad | Adem Chanie Ali | Martin Semmann | Chris Biemann | Shamsuddeen Hassan Muhammad | Seid Muhie Yimam
Usman Naseem | Robert Geislinger | Ada Ren | Sarah Kohail | Rudy Garrido Veliz | P Sam Sahil | Yiran Zhang | Marco Antonio Stranisci | Idris Abdulmumin | Özge Alacam | Cengiz Acarturk | Aisha Jabr | Saba Anwar | Abinew Ali Ayele | Elena Tutubalina | Aung Kyaw Htet | Xintong Wang | Surendrabikram Thapa | Tanmoy Chakraborty | Dheeraj Kodati | Sahar Moradizeyveh | Firoj Alam | Ye Kyaw Thu | Shantipriya Parida | Ihsan Ayyub Qazi | Lilian Diana Awuor Wanzare | Nelson Odhiambo | Clemencia Siro | Ibrahim Said Ahmad | Adem Chanie Ali | Martin Semmann | Chris Biemann | Shamsuddeen Hassan Muhammad | Seid Muhie Yimam
We present SemEval-2026 Task 9, a shared task on online polarization detection, covering 22 languages and comprising over 110K annotated instances. Each data instance is multi-labeled with the presence of polarization, polarization type, and polarization manifestation. Participants were asked to predict labels in three subtasks: (1) detecting the presence of polarization, (2) identifying the type of polarization, and (3) recognizing the polarization manifestation. The three tasks attracted over 1,000 participants worldwide and more than 10k submissions on Codabench. We received final submissions from 67 teams and 69 system description papers. We report the baseline results and analyze the performance of the best-performing systems, highlighting the most common approaches and the most effective methods across different subtasks and languages. The dataset and other resources for this task are publicly available.
SemEval-2026 Task 1: MWAHAHA, Models Write Automatic Humor And Humans Annotate
Santiago Castro | Luis Chiruzzo | Santiago Góngora | Naihao Deng | Salar Rahili | Ignacio Sastre | Aiala Rosá | Victoria Amoroso | Guillermo Rey | Guillermo Moncecchi | J. A. Meaney | Juan José Prada | Rada Mihalcea
Santiago Castro | Luis Chiruzzo | Santiago Góngora | Naihao Deng | Salar Rahili | Ignacio Sastre | Aiala Rosá | Victoria Amoroso | Guillermo Rey | Guillermo Moncecchi | J. A. Meaney | Juan José Prada | Rada Mihalcea
We present SemEval-2026 Task 1: MWAHAHA (Models Write Automatic Humor And Humans Annotate), the first shared task on general-purpose humor generation. Systems must produce short jokes in English, Spanish, and Chinese under lexical or topical constraints (Subtask A) and generate humorous captions for GIFs (Subtask B). To discourage memorization and ensure fairness, all jokes must meet specific criteria, such as using infrequent word pairs or relating to recent news headlines. Evaluation is conducted through pairwise human preference judgments in a Chatbot Arena-style setting, yielding Elo-based rankings. The task attracted 309 registered users, with 37 teams submitting systems to the evaluation phase. Participating systems employ a wide range of NLP techniques, including generate-then-rank pipelines, reinforcement learning, parameter-efficient fine-tuning, retrieval-augmented generation, humor-theory-grounded prompting, and persona-based strategies. Our Gemini 2.5 Flash baseline, using simple prompts, tied for first place in all subtasks, and the majority of elaborate multi-stage pipelines only marginally surpassed it with overlapping confidence intervals. More work is necessary to outperform the simple usage of state-of-the-art large language models. We release all evaluation data, prompts, and leaderboard results to support future research in computational humor generation.
SemEval-2026 Task 7: Everyday Knowledge Across Diverse Languages and Cultures
Nedjma Ousidhoum | Junho Myung | Carla Perez Almendros | Jiho Jin | Amr Keleg | Meriem Beloucif | Yi Zhou | Rodrigo Agerri | Vladimir Araujo | Naomi Baes | James Barry | Joanne Boisson | Nancy Chen | Christine De Kock
Nedjma Ousidhoum | Junho Myung | Carla Perez Almendros | Jiho Jin | Amr Keleg | Meriem Beloucif | Yi Zhou | Rodrigo Agerri | Vladimir Araujo | Naomi Baes | James Barry | Joanne Boisson | Nancy Chen | Christine De Kock
We present our shared task on evaluating the adaptability of LLMs and NLP systems across multiple languages and cultures. The task data consist of an extended version of our manually constructed BLEnD benchmark (Myung et al., 2024), covering more than 30 language–culture pairs, predominantly representing low-resource languages spoken across multiple continents. As the task is designed strictly for evaluation, participants were not permitted to use the data for training, fine-tuning, few-shot learning, or any other form of model modification.Our task includes two tracks: (a) Short-Answer Questions (SAQ) and (b) Multiple-Choice Questions (MCQ). Participants were required to predict labels and were allowed to submit any NLP system and adopt diverse modelling strategies, provided that the benchmark was used solely for evaluation. The task attracted more than 140 registered participants, and we received final submissions from 62 teams, along with 19 system description papers.We report the results and present an analysis of the best-performing systems and the most commonly adopted approaches. Furthermore, we discuss shared insights into open questions and challenges related to evaluation, misalignment, and methodological perspectives on model behaviour in low-resource languages and for under-represented cultures. Our data and resources are available at https://github.com/BLEnD-SemEval2026/SemEval-2026-Task-7.
up
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Proceedings of the Second Workshop Natural Language Processing for Turkic Languages (SIGTURK 2026)
Kemal Oflazer | Abdullatif Köksal | Onur Varol
Kemal Oflazer | Abdullatif Köksal | Onur Varol
Transformer models have revolutionized NLP, yet many morphologically rich languages remain underrepresented in large-scale pre-training efforts. With SindBERT, we set out to chart the seas of Turkish NLP, providing the first large-scale RoBERTa-based encoder for Turkish. Trained from scratch on 312 GB of Turkish text (mC4, OSCAR23, Wikipedia), SindBERT is released in both base and large configurations, representing the first large-scale encoder-only language model available for Turkish. We evaluate SindBERT on part-of-speech tagging, named entity recognition, offensive language detection, and the TurBLiMP linguistic acceptability benchmark. Our results show that SindBERT performs competitively with existing Turkish and multilingual models, with the large variant achieving the best scores in two of four tasks but showing no consistent scaling advantage overall. This flat scaling trend, also observed for XLM-R and EuroBERT, suggests that current Turkish benchmarks may already be saturated. At the same time, comparisons with smaller but more curated models such as BERTurk highlight that corpus quality and diversity can outweigh sheer data volume. Taken together, SindBERT contributes both as an openly released resource for Turkish NLP and as an empirical case study on the limits of scaling and the central role of corpus composition in morphologically rich languages. The SindBERT models are released under the MIT license and made available in both fairseq and Huggingface formats.
Authorial style transfer is particularly challenging in low-resource scenarios, such as those presented by languages with a distinct socio-digital trajectory like Turkish, where contemporary digital text coexists with under-resourced literary and historical styles. This work addresses this gap through the Dual-Stage Stylometric Imprinting (DSSI) framework, introducing a Rule+Example paradigm for effective style profiling. Evaluated on a corpus of Turkish texts, the approach enables smaller models to achieve up to 90% of large model performance by combining explicit stylistic guidelines with contextual demonstrations. The findings demonstrate altered scaling laws for stylistic tasks and facilitate the practical deployment of personalized style transfer for preserving distinctive writing characteristics.
TUNE: A Task For Turkish Machine Unlearning For Data Privacy
Doruk Benli | Ada Canoğlu | Nehir İlkim Gönençer | Dilara Keküllüoğlu
Doruk Benli | Ada Canoğlu | Nehir İlkim Gönençer | Dilara Keküllüoğlu
Most large language models (LLMs) are trainedon massive datasets that include private infor-mation, which may be disclosed to third-partyusers in output generation. Developers put de-fences to prevent the generation of harmful andprivate information, but jailbreaking methodscan be used to bypass them. Machine unlearn-ing aims to remove information that may beprivate or harmful from the model’s genera-tion without retraining the model from scratch.While machine unlearning has gained somepopularity to counter the removal of privateinformation, especially in English, little to noattention has been given to Turkish unlearn-ing paradigms or existing benchmarks. In thisstudy, we introduce TUNE (Turkish Unlearn-ing Evaluation), the first benchmark datasetfor Turkish unlearning task for personal infor-mation. TUNE consists of 9842 input-targettext pairs about 50 fictitious personalities withtwo training task types: (1) Q A and (2) In-formation Request. We fine-tuned the mT5base model to evaluate various unlearning meth-ods, including our proposed approach. We findthat while current methods can help unlearnunwanted private information in Turkish, theyalso unlearn other information we want to re-tain in the model.
A Unified Turkic Idiom Understanding Benchmark: Idiom Detection and Semantic Retrieval Across Five Turkic Languages
Gözde Aslantaş | Tunga Gungor
Gözde Aslantaş | Tunga Gungor
Idiomatic expressions are culturally grounded, semantically opaque, and difficult to interpret for multilingual natural language processing systems. Despite the large speaker population of Turkic languages, resources that focus on monolingual and cross-lingual idioms and their meanings are limited. We introduce the first unified benchmark for idiom understanding across Turkish, Azerbaijani, Turkmen, Gagauz, and Uzbek languages. The datasets compiled include token-level idiom span annotations. We develop models for idiom identification and semantic retrieval tasks. We evaluate seven models for idiom identification and nine embedding models for semantic retrieval tasks under several fine-tuning schemes using standard dense retrieval metrics. This benchmark provides a basis for studying idiomatic phenomena in Turkic languages and clarifies how idiomatic meanings are shared, altered, or diverge across languages.
TR-EduVSum: A Turkish-Focused Dataset and Consensus Framework for Educational Video Summarization
Figen Eğin | Aytuğ Onan
Figen Eğin | Aytuğ Onan
This study presents a framework for generating the gold-standard summary fully automatically and reproducibly based on multiple human summaries of Turkish educational videos. Within the scope of the study, a new dataset called TR-EduVSum was created, encompassing 82 Turkish course videos in the field of "Data Structures and Algorithms" and containing a total of 3281 independent human summaries. Inspired by existing pyramid-based evaluation approaches, the AutoMUP (Automatic Meaning Unit Pyramid) method is proposed, which extracts consensus-based content from multiple human summaries. AutoMUP clusters the meaning units extracted from human summaries using embedding, statistically models inter-participant agreement, and generates graded summaries based on consensus weight. In this framework, the gold summary corresponds to the highest-consensus AutoMUP configuration, constructed from the most frequently supported meaning units across human summaries. Experimental results show that AutoMUP summaries exhibit high semantic overlap with robust LLM summaries such as Flash 2.5 and GPT-5.1. Furthermore, ablation studies clearly demonstrate the decisive role of consensus weight and clustering in determining summary quality. The proposed approach can be generalized to other Turkic languages at low cost.
SarcasTürk: Turkish Context-Aware Sarcasm Detection Dataset
Niyazi Ahmet Metin | Sevde Yılmaz | Osman Enes Erdoğdu | Elif Sude Meydan | Oğul Sümer | Dilara Keküllüoğlu
Niyazi Ahmet Metin | Sevde Yılmaz | Osman Enes Erdoğdu | Elif Sude Meydan | Oğul Sümer | Dilara Keküllüoğlu
Sarcasm is a colloquial form of language that is used to convey messages in a non-literal way, which affects the performance of many NLP tasks. Sarcasm detection is not trivial and existing work mainly focus on only English. We present SarcasTürk, a context-aware Turkish sarcasm detection dataset built from Ekşi Sözlük entries, a large-scale Turkish online discussion platform where people frequently use sarcasm. SarcasTürk contains 1,515 entries from 98 titles with binary sarcasm labels and a title-level context field created to support comparisons between entry-only and context-aware models. We generate these contexts by selecting representative sentences from all entries under a title using summarization techniques. We report baseline results for a fine-tuned BERTurk classifier and zero-shot LLMs under both no-context and context-aware conditions. We find that BERTurk model with title-level context has the best performance with 0.76 accuracy and balanced class-wise F1 scores (0.77 for sarcasm, 0.75 for no sarcasm). SarcasTürk can be shared upon contacting the authors since the dataset contains potentially sensitive and offensive language.
Language Matters: Target-Language Supervision for Political Bias Detection in Turkish News
Umut Ozbagriacik | Haim Dubossarsky
Umut Ozbagriacik | Haim Dubossarsky
We present, to our knowledge, the first systematic transformer-based outlet-ideology classification study for Turkish news. Using a topic-balanced corpus of Turkish political articles drawn from six outlets commonly perceived as left-, centre-, or right-leaning, we formulate a three-way outlet-ideology classification task. On this dataset, we evaluate a monolingual encoder (BERTurk), two multilingual encoders (mBERT, XLM-R), and a LoRA-adapted decoder model (Mistral). BERTurk achieves the best performance among individual models (70% accuracy, 71% macro-F1), reaching levels comparable to English-language studies despite operating in a lower-resource setting. Error analyses show that all encoders reliably distinguish centrist from partisan articles, but frequently confuse left- and right-leaning articles with each other. Moreover, BERTurk is relatively stronger on right-leaning content, whereas the multilingual models favour left-leaning content, suggesting an “ideological fingerprint” of their pre-training data. Crucially, models fine-tuned on an English political-bias task fail to transfer to Turkish, collapsing to near-chance performance. Taken together, these results demonstrate that effective political bias detection requires target-language supervision and cannot be achieved through naïve cross-lingual transfer. Our work establishes a first baseline for Turkish political bias detection and underscores the need for open, carefully designed Turkish (and broader Turkic) bias benchmarks to support robust and fair media analysis.
Modelling the Morphology of Verbal Paradigms: A Case Study in the Tokenization of Turkish and Hebrew
Giuseppe Samo | Paola Merlo
Giuseppe Samo | Paola Merlo
In this paper, we investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, focusing on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish—with its transparent morphological markers—both monolingual and multilingual models succeed either when tokenization is highly atomic or breaking words into small subword units. For Hebrew, however, a multilingual model using character-level tokenization fails to capture its non-concatenative morphology, while a monolingual model with unified morpheme-aware segmentation excels. Performance improves on more synthetic datasets, in all models.
A Morphology-Aware Evaluation of Turkish Syntax in Large Language Models
Ezgi Başar | Arianna Bisazza
Ezgi Başar | Arianna Bisazza
Minimal pair benchmarks have become a common approach for evaluating the syntactic knowledge of language models (LMs). However, the creation of such benchmarks often overlooks language-specific confounders that may affect model performance, particularly in the case of morphologically rich languages. In this paper, we investigate how surface-level factors such as morpheme count, subword count, and sentence length influence the performance of LMs on a Turkish benchmark of linguistic minimal pairs. We further analyze whether a tokenizer’s degree of alignment with morphological boundaries can serve as a proxy for model performance. Finally, we test whether the distribution of morphemes in a minimal pair benchmark can skew model performance. Our results show that while surface factors have limited predictive power, they might still serve as a systematic source of bias. Moreover, we find that morphological alignment can roughly correspond to model performance, and morpheme-level imbalances in the benchmark may have a significant influence on results.
Benchmarking Hate Speech Detection in Azerbaijani with Turkish Cross-Lingual Transfer and Transformer Models
Tural Alizada | Haim Dubossarsky
Tural Alizada | Haim Dubossarsky
In this paper, we investigated the task of hate-speech classification in the closely related Turkic language pair, Turkish-Azerbaijani. Transformer models can achieve strong hate-speech classification in Turkish, but their performance does not reliably transfer to closely related low-resource languages without careful evaluation. We study Turkish–Azerbaijani hate speech detection and introduce the first manually annotated Azerbaijani benchmark, comprising 1,112 YouTube comments from major news channels with severe class imbalance. We compare XLM-RoBERTa and a compact BERT-Tiny model against a TF–IDF + logistic regression baseline under monolingual training, zero-shot Turkish→Azerbaijani transfer, low-resource balanced subsampling, bilingual mixed fine-tuning, and translation-based augmentation using machine-translated Turkish data. XLM-R attains high macro-F1 in Turkish and achieves moderate zero-shot transfer to Azerbaijani, but native Azerbaijani training is fragile for the hate class. Mixed bilingual training improves robustness for both languages, whereas TF–IDF generalizes poorly to Azerbaijani.
When Semantic Overlap Is Not Enough: Cross-Lingual Euphemism Transfer Between Turkish and English
Hasan Can Biyik | Libby Barak | Jing Peng | Anna Feldman
Hasan Can Biyik | Libby Barak | Jing Peng | Anna Feldman
Euphemisms substitute socially sensitive expressions, often softening or reframing meaning, and their reliance on cultural and pragmatic context complicates modeling across languages. In this study, we investigate how cross-lingual equivalence influences transfer in multilingual euphemism detection. We categorize Potentially Euphemistic Terms (PETs) in Turkish and English into Overlapping (OPETs) and Non-Overlapping (NOPETs) subsets based on their functional, pragmatic, and semantic alignment. Our findings reveal a transfer asymmetry: semantic overlap is insufficient to guarantee positive transfer, particularly in low-resource Turkish-to-English direction, where performance can degrade even for overlapping euphemisms, and in some cases, improve under NOPET-based training. Differences in label distribution help explain these counterintuitive results. Category-level analysis suggests that transfer may be influenced by domain-specific alignment, though evidence is limited by sparsity.
TurkBench: A Benchmark for Evaluating Turkish Large Language Models
Cagri Toraman | Ahmet Kaan Sever | Ayşe Aysu Cengiz | Elif Ecem Arslan | Görkem Sevinç | Sarp Kantar | Mete Mert Birdal | Yusuf Faruk Güldemir | Ali Buğra Kanburoğlu | Sezen Felekoğlu | Birsen Şahin Kütük | Büşra Tufan | Elif Genç | Serkan Coşkun | Gupse Ekin Demir | Muhammed Emin Arayıcı | Olgun Dursun | Onur Gungor | Susan Üsküdarlı | Abdullah Topraksoy | Esra Darıcı
Cagri Toraman | Ahmet Kaan Sever | Ayşe Aysu Cengiz | Elif Ecem Arslan | Görkem Sevinç | Sarp Kantar | Mete Mert Birdal | Yusuf Faruk Güldemir | Ali Buğra Kanburoğlu | Sezen Felekoğlu | Birsen Şahin Kütük | Büşra Tufan | Elif Genç | Serkan Coşkun | Gupse Ekin Demir | Muhammed Emin Arayıcı | Olgun Dursun | Onur Gungor | Susan Üsküdarlı | Abdullah Topraksoy | Esra Darıcı
With the recent surge in the development of large language models, the need for comprehensive and language-specific evaluation benchmarks has become critical. While significant progress has been made in evaluating English-language models, benchmarks for other languages, particularly those with unique linguistic characteristics such as Turkish, remain less developed. Our study introduces TurkBench, a comprehensive benchmark designed to assess the capabilities of generative large language models in the Turkish language. TurkBench involves 8,151 data samples across 21 distinct subtasks. These are organized under six main categories of evaluation: Knowledge, Language Understanding, Reasoning, Content Moderation, Turkish Grammar and Vocabulary, and Instruction Following. The diverse range of tasks and the culturally relevant data would provide researchers and developers with a valuable tool for evaluating their models and identifying areas for improvement. We further publish our benchmark for online submissions at https://huggingface.co/turkbench
BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish
Burak Aktaş | Mehmet Can Baytekin | Süha Kağan Köse | Ömer İlbilgi | Elif Özge Yılmaz | Cagri Toraman | Bilge Kaan Görür
Burak Aktaş | Mehmet Can Baytekin | Süha Kağan Köse | Ömer İlbilgi | Elif Özge Yılmaz | Cagri Toraman | Bilge Kaan Görür
Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation–driven by both structural linguistic divergence and underrepresentation in LLM pretraining–while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.
Tokenisation of Turkic Copula Constructions in Universal Dependencies
Cagri Coltekin | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Sardana Ivanova | Gulnura Dzhumalieva | Aida Kasieva | Nikolett Mus | Jonathan Washington
Cagri Coltekin | Furkan Akkurt | Bermet Chontaeva | Soudabeh Eslami | Sardana Ivanova | Gulnura Dzhumalieva | Aida Kasieva | Nikolett Mus | Jonathan Washington
Identifying units, ’syntactic words’, for morphosyntactic analysis is important yet challenging for morphologically rich languages. In this paper we propose a set of guiding principles to determine units of morphosyntactic analysis, and apply them to the case of copular constructions in Turkic languages, in the context of Universal Dependencies (UD) framework. We also provide a survey of the practice in the Turkic UD treebanks published to date, and discuss the advantages and disadvantages of the proposed tokenisation for a selection of Turkic languages.
RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish
Süha Kağan Köse | Mehmet Can Baytekin | Burak Aktaş | Bilge Kaan Görür | Evren Ayberk Munis | Deniz Yılmaz | Muhammed Yusuf Kartal | Cagri Toraman
Süha Kağan Köse | Mehmet Can Baytekin | Burak Aktaş | Bilge Kaan Görür | Evren Ayberk Munis | Deniz Yılmaz | Muhammed Yusuf Kartal | Cagri Toraman
Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline—from query transformation and reranking to answer refinement—without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.
OCRTurk: A Comprehensive OCR Benchmark for Turkish
Deniz Yılmaz | Evren Ayberk Munis | Cagri Toraman | Süha Kağan Köse | Burak Aktaş | Mehmet Can Baytekin | Bilge Kaan Görür
Deniz Yılmaz | Evren Ayberk Munis | Cagri Toraman | Süha Kağan Köse | Burak Aktaş | Mehmet Can Baytekin | Bilge Kaan Görür
Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining the best Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type: models perform well on non-academic documents, while slideshows become the most challenging.
Building a Turkish Large Language Model via Continual Pre-Training and Parameter-Efficient Adaptation
Alperen Enes Bayar | Mert Ege | Gökhan Yurtalan | Alper Karamanlioglu | Berkan Demirel | Ramazan Gokberk Cinbis
Alperen Enes Bayar | Mert Ege | Gökhan Yurtalan | Alper Karamanlioglu | Berkan Demirel | Ramazan Gokberk Cinbis
Large Language Models (LLMs) achieve strong performance on many tasks, but they still struggle with morphologically rich, low-resource languages such as Turkish. This difficulty stems from Turkish being an agglutinative language and underrepresented in multilingual training data, which causes current models to often fail at capturing its morphology, flexible word order, and formal registers. In this paper, we introduce MODA (Model Adapted for Domain Applications), a Turkish-specialized LLM built via a modular pipeline that combines continual pre-training, parameter-efficient fine-tuning, and model merging. Starting from Qwen2.5-7B as the base model, we first perform large-scale continual pre-training on a Turkish web corpus to improve grammatical and morphological representations. We then apply parameter-efficient supervised fine-tuning on task-oriented instruction data, and finally merge specialized variants into a single unified model. We evaluate MODA on TurkishMMLU, the Turkish subset of EXAMS, and TRCLAIM-19, where it consistently outperforms both the base and instruction-tuned Qwen2.5-7B models. Our results support a training strategy that explicitly separates linguistic acquisition from task alignment when adapting LLMs to morphologically rich, underrepresented languages under realistic hardware constraints.
From Lemmas to Dependencies: What Signals Drive Light Verbs Classification?
Sercan Karakaş | Yusuf Şimşek
Sercan Karakaş | Yusuf Şimşek
Light verb constructions (LVCs) are a challenging class of verbal multiword expressions, especially in Turkish,where rich morphology and productive complex predicates create minimal contrasts between idiomatic predicatemeanings and literal verb–argument uses. This paper asks what signals drive LVC classification bysystematically restricting model inputs. Using UD-derived supervision, we compare lemma-driven baselines(lemma TF–IDF + Logistic Regression; BERTurk trained on lemma sequences), a grammar-only Logistic Regressionover UD morphosyntax (UPOS/DEPREL/MORPH), and a full-input BERTurk baseline. We evaluate on a controlleddiagnostic set with Random negatives, lexical controls (NLVC), and LVC positives, reporting split-wiseperformance to expose decision-boundary behavior. Results show that coarse morphosyntax alone is insufficientfor robust LVC detection under controlled contrasts, while lexical identity supports LVC judgments but issensitive to calibration and normalization choices. Overall, our findings motivate targeted evaluation forTurkish MWEs and highlight that “lemma-only” is not a single representation but depends critically on hownormalization is instantiated.
Beyond the Token: Correcting the Tokenization Bias in XAI via Morphologically-Aligned Projection
Muhammet Anil Yagiz | Fahrettin Horasan
Muhammet Anil Yagiz | Fahrettin Horasan
Current interpretability methods for Large Language Models (LLMs) operate on a fundamental yet flawed assumption: that subword tokens represent independent semantic units. We prove that this assumption creates a fidelity bottleneck in Morphologically Rich Languages (MRLs), where semantic meaning is densely encoded in sub-token morphemes. We term this phenomenon the Tokenization-Morphology Misalignment (TMM). To resolve TMM, we introduce MAFEX (Morpheme-Aligned Faithful Explanations), a theoretically grounded framework that redefines feature attribution as a linear projection from the computational (token) basis to the linguistic (morpheme) basis. We evaluate our method on a diverse suite of Turkish LLMs, including BERTurk, BERTurk-Sentiment, Cosmos-BERT, and Kumru-2B. On our embedded benchmark (N=20), MAFEX achieves an average F1@1 of 91.25% compared to 13.75% for standard token-level baselines (IG, SHAP, DeepLIFT), representing a +77.5% absolute improvement, establishing it as the new standard for faithful multilingual interpretability.
Overview of the SIGTURK 2026 Shared Task: Terminology-Aware Machine Translation for English–Turkish Scientific Texts
Ali Gebeşçe | Abdulfattah Safa | Ege Uğur Amasya | Gözde Gül Şahin
Ali Gebeşçe | Abdulfattah Safa | Ege Uğur Amasya | Gözde Gül Şahin
This paper presents an overview of the SIGTURK 2026 Shared Task on Terminology-Aware Machine Translation for English-Turkish Scientific Texts. We address the critical challenge of terminological accuracy in low-resource settings by constructing the first terminology-rich English-Turkish parallel corpus, comprising 3,300 sentence pairs from STEM domains with 10,157 expert-validated term pairs. The shared task consists of three subtasks: term detection, expert-guided correction, and end-to-end post-editing. We evaluate state-of-the-art baselines (including GPT-5.2 and Claude Sonnet 4.5) alongside participant systems employing diverse strategies from fine-tuning to Retrieval-Augmented Generation (RAG). Our results highlight that while massive generalist models dominate zero-shot detection, smaller, domain-adapted models using Supervised Fine-Tuning and Reinforcement Learning can significantly outperform them in end-to-end post-editing. Furthermore, we find that rigid retrieval pipelines often disrupt fluency, whereas Chain-of-Thought prompting allows models to integrate terminology more naturally. Despite these advances, a significant gap remains between automated systems and human expert performance in strict terminology correction.
up
Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Proceedings of the 8th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
Ekaterina Vylomova | Andrei Shcherbakov | Priya Rani
Ekaterina Vylomova | Andrei Shcherbakov | Priya Rani
Automatic Grammatical Case Prediction for Template Filling in Case-Marking Languages: Implementation and Evaluation for Finnish
Johannes Laurmaa
Johannes Laurmaa
Automatically generating grammatically correct sentences in case-marking languages is hard because nominal case inflection depends on context. In template-based generation, placeholders must be inflected to the right case before insertion, otherwise the result is ungrammatical. We formalise this case selection problem for template slots and present a practical, data-driven solution designed for morphologically rich, case-marking languages, and apply it to Finnish. We automatically derive training instances from raw text via morphological analysis, and fine-tune transformer encoders to predict a distribution over 14 grammatical cases, with and without lemma conditioning. The predicted case is then realized by a morphological generator at deployment. On a held-out test set in the lemma-conditioned setting, our model attains 89.1% precision, 81.1% recall, and 84.2% F1, with recall@3 of 93.3% (macro averages). The probability outputs support abstention and top-k- suggestion User Interfaces, enabling robust, lightweight template filling for production use in multiple domains, such as customer messaging. The pipeline assumes only access to raw text plus a morphological analyzer and generator, and can be applied to other languages with productive case systems.
The paper presents a prototype of a web-app designed to automatically generate verb valency lexica based on the Universal Dependencies (UD) treebanks. It offers an overview of the structure of the app, its core functionality, and functional extensions designed to handle treebank-specific features. Besides, the paper highlights the limitations of the prototype and the potential of its further development.
Evaluating the Interplay of Information Status and Information Content in a Multilingual Parallel Corpus
Julius Steuer | Toshiki Nakai | Andrew Thomas Dyer | Luigi Talamo | Annemarie Verkerk
Julius Steuer | Toshiki Nakai | Andrew Thomas Dyer | Luigi Talamo | Annemarie Verkerk
The uniform information density (UID) hypothesis postulates that linguistic units are distributed in a text in such a way that the variance around an average information density is minimized. The relationship between information density and information status (IS) is so far underexplored. In this ongoing work, we project IS annotations on the English section of the CIEP+ corpus (Verkerk Talamo 2024) to parallel sections in other languages. We then use the projected annotations to evaluate the relationship between IS and information content in a typologically diverse sample of languages. Our preliminary findings indicate that there is an effect of information status on information density, with the directionality of the effect depending on language and part of speech.
It is common in cognitive computational linguistics to use language model surprisal as a measure of the information content of units in language production. From here, it is tempting to then apply this to information structure and status, considering surprising mentions to be new and unsurprising ones to be given, providing us with a ready-made continuous metric of information givenness/newness. To see if this conflation is appropriate, we perform regression experiments to see if language model surprisal is actually well predicted by information status as manually annotated, and if so, if this effect is separable from more trivial linguistic information such as parts of speech and word frequency. We find that information status alone is at best a very weak predictor of surprisal, and that surprisal can be much better predicted by the effect of parts of speech, which are highly correlated with both information status and surprisal; and word frequency. We conclude that surprisal should not be used as a continuous representation of information status by itself.
Beyond Multilinguality: Typological Limitations in Multilingual Models for Meitei Language
Badal Nyalang
Badal Nyalang
We present MeiteiRoBERTa, the first publicly available monolingual RoBERTa-based language model for Meitei (Manipuri), a low-resource language spoken by over 1.8 million people in Northeast India. Trained from scratch on 76 million words of Meitei text in Bengali script, our model achieves a perplexity of 65.89, representing a 5.2× improvement over multilingual baselines BERT (341.56) and MuRIL (355.65). Through comprehensive evaluation on perplexity, tokenization efficiency, and semantic representation quality, we demonstrate that domain-specific pre training significantly outperforms general-purpose multilingual models for low-resource languages. Our model exhibits superior semantic understanding with 0.769 similarity separation compared to 0.035 for mBERT and near-zero for MuRIL, despite MuRIL’s better tokenization efficiency (fertility: 3.29 vs. 4.65). We publicly release the model, training code, and datasets to accelerate NLP research for Meitei and other underrepresented Northeast Indian languages
Linguistic reference material is a trove of information that can be utilized for the analysis of languages. The material, in the form of grammar books and sketches, has been used for machine translation, but it can also be used for language analysis. Retrieval Augmented Generation (RAG) has been demonstrated to improve large language model (LLM) capabilities by incorporating external reference material into the generation process. In this paper, we investigate the use of grammar books and RAG techniques to identify language features. We use Grambank for feature definition and ground truth values, and we evaluate on five typologically diverse low-resource languages. We demonstrate that this approach can effectively make use of reference material.
up
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
The Proceedings of the First Workshop on NLP and LLMs for the Iranian Language Family
Rayyan Merchant | Karine Megerdoomian
Rayyan Merchant | Karine Megerdoomian
Unmasking the Factual-Conceptual Gap in Persian Language Models
Alireza Sakhaeirad | Ali Ma'manpoosh | Arshia Hemmat
Alireza Sakhaeirad | Ali Ma'manpoosh | Arshia Hemmat
While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DIVANBENCH, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model’s ability to discern contradictions; and all models show a 21% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.
Benchmarking Offensive Language Detection in Persian and Pashto
Zahra Bokaei | Bonnie Webber | Walid Magdy
Zahra Bokaei | Bonnie Webber | Walid Magdy
Offensive language detection and target identification are essential for maintaining respectful online environments. While these tasks have been widely studied for English, comparatively less attention has been given to other language, including Persian and Pashto, and the effectiveness of recent large language models for these languages remains underexplored. To address this gap, we created a comprehensive benchmark of diverse modeling approaches in Persian and Pashto. Our evaluation covers zeroshot, fine-tuned, and cross-lingual transfer settings, analyzing when detection succeeds or fails across different model approaches. This study provides one of the first systematic analyses of offensive language detection and crosslingual transfer between these languages.
Large language models (LLMs) are increasingly used for communication in many languages, therefore, understanding their limitations with respect to culture-specific pragmatics is important. While LLMs perform well on statistically frequent structures, their shortcomings are most evident in rare pragmatic phenomena. This study investigates whether LLMs can generate a (rare) complex honorific mismatch in Farsi. The pattern arises at two levels:(i) a plural pronoun disagrees with a singular referent for the sake of honorification, and (ii) the related components violate the Polite Plural Generalization due to intimacy implication. This double mismatch pattern is attested in everyday speech, though it is statistically sparse. We tested GPT-4 across multiple scenarios. The results reveal that the model successfully employs the first mismatch to indicate honorific, but fails to adopt the second mismatch that simultaneously conveys intimacy. The model thus deviates from humanlike behavior at the syntax–pragmatics interface. These findings suggest that, while machine models demonstrate partial success in generating honorifics, they rely primarily on statistical patterns and lack the deeper pragmatic understanding necessary for contextual competence.
TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP
Mullosharaf Kurbonovich Arabov
Mullosharaf Kurbonovich Arabov
This work introduces TajPersLexon, a curated Tajik–Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families:(i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The dataset, code, and models will be publicly released.
A Computational Approach to Language Contact – A Case Study of Persian
Ali Basirat | Danial Namazifard | Navid Baradaran Hemmati
Ali Basirat | Danial Namazifard | Navid Baradaran Hemmati
We investigate structural traces of language contact in the intermediate representations of a monolingual language model. Focusing on Persian (Farsi) as a historically contact-rich language, we probe the representations of a Persian-trained model when exposed to languages with varying degrees and types of contact with Persian. Our methodology quantifies the amount of linguistic information encoded in intermediate representations and assesses how this information is distributed across model components for different morphosyntactic features. The results show that universal syntactic information is largely insensitive to historical contact, whereas morphological features such as CASE and GENDER are strongly shaped by language-specific structure, suggesting that contact effects in monolingual language models are selective and structurally constrained.
Polarization detection in low-resource and mid-resource languages remains a significant challenge for social understanding. This paper presents the first comprehensive benchmark to evaluate transformer-based models for detection of polarized language in Persian (also called Farsi) social media. The aim is to evaluate 1) how and if finetuning the pre-trained models have substantial impact; 2) how Persian specific monolingual models compare to multilingual for this task; 3) how and if transfer learning from models trained on other languages such as culturally-distant English, and culturally-close[er] Turkish, and Arabic can be of interest for this task; and 4) how competitive Large Language Models (LLMs) are in a zero-shot setting. Our evaluation of ten transformer-based models and two LLMs on a publicly available Farsi polarization dataset shows promising findings,highlighting both the strengths and limitations of each approach.
ParsCORE: The Persian Corpus of Online Registers
Alireza Razzaghi | Erik Henriksson | Veronika Laipalla
Alireza Razzaghi | Erik Henriksson | Veronika Laipalla
Despite recent advances in automatic web register (genre) labeling and its applications to web-scale datasets and LLM development, the effectiveness of these tools for digitally lowresource languages remains unclear. This study introduces ParsCORE, the first largescale collection of Persian web registers (genres), and evaluates deep learning models for register classification and keyword analysis across major registers. Using 2,000 humanannotated documents, the models achieved a micro F1-score of 0.76. The findings provide a foundation for future research on the linguistic and cultural specificities of Persian registers.
PMWP: A Benchmark for Math Word Problem Solving in Persian
Marzieh Abdolmaleki | Mehrnoush Shamsfard | Veronique Hoste | Els Lefever
Marzieh Abdolmaleki | Mehrnoush Shamsfard | Veronique Hoste | Els Lefever
Mathematical reasoning captures fundamental aspects of human cognitive ability. Although recent advances in LLMs have led to substantial improvements in automated mathematical problem solving, most existing benchmarks remain focused on English. As a result, robust mathematical reasoning remains a challenging and insufficiently explored capability for underrepresented languages including Persian. To address this gap, we introduce PMWP, the first dataset of 15K elementary-level Persian math word problems that supports both supervised training and evaluation of reasoning models. By expanding mathematical reasoning resources beyond English, PMWP contributes to the development of multilingual AI systems with stronger reasoning capabilities. In this work, we conduct a systematic evaluation of the Persian math word problem solving capabilities of different state-of-the-art LLMs. Our results indicate that DeepSeek-V3 exhibits reduced language bias when problem texts are translated into English, while Gemini-2.5-Flash achieves the highest equation value accuracy (72.02%) in Persian. In addition, we investigate parameter-efficient adaptation for equation generation by applying LoRA-based fine-tuning to LLaMA-3-8B and Qwen-2.5-7B. Our results show that, following fine-tuning, these openweight models achieve 91.65% and 92.53% exact equation match accuracy, respectively. Overall, our findings provide insights into the comparative strengths and limitations of proprietary and open-weight models for mathematical reasoning in Persian.
APARSIN: A Multi-Variety Sentiment and Translation Benchmark for Iranic Languages
Sadegh Jafari | Tara Azin | Farhad Roodi | Zahra Dehghani Tafti | Mehrdad Ghadrdan | Elham Vatankhahan Esfahani | Aylin Naebzadeh | Mohammadhadi Shahhosseini | Ghafoor Khan | Kazem Forghani | Danial Namazi | Seyed Mohammad Hossein Hashemi | Farhan Farsi | Mohammad Osoolian | Maede Mohammadi | Mohammad Erfan Zare | Muhammad Hasnain Khan | Muhammad Hussain | Nooreen Zaki | Joma Mohammadi | Shayan Bali | Mohammad Javad Ranjbar | Els Lefever | Veronique Hoste
Sadegh Jafari | Tara Azin | Farhad Roodi | Zahra Dehghani Tafti | Mehrdad Ghadrdan | Elham Vatankhahan Esfahani | Aylin Naebzadeh | Mohammadhadi Shahhosseini | Ghafoor Khan | Kazem Forghani | Danial Namazi | Seyed Mohammad Hossein Hashemi | Farhan Farsi | Mohammad Osoolian | Maede Mohammadi | Mohammad Erfan Zare | Muhammad Hasnain Khan | Muhammad Hussain | Nooreen Zaki | Joma Mohammadi | Shayan Bali | Mohammad Javad Ranjbar | Els Lefever | Veronique Hoste
The Iranic language family includes many underrepresented languages and dialects that remain largely unexplored in modern NLP research. We introduce APARSIN, a multi-variety benchmark covering 14 Iranic languages, dialects, and accents, designed for sentiment analysis and machine translation. The dataset includes both high and low-resource varieties, several of which are endangered, capturing linguistic variation across them. We evaluate a set of instruction-tuned Large Language Models (LLMs) on these tasks and analyze their performance across the varieties. Our results highlight substantial performance gaps between standard Persian and other Iranic languages and dialects, demonstrating the need for more inclusive multilingual and dialectally diverse NLP benchmarks.
One Language, Three of Its Voices: Evaluating Multilingual LLMs Across Persian, Dari, and Tajiki on Translation and Understanding Tasks
Noor Mairukh Khan Arnob | Abu Bakar Siddique Mahi
Noor Mairukh Khan Arnob | Abu Bakar Siddique Mahi
The Iranian linguistic family is pluricentric, encompassing Iranian Persian, Dari (Afghanistan), and Tajiki (Tajikistan). While Multilingual Large Language Models (MLLMs) claim broad coverage, their robustness across these regional variants and script differences (Perso-Arabic vs. Cyrillic) remains under-explored, particularly in the open-weight landscape. We evaluate five openweight models from the Qwen, Bloomz, and Gemma families across four downstream tasks: Sentiment Analysis, Machine Translation (MT), NLI, and QA. Utilizing a dataset of over 240,000 processed samples, we observe severe performance disparities. While the fine-tuned gemma-3-4b-persian achieves promising results on Iranian Persian (77.3% accuracy in Sentiment), almost all tested models appear to suffer catastrophic degradation on Tajiki script (dropping to 1.0 BLEU). These findings highlight a critical “script barrier” in current open-weight MLLM development for Central Asian languages. Code and data available here.
PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration
Mohammad Javad Ranjbar Kalahroodi | Heshaam Faili | Azadeh Shakery
Mohammad Javad Ranjbar Kalahroodi | Heshaam Faili | Azadeh Shakery
Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset and model publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.
Shughni Machine Translation Enhanced by Donor Languages
Dmitry Novokshanov | Innokentiy S. Humonen | Ilya Makarov
Dmitry Novokshanov | Innokentiy S. Humonen | Ilya Makarov
This paper presents the first machine translation system for Shughni, an extremely lowresource Eastern Iranian language spoken in Tajikistan and Afghanistan. We fine-tune NLLB-200 models and explore auxiliary language selection through typological similarity and "super-donor" experiments. Our final Shughni–Russian model achieves a chrF++ score of 36.3 (45.7 on BivalTyp data), establishing the first computational translation resource for this language. Beyond reporting system performance, this work demonstrates a practical path toward supporting languages with virtually no prior MT resources. Our demo system with Shughni-Russian- English translation (Russian serves as a pivot language for the Shughni- English pair) is available on Hugging- Face (https://huggingface.co/spaces/Novokshanov/Shughni-Translator).
Segmentation Strategy Matters: Benchmarking Whisper on Persian YouTube Content
Reihaneh Iranmanesh | Rojin Ziaei | Joe Garman
Reihaneh Iranmanesh | Rojin Ziaei | Joe Garman
Automatic Speech Recognition (ASR) transcription accuracy remains highly sensitive to audio segmentation strategies, yet most benchmarks assume oracle timestamps unavailable in deployment. We systematically evaluate how audio segmentation affects Whisper’s performance on 10 hours of Persian YouTube content, comparing transcript-aligned (oracle) versus silence-based (realistic) approaches across contrasting acoustic conditions. Results reveal striking content-type dependency: podcast content benefits from timestamp segmentation (33% lower mean WER), while entertainment content favors silence-based segmentation (8% lower mean WER). This finding demonstrates that optimal segmentation must be content-aware, with silence detection better capturing natural boundaries in acoustically heterogeneous media while avoiding mid-utterance splits. We publicly release our evaluation framework, 10 hours of audio with gold transcripts, and segmentation results here: https://github.com/ri164-bolleit/persian-youtube-whisper-benchmark
Multi-modal Neural Machine Translation for Low-Resource Classical Persian Poetry: A Culture-Aware Evaluation
Soheila Ansari | Mounir Boukadoum | Fatiha Sadat
Soheila Ansari | Mounir Boukadoum | Fatiha Sadat
Persian poetry, particularly Rumi’s Masnaviye-Ma’navi, is known for its complex form, mystical narrative style, rich cultural information, and linguistic nuances, and is considered a low-resource domain. Translating Persian poetry is a challenging task for neural machine translation (NMT) systems. To address this challenge, we present a novel multimodal NMT system for Rumi’s Masnavi in four stages. First, we built a new multi-modal parallel Persian-English corpus of 26,571 aligned verses from all six books of Masnavi, and all paired with aligned audio recitations. Second, a strong text-only baseline is developed by applying domain-adaptive fine-tuning to mBART- 50, pre-trained on a large monolingual Persian poetry corpus, followed by training on the parallel Masnavi corpus (train set). Third, we extend this model to a multi-modal scenario by adding aligned audio representations using a cross-attention fusion mechanism. Fourth, we conduct a culture-aware evaluation. We propose a culture-specific item (CSI) evaluation approach by developing a CSI classification system and a Persian-English CSI dictionary alongside the standard MT metrics. Our findings demonstrate that integrating audio recitations increased the BLEU score from 9.85 to 17.95, and raised CSI-recall from 61.60% to 82.04%, suggesting greater consistency in producing culturally meaningful terms.
up
Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks
Proceedings of the 11th Social Media Mining for Health Research and Applications (SMM4H-HeaRD 2026) Workshop and Shared Tasks
Guillermo Lopez-Garcia | Graciela Gonzalez-Hernandez
Guillermo Lopez-Garcia | Graciela Gonzalez-Hernandez
A3S@C-DAC at #SMM4H-HeaRD 2026: Reasoning Meets Evidence: LLMs for Interpretable Insomnia Detection with Evidence Extraction in Clinical Notes
Abhishek Maity | Amol Shinde | Abhishek Suresh Kushare | Swapnil Pawar
Abhishek Maity | Amol Shinde | Abhishek Suresh Kushare | Swapnil Pawar
Detecting insomnia from clinical narratives requires both accurate classification and clinically grounded reasoning with interpretable evidence. We present our systems for the SMM4H-HeaRD 2026 shared task, which leverages MIMIC-III notes annotated with rule-based insomnia criteria and supporting evidence spans. We explore two complementary approaches: parameter-efficient fine-tuning of lightweight models using QLoRA and LoRA, and few-shot prompting of large language models for joint reasoning and evidence extraction. Our best system achieves an F1-score of 0.7333 on binary classification and a micro-F1 of 0.6535 on multi-label rule prediction, with up to 0.5192 partial-match F1 for evidence extraction. Results show that lightweight fine-tuned models can outperform larger models in classification, while larger models demonstrate stronger reasoning but struggle with precise span localization, highlighting a key gap in clinically interpretable NLP systems.
Gladiators at #SMM4H–HeaRD 2026: Multi-Seed XLM-RoBERTa Ensemble with Focal Loss and Per-Language Threshold Optimization for Multilingual Adverse Drug Event Detection
Ankit Kumar Singh
Ankit Kumar Singh
This paper describes the Gladiators system for Task 1 of the SMM4H 2026 shared task on binary classification of adverse drug event (ADE) mentions in multilingual social media posts. Our system fine-tunes three XLM-RoBERTa large models with different random seeds using focal loss (α=0.75, γ=2.0) and 3× positive oversampling, then averages their predicted probabilities and applies per-language threshold optimization. On the development set, our ensemble achieves a pooled binary F1 of 0.7505. On the official test set—which introduced surprise Farsi comprising 35.5% of samples—our system achieves F1 = 0.6039, above the competition mean (0.5465) and median (0.5798). We evaluated eleven approaches and document key negative results. Post evaluation, a six-model cross-regime ensembleimproved dev F1 to 0.7585.
LSI_UNED at #SMM4H–HeaRD 2026: Grid-Based Biomedical Named Entity Recognition Across Languages and Entity Types
Alicia Ramirez-Arrabe | Juan Martinez-Romo | Andres Duque
Alicia Ramirez-Arrabe | Juan Martinez-Romo | Andres Duque
This paper describes the participation of the LSI_UNED team in the firt sub-task of MultiClinAI at the #SMM4H-HeaRD 2026 Workshop, which focuses on multilingual clinical named entity recognition in seven languages. The task requires identifying mentions of diseases, procedures, and symptoms in clinical case reports. We propose a set of systems based on the W2NER architecture, with a separate model trained for each language and entity type. For Spanish, we use a RoBERTa-based model with data augmentation from additional NER resources, while English and Italian systems are based on different biomedical BERT variants. Results show consistent performance across languages, with the best overall results obtained for Spanish. Data augmentation improves recall and F1, while English and Italian models achieve competitive but slightly lower scores. Symptom recognition remains the most challenging entity type across all languages.
SINAI at #SMM4H–HeaRD 2026: Multilingual Clinical NER with MrBERT-biomed and Optuna Hyperparameter Optimization
Lucas Molino Piñar | Manuel Carlos Diaz-Galiano | María-Teresa Martín-Valdivia
Lucas Molino Piñar | Manuel Carlos Diaz-Galiano | María-Teresa Martín-Valdivia
This paper describes the system submitted by our team to the MultiClinAI shared task at the 11th SMM4H-HeaRD Workshop (ACL 2026). The task addresses multilingual clinical Named Entity Recognition (NER) for three entity types (Disease, Procedure, and Symptom) in Spanish clinical texts. Our approach fine-tunes MrBERT-biomed, a domain-adapted ModernBERT model pre-trained on biomedical corpora, using multilingual clinical data from seven European languages. We train independent entity-specific models, each optimized via Bayesian hyperparameter search with Optuna, and apply a deterministic post-processing step that aligns predicted spans to word boundaries. On the official test set, our system achieves overall strict micro-F1 scores of 0.7453, 0.7107, and 0.6603 for Disease, Procedure, and Symptom, respectively.
Prestige at #SMM4H-HeaRD 2026: Binary Insomnia Classification from Clinical Notes Using LLMs with Chain-of-Thought Reasoning
Oyindolapo O. Komolafe
Oyindolapo O. Komolafe
This paper describes our system for Subtask 1 of the SMM4H HeaRD 2026 Task 2, which is an LLM-based system for binary insomnia classification from MIMIC-III clinical notes using OpenAI GPT-5.2 with chain-of-thought (CoT) prompting. Our approach implements three strategies: baseline fixed 8-shot prompting, dynamic retrieval using semantic embeddings, and self-consistency voting. The system applies rule-based criteria combining symptom patterns (difficulty sleeping and daytime impairment) with medication indicators (primary and secondary insomnia medications).Our best configuration (Self-Consistency Voting) achieved 95.67% weighted F1 on validation and 82.35% F1 on the official test set , outperforming the Baseline (81.25% F1). Notably, our test F1-score of 82.35% substantially exceeded the task mean (68.05%) and median (70.37%) across all participating teams. Key contributions include explicit comorbidity exclusion prompting, context-aware nursing note handling, logical constraint enforcement for prediction consistency, and a comparative analysis demonstrating that self-consistency improves recall at moderate computational cost.
Team Gazoo! at #SMM4H-HeaRD 2026: Zero-Training NER via Iterative LLM Prompt Self-Optimization for Opioid Impact Span Detection
Diego Estuar
Diego Estuar
This paper describes the system submitted by Team Gazoo! for Task 7 of the #SMM4H-HeaRD 2026 shared task on detecting self-reported clinical and social impacts of nonmedical opioid use in social media text. We present a zero-training, prompt-only approach that uses a large language model (GPT-5.4) with structured few-shot prompting and autonomous, iterative rule optimization. Our system encodes a domain-specific entity ontology, three core decision rules, and 65 cognitively organized few-shot examples into a single prompt, with BIO constraint enforcement applied as post-processing. Crucially, the prompt itself is refined by the LLM: at each iteration the model analyzes its own errors and proposes targeted edits to its rules and examples. Through 18 such self-refinement cycles, our system achieved an F1-Strict of 0.53 and F1-Relaxed of 0.60 on the test set, ranking first among all participating teams under both evaluation criteria.
DNT at #SMM4H–HeaRD 2026: Leveraging BERT-based Encoders and LLMs for Medical Information Extraction
Doan Nhat Tien | Thìn Đặng Văn
Doan Nhat Tien | Thìn Đặng Văn
This paper presents our systems for two tasks at #SMM4H-HeaRD 2026. For Task 1 (multilingual Adverse Drug Event detection), we fine-tune BERT-based multilingual models (InfoXLM and XLM-RoBERTa) and Qwen3.5-9B with ensemble methods, achieving 0.8584 macro F1 on the development set and 0.5304 F1 on unseen Farsi. For Task 7 (span detection of ClinicalImpacts and SocialImpacts in opioid narratives), DeBERTa-Large with simplified labeling achieves the best test performance (0.583 relaxed F1, 0.500 strict F1). Our analysis shows that LLMs excel on known languages in Task 1, while transformer-based models with simplified labeling generalize better for NER tasks.
BIT.UA at #SMM4H–HeaRD 2026: Towards Multi-Class Multilingual Clinical Entity Recognition with Multi-Head CRF Ensembles
Richard A. A. Jonker | Sérgio Matos
Richard A. A. Jonker | Sérgio Matos
This paper describes the BIT.UA system for the MultiClinNER shared task at #SMM4H–HeaRD 2026, targeting multilingual clinical named entity recognition across seven languages for three entity types (Disease, Procedure, Symptom). We extend the Multi-Head CRF architecture, originally developed for multi-class NER on Spanish clinical text, to the multilingual setting. To enable joint multi-entity training despite per-entity text variations in the dataset, we develop an adaptive text consolidation pipeline that preserves over 94% of annotations. Our central finding is that a single xlm-roberta-large model, trained jointly on all seven languages and three entity types, achieves competition rank 2 for five of seven languages, outperforming dedicated monolingual models by up to +6.94 F1 points, while requiring only a single set of weights. Ensembling multiple seeds of this model achieves rank 1 for those five languages, and combining it with monolingual models yields rank 1 for the remaining two. Code and models are publicly available at https://github.com/ieeta-pt/Multi-Head-CRF/tree/MultiClinNER and https://huggingface.co/collections/IEETA/multiclinner-models.
Bhramastra at #SMM4H-HeaRD 2026: A Multi-Stage Hunter-Judge Pipeline using DSPy-Optimized LLMs for Multilingual ADE Detection
Bhaarat Pachori
Bhaarat Pachori
This paper describes the submission by **Team Bhramastra** for the **#SMM4H-HeaRD 2026** Shared Task 1, focused on personal Adverse Drug Event (ADE) detection in multilingual social media. A decoupled architecture, **Hunter-Judge**, is proposed to handle extreme class imbalance and linguistic variance across seven languages, including a surprise zero-shot Farsi set. The system employs a fine-tuned multilingual mDeBERTa-v3 model as a high-recall filter (**Hunter**), followed by a Gemini-2.5-Flash model (**Judge**) optimized via the **DSPy** framework for precision-oriented agentic adjudication. By implementing a reasoning protocol grounded in clinical RAG evidence and universal ingredient mapping, the pipeline achieved the **highest average F1-score (0.6653)** among all teams. Strong zero-shot generalizability on Farsi (**F1: 0.5863**) was demonstrated, highlighting the effectiveness of medically-grounded adjudication in low-resource contexts.
LLATMU at #SMM4H-HeaRD 2026: Clinical Text Structuring with QLoRA-based Generation and Partial-Label TNM Classification
Eric Hsiao | Min-Hsuan Ku | Hsuan-Lei Shao
Eric Hsiao | Min-Hsuan Ku | Hsuan-Lei Shao
We describe the LLATMU systems submitted to the #SMM4H-HeaRD 2026 shared tasks, covering two related clinical text structuring problems: dialogue-to-SOAP note generation (Task 4) and TNM staging classification from pathology reports (Task 6). Although the two tasks differ in modeling paradigm (text generation versus supervised classification), both require transforming unstructured clinical narratives into structured representations.For Task 4, we instruction-tuned LLMs with parameter-efficient adaptation and submitted a QLoRA-based Ministral-3B system, achieving an official blind test average score of 0.53 and outperforming the task-wide mean and median. For Task 6, we formulate TNM prediction as a three-head classification problem using BioClinical-ModernBERT-large with long-context encoding, class-weighted loss, and normalized partial-label training. The model achieves a validation average macro-F1 of 0.9196 and continues to outperform the official baseline on the more challenging tie-break test set.Across both tasks, our results suggest that robust data handling, stable fine-tuning, and task-appropriate supervision are important for practical clinical NLP under constrained and imperfect shared-task settings.
Patient2Paper at #SMM4H-HeaRD 2026: Retrieval-Augmented Few-Shot Generation for Clinical Note Synthesis
Ioan-Tudor-Alexandru Anghel | Timotei Andrei | Comârdici Marian Bogdan | Carina Sâicu
Ioan-Tudor-Alexandru Anghel | Timotei Andrei | Comârdici Marian Bogdan | Carina Sâicu
We present a retrieval-augmented few-shot system for the MedSynth Dial2Note shared task at SMM4H-HEARD 2026, placing 3rd on the official leaderboard (0.51 avg). Across 28 configurations, we find that retrieval design (hybrid BM25 + medical-domain dense fused via RRF) and prompt presentation format (few-shot examples as conversation turns) are the primary quality drivers, while model scale has surprisingly limited impact: Llama 3.2:3B, Llama 3.1:8B and GPT-4o mini remain within a narrow band on our locally computed scores. Our final submission used GPT-4o mini with k=3 few-shot examples retrieved by RRF over BioLORD-2023 embeddings. We report a full ablation, including negative results, to show where the gains come from and where further engineering stops paying off.
In2Lab-TNT at #SMM4H-HeaRD 2026: An Application of QTT’s Terminological Entanglement to Leverage Insomnia Detection in Clinical Notes
Antonio Jesus Tamayo Herrera | Giovanny Díaz-Laínes | Carlos Mario Perez Perez | Diego A Burgos
Antonio Jesus Tamayo Herrera | Giovanny Díaz-Laínes | Carlos Mario Perez Perez | Diego A Burgos
We present a lightweight, deterministic post-processing approach for clinical text classification based on entanglement between clinically meaningful concepts. Our system was developed for the SMM4H 2026 shared task on insomnia detection and related information extraction from clinical notes. For Subtask 1, we introduce an entanglement-based rescue layer that models dependencies between sleep disturbance, daytime impairment, and sleep-targeted medication evidence. Applied as a false-negative correction on top of an LLM baseline, this approach improves recall while preserving precision. On the official test set, the rescue layer increases F1 by 25% without degrading precision (1.00). Local experiments show larger gains on weaker runs, suggesting a stabilizing effect on variable LLM outputs. For Subtask 2, we implement an LLM-based system for rule-based evidence and span extraction. Results highlight the effectiveness of modeling clinically grounded dependencies and suggest directions for improving evidence extraction and span matching.
blue at SMM4H-HeaRD 2026: Class-Weighted Transformer Ensembles with Structured Decoding and Chain-of-Thought Blending across Six Health NLP Shared Tasks
Krish Sharma | Rhea Singhal | Jatin Bedi
Krish Sharma | Rhea Singhal | Jatin Bedi
We describe team blue’s participation across six SMM4H-HeaRD 2026 shared tasks spanning multilingual adverse drug event detection (Task 1), influenza vaccine effectiveness estimation (Task 3), patient metadata classification (Task 5), TNM cancer staging (Task 6), opioid impact span detection (Task 7), and multilingual clinical NER with cross-lingual annotation projection (Task 8). Despite the heterogeneity of these tasks, binary, multi-class, multi-label, and sequence-labelling, our systems share three recurring design principles: (i) inverse-frequency class weighting to handle severe imbalance, (ii) multi-seed and/or multi-backbone ensembling to reduce variance, and (iii) post-hoc calibration of decision boundaries. Key results include micro-F1 of 0.990 on TNM staging (Task 6), 0.872/0.918 on flu vaccination/test classification surpassing the 70B CoT baseline on vaccination (Task 3), F1 of 0.764 on patient metadata approaching the fine-tuning benchmark of 0.776 (Task 5), and competitive performance on ADE detection (Task 1, F1 = 0.580), opioid spans (Task 7, relaxed F1 = 0.59), and multilingual clinical NER (Task 8, strict F1 0.20–0.41 across 7 languages).
DT4H.nl at #SMM4H-HeaRD 2026: Multilingual Clinical NER with multilingual and monolingual models
Bram van Es
Bram van Es
We describe the setup we used to complete the MultiClinAI-NER task in the SMM4H-HeaRD workshop 2026. In this work we employed a dedicated multilingual encoder model (EuroBERT-610m), two Dutch encoder models trained from scratch on clinical corpora (MedRoBERTa.nl and CardioDeBERTa.nl) and a generic Dutch encoder model (RobBERT2023-large), all finetuned with a 3-layer DNN head. We find that the use of multilingual datasets is potentially beneficial in augmenting the training corpora of monolingual models.
This paper describes the participation of team SMMTech in the SMM4H-HeaRD 2026 Shared Task 2: Detection of Insomnia in Clinical Notes. We present a comparative architectural study exploring the friction between extractive token-classification models and generative Large Language Models (LLMs) in clinical span extraction, on the MIMIC-III Clinical Database. During the validation phase we established baselines using encoder-only transformers such as BERT, ClinicalBERT, BigBird and Clinical BigBird. For the official test phase, we deployed a 4-bit quantized generative hybrid pipeline using Llama3-Med42-8B to evaluate its multi-hop reasoning capabilities. While the generative pipeline achieved an F1-score of 0.4783 on Subtask 1 (Classification), it struggled with exact span matching on Subtask 2.In this paper we present the mechanical limitations of zero-shot JSON extraction and the necessity of decoupling clinical reasoning from character-level span extraction.
Beyond Lexical Similarity: Evaluating Faithfulness in LLM-Based Medical Question Reformulation
Md Rabiul Hasan | Aleka Melese Ayalew | Mourad Oussalah
Md Rabiul Hasan | Aleka Melese Ayalew | Mourad Oussalah
Medical query rewriting transforms verbose consumer health questions into concise clinical queries, a critical step in health information retrieval. Large language models (LLMs) perform well on this task by standard metrics, yet high ROUGE or BERTScore does not guarantee preservation of clinical content. To address this issue, we introduce MedFaith-F1, a category-level faithfulness metric over four clinically salient categories: diagnoses, medications, procedures, and follow-up intent. We further propose a hybrid Evidence and Knowledge-Grounded Retrieval-Augmented Generation EKG-RAG, an evidence and knowledge-grounded framework combining hybrid retrieval over PubMed and MedlinePlus resources with UMLS (Unified Medical Language System)-aligned ontology grounding. Evaluating large language models LLaMA-3 and Qwen2.5 across zero-shot, few-shot, and QLoRA settings on MeQSum and medical question-pair (MQP) datasets revealed that base models exhibit category-level hallucination rates exceeding 40%, invisible to standard metrics, while EKG-RAG with QLoRA reduces this rate to 26.75%, achieving MedFaith-F1 of 0.73. Our findings call for faithfulness-aware evaluation in clinical query rewriting, and MedFaith-F1 provides a reproducible step in that direction.
NU_DeepHealthNLP at #SMM4H-HeaRD 2026: Entity-Conditioned Generation and a Four-Stage Pipeline for Automated SOAP Note Generation
Thanya Mysore Santhosh | Deahan Yu
Thanya Mysore Santhosh | Deahan Yu
We describe two system submissions to Task 4 of the SMM4H-HeaRD 2026 Shared Task on automated SOAP note generation from doctor–patient dialogues. Our first submission is a standalone entity-conditioned generation model: Mistral-7B-Instruct-v0.1 fine-tuned with QLoRA on 8,529 MedSynth training dialogues, where both training and inference prompts include clinical entities extracted and grouped by SOAP section. Our second submission is a four-stage modular pipeline that additionally incorporates a hybrid retrieval stage and a rule-based verification stage. The key finding of this work is that incorporating structured clinical domain knowledge, in the form of NER entities grouped by SOAP section, directly into the generation prompt produces consistent and reliable improvements over dialogue-only generation. Our four-stage pipeline submission achieved an average score of 0.54 on the official test set, ranking first on the shared task leaderboard.
GoBlueInformatics at #SMM4H-HeaRD 2026: Long-Context Encoders and Generative Biomedical LLMs for Pathological TNM Stage Prediction
Shangqing Wei
Shangqing Wei
We describe our systems for #SMM4H-HeaRD 2026 Task 6, which requires predicting the T, N, and M components of pathological TNM stage from TCGA pathology reports. We explored both discriminative long-context encoders and generative biomedical LLMs. For the first test set, our BioClinical-ModernBERT-large ensemble achieved 0.993 micro-F1 and 0.915 macro-F1, improving over the BB-TEN baseline scoring-log result of 0.947 micro-F1 and 0.780 macro-F1. For the harder second test set, our OpenBioLLM-8B LoRA extractor improved component macro-F1 over the organizer baseline from 0.454 to 0.626 for T, from 0.591 to 0.758 for N, and from 0.554 to 1.000 for M. These results suggest that long-context encoders are strong for explicit T and N evidence, while constrained generative LLM extraction can be effective for harder reports. The main remaining weakness is rare-class T4 recognition.
Enigma at #SMM4H–HeaRD 2026: Leveraging Multilingual Pre-trained Models for Clinical Named Entity Recognition
Sylvia Vassileva | Plamena Ilieva | Teodor Svetoslavov Kostadinov | Monika Peteva Petkova | Daniel Manchevski | Vitosh Doynov | Ivan Koychev | Svetla Boytcheva
Sylvia Vassileva | Plamena Ilieva | Teodor Svetoslavov Kostadinov | Monika Peteva Petkova | Daniel Manchevski | Vitosh Doynov | Ivan Koychev | Svetla Boytcheva
This paper addresses the MultiClinAI challenge, subtask MultiClinNER, which focuses on clinical Named Entity Recognition (NER) across seven languages: Czech, Dutch, English, Italian, Romanian, Spanish, and Swedish. The main goal of MultiClinNER is to identify and extract clinical terms specifically related to diseases, procedures, and symptoms from discharge summaries. The paper explores a variety of state-of-the-art methods, both monolingual and multilingual, ranging from pretrained, zero-shot, domain-adapted transformers to fine-tuned transformer models, and demonstrates the benefits of ensemble modeling. Data augmentation through external resources significantly enhanced the models’ ability to recognize clinical entities. Both monolingual and multilingual approaches showed complementary strengths depending on the language and entity type. The average F1 score achieved across the best models for each language and category is 0.6502.
RACAI at #SMM4H-HeaRD: Named Entity Recognition for Detecting the Impacts of Drug Abuse in Social Media Posts: Zero-Shot and Fine-Tuning Approaches
Tiberiu Boros | Radu-Gabriel Chivereanu
Tiberiu Boros | Radu-Gabriel Chivereanu
In this work, we address the detection of drug abuse repercussions in Reddit posts, as part of SMM4H-HeaRD Task 7: Extraction of Social and Clinical Impacts of Substance Use from Social Media Posts. We evaluate multiple approaches, including fine-tuning and zero-shot inference, across several deep learning architectures. Our best result is obtained using an adapter-based fine-tuning approach on the DeBERTaV3 model. In addition, we explore text-based evolutionary optimization for Gemma 4 workflows and show that, on this task, they achieve competitive performance with the supervised DeBERTaV3 setup.
ICB-UMA at #SMM4H–HeaRD 2026: Hybrid Clinical Entity Projection for MultiClinAI: Adaptive Candidate Windows, XGBoost, and LLM Refinement
Alvaro Rey-Blanes | Sara Giménez-Gómez | Francisco J. Veredas | Francisco J. Moreno-Barea
Alvaro Rey-Blanes | Sara Giménez-Gómez | Francisco J. Veredas | Francisco J. Moreno-Barea
This paper presents our submission to the MultiClinAI Shared Task (Gallego-Donoso et al., 2026) on cross-lingual clinical entity annotation projection from Spanish to English. Our system transfers expert annotations for Diseases, Symptoms and Procedures entities. The approach integrates three core components: adaptive candidate window generation, an XGBoost classifier leveraging surface and semantic features, and an LLM-based post-processing stage to resolve complex misalignments. Our highest-performing run ranked 3rd on the official leaderboard, achieving strict F1 scores of 0.737, 0.549, and 0.538 for Diseases, Symptoms and Procedures, respectively. These results show that combining supervised candidate scoring with targeted LLM refinement provides a robust strategy for clinical entity projection.
URJC-Team at #SMM4H-HeaRD 2026: TNM Stage Extraction with a Regex-LLM Workflow
Natalia Madrueño | Jose Walter Hernández Pérez | Rubén R. Fernández | Soto Montalvo
Natalia Madrueño | Jose Walter Hernández Pérez | Rubén R. Fernández | Soto Montalvo
TNM cancer staging is a critical process for characterizing tumor burden and guiding clinical decisions. Nevertheless, its automated extraction remains challenging due to the unstructured and heterogeneous nature of free-text pathology reports. This paper describes the participation of the URJC-Team in Task 6 of the Social Media Mining for Health/Health Real-World Data (#SMM4H-HeaRD) 2026 Shared Tasks. It focuses on predicting TNM staging from pathology reports. The proposed workflow combines hand-crafted regular expressions with a Large Language Model (LLM). First, explicit TNM mentions are extracted using rule-based patterns. Then, any stage not recovered by these rules is inferred by an LLM. Overall, the proposal provides competitive results across all official shared-task phases.
LotusOrchid at #SMM4H–HeaRD 2026: Fitting pretrained encoders for Dutch medical data
Sophie Arnoult | Shutao Chen | Piek Vossen
Sophie Arnoult | Shutao Chen | Piek Vossen
This paper presents our submission to MultiClinAI’s NER subtask for #SMM4H-HeaRD 2026. We focus on the questions 1) which Language Model represents the clinical notes best and 2) which annotations can help training these models. To get answers for these questions, we follow a token-based classification approach with pretrained encoder language models, where we compare models that were pretrained on generic data against medical data, and on a single language, Dutch, against many languages. In addition, we present two data-augmented systems: one with data from the other languages of the workshop for multilingual training, and one with synthetic annotations.
PEI at #SMM4H-HeaRD 2026: Enhancing Patient Metadata Detection via Hypothesis-Conditioned Classification and Paraphrase-Based Data Augmentation
Farnaz Zeidi | Roman Christof | Farnoush Zeidi | Renate König | Liam Childs
Farnaz Zeidi | Roman Christof | Farnoush Zeidi | Renate König | Liam Childs
This paper presents our approach to Task 5 of the #SMM4H-HeaRD 2026 Workshop, which focuses on detecting patient metadata in SARS-CoV-2 sequencing articles as a binary classification task. We explore both encoder-based and large language model (LLM) approaches, using BioM-BERT as a baseline and Mistral-Nemo as the LLM. To improve performance, we propose a paraphrase-based data augmentation pipeline using Qwen3, where paraphrased training and validation instances are added for fine-tuning. For the LLM, we perform prompt refinement and error analysis, while for the encoder-based model, we reformulate the task as a hypothesis-conditioned classification task inspired by Natural Language Inference (NLI). Our methods improve both models: Mistral-Nemo increases from 0.423 to 0.750 F1, and BioM-BERT from 0.801 to 0.821 on the validation set. Although Mistral-Nemo does not surpass BioM-BERT, our best BioM-BERT model achieves an F1-score of 0.786 on the test set, outperforming the mean and median of competing systems. To support reproducibility, we release our best-performing model on Hugging Face.
Dr-BERT-NL at #SMM4H–HeaRD 2026: DOKTERBERT – Ontology-Grounded Contextual Representations for Dutch Clinical NLP
Gijs Danoe | Andreas Voss | Axel Hamprecht | Matthijs S. Berends
Gijs Danoe | Andreas Voss | Axel Hamprecht | Matthijs S. Berends
We describe our submission to SMM4H-HeaRD 2026 Task 7, which asks systems tolabel ClinicalImpacts and SocialImpactsspans in Reddit posts about non-medical sub-stance use. We compare four pipeline shapesbuilt on the same DeBERTa-v3-base back-bone: (i) a direct 5-class encoder with a linear-chain CRF head, (ii) a two-stage detect-then-classify pipeline that delegates span typingto an instruction-tuned LLM (Qwen2.5-7Bor Gemma-3-12B, 4-bit NF4), (iii) an auditpipeline in which the same LLM verifies theencoder’s predictions, and (iv) a classical-MLvariant that replaces the LLM with an SVMtrained on encoder span embeddings. Across16 configurations, the encoder-only DeBERTa-v3 + CRF configuration is the strongest sin-gle system on the official test split, reaching45.4% strict and 54.2% relaxed F1 — +8.6/ +5.3 points above a mental-roberta-basebaseline. LLM audits give a small dev gain thatdoes not transfer to test.
Vasudev Awatramani at #SMM4H-HeaRD 2026: A Two-Pass LLM Pipeline with Deterministic Rule Derivation for Interpretable Insomnia Detection in Clinical Notes
Vasudev Awatramani
Vasudev Awatramani
We describe our system for Shared Task 2 of #SMM4H–HeaRD 2026, which targets the detection of insomnia in MIMIC-III clinical notes. We frame the task as evidence extraction followed by deterministic rule application, rather than end-to-end label prediction. Our system operates in two passes: (1) a Gemini 2.5 Flash large language model (LLM), invoked through typed prompts written in BAML, extracts structured evidence (sleep difficulties, daytime impairment, hypnotic medications) with verbatim character-level citations from each note; (2) a small Python rule engine deterministically applies the task’s published Insomnia rules–Definition 1, Definition 2, and Rules B and C–to derive the binary patient-level label, the rule-component labels, and their evidence spans. We submitted two test-set systems: a zero-shot variant and a retrieval-augmented few-shot variant that selects nearest-neighbor training notes via FAISS over a sentence-embedding index. Our zero-shot variant achieved F1 = 0.8108 on Subtask 1 (binary classification) and a label-classification micro-F1 of 0.7126 with partial-match span F1 = 0.6621 on Subtask 2, both above the across-team mean. We additionally evaluate a GEPA-optimized prompt variant on the validation split. We discuss two findings of methodological interest: the few-shot variant improves Subtask 1 precision but does not improve F1, and does not move the multi-label or span metrics on Subtask 2 in our submission, and pushing the deterministic rule engine to consume LLM-extracted evidence (rather than asking the LLM to emit labels directly) gives strong, easily auditable behavior on a small test set.
Parallia at #SMM4H-HeaRD 2026: ClinicalAligner26AM: A Cross-Lingual Aligner for Dataset Translation; Evidences from the MultiClinCorpus Shared Task
François Remy
François Remy
Word-level cross-lingual alignment is central to annotation projection, translation auditing, and cross-lingual faithfulness estimation, yet existing neural aligners are rarely adapted to specialized domains.In this paper, we introduce ClinicalAligner26AM, a large-context multilingual aligner model for biomedical and clinical text initialized from ClinicalEncoder26AM.Our training recipe is inspired by AWESoME Align. We build our soft alignment target by sharpening with Sinkhorn–Knop optimal transport a cost matrix established for parallel clinical texts and conversations through the fusion of sentence-level, phrase-level, and token-level signals. We distill this sharpened alignment matrix directly into our student aligner, by encouraging its naive cosine-based token similarity scores to match this target.At inference time, we project source-span scores through the learned token alignment matrix and decode the longest valid high-scoring span in the target text, optionally supported by MultiClinNER predictions.We evaluate CA26AM on the MultiClinCorpus shared task, which projects Spanish clinical entity annotations into six target languages. Our two submitted systems ranked respectively first and second across all languages and entity types, with character-weighted F1 scores above 0.95 in nearly all settings.
Discovery@FI at #SMM4H–HeaRD 2026: Ensemble Character Classifier for Multilingual Clinical NER
Petr Zelina | Vit Novacek
Petr Zelina | Vit Novacek
We present a system for multilingual clinical named entity recognition (NER) submitted to the MultiClinNER subtask of MultiClinAI 2026, covering all seven languages and three entity classes (disease, symptom, procedure).Our approach trains one binary token classifier ensemble per entity class using cross-lingual fine-tuning of XLM-RoBERTa-large, with all languages handled jointly.We apply character-level ensembling over six models (two encoder variants × three cross-validation folds).This ensembling method provides more granular probability estimates than single-model classifiers, allowing for more flexible precision-recall trade-off tuning.The system achieves character-level F1 scores of 0.70–0.88 on the official test set.
IITPatna_ADE at #SMM4H-HeaRD 2026: Multilingual Adverse Drug Event Detection with LoRA-XLM-RoBERTa, Cross-Fold Ensembles, and Post-hoc Calibration
Sofia Jamil | Manish Singh | Harshal Dharpure | Sriparna Saha | Rajiv Misra
Sofia Jamil | Manish Singh | Harshal Dharpure | Sriparna Saha | Rajiv Misra
We describe our submission to Task 1 of #SMM4H-HeaRD 2026: multilingual binary classification of adverse drug event (ADE) mentions in social media. Our system fine-tunes xlm-roberta-large with LoRA adapters and learned language embeddings, using two-stage training (CADEC translated domain adaptation, then five-fold cross-validation on the official training set). We ensemble the five fold checkpoints by mean logits, apply temperature scaling on the development set, and tune decision thresholds to maximize the official metric. On development, the final ensemble reaches macro-F1 0.788 with a global threshold and 0.796 with per-language thresholds; our best official test submission achieves macro-F1 0.616 (ID 678990).
CUET_DiagNLP at #SMM4H-HeaRD 2026: Per-Axis TNM Staging from Pathology Reports and Opioid Impact Span Detection from Social Media
Shuva Dey | Priyangshu Barua | Mohammad Ashfak Habib
Shuva Dey | Priyangshu Barua | Mohammad Ashfak Habib
In this paper, we describe systems for two #SMM4H-HeaRD 2026 shared tasks. Task 6 asks for per-axis TNM cancer staging from free-text TCGA pathology reports under severe label imbalance and long-document constraints. We fine-tune GatorTron-base separately on each axis using Focal loss with class weights and a pooled [CLS]–mean representation, reaching macro F1 of 0.700 (T), 0.774 (N), and 0.640 (M) on test set 2 against a baseline of 0.454, 0.591, and 0.554 respectively. Task 7 asks for span-level detection of opioid-related ClinicalImpacts and SocialImpacts in first-person Reddit posts. We combine DeBERTa-large and PubMedBERT (two seeds each) in a uniform-weight ensemble with boundary-aware loss, entity-replacement augmentation, and a first-person post filter, achieving strict F1 of 0.51 and relaxed F1 of 0.60, above both the task mean (0.46 / 0.55) and median (0.48 / 0.58).
MedMind AI at #SMM4H-HeaRD 2026: Data Extraction and Generation Using Prompt Engineering and Structured Outputs (Tasks 1–6)
Aatish Pradhan | Brian M. Habersberger
Aatish Pradhan | Brian M. Habersberger
Six tasks from the SMM4H–HeaRD 2026 workshop were addressed with task-specific large-language-model (LLM) pipelines relying on prompt engineering, strict structured (JSON) responses, and deterministic rule sets. The pipelines utilize no task-specific fine-tuning and can be adapted across diverse clinical and social media data. This study demonstrates that general-purpose LLMs (gpt-5.4-mini and gpt-5.4) can accurately extract and classify crucial health information when constrained by strict output schemas. Notably, our hybrid approachachieved the best overall performance among all participating systems for Task 2 (Insomnia Detection).
CaresAI at SMM4H-HeaRD 2026: Predicting TNM Staging
Joseph Itopa Abubakar | Jorge Jarme | Favour Igwezeke | Mary Adewunmi
Joseph Itopa Abubakar | Jorge Jarme | Favour Igwezeke | Mary Adewunmi
The Tumor, Node, and Metastasis (TNM) staging system is critical to cancer treatment. This study aims to predict TNM stage labels independently, with the Cancer Genome Atlas (TCGA) pathology report as the sixth shared task of SMM4H-HeaRD 2026. The problem is framed as three multi-label classification tasks. We explore both classical and deep learning approaches using Term Frequency-Inverse Document Frequency (TF-IDF) features and embeddings from ClinicalBERT, BioBERT, and PubMedBERT. These representations are used with Logistic Regression (LR), Light Gradient Boosting Machine (LightGBM), Feed-Forward Neural Networks (FFNN), and Wide Residual Networks (WRN). Our results show that individual embeddings perform similarly to the TNM label classification, while their combination improves its predictive ability. WRN achieves AUROC scores of 0.839 (T), 0.8502 (N), and 0.803 (M) with F1-scores of 0.622, 0.702, and 0.9337, respectively, for the training phase. LightGBM with TF-IDF performs best with AUROC scores of 0.9368 (T), 0.9524 (N), and 0.8311 (M) and F1-scores of 0.7559 (T), 0.7384 (N), and 0.7017 (M) during the training phase. Furthermore, the result of the Codabench for the test sets indicates a Macro-F1 score of 0.978, 0.957, and 0.879 for the T, N, and M categories respectively for test set 1; while test set 2 records a Macro-F1 score for T, N, and M is 0.807, 0.767, 1.0 respectively. However, performance declined during the evaluation phase of the test sets, a drop from 0.938 for test set 1 to 0.858 for test set 2, for the Macro-F1 score across all stages; suggesting limitations in model generalizability, sensitivity to class imbalance, and challenges in processing lengthy clinical documents. Although this study provides an efficient baseline model and a reproducible pipeline, further optimization and validation are required before it can be considered suitable for use in a real-world clinical setting.
Vinland_Vector at #SMM4H-HeaRD 2026: Multilingual ADE Detection and Query-Augmented Clinical NER for English
Nirjhar Das | Rathijit Aich | Mahfuzulhoq Chowdhury
Nirjhar Das | Rathijit Aich | Mahfuzulhoq Chowdhury
In this paper, we address Task 1 on adverse drug event (ADE) detection and Task 8 on MultiClinNER at SMM4H-HeaRD 2026. ADE detection is formulated as a multilingual binary classification problem over social media posts spanning German, French, Russian, English, Mandarin and Japanese, with zero-shot on Farsi. Using XLM-RoBERTa-Large with a dual-pooling head, combined with stratified sampling, language-conditioned inputs, translation-based augmentation, and calibrated ensembling, our model achieves a macro F1 score of 0.6088, surpassing both the competition mean (0.5465) and median (0.5798). Our work in MultiClinNER targets clinical NER for English text. Using GLiNER-large with sliding-window inference, query augmentation, and calibrated thresholds, it achieves strict F1 scores of 0.7591 (Disease), 0.7263 (Procedure), and 0.6733 (Symptom), outperforming a PubMedBERT baseline across all entities.
SIEMENS at #SMM4H–HeaRD 2026: The Impact of Training Strategy and Backbone Selection on BERT-based Multilingual Clinical NER
Manuela Daniela Danu
Manuela Daniela Danu
This paper describes our participation in the MultiClinNER subtask of the MultiClinAI shared task, part of the #SMM4H-HeaRD Workshop at ACL 2026. The task requires identifying DISEASE, SYMPTOM, and PROCEDURE mentions in clinical case reports across seven languages: Czech, Dutch, English, Italian, Romanian, Spanish, and Swedish. We compare two BERT-based sequence labeling methods: (i) sentence-level token classification with a fixed train/validation split, and (ii) paragraph-level chunking with 5-fold cross-validation and checkpoint merging, using language-specific BERT models and multilingual XLM-RoBERTa-large as backbones. Our results show that 5-fold training with checkpoint merging consistently outperforms the fixed split strategy, with further analysis suggesting that the gains are primarily driven by improved training-set coverage rather than by differences in input granularity. Language-specific BERT encoders prove most effective for Spanish and English, while XLM-RoBERTa-large yields the strongest results for the remaining five languages through cross-lingual transfer.
HALELab-NITK at #SMM4H-HeaRD2026: Inclusion of Feature Engineering for Detection of Patient Metadata in SARS-CoV2 Sequencing Articles
Aakarsh Bansal | Abhishek Srinivas | Sowmya Kamath S.
Aakarsh Bansal | Abhishek Srinivas | Sowmya Kamath S.
This article presents a system description for our work as part of Task 5 of the SMM4H-HeaRD 2026 workshop. We fine-tune pretrained BERT and BiomedBERT models and further enhance them using custom feature augmentation techniques. Incorporating these engineered features results in improved performance, with the best model achieving a validation F1 score of 0.8419 and an evaluation phase F1 score of 0.753.
Cuet_Data_Wizards at #SMM4H-HeaRD 2026: Multilingual ADE Detection and Influenza Vaccine Effectiveness Estimation from Social Media
Abir Dey | Mohammed Omar Faiaz | Muhammad Ibrahim Khan
Abir Dey | Mohammed Omar Faiaz | Muhammad Ibrahim Khan
We present our systems for Task 1 and Task 3 of the #SMM4H-HeaRD 2026 shared tasks. Task 1 focuses on binary classification of adverse drug event (ADE) mentions across seven languages, including a zero-shot Persian setting without labeled training data. We fine-tune XLM-RoBERTa-large using weighted cross-entropy loss and augment low-resource settings with additional CADEC data and machine translation-based Persian augmentation. Our system achieves a macro F1 score of 0.582, outperforming the shared task average of 0.547. Task 3 addresses influenza vaccine effectiveness estimation through classification of vaccination status and flu-test results from X posts. We fine-tune twitter-roberta-large, achieving micro F1 scores of 0.845 for vaccination status and 0.883 for flu-test classification on the official test set. Post-evaluation experiments with focal loss, test-time augmentation, and head-tail truncation further improve performance. These results highlight the effectiveness of robust transformer adaptation for health-related social media classification.
Limics at #SMM4H-HeaRD 2026: Uncertainty-Driven Prediction for ADE Detection in Social Media
Nour Allam
Nour Allam
This paper describes our system for the SMM4H-HeaRD 2026 Task 1: Detection of Adverse Drug Events in Multilingual and Multi-platform Social Media Posts. We developed a two-stage pipeline combining a fine-tuned XLM-RoBERTa-large encoder-only model with a large language model for final decision on ambiguous cases. To handle complex linguistic boundaries, we explore explicitly training the encoder to treat ambiguity as a discrete third label to delegate those cases to the generative model. Although introducing the third label was associated with lower performance than relying on a binary model, when using the encoder as a preliminary filter for classifying 78.62% of posts as negatives, we achieved a global F1 score of 0.614 (+0.034 over task median).
This paper demonstrates our system for shared task 4 of #SMM4H-HeaRD 2026 Workshop where a given doctor-patient dialogue is summarized into a clinical note in the corresponding SOAP format. Our proposed solution includes semi-supervised learning together with parameter efficient finetuning (PEFT) applied to a lightweight pre-trained QWEN3.5 model. Our model delivers competitive performance relative to its parameter count, and generalizes its performance to unseen test dataset.
ACSS-PSL at #SMM4H-HeaRD 2026: An LLM-Driven Autoresearch Loop for Opioid-Impact NER
Olivier Caron | Bruno Chaves Ferreira | Christophe Benavent
Olivier Caron | Bruno Chaves Ferreira | Christophe Benavent
We apply an LLM-driven autoresearch protocol to Task 7 of #SMM4H-HeaRD 2026, which requires extracting ClinicalImpacts and SocialImpacts spans from Reddit posts about non-medical opioid use. A coding agent iteratively proposes a hypothesis, modifies the training configuration, and evaluates against the held-out validation set. Across 79 runs, only 9 improved strict F1, indicating a narrow viable search space on this small dataset (842 training examples). The submitted ensemble combines DeBERTa-large, MC Dropout blending, and a constrained multi-LLM consensus layer, reaching 0.46 strict and 0.52 relaxed F1 on test, though single-seed evaluation limits the reliability of run-level comparisons. The run log provides a reproducible case study of autonomous experimentation, including failure modes and guardrails for small-data NER.
Creative Catalysts at #SMM4H-HeaRD 2026: XLM-RoBERTa for Task 1 Binary Classification of Social Media Posts Containing Adverse Drug Events
Radja Afren | Hichem Rahab | Imane Guellil
Radja Afren | Hichem Rahab | Imane Guellil
Adverse drug events (ADEs) automatic detection from social media posts has become an important task for healthcare systems with real-world, patient-collected data. The current work deals with ADE on user generated content for Task 1 of the Social Media Mining for Health Research and Applications Workshop (SMM4H 2026), Creative Catalysts. We fine-tuned XLM-RoBERTa, pre-trained model chosen for its robustness in handling multilingual content and linguistic diversity common in social media text. To better handle the class imbalance, we subsequently implemented a class-weighting strategy to increase the model’s focus on the underrepresented positive class. This adjusted model improved the validation F1-score to 65%. Our results demonstrate the effectiveness of transformer-based architectures for ADE detection while highlighting the critical need for robust class-balancing techniques and multilingual generalization to handle real-world, imbalanced social media data.
BioNLP at #SMM4H-HeaRD 2026 Task 3 Estimating Flu Vaccine Effectiveness: A Temporal-Aware Fine-Tuning and Similarity-Based Few-Shot Prompting Approach
Irina Patularu
Irina Patularu
This paper presents our systems for the SMM4H 2026 shared task on flu-related tweetclassification across two subtasks: flu vaccination status and flu test outcome classification. For each subtask, we evaluate two approaches: fine-tuning BERTweet-large with atemporal-aware architecture, cross-validation ensembling, and regularization techniques, anda GPT-4o few-shot prompting system with similarity-based dynamic example retrieval,chain-of-thought reasoning and contrastive label ranking. Fine-tuning proves superior for theflu vaccination subtask (micro-F1: 87.90%), where sufficient and relatively balanced training datais available, while few-shot prompting performs better for the flu test subtask (micro-F1: 95.74%), where limited and heavily imbalanced training data renders fine-tuning less effective.
Infimobius at #SMM4H-HeaRD 2026: Multi-Seed DeBERTa Ensemble for Flu Vaccination and Testing Status Classification
Pradyumn Kejriwal | Suhani Singh Charan | Raksha Sharma | Rudra Murthy
Pradyumn Kejriwal | Suhani Singh Charan | Raksha Sharma | Rudra Murthy
This paper describes FluENS (Flu ENsemble System), our submission to the Social Media Mining for Health (SMM4H) 2026 Shared Task 3, which targets fine-grained classification of flu vaccination and flu testing statuses from tweets. FluENS builds on the microsoft/deberta-v2-xlarge pre-trained language model and employs a multi-seed ensemble strategy in which five models, each initialized with a different random seed and trained on the full training set, are aggregated through soft-voting over averaged softmax probabilities. We additionally incorporate balanced class weights to mitigate severe label imbalance and apply a two-stage learning rate schedule that separately controls the encoder and classification head. On the development set, FluENS achieves a macro F1 of 79.64% and micro F1 of 85.56% on the flu vaccination sub-task, and a macro F1 of 96.35% and micro F1 of 97.04% on the flu testing sub-task, substantially outperforming a roberta-base baseline across all metrics.
Thunderbolts at #SMM4H-HeaRD 2026: Detection of Insomnia in Clinical Notes using Transformers
Guddanti Venkata Sree Charan | Nama_Ss@Cs.Iitr.Ac.In Nama_Ss@Cs.Iitr.Ac.In | Raksha Sharma | Rudra Murthy
Guddanti Venkata Sree Charan | Nama_Ss@Cs.Iitr.Ac.In Nama_Ss@Cs.Iitr.Ac.In | Raksha Sharma | Rudra Murthy
We present the SuSh system for Subtask 1 of the MultiClinAI shared task at the 11th SMM4H and HeaRD Workshop (ACL 2026), which addresses multilingual clinical named entity recognition (NER) across seven languages. Our system adopts a fully zero-shot approach using GLiNER-biomed-large-v1.0, a span-based NER model pre-trained on biomedical text, requiring no task-specific fine-tuning or labeled data in target languages. We apply a character-level sliding window strategy to handle long clinical documents that exceed the model’s token limit and incorporate a post processing pipeline including threshold optimization via F1-max sweep, entity-specific gazetteer lookup derived from DisTEMIST and SympTEMIST terminology lists, span boundary correction, and negation filtering. Our official submission achieves a Strict F1 of 0.5175, Strict Precision of 0.5536, Strict Recall of 0.4859, and CHR F1 of 0.6130 on the English disease subtask, demonstrating that domain adapted zero-shot biomedical NER models can serve as competitive baselines for multilingual026 clinical entity recognition without any task specific training data.
Team TIET at #SMM4H-HeaRD 2026: Fine-tuned Biomedical Transformers with Language-Balanced Sampling for Patient Metadata and Multilingual ADE Detection
Divrose Kaur | Jatin Bedi | Jasmeet Singh
Divrose Kaur | Jatin Bedi | Jasmeet Singh
We present Team TIET’s systems for two shared tasks at #SMM4H-HeaRD 2026: Task 5 (detection of patient metadata in SARS-CoV-2 sequencing papers) and Task 1 (multilingual adverse drug event detection across six languages plus an unseen Farsi subset). For Task 5 we explore iterative LLM prompting followed by fine-tuning BiomedBERT-base with weighted cross-entropy loss and probability threshold optimization, achieving F1 = 0.760 on the official test set (above the competition mean of 0.729). For Task 1 we fine-tune XLM-RoBERTa-base with a combined language- and class-balanced sampling strategy and per-language threshold tuning, achieving macro F1 = 0.497 overall (0.608 excluding the unseen Farsi subset). We report empirical findings on BERT+LLM ensemble failure with bimodal probability distributions, the superiority of base over large model variants under limited data, and the importance of language-balanced gradient contribution in multilingual classification.
MetaMiners at SMM4H-HeaRD 2026: A Semantic-Structural Knowledge-Enriched Ensemble for SARS-CoV-2 Metadata Identification
Claudia-Alexandra Ursu | Alecsandru-Florin Soare
Claudia-Alexandra Ursu | Alecsandru-Florin Soare
This paper presents a hybrid solution for a binary classification of medical PubMed articles created for identifying reports that associate clinical metadata with SARS-CoV-2 genomic sequences. The system is designed to catch the subtle distinction between reports of sequence-associated patient metadata and sentences where such metadata is either unrelated, irellevant, or linked to previous studies. The biggest challenge is the fact that the medical dataset is highly imbalanced, consisting of only 13.3 % of medical reports labeled positive.Our system proposes a hybrid system that combines 4 approaches that includes dual-evidence tagging, negation-aware suppression, semantic frame extraction, adversarial training. All these approaches were tested on multiple models: BiomedBERT-base-abstract, BioLinkBERT-large, PubMedBERT-base-fulltext, followed by a best subset ensamble search to obtain the result of 0.792 F1 score, setting a new benchmark and positioning the solution on the 1st place of the competition.
No_gmail at #SMM4H-HeaRD 2026: Detecting Patient Metadata in COVID-19 Scientific Literature: A Comparative Study of Encoder-Only and Autoregressive Language Models
Stefanescu Anastasia
Stefanescu Anastasia
Identifying sentences in COVID-19 literature that report patient metadata is an important step in genomic epidemiology, currently requiring costly manual curation. We compare fine-tuned encoder-only models (BERT, BioLinkBERT) and autoregressive LLMs (Llama, Gemma, GPT-OSS) under prompting and fine-tuning regimes, using Focal Loss and undersampling to address severe class imbalance. Encoder-only models substantially outperform autoregressive models: BioLinkBERT-base with Focal Loss achieves macro F1 of 0.76, versus 0.54 for the best fine-tuned autoregressive model.
Understanding the Sociocultural Dimensions of Mental Health Discourse in Arabic X Communities
Amal Abdullah Alqahtani | Rana Aref Salama | Mona T. Diab
Amal Abdullah Alqahtani | Rana Aref Salama | Mona T. Diab
Computational mental health research has predominantly centered on English-speaking populations, leaving Arabic-language discourse comparatively under-examined. We present an exploratory computational study of 8,147 tweets from 607 users classified by a GPT-4.1 personal-disclosure pipeline as likely lived-experience authors in three condition-specific Arabic-language X (formerly Twitter) Communities. We focus on discourse related to borderline personality disorder (BPD), bipolar disorder, and ADHD, and characterize community-associated linguistic patterns using a multi-domain cultural keyword framework. The results suggest that in this corpus, Bipolar tweets contain more religious and medical vocabulary, BPD tweets contain more relational, identity, and emotional-distress vocabulary, and ADHD tweets more often focus on practical symptoms and medication management. We treat these patterns as hypothesis-generating rather than confirmatory because the corpus is imbalanced across conditions, some subcorpora are temporally concentrated, and the keyword framework is an initial operationalization rather than a validated measurement instrument. The paper contributes a reusable LLM-assisted personal-disclosure pipeline and an exploratory cultural keyword framework for Arabic mental health discourse.
Team Paradise at #SMM4H-HeaRD 2026: Multi-Task Approaches for Social Media Health Mining
Dhruv Goyal | Ishita Gupta | Jatin Bedi
Dhruv Goyal | Ishita Gupta | Jatin Bedi
We present Team Paradise’s systems for three tasks in the SMM4H-HeaRD 2026 shared task: multilingual adverse drug event detection (Task 1), influenza vaccine effectiveness estimation via two-subtask classification (Task 3), and opioid impact span extraction (Task 7). For Task 1, threshold-only ablation on XLMRoBERTa-large achieves a macro-F1 of 0.597, exceeding the field mean (0.547) by +0.050. For Task 3, a three-stage hybrid pipeline combining twitter-RoBERTa-base-2022 with rule-based post-processing achieves Micro-F1 0.8434 (Subtask 1: vaccination status) and 0.8936 (Subtask 2: test results). For Task 7, RoBERTa-large with CRF decoding and sliding-window inference obtains relaxed F1 0.60 despite severe train-test distributional shift Across tasks, we identify class imbalance, temporal ambiguity, and platform heterogeneity as central challenges.
The MultiClinAI Shared Task on Multilingual Clinical Corpus Construction and Concept Extraction: Systems, Evaluation, and Datasets
Fernando Gallego Donoso | Salvador Lima-Lopez | Judith Rosell | Eulàlia Farré-Maduel | Martin Krallinger
Fernando Gallego Donoso | Salvador Lima-Lopez | Judith Rosell | Eulàlia Farré-Maduel | Martin Krallinger
We present an overview of the MultiClinAI shared task, which focuses on multilingual clinical entity extraction and automatic corpus generation through annotation projection. It addresses two key challenges in clinical natural language processing (NLP): (i) developing comparable multilingual named entity recognition (NER) systems and (ii) automatically constructing multilingual clinical corpora through annotation projection. The MultiClinAI task provides a unified benchmark for evaluating multilingual and cross-lingual clinical NLP approaches that cover diseases, symptoms, and procedures in Spanish, English, Dutch, Italian, Romanian, Swedish, and Czech. A total of 21 teams from 13 countries participated, submitting 531 runs across the different subtasks. The top runs obtained very competitive results, close to human expert annotation quality. The results highlight both the challenges and opportunities of multilingual clinical information extraction. All resources, including a corpus of over 738,201 manually revised entity mentions across seven languages, are publicly available on Zenodo at: https://zenodo.org/records/19334278.
Overview of #SMM4H-HeaRD 2026 – Task 6: Predicting TNM staging from pathology reports
Jose Miguel Acitores Cortina | Jacob S. Berkowitz | Nadine A. Friedrich | Nicholas P Tatonetti
Jose Miguel Acitores Cortina | Jacob S. Berkowitz | Nadine A. Friedrich | Nicholas P Tatonetti
This paper provides an overview of Task 6 from the Social Media Mining for Health/Health Real-World Data shared task (#SMM4H-HeaRD 2026), which focused on predicting TNM staging from pathology reports from TCGA. Seven teams submitted systems spanning fine-tuned clinical encoders, open-source generative LLMs, and closed-source API models. On a straightforward test set, most teams achieved near-perfect F1 scores (average 0.993, 0.972, and 0.957 for T, N, and M). However, on a harder tiebreak set where explicit TNM notation was removed and staging had to be inferred from clinical descriptions, performance dropped substantially (average 0.725, 0.783, and 0.846). Notably, the two teams using large closed-source API models generalized best to the harder set, achieving the highest T and N scores despite not leading on the easy set. These results suggest that while fine-tuned domain-specific encoders excel at surface-level extraction, larger general-purpose LLMs may be more robust when staging must be inferred from contextual clinical findings. All teams surpassed baseline overall performance on both test sets.
NoviceTrio in #SMM4H-HeaRD 2026: Hybrid Clinical Transformer Ensembles for Insomnia Detection and Evidence Extraction from Clinical Notes
Abir Naskar | Mike Conway
Abir Naskar | Mike Conway
We present two systems for the #SMM4H-HeaRD 2026 Task 2 shared task of automated insomnia detection from clinical notes. Our system addresses both subtasks: (1) binary insomnia classification and (2) multi-label rule prediction with evidence span extraction. For Subtask 1, we employ an ensemble architecture combining Qwen3-4B-Instruct and Bio_ClinicalBERT to capture both general semantic reasoning and domain-specific clinical representations. The framework utilizes chunk-based processing with overlapping token windows to handle long clinical notes efficiently. For Subtask 2, we develop a dual-head multi-task transformer model that jointly predicts insomnia labels and token-level evidence spans using BIO tagging. To improve clinical relevance, we additionally incorporate sentence-level filtering using sentence-transformer embeddings and similarity-based retrieval of insomnia-related contexts. Experimental results suggest competitive performance relative to the shared task mean and median scores across both subtasks. Our best Subtask 1 system achieves a recall of 0.9474, surpassing the shared-task mean and median recall, while our Subtask 2 system exceeds the mean and median scores across label classification, exact match, and partial match metrics. The end-to-end implementation is publicly available on GitHub.
Overview of #SMM4H-HeaRD 2026 - Task 2: Detection of Insomnia in Clinical Notes
Joey Chan | Lauren D. Gryboski | Guillermo Lopez-Garcia | Graciela Gonzalez-Hernandez
Joey Chan | Lauren D. Gryboski | Guillermo Lopez-Garcia | Graciela Gonzalez-Hernandez
This paper provides an overview of Task 2 from the Social Media Mining for Health and Health Real-World Data (#SMM4H-HeaRD) 2026 Workshop and Shared Tasks, which focused on the detection of insomnia in clinical notes derived from the MIMIC-III dataset. The task consisted of two subtasks: binary text classification to determine whether a patient is likely experiencing insomnia (Subtask 1), and multi-label classification combined with character-level evidence extraction to identify supporting evidence for specific insomnia crite- ria (Subtask 2). Eight teams participated, using approaches ranging from large language model (LLM) prompting and fine-tuned encoder mod- els to hybrid rule-based pipelines. Results demonstrated that structured LLM pipelines with deterministic post-processing achieved the strongest overall performance, while character-level span extraction remained substantially harder than classification across all systems. These findings highlight both the promise of NLP for identifying underdiagnosed conditions in electronic health records and the ongoing difficulty of producing interpretable, evidence-grounded clinical predictions.
Overview of the 11th Social Media Mining for Health (#SMM4H) and Health Real-World Data (HeaRD) Shared Tasks at ACL 2026
Guillermo Lopez-Garcia | Jose Miguel Acitores Cortina | Jacob Berkowitz | Joey Chan | Sumon Kanti Dey | Ivan Flores Amaro | Fernando Gallego | Lauren Gryboski | Ari Z. Klein | Farnoush Zeidi Kolehparcheh | Martin Krallinger | Salvador Lima-Lopez | Yujun Ma | Tomohiro Nishiyama | Ahmad Rezaie Mianroodi | Amirali Rezaie Mianroodi | Lisa Raithel | Roland Roller | Judith Rosell | Frank Rudzicz | Abeed Sarker | Nicholas Tatonetti | Philippe Thomas | Elena Tutubalina | Dongfang Xu | Farnaz Zeidi | Yu Zhai | Pierre Zweigenbaum | Graciela Gonzalez-Hernandez
Guillermo Lopez-Garcia | Jose Miguel Acitores Cortina | Jacob Berkowitz | Joey Chan | Sumon Kanti Dey | Ivan Flores Amaro | Fernando Gallego | Lauren Gryboski | Ari Z. Klein | Farnoush Zeidi Kolehparcheh | Martin Krallinger | Salvador Lima-Lopez | Yujun Ma | Tomohiro Nishiyama | Ahmad Rezaie Mianroodi | Amirali Rezaie Mianroodi | Lisa Raithel | Roland Roller | Judith Rosell | Frank Rudzicz | Abeed Sarker | Nicholas Tatonetti | Philippe Thomas | Elena Tutubalina | Dongfang Xu | Farnaz Zeidi | Yu Zhai | Pierre Zweigenbaum | Graciela Gonzalez-Hernandez
The aim of the Social Media Mining for Health Applications and Health Real-World Data (#SMM4H-HeaRD) shared tasks is to fos- ter the development and evaluation of natural language processing, machine learning, and artificial intelligence methods for analyzing health-related text from social media and other real-world data sources. For the 11th iteration, held online and co-located with ACL 2026, the workshop continued the expanded #SMM4H- HeaRD platform initiated in 2025, broaden-ing its scope beyond social media to include additional health real-world data sources such as clinical narratives and biomedical literature. The 8 shared tasks covered diverse data sources, health domains (e.g., adverse drug events, insomnia, influenza vaccine effectiveness, cancer staging, substance use), and task formulations (e.g., classification, named entity recognition, span extraction, and text generation). In total, 110 teams registered, representing 31 countries. In this paper, we present an overview of the datasets, participant systems, and performance results, providing insights into current methods for mining social media and health real-world data for biomedical and clinical applications.
up
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Saif M. Mohammad | Nedjma Ousidhoum
Saif M. Mohammad | Nedjma Ousidhoum
SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Models
Dong Liu | Yanxuan Yu
Dong Liu | Yanxuan Yu
Long-context language models face efficiency challenges as context lengths expand. Traditional tokenization methods like BPE operate on frequency statistics, ignoring semantic structure and over-tokenizing redundant spans. We propose SemToken, a semantic-aware tokenization framework that adaptively compresses token sequences based on semantic density. SemToken uses lightweight encoders to identify and merge semantically equivalent spans, allocates variable granularity based on local semantic density, and dynamically adjusts token budgets during generation. Evaluations on WikiText-103, LongBench, and BookSum demonstrate 2.4× token reduction, 1.9× inference speedup, and 67% memory reduction while preserving or improving model quality. SemToken integrates seamlessly with existing models and achieves multiplicative benefits when combined with FlashAttention (up to 2.7× total speedup).
Syntactic Priming in Few-Shot Learning: How Demonstration Structure Shapes LLM Performance
Prasanth Yadla
Prasanth Yadla
Large language models (LLMs) exhibit remarkable few-shot learning capabilities, yet the role of syntactic structure in demonstration examples remains unexplored. Drawing on psycholinguistic research on structural priming, we investigate whether syntactic patterns in few-shot prompts influence LLM outputs and task performance. We conduct systematic experiments across four model families (Llama, Mistral, Qwen, Gemma) using four syntactic constructions (passive voice, cleft sentences, dative alternation, particle placement). Our results reveal robust syntactic priming effects, with priming strength ranging from 1.3× to 6.4× depending on construction type, indicating that models are substantially more likely to produce constructions matching demonstration syntax. Critically, we find that priming strength shows a positive trend with model size (r = 0.85, p = 0.068), with effects intensifying from 7B to 14B parameter models. We demonstrate that priming is construction-specific rather than reflecting general stylistic preferences, and that priming effects persist across multiple intervening sentences. Analysis across three task types (sentence completion, paraphrase generation, story continuation) reveals that syntactic structure in demonstrations influences output style, and that models produce primed constructions even when the task calls for a different syntactic form. These findings have immediate implications for prompt engineering and reveal that LLMs encode syntactic abstractions beyond surface-level pattern matching. We release our benchmark, SyntaxPrime-ICL, containing controlled examples across multiple constructions for evaluating syntactic priming in few-shot contexts.
A Logic-Based Approach to Hallucinations in Data-to-Text NLG: Experiments with Human and LLM Annotators
Eduardo Calò | Saad Mahamood | Albert Gatt | Kees Van Deemter
Eduardo Calò | Saad Mahamood | Albert Gatt | Kees Van Deemter
Hallucinations are a persistent challenge in natural language generation, including data-to-text. van Deemter (2024) introduced a framework based on the relation of logical consequence ("follows from"), which divides all data-to-text hallucinations into seven disjoint categories. We examine whether human annotators and large language models are able to apply the framework, in two data-to-text domains. Results suggest that the framework is applicable, although there are significant domain-dependent variations, as well as discrepancies between human and model judgments. We also uncover several issues that should inform future work on hallucination.
A framework for annotating and modelling intentions behind metaphor use
Gianluca Michelli | Xiaoyu Tong | Ekaterina Shutova
Gianluca Michelli | Xiaoyu Tong | Ekaterina Shutova
Metaphors are part of everyday language and shape the way in which we conceptualize the world. Moreover, they play a multifaceted role in communication, making their understanding and generation a challenging task for language models (LMs). While there has been extensive work in the literature linking metaphor to the fulfilment of individual intentions, no comprehensive taxonomy of such intentions, suitable for natural language processing (NLP) applications, is available to present day. In this paper, we propose a novel taxonomy of intentions commonly attributed to metaphor, which comprises 9 categories. We also release the first dataset annotated for intentions behind metaphor use. Finally, we use this dataset to test the capability of large language models (LLMs) in inferring the intentions behind metaphor use, in zero- and in-context few-shot settings. Our experiments show that this is still a challenge for LLMs.
ReFRAME or Remain: Unsupervised Lexical Semantic Change Detection with Frame Semantics
Bach Phan Tat | Kris Heylen | Stefano De Pascale | Dirk Geeraerts | Dirk Speelman
Bach Phan Tat | Kris Heylen | Stefano De Pascale | Dirk Geeraerts | Dirk Speelman
The majority of contemporary computational methods for lexical semantic change (LSC) detection are based on neural embedding distributional representations. Although these models perform well on LSC benchmarks, their results are often difficult to interpret. We explore an alternative approach that relies solely on frame semantics. We show that this method is effective for detecting semantic change and can even outperform many distributional semantic models. Finally, we present a detailed quantitative and qualitative analysis of its predictions, demonstrating that they are both plausible and highly interpretable.
Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models
Katrina Olsen | Sebastian Padó
Katrina Olsen | Sebastian Padó
Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets—both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.
Uncovering Ideological Bias in RAG with Lexical Multidimensional Analysis: A Case Study on COVID-19
Elmira Salari | Maria Claudia Nunes Delfino | Hazem Amamou | José Victor de Souza | Shruti Kshirsagar | Alan Davoust | Anderson Avila
Elmira Salari | Maria Claudia Nunes Delfino | Hazem Amamou | José Victor de Souza | Shruti Kshirsagar | Alan Davoust | Anderson Avila
This paper studies the impact of retrieved ideologically framed texts on the outputs of large language models (LLMs). While interest in understanding ideological framing in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideologically framed texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify discourse dimensions within the corpus. LLMs are tasked to answer questions derived from three identified discourse dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideologically framed texts; and the second contains the question, ideologically framed texts, and LMDA descriptions. Alignment between reference ideologically framed texts and LLMs’ responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that retrieved ideologically framed texts influence LLM responses toward the discourse framing represented in the external knowledge, with enhanced prompts further amplifying this effect. Our findings highlight the importance of identifying ideological framings within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of intentional discourse steering of such models.
Evaluating Adjective-Noun Compositionality in LLMs: Functional vs Representational Perspectives
Ruchira Dhar | Qiwei Peng | Anders Søgaard
Ruchira Dhar | Qiwei Peng | Anders Søgaard
Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective–noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.
Lexical Availability and Human Distributional Agreement in GPT-4o’s Color Naming
Anna Feldman | Jing Peng
Anna Feldman | Jing Peng
We evaluate GPT-4o’s color naming across nine languages using both synthetic and human-derived stimuli. Using hue wheels, fixed basic categories, low-chroma hue lines, and dense binned CIELAB grids, we separate lexical availability of color terms from distributional agreement with human color naming. GPT-4o reliably names vivid, high-chroma colors and reproduces several known language-specific distinctions under constrained settings. However, its performance degrades sharply for low-chroma colors and for stimuli near human category boundaries. In these regions, model-human divergence remains high. Overall, GPT-4o shows strong cross-linguistic lexical knowledge but does not reliably match human color-naming distributions, especially in low-chroma and boundary regions.
The LSCD Benchmark: a Testbed for Diachronic Word Meaning Tasks
Dominik Schlechtweg | Sachin Yadav | Jonas Kuhn | Nikolay Arefyev
Dominik Schlechtweg | Sachin Yadav | Jonas Kuhn | Nikolay Arefyev
Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task’s modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization. We use the implemented benchmark to conduct a number of experiments with recent models and systematically improve the state-of-the-art.
From Latents to Labels: Zero-Shot Named Entity Recognition using Sparse Autoencoder Features
Nakanyseth Vuth | Gilles Sérasset | Didier Schwab
Nakanyseth Vuth | Gilles Sérasset | Didier Schwab
Zero-shot Named Entity Recognition is critical for low-resource domains, yet existing approaches rely on opaque prompting of large language models or dense representations that suffer from polysemanticity. We propose an alternative approach that leverages monosemantic features of Sparse Autoencoders. We introduce SAE-NER, a training-free framework that maps monosemantic SAE feature activations to entity types through direct precision estimation, requiring no supervision or prompting. Experiments across general and biomedical domains show that SAE-NER consistently outperforms trained probing classifiers, with especially a large margin in the biomedical domain (up to +20 F1). Finally, we evaluate the utility of SAE-NER predictions as silver training data for downstream NER models. Using controlled perturbations of gold annotations to simulate realistic annotation noise, we show that false negatives are the primary bottleneck for silver-data quality, outweighing the impact of boundary imprecision and false positives.
Can You Be More Explicit? A Task and Dataset on Explicitations of Implicit Meaning
Laura Zeidler | Michael Roth
Laura Zeidler | Michael Roth
Making texts clear and comprehensible has become an increasingly important topic in NLP. A possible strategy to enhance text comprehension is to make implicitly conveyed meaning explicit. To explore the role of explicit vs. implied meaning, we study cases of so-called explicitations, i.e. revisions of text in which implicitly conveyed content is made explicit. Using revision histories from wikiHow, we propose a rule-based approach to extract candidate explicitations and curate a human-annotated dataset in which explicitations are distinguished from insertions of new information. Our analyses show that while the extraction method is effective in retrieving relevant cases, distinguishing explicitations from new information is a challenging and often subjective task, reflecting differences in background knowledge and reasoning. Experimentally, we find off-the-shelf LLMs to achieve promising performance, with inconsistent gains from few-shot prompting and fine-tuning. In contrast, fine-tuned NLI models benefit consistently from supervised training and show stronger robustness under distribution shift. In sum, our findings show that the task is challenging, but also indicate that our annotated dataset contains informative signals that models can learn from, paving the way for further research on explicitations.
Large language models (LLMs) appear successful in emulating compositional language, yet it remains unclear what these results entail about their underlying compositional semantic representations. The probing classifier paradigm has emerged as a tool to remedy this. This paper proposes to critically review the findings of 24 probing studies targeting a wide range of linguistic and semantic phenomena. It proposes a taxonomy of probing tasks based on the linguistic primitives they presuppose, distinguishing four tiers: lexical semantics, the syntax–semantics interface, propositional semantics, and discourse and pragmatics. A gradient in representational evidence emerges: LLMs robustly encode lexical information, display less consistent sensitivity to structural relations within sentences, and obtain unsatisfactory results on tasks requiring propositional content, speech acts, or pragmatic inference. The review underscores the need for a clearer theoretical grounding of what probing tasks measure and reflects on how probing can illuminate the compositional pathways available within current language models.
*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation
Quentin Lemesle | Leane Jourdan | Daisy Munson | Pierre Alain | Jonathan Chevelu | Arnaud Delhay | Damien Lolive
Quentin Lemesle | Leane Jourdan | Daisy Munson | Pierre Alain | Jonathan Chevelu | Arnaud Delhay | Damien Lolive
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over “Yes/No” answers without generating text. We introduce *-PLUIE, task-specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.
Under the Surface: Probing Tamil Paraphrase Intelligence
Viswadarshan R R | Dr. J. Felicia Lilian | Mahalakshmi S
Viswadarshan R R | Dr. J. Felicia Lilian | Mahalakshmi S
We present a systematic study on paraphrase detection in Tamil by constructing a unified dataset through translation and semantic verification of three English benchmarks QQP, PAWS, and MRPC. Unlike prior efforts that focus on individual sources or limited scales, our dataset combines multiple paraphrase detection paradigms and is evaluated using semantic similarity metrics, round-trip translation checks, and classifier agreement analysis. We fine-tune five multilingual transformer models (mBERT, XLM-R, IndicBERT, MuRIL, and DistilmBERT) and a Tamil-specific compact model, TLMR (Tamil Language Model - DeBERTa), pretrained on 525M Tamil tokens. Furthermore, we assess the representational quality of the sentence embeddings that are taken from these models using lightweight classifiers (SVM, XGBoost, and Logistic Regression). We formulate an efficiency-oriented metric that incorporates top-5 accuracy, vocabulary usage, and script fidelity in relation to perplexity in order to facilitate resource-aware evaluation. The experimental findings lay the groundwork for future Tamil semantic understanding tasks by highlighting differences in generalization and efficiency across models.
Annotating Indian Regional Biases using Large Language Models: Evaluation and Analysis
Debasmita Panda | Akash Anil | Neelesh Kumar Shukla
Debasmita Panda | Akash Anil | Neelesh Kumar Shukla
Social biases based on regional identity (or regional bias) are often observed in Indian contexts on major online social networks and require critical attention. However, due to large linguistic and cultural diversity, high annotation costs, and inherent human biases, very little annotated data exists on regional biases in the Indian context. Recently, Large Language Models (LLMs) have garnered attention for the automatic annotation of text. However, such annotation efforts are largely limited to English texts, and LLMs often perform poorly when applied to low-resource languages. Therefore, this paper focuses on understanding the capabilities and challenges of popular open-source LLMs in annotating Indian regional biases. We utilize the recently proposed IndRegBias dataset, which consists of Indian regionally biased social media comments in both English and code-mixed formats. First, we assess the annotation capabilities of LLMs in a zero-shot setting and critically analyze their performance across different writing styles, including code-mixing, transliteration, and English. We find that the majority of LLMs exhibit low agreement with human annotations (measured using Cohen’s kappa). Consequently, we extend our study by fine-tuning the models using 50% of the data and evaluating them on the remaining 50%. We observe a significant improvement in annotation agreement (kappa) for all the LLMs. To further assess the capabilities of the fine-tuned models, we evaluate them on 500 newly collected social media comments discussing regional issues in India. The results show that most fine-tuned LLMs outperform their zero-shot counterparts when annotating these new comments.
The Tatoxa System for Text Detoxification in Low-Resource Languages: The Case of Tatar
Ilseyar Alimova | Bogdan Monogov | Artyom Mazur | Daniil Antonov | Vsevolod Karimov | Vitaliy Egorov | Bulat Khakimov | Alexander Panchenko
Ilseyar Alimova | Bogdan Monogov | Artyom Mazur | Daniil Antonov | Vsevolod Karimov | Vitaliy Egorov | Bulat Khakimov | Alexander Panchenko
Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have received little research attention. In this paper we present Tatoxa, a novel state-of-the-art system for text detoxification in the Tatar language. Comparative experiments show that the proposed approach outperforms existing open source and proprietary commercial LLMs on key quality metrics. We also introduce a new dataset for text detoxification in Tatar, designed for fine tuning and evaluation in low resource settings. Finally, cross lingual transfer experiments indicate that transfer from other languages, including the culturally close Russian, performs significantly worse than training on native Tatar data even when a large Russian corpus is available.
Updating bilingual dictionary entries is a tedious, time-consuming, and highly subjective task, especially when a new sense in the source language requires identifying an appropriate translation equivalent. To date, there have been no attempts to automatize the discovery of new bilingual sense entries. Related tasks such as Word-level Bilingual Dictionary Induction and cross-lingual embedding alignment do not account for polysemy and are not applied to lexicographic data. In contrast to their monolingual counterparts, bilingual dictionaries fall short in terms of completeness, detail with respect to examples and glosses, and diachronic information. We introduce a novel NLP task, Sense-Level Bilingual Dictionary Induction (SenseBDI), at the intersection of lexical semantics, cross-lingual, and diachronic NLP. We construct a dataset of time-stamped sense-level bilingual dictionary entries by aligning two bilingual dictionaries, two monolingual dictionaries, and the multilingual resource BabelNet, thereby enriching bilingual entries with monolingual source-language information. We propose a baseline based on nearest-neighbor search over cross-lingual embeddings of glosses and usages. We find that usages contribute more strongly than glosses, with substantial variation across language pairs and discuss task-specific challenges with regards to target language polysemy and future directions such as transfer to real-world scenarios.
Beyond Accuracy: Interpreting Topic Representation in Suicide Ideation Detection Models
Hamideh Ghanadian | Isar Nejadgholi | Hussein Al Osman
Hamideh Ghanadian | Isar Nejadgholi | Hussein Al Osman
Suicide ideation detection models are typically evaluated using aggregate performance metrics, yet little is known about how they internally represent psychologically meaningful risk factors. In high-stakes mental health applications, understanding these internal representations is essential for safety, transparency, and responsible deployment. In this work, we move beyond accuracy and analyze how suicide detection models trained on original and topic-augmented datasets encode psychological risk factors in their internal representation space. Using visualization and geometric analysis, we examine the coherence and separability of topic-related features. Our results show that topic-aware augmentation increases the clarity and distinctness of underrepresented psychosocial risk factors such as immigration, family issues, and financial crisis. These findings suggest that augmentation not only improves model performance but also leads to more structured and interpretable internal representations.
Stance detection seeks to determine whether a text expresses a position in favor of, against, or neutral toward a target. Despite advances in neural architectures, performance remains inconsistent across datasets. To better understand these disparities, we analyze over 75K samples from four benchmark datasets using six neural models, focusing on stylistic and pragmatic language features rather than architectures or external knowledge. We extract 43 features spanning lexical richness, syntactic complexity, affective tone, and hedging, and assess their impact through both Logistic Regression and SHAP analyses. Our findings reveal distinct stylistic profiles for each stance: favor is best detected when expressed concisely with minimal hedging; against when paired with strong negative emotions and greater lexical variety; and none when texts are lexically simple and emotionally neutral. Across classes, errors arise from excessive complexity, mixed emotional signals, and overuse of hedging. These results advance understanding of what drives success and failure in stance detection.
Stance detection identifies whether a text expresses support, opposition, or neutrality toward a target and is central to applications such as political analysis and misinformation monitoring. With the shift toward large language models (LLMs), stance classification increasingly relies on prompting and lightweight adaptation. Yet the generalization behavior of open-source LLMs across new targets and domains remains uneven. We conduct a large-scale diagnostic study of four open-source LLMs (3B–24B parameters), examining how model size, prompting strategies, and Low-Rank Adaptation (LoRA) interact across in-target, cross-target, and cross-domain settings. Across 912 experiments, three patterns emerge: (1) larger models improve prompting-based in-target performance, but this advantage diminishes after fine-tuning; (2) LoRA boosts in-target accuracy yet often harms cross-context transfer; (3) optimal prompting depends on model size. These results reveal a consistent tension between specialization and generalization, offering practical guidance for configuring LLM-based stance detection under transfer.
Can LLMs Solve My Grandma’s Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles
Nurul Labib Sayeedi | Md. Faiyaz Abdullah Sayeedi | Khushnur Binte Jahangir | Swakkhar Shatabda | Sarah Masud Preum
Nurul Labib Sayeedi | Md. Faiyaz Abdullah Sayeedi | Khushnur Binte Jahangir | Swakkhar Shatabda | Sarah Masud Preum
Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83.3% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://anonymous.4open.science/r/BanglaRiddleEval.
"Undocumented Immigrants" != "Illegal Aliens": Decomposing the Conceptual and Narrative Landscapes of Partisan Immigration Terms
Yejin Cho | Gabriella Chronis | Nitin Sudarsanam | Kevin Barcenas-Martinez | Katrin Erk
Yejin Cho | Gabriella Chronis | Nitin Sudarsanam | Kevin Barcenas-Martinez | Katrin Erk
Do politically charged terms with similar referents, like "undocumented immigrants" (UI) "illegal aliens" (IA) differ only in who uses them, or also in what they mean? We investigate usage patterns by projecting contextual embeddings into interpretable psycholinguistic feature space, and extracting narrative scenes with LLMs. We find that in partisan news, the term IA appears in contexts emphasizing causation and fear. UI appears in contexts emphasizing consequences experienced and shared humanity. Scene abstraction reveals parallel patterns: IA is embedded in narratives of criminality and threat, UI in narratives of vulnerability and governance. Beyond indexing speaker identity, these terms impart different construals on migrants: as agents of harm versus patients of circumstance. This dual-track methodology adds new tools to the growing body of computational approaches for understanding the conceptual framing of politically charged topics.
Can a Remedy Find a Researcher? Exploring the Development of Semantic Knowledge in Italian BabyLMs
Alice Suozzi | Luca Capone | Gianluca Lebani | Alessandro Lenci
Alice Suozzi | Luca Capone | Gianluca Lebani | Alessandro Lenci
A large body of research has examined the linguistic abilities of language models (LMs) across various languages. However, conclusive evidence regarding their semantic competence and world knowledge remains limited, especially for low-resource languages. In this study, we explore the semantic competence of Italian BabyLMs, focusing on their sensitivity to semantic violations. To this end, we adapt a minimal pair benchmark targeting semantic violations to evaluate the semantic abilities of BAMBI, a family of small-scale models trained on progressively larger and more complex datasets. We further compare their performance, assessed through accuracy, mean log-likelihood offset, and expected calibration error, with that of three larger Italian LMs. Our findings shed light on this aspect of semantic competence in small-scale models and how this is affected by data scale and training strategies.
Frame In, Frame Out: Measuring Framing Bias in LLM-Generated News Summaries
Valeria Pastorino | Nafise Sadat Moosavi
Valeria Pastorino | Nafise Sadat Moosavi
News headlines and summaries shape how events are interpreted through selective emphasis and omission, a phenomenon commonly referred to as framing. Large language models are now routinely used to generate such content, yet existing evaluation frameworks largely overlook this dimension. We introduce Frame In, Frame Out (FIFO), the first large-scale benchmark for measuring framing presence in LLM-generated news summaries, grounded in the widely used XSum dataset. FIFO combines 15,499 jury-annotated examples with 320 expert-labeled instances (𝜅 = 0.61) to validate and calibrate model-based annotations. Using FIFO, we analyze measured framing rates across 27 summarization models. We find that LLM-generated summaries often exhibit higher calibrated framing rates than human-written references, with substantial variation across topics and training regimes, including elevated rates in scientific and public health summaries. Our results establish framing as an underexplored and consequential dimension of summarization quality.
Mitigating Language Bias in Multilingual Sentence Embeddings for Cross-Lingual Similarity Estimation
Kanade Nonomura | Keita Fukushima | Risa Kondo | Tomoyuki Kajiwara
Kanade Nonomura | Keita Fukushima | Risa Kondo | Tomoyuki Kajiwara
We disentangle multilingual sentence embeddings into language-dependent and language-agnostic components, leveraging the latter to improve cross-lingual similarity estimation. Previous studies on this approach have trained disentanglers by combining intra-component constraints, which either align or disalign language-dependent embeddings or language-agnostic embeddings, with inter-component constraints across both embeddings. However, when and how these constraints are effective remains unclear. Our experiments on sentence similarity estimation and machine translation quality estimation revealed that while intra-component constraints and the combination of both constraints are effective for encoder-based multilingual sentence embeddings, inter-component constraints are effective for decoder-based ones. Furthermore, our detailed analysis revealed distinct roles: intra-component constraints improve uniformity within the embedding space, while inter-component constraints enhance cross-lingual alignment between parallel sentences.
Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model’s entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.
Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?
Luca Modica | Filip Landin | Mehrdad Farahani | Livia Qian | Gabriel Skantze | Richard Johansson
Luca Modica | Filip Landin | Mehrdad Farahani | Livia Qian | Gabriel Skantze | Richard Johansson
In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.
Interactive Agents: Simulating Counselor-Client Psychological Counseling via Role-Playing LLM-to-LLM Interactions
Huachuan Qiu | Zhenzhong Lan
Huachuan Qiu | Zhenzhong Lan
Creating effective dialogue systems for mental health support requires high-quality multi-turn counseling dialogue data, yet collecting real counselor-client conversations presents significant challenges, including privacy concerns, high costs, and limited scalability. We present Interactive Agents, a novel framework that simulates naturalistic counseling dialogues through controlled LLM-to-LLM interactions. The framework introduces two key innovations: (1) a personalized client agent that maintains consistent psychological characteristics throughout a session, and (2) a counselor agent that implements a theoretically grounded three-stage therapeutic model comprising the exploration, insight, and action phases. Through rigorous evaluation using both automatic metrics and professional-counselor assessments based on the Working Alliance Inventory, we demonstrate that our framework generates therapeutically valid dialogues that are comparable in quality to human-generated sessions. Models fine-tuned on our proposed synthetic dataset (SimPsyDial) achieve state-of-the-art performance in a standard pairwise chatbot-arena evaluation of LLM-based counselors. Our framework provides a scalable, privacy-preserving method for generating high-quality counseling dialogue data while maintaining professional therapeutic standards.
ZIP: Quantifying Which Words Matter in Zero-Shot Instructional Prompts
Nikta Gohari Sadr | Sangmitra Madhusudan | Arash Asgari | Hassan Sajjad | Laleh Seyyed-Kalantari | Ali Emami
Nikta Gohari Sadr | Sangmitra Madhusudan | Arash Asgari | Hassan Sajjad | Laleh Seyyed-Kalantari | Ali Emami
While zero-shot instructional prompts like "Let’s think step-by-step” have revolutionized Large Language Model performance, we lack systematic understanding of why: which specific words drive their effectiveness, and how do these patterns vary across tasks and models? We introduce the ZIP score (Zero-shot Importance of Perturbation), a metric that quantifies individual word importance through controlled, semantically meaningful perturbations. To enable rigorous evaluation, we also introduce the first ground-truth benchmark for prompt interpretability, a set of validation prompts with predetermined keywords where ZIP achieves 95.8% accuracy compared to 65.8% for LIME. Analyzing six flagship models across seven prompts and multiple task domains, we find that word importance is task-dependent ("step-by-step” dominates mathematical reasoning; "think” matters more for common-sense tasks), varies systematically across model families, and correlates inversely with model performance, suggesting prompts have greatest impact on tasks where models struggle. Our findings advance prompt science, providing both practical guidance for prompt engineering and theoretical understanding of how instructional language shapes model behavior.
Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal–novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal–frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.
Text embedding models are designed for sentence-level applications like retrieval and semantic similarity, and are primarily evaluated on sentence-level benchmarks. Their behavior on isolated words is less understood. We show that simply prepending semantic prompts to words before embedding substantially improves word similarity correlations. Testing 7 text embedding models, including text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), voyage-3 (Voyage AI), all-mpnet-base-v2, and Qwen3-Embedding-8B, on 3 standard benchmarks (SimLex-999, WordSim-353, MEN-3000), we find that prompts like "meaning: word" or "Represent the semantic concept: word" improve Spearman correlations by up to +0.28 on SimLex-999. Some models fail completely on bare words (ρ ≈ 0) but recover with prompts (+0.73 improvement). Our best results achieve ρ=0.692 on SimLex-999 with embed-english-v3.0 (Cohere), ρ=0.811 on WordSim-353, and ρ=0.855 on MEN-3000 with text-embedding-3-large (OpenAI). These results outperform classic static embeddings like Word2Vec (ρ=0.40) and even the best static method LexVec (ρ=0.48) on SimLex-999, establishing a new state-of-the-art for pure embedding methods. This zero-shot technique requires no training and works with any text embedding model.
HistoryBankQA: Multilingual Temporal Question Answering on Historical Events
Biswadip Mandal | Anant Khandelwal | Manish Gupta
Biswadip Mandal | Anant Khandelwal | Manish Gupta
Temporal reasoning over historical events is vital for temporal NLP tasks such as event extraction, entity linking, question answering (QA), timeline summarization, event clustering, and natural language inference. However, benchmarks for evaluating large language models (LLMs) on temporal reasoning remain limited. Existing datasets are small, lack multilingual coverage, and focus on recent events. To address this, we introduce HistoryBank, a multilingual database of 10M+ historical events sourced from Wikipedia timelines and infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. We also present a comprehensive benchmark covering 6 temporal QA tasks across all languages, evaluating models like LLaMA-3-8B, Mistral-7B, Gemma-2-9B, Qwen3-8B, and GPT4o. GPT-4o consistently performs best; Gemma-2 leads among smaller models. Our work offers a rich resource for advancing multilingual, temporally-aware language understanding of historical events. To support further research, we publicly release our code and datasets. Code available at https://github.com/mandalbiswadip/history-bank and data available at: https://drive.google.com/drive/folders/1vHudioDdI3EeYPbhYjKa0gimxaXvpxB2.
EvalMORAAL: Interpretable Chain-of-Thought and LLM-as-Judge Evaluation for Moral Alignment in Large Language Models
Hadi Mohammadi | Anastasia Giachanou | Robert A. Bagheri
Hadi Mohammadi | Anastasia Giachanou | Robert A. Bagheri
We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson’s r ≈ 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment (r=0.74, p<.001; PEW r=0.39, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.
Toxic Subword Pruning for Dialogue Response Generation on Large Language Models
Hongyuan Adam Lu | Wai Lam
Hongyuan Adam Lu | Wai Lam
How to defend (possibly) toxic large language models (LLMs) from generating toxic content is an important research area. Yet, most research focused on defending jailbreak or toxic prompts on safe models. However, they could fail on already-toxic models, either unintentionally made by those individual developers or the attackers have access to model weights.1 We thus propose a simple yet effective and novel algorithm, namely Toxic Subword Pruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. In contrast to the previous work that demonstrates pruning BPE tokens as harmful to the task of machine translation, we surprisingly found its usefulness in preventing toxic content from being generated on LLMs. Our methods have unique advantages. First, our findings suggest that ToxPrune simultaneously improves the toxic language model NSFW-3B on dialogue response generation.2 Second, ToxPrune also improved the official Llama-3.1-6B on the metric of diversity. Extensive automatic results and human evaluation indicate that ToxPrune could be helpful for both remediating toxic LLMs and improving non-toxic LLMs on the task of dialogue response generation.
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models
Hongyuan Lu | Z.L. L | Wai Lam
Hongyuan Lu | Z.L. L | Wai Lam
There are two shortages in the current Large Language Models (LLMs) era. The first is short of multilingual models, where most LLMs are English-centric and performance is limited on multilingual reasoning. The second is the place of external knowledge to be used, where most retrieved knowledge is prepended to the user queries (maybe sub-optimal). This paper presents a novel and simple yet effective method called Dictionary Insertion Prompting (DIP). When providing a non-English prompt, DIP looks up a word dictionary and inserts words’ English counterparts into the middle of the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results. We experiment with 10 to 200 languages from FLORES-200.1 Since there are no adequate datasets, we use the NLLB translator to create synthetic multilingual benchmarks from the existing 4 English reasoning benchmarks such as GSM8K and AQuA. The synthetic benchmarks are translated back into English for quality assurance with manual annotation. Interestingly, the place for injecting the dictionary plays an important factor in the performance gains, and we found that interleaving the dictionary with the original words gives a better performance compared to prepending/appending the dictionary, under the same dictionary constructed.
up
Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)
Proceedings of the 1st Workshop on Stereotypes Across Cultures in Language Technologies (StereACuLT 2026)
Weicheng Ma | Soroush Vosoughi | Nabeel Gillani | Rolando Coto-Solano
Weicheng Ma | Soroush Vosoughi | Nabeel Gillani | Rolando Coto-Solano
CrowS-Pairs-NL: A Benchmark to Evaluate Dutch Stereotype Bias in LLMs
Jens van der Weide | Dong Nguyen | Marianne Schaaphok | Roos M. Bakker
Jens van der Weide | Dong Nguyen | Marianne Schaaphok | Roos M. Bakker
Bias benchmarks for LLMs largely focus on English, overlooking language- and culture-specific stereotypes. We introduce CrowS-Pairs-NL, a Dutch stereotype benchmark built by filtering, translating, and adapting the English CrowS-Pairs dataset to address known conceptual pitfalls, and extending it with newly crowdsourced Dutch sentence pairs. We evaluate six multilingual and Dutch-trained models using both a pseudo-log-likelihood metric adapted for autoregressive models and a prompt-based metric with three template variants. Models explicitly trained on Dutch data consistently exhibit higher stereotyping scores, suggesting that language-specific fine-tuning introduces language-specific bias. The two metrics broadly agree on model rankings but differ in sensitivity, with the prompt metric showing a narrower range of scores. Our benchmark and findings underscore the need for culturally grounded bias evaluation beyond English.
Lost in Translation: Cross-Cultural Bias in LLM-Assisted Medical Symptom Interpretation
Yuting Tian | Salar Khaleghzadegan | Benjamin Huh | Yash Raj | Gena Heng
Yuting Tian | Salar Khaleghzadegan | Benjamin Huh | Yash Raj | Gena Heng
Large language models (LLMs) are increasingly used to convert patient language into clinical-style summaries, yet patient symptom descriptions may vary across linguistic, cultural, and cross-linguistic contexts. In this pilot study, we operationalize this variation using four expression styles: direct English, indirect English, culturally mediated English, and Chinese-original patient language. We propose a compact red-teaming framework for testing whether LLM-based symptom interpretation changes when the same underlying concern is expressed in different linguistic and cultural forms. Our pilot dataset contains eight symptom scenarios, each expressed in four styles, yielding 32 vignettes before prompt variation. We evaluate GPT-5 mini as a pilot case-study model under generic and culture-aware prompts, repeating the full evaluation three times to produce 192 model outputs. Reference labels and a stratified subset of model output annotations were reviewed for face validity by an independent reviewer with clinical training.The model usually preserves broad symptom categories, but subtle failure modes emerge. Culture-aware prompting reduces severity downgrades from 14.6% to 9.4% and ambiguity-flagging failures from 28.1% to 13.5%, but does not reduce interpretation inconsistency or clinical category shift, both of which remain at 6.2%. Indirect English shows the highest severity-downgrade and flagging-failure rates, while Chinese-original expressions are often interpreted with the correct broad category but are not consistently flagged as ambiguous. These findings suggest that medical LLM evaluation should assess cultural robustness, severity framing, ambiguity preservation, and human-review escalation in addition to factual accuracy.
Exploratory As-Analyzed No-Detection of Culturally-Marked Predicate-Triggered PII Amplification in a Synthetic-English RAG Probe: A Predicate-Resource-Confounded Audit
Yanhang Li | Zhichao Fan | Zexin Zhuang
Yanhang Li | Zhichao Fan | Zexin Zhuang
We ask whether stereotype-loaded queries about culturally marked people leak more personal information from a retrieval-augmented generation (RAG) system than otherwise equivalent neutral queries. We pre-register a four-culture audit covering en-Anglo, es-LATAM, Arabic, and Hindi probes on a synthetic English PII corpus, comparing five paired query arms via the Stereotype-Trigger Leakage Delta (STLD). The locked confirmatory estimator was not run, so all reported tests are exploratory or sensitivity analyses, with deviations documented. We also identify a prompt-echo confound in the name-leakage metric: the model often re-emits the queried name, inflating apparent leakage without retrieval extraction. On cleaner non-name channels—email, phone, SSN-like identifier, and address—we find no stereotype-driven amplification for any culture after multiple-comparison correction. One name-included es-LATAM cell is significant in the negative direction, but matched-arm decomposition and an expanded culture-neutral control sensitivity suggest a high-leak control-predicate sampling artifact rather than a stereotype-treatment effect. Because the study is powered only for mid-sized effects and the culturally marked probe bank mixes stereotype content with cultural markers and heritage practices, we interpret the results as no detection—not evidence of no effect—of culturally marked predicate-triggered PII amplification under this synthetic-English RAG setting. The paper contributes a preregistered stereotype-as-privacy-side-channel test, diagnoses prompt-echo and predicate-resource confounds, and outlines release of the synthetic corpus, predicate bank, query generator, audit scripts, and analysis code upon acceptance
Controlling Cross-Lingual Answer Distributions in Language Models: Enabling Transfer of Factual Preferences
Lukas Ellinger | Alexander Manev | Georg Groh
Lukas Ellinger | Alexander Manev | Georg Groh
Multilingual large language models exhibit systematic differences in their outputs across languages, even when representing the same underlying knowledge. Prior work has primarily focused on evaluating or reducing such inconsistencies. In this work, we instead study whether cross-lingual behavior can be controlled: specifically, whether answer distributions associated with other languages can be expressed under English prompting. To this end, we construct a human-annotated factual dataset and a cultural scenarios dataset, and compare intervention methods including persona prompting, activation steering, and preference-based fine-tuning. We evaluate how these methods affect answer distributions and their generalization to culturally grounded settings. Our results show that answer distributions can be systematically shifted toward those observed in other languages, with persona prompting consistently outperforming more complex intervention methods.
Counterfactual Auditing of Cross-Cultural Variation in LLM-Generated Medical Advice
Hyunwoo Yoo | Gail Rosen
Hyunwoo Yoo | Gail Rosen
Large language models (LLMs) are increasingly explored for patient-facing medical advice and symptom triage, yet their responses may shift when identical clinical evidence is paired with culturally marked patient descriptors. We present a counterfactual audit framework for evaluating cross-cultural variation in LLM-generated medical advice by isolating identity-related cues while holding clinical evidence constant.Our evaluation uses matched clinical vignettes, cross-regional and culturally marked prompt variants, repeated sampling, and structured comparison of urgency framing, safety recommendations, empathy, and escalation advice.Across multiple commercial and open-weight LLMs, we observe measurable identity-conditioned variation in both triage decisions and interactional framing. In several cases, culturally marked descriptors shift urgency assessments or escalation recommendations despite unchanged clinical evidence. While the magnitude and direction of these effects differ across models, the results suggest that LLM-generated medical advice remains sensitive to culturally linked identity cues in ways that may affect safety-critical guidance.Our results demonstrate how culturally grounded counterfactual auditing can help identify clinically unsupported variation while distinguishing potentially harmful shifts from appropriate communication adaptation in patient-facing medical advice.
Stereotyped by Silence: How LLMs Erase Northeast Indian Languages Through Omission and Orthographic Corruption
Badal Nyalang
Badal Nyalang
Large language models (LLMs) perpetuate cultural stereotypes not only through biased associations but through systematic omission and orthographic erasure of underrepresented languages. We present empirical evidence of two compounding failure modes affecting Northeast Indian languages: (1) entity-level invisibility, where state-of-the-art NER systems score F1=0.000 on culturally critical named entities such as Khasi surnames, Garo festivals, and tribal names; and (2) orthographic corruption, where LLM tokenizers corrupt semantically meaningful diacritics (ï, ñ) and the Garo morpheme boundary marker (U+00B7) at rates of 18.8–50% across four of five evaluated models. Drawing on NortheastNER (F1=0.964, six entity categories, XLM-RoBERTa-base) and a systematic tokenization study across Khasi and Garo, we argue that stereotype-by-omission constitutes a distinct and measurable harm to indigenous language communities. We further show that a custom multilingual tokenizer achieves 26–50% token reduction over five baseline LLMs, demonstrating that culturally grounded infrastructure can partially remediate these failures. Our findings call for cultural representation audits as a standard component of multilingual NLP evaluation.
Whose Pragmatics? Cultural Grounding as a Bottleneck for Stereotype Detection in Egyptian Arabic Social Media
Samar A. Assem
Samar A. Assem
Stereotype detection benchmarks assume that stereotyping occurs through what is said — via lexical co-occurrence between demographic terms and stereotypical attributes. We argue that stereotyping is often conveyed by what is meant: through presupposition, implicature, and speech-act framing that leave surface content unchanged while embedding prejudice in the pragmatic layer. We call this phenomenon pragmatic stereotyping. Evaluating GPT-4 and Claude 3.5 Sonnet on a stratified sample of 500 Egyptian Arabic social media comments annotated with a seven-tag sentiment/(im)politeness taxonomy, we find that cultural grounding is the critical bottleneck in detecting pragmatic stereotyping in non-English discourse. About 35% of LLM errors result from cultural grounding gaps, leading to a 15-percentage-point F1 difference between explicit tags (0.81) and implicit tags (0.66). These failures are bidirectional: on the author side, LLMs under-detect prejudice encoded through concessive presupposition and backhanded compliments; on the model side, LLMs apply English-based pragmatic assumptions, misinterpreting genuine polite criticism as sarcasm and positive-intended impoliteness as conflictive. Our five-layer Chain-of-Thought diagnostic framework localizes these failures to the culture-dependent inference layers. These results extend stereotype evaluation beyond lexical benchmarks and have direct implications for content moderation pipelines serving Arabic-speaking communities.
Measuring Semantic Flow Without Direction: A Rhizomatic Protocol for Stereotype Translation in Cross-Cultural Language Technology
Gustavo Aviña Cerecer
Gustavo Aviña Cerecer
We present an open-source measurement protocol for stereotype interpretation that quantifies how users translate or interprets provocative discourse without assuming a normative direction. Building on Deleuze and Guattari’s rhizomatic framework, we operationalize three modes of semantic movement —Reaffirm, De-signify, and Escape (RDE)— through an abstract-machine operator detector that combines transparent linguistic patterns (526 patterns across 8 languages) with optional contextual embeddings. The protocol is direction-agnostic: it measures equally well a user who reproduces their own semantic territory and one who departs from it, capturing diasporic, assimilationist, and escape trajectories that English-centric, Chomskyan-hierarchical taxonomies obscure. We demonstrate the protocol on five extreme user profiles (Russian conservative, Russian diaspora, trans Russian exile, Mexican malinchista, Mapuche speaker), each producing coherent and distinct RDE signatures. Deployed in a free-tier web service, the protocol enables both individual reflective use and corporate calibration of tolerable territoriality ranges for personnel engaged in intercultural translation and interpretation tasks.
Signals Are Not States: Neuro-Symbolic Safeguards for Culturally Aware Classroom AI
Sina Bagheri Nezhad
Sina Bagheri Nezhad
Classroom AI systems increasingly infer high-level educational states such as engagement, confusion, collaboration, participation, and instructional quality from multimodal and linguistic signals. In multicultural and multilingual classrooms, such inferences can translate culturally situated behavior into stereotyped claims: silence may be read as disengagement, gaze aversion as inattention, code-switching as low proficiency, or indirect help-seeking as confusion. We argue that stereotype-aware classroom AI should separate observable evidence from culturally loaded interpretation and should treat unsupported construct-level claims as safety risks. We introduce NSCR, a culturally grounded neuro-symbolic framework that converts video, audio, ASR, lesson artifacts, and contextual metadata into typed facts with uncertainty, provenance, and cultural scope, then composes them through executable reasoning and policy constraints. We define a taxonomy of stereotype-prone classroom inferences and propose a benchmark agenda covering culture-conditioned state inference, evidence-grounded claim verification, multilingual and code-switched reasoning, collaboration analysis, counterfactual cultural robustness, and culture-conditioned red-teaming. We further specify metrics for stereotype leakage, unsupported attribution, cultural calibration gaps, abstention under cultural ambiguity, and evidence faithfulness. The contribution is methodological: a concrete framework and evaluation agenda for mitigating stereotyped reasoning in classroom AI, with education as a high-stakes, culturally variable deployment setting.
AmchiBias: Measuring Stereotypical Bias in Goan Identity Groups with a Minimal Pair Dataset in English and Konkani
Michelle Barbosa | Sebastian Padó | Franziska Weeber
Michelle Barbosa | Sebastian Padó | Franziska Weeber
Socio-cultural stereotypical bias is an important consideration in the development and deployment of NLP systems. It is however often considered only at the national level, despite rich subnational socio-cultural structures. We present AmchiBias, the first benchmark for enmeasuring socio-cultural stereotypical bias for the Indian state of Goa with its unique historically multicultural setting. It covers various Goan identity groups and comprises 313 minimal pairs across eight sociodemographic dimensions in both English and Devanagari Konkani. We then evaluate stereotypical bias in five multilingual encoder models on this benchmark. We find near-chance scores in Konkani, reflecting language incompetence for general multilingual models and a lack of Goan cultural competence for Indian language models. Queried in English, models with a stronger Indian language coverage show higher bias for pan-Indian groups than hyperlocal Goan groups. This suggests the English signal reflects pan-Indian pretraining associations rather than genuine Goan cultural knowledge. Our findings highlight a critical gap in low-resource multilingual NLP evaluation for hyperlocal community identities.
Translation Is Not Representation: English-Hub Routing in Cross-Lingual Bias Benchmarks
Hak Hyun Kim | Benjamin Huh
Hak Hyun Kim | Benjamin Huh
Cross-lingual bias benchmarks such as JBBQ and KoBBQ translate English bias probes and compare scores across languages, assuming the translated probe measures the same construct. We test this assumption at the representation and behavioral levels using 13B-parameter models matched on architecture but differing in language-training regime. A multi-anchor logit lens shows that an English-centric model (Llama 2) processes Japanese and Korean inputs predominantly through English-script predictions in its middle layers, even where Centered Kernel Alignment (CKA) between languages is high: geometric convergence masks English-hub routing. Matched continual-adaptation comparisons show that target-language adaptation reduces this English-script mass: from 0.77 to 0.56 after Japanese adaptation (Swallow), and from 0.78 to 0.71 after Korean adaptation (koen), while balanced bilingual pretraining (LLM-jp) lowers it further to 0.19. Behaviorally, every model is more stereotype-biased in English than in Japanese, with gaps from 0.13 to 0.14, but this asymmetry is language-specific: in Korean it is weak and disappears after Korean adaptation, with Korean nearly as stereotype-leaning as English. Yet patching English hub states into target-language processing does not transplant this bias. Cross-lingual bias scores thus reflect genuine language-specific behavior, not an English-pivot artifact, even though the underlying representations are not comparable. We distill this dissociation between representation and behavior into a four-step audit protocol for translated bias benchmarks.
IndicSteer: Inference-Time Safety Steering for Indic LLMs
Ruhaib Muhammad | Saahas Vijayalakshmi Rajaram | Suriya Priyan Durairaj
Ruhaib Muhammad | Saahas Vijayalakshmi Rajaram | Suriya Priyan Durairaj
Safety controls for Indic language generation must account for multilingual variation and culturally grounded harm categories that are underrepresented in English-centric resources. We present IndicSteer, an initial study of inference-time activation steering for safety across 8 harm categories and 9 Indic language settings, based on contrastive directions computed from safe/unsafe response pairs. To the best of our knowledge, this is the first application of Contrastive Activation Addition (CAA) to Indic LLMs. Evaluation uses a structured LLM-as-a-judge protocol with strict isolation by category and alpha, covering ≈12,960 prompt-response pairs. We report harmful-response and coherence metrics for Sarvam-1 and OpenHathi (Hindi track), and present cross-lingual representation structure via linear CKA for Sarvam-1 and Krutrim-2-Instruct. On matched slices, Sarvam-1 at 𝛼=12 reduces harmful rate from 73.47% to 41.34% (32.13 pp; 43.73% relative) with no additional retraining. For OpenHathi Hindi, harmful rate falls monotonically from 85.83% (baseline) to 27.13% at 𝛼=15, a 58.71 pp total reduction.
up
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Proceedings of the First Workshop on Structured Understanding, Retrieval, and Generation in the LLM Era (SURGeLLM 2026)
Vivek Gupta | Kaize Ding | Harsha Kokel | Yue Zhao | Amit Agarwal | Yu Wang | Michael Glass | Yu Zhang | Kavitha Srinivas | Xiusi Chen | Oktie Hassanzadeh | Qi Zhu | Shuaichen Chang | Yuan Luo
Vivek Gupta | Kaize Ding | Harsha Kokel | Yue Zhao | Amit Agarwal | Yu Wang | Michael Glass | Yu Zhang | Kavitha Srinivas | Xiusi Chen | Oktie Hassanzadeh | Qi Zhu | Shuaichen Chang | Yuan Luo
UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification
Poojah Ganesan | Rajat Aayush Jha | Dan Roth | Vivek Gupta
Poojah Ganesan | Rajat Aayush Jha | Dan Roth | Vivek Gupta
Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases. Our code is available at: https://github.com/coral-lab-asu/unjoin
The Mighty ToRR: A Benchmark for Table Reasoning and Robustness in LLMs
Shir Ashury-Tahan | Yifan Mai | Rajmohan C | Ariel Gera | Yotam Perlitz | Asaf Yehudai | Elron Bandel | Leshem Choshen | Eyal Shnarch | Percy Liang | Michal Shmueli-Scheuer
Shir Ashury-Tahan | Yifan Mai | Rajmohan C | Ariel Gera | Yotam Perlitz | Asaf Yehudai | Elron Bandel | Leshem Choshen | Eyal Shnarch | Percy Liang | Michal Shmueli-Scheuer
Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. We further find that no single table format consistently yields superior performance. However, evaluating models across multiple formats is essential for a reliable assessment of their capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that reasoning over table tasks remains a significant challenge. The leaderboard, data and code are publicly available.
Ontology-Free General-Domain Knowledge Graph-to-Text Generation Dataset Synthesis using Large Language Model
Daehui Kim | Deokhyung Kang | Sangwon Ryu | Gary Lee
Daehui Kim | Deokhyung Kang | Sangwon Ryu | Gary Lee
Knowledge Graph-to-Text (G2T) generation involves verbalizing structured knowledge graphs into natural language text. Recent advancements in Pretrained Language Models (PLMs) have improved G2T performance, but their effectiveness relies on datasets with precise graph-text alignment. However, the scarcity of high-quality, general-domain G2T generation datasets restricts progress in the general-domain G2T generation research. To address this issue, we introduce Wikipedia Ontology-Free Graph-text dataset (WikiOFGraph), a new large-scale G2T dataset generated using a novel method that leverages Large Language Models (LLMs) and Data-QuestEval. Our dataset, which contains 5.85M general-domain graph-text pairs, offers high graph-text consistency without reliance on external ontologies. Experimental results demonstrate that PLM fine-tuned on WikiOFGraph outperforms those trained on other datasets across various evaluation metrics. Our method proves to be a scalable and effective solution for generating high-quality G2T data, significantly advancing the field of G2T generation.
Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space
Tobias Materzok
Tobias Materzok
We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.
TreeDiff: AST-Guided Code Generation with Diffusion LLMs
Yiming Zeng | Jinghan Cao | Zexin Li | Yiming Chen | Tao Ren | Zhuochun Li | Dawei Xiang | Xidong Wu | Shangqian Gao | Tingting Yu
Yiming Zeng | Jinghan Cao | Zexin Li | Yiming Chen | Tao Ren | Zhuochun Li | Dawei Xiang | Xidong Wu | Shangqian Gao | Tingting Yu
Code generation is increasingly critical for real-world applications. Still, diffusion-based large language models continue to struggle with this demand. Unlike free-form text, code requires syntactic precision; even minor structural inconsistencies can render a program non-executable. Existing diffusion-based large language models rely on random token masking for corruption, leading to two key failures: they lack awareness of syntactic boundaries during the iterative denoising process, and they fail to capture the long-range hierarchical dependencies essential for program correctness.We propose TreeDiff to address both issues. Specifically, we propose a syntax-aware diffusion framework that incorporates structural priors from Abstract Syntax Tree (AST) into the corruption process. Instead of masking individual tokens at random, we selectively mask tokens belonging to key AST nodes. By aligning the corruption process with the underlying structure of code, our method encourages the model to internalize the compositional nature of programming languages, enabling it to reconstruct programs that respect grammatical boundaries and capture long-range dependencies. Our method achieves a 13.3% relative improvement over the random masking training method, demonstrating its effectiveness in code generation task by leveraging underlying structures.
Chart-RL: Generalized Chart Comprehension via Reinforcement Learning with Verifiable Rewards
Xin Zhang | Xingyu Li | Rongguang Wang | Ruizhong Miao | Zheng Wang | Yuying Wang | Dan Roth | Chenyang Li
Xin Zhang | Xingyu Li | Rongguang Wang | Ruizhong Miao | Zheng Wang | Yuying Wang | Dan Roth | Chenyang Li
Accurate chart comprehension represents a critical challenge in advancing multimodal learning systems, as extensive information is compressed into structured visual representations. However, existing vision-language models (VLMs) frequently struggle to generalize on unseen charts because it requires abstract, symbolic, and quantitative reasoning over structured visual representations. In this work, we introduce Chart-RL, an effective reinforcement learning (RL) method that employs mathematically verifiable rewards to enhance chart question answering in VLMs. Our experiments demonstrate that Chart-RL consistently outperforms supervised fine-tuning (SFT) across different chart understanding benchmarks, achieving relative improvements of 16.7% on MultiChartQA, and 11.5% on ChartInsights. We conduct robustness analysis, where Chart-RL achieves enhanced performance in 18 of 25 perturbed chart categories, demonstrating strong consistency and reasoning capability across visual variations. Furthermore, we demonstrate that task difficulty and inherent complexity are more critical than data quantity in RL training. For instance, Chart-RL trained on merely 10 complex chart-query examples significantly outperforms models trained on over 6,000 simple examples. Additionally, training on challenging reasoning tasks not only improves in-domain generalization relative to simpler tasks, but also facilitate strong transfer to out-of-domain visual mathematical problems.
RSAT: Structured Attribution Makes Small Language Models Faithful Table Reasoners
Jugal Gajjar | Kamalasankari Subramaniakuppusamy
Jugal Gajjar | Kamalasankari Subramaniakuppusamy
When a language model answers a table question, users have no way to verify which cells informed which reasoning steps. We introduce RSAT, a method that trains small language models (SLMs, 1–8B) to produce step-by-step reasoning with cell-level citations grounded in table evidence. Phase 1 (SFT) teaches a structured JSON output format from verified reasoning traces. Phase 2 (GRPO) optimizes a composite reward centered on NLI-based faithfulness, alongside citation validity and parsimony. Across six models from two families—Qwen2.5 (1.5B/3B/7B) and Llama3 (1B/3B/8B)—RSAT improves faithfulness 3.7× over SFT alone (0.224→0.826), with near-perfect citation validity (0.992). Post-hoc attribution collapses below 13% format success, confirming that attribution must be integrated into reasoning, not retrofitted. Ablations show the faithfulness reward is essential: removing it drops faithfulness from 0.97 to 0.03.
Framework of Thoughts: A Foundation Framework for Dynamic and Optimized Reasoning based on Chains, Trees, and Graphs
Felix Fricke | Simon Malberg | Georg Groh
Felix Fricke | Simon Malberg | Georg Groh
Prompting schemes such as Chain of Thought, Tree of Thoughts, and Graph of Thoughts can significantly enhance the reasoning capabilities of large language models. However, most existing schemes require users to define static, problem-specific reasoning structures that lack adaptability to dynamic or unseen problem types. Additionally, these schemes are often under-optimized in terms of hyperparameters, prompts, runtime, and prompting cost. To address these limitations, we introduce Framework of Thoughts (FoT) – a general-purpose foundation framework for implementing and optimizing dynamic reasoning schemes. FoT comes with built-in features for hyperparameter tuning, prompt optimization, parallel execution, and intelligent caching, unlocking the latent performance potential of reasoning schemes. We demonstrate FoT’s capabilities by implementing three popular schemes – Tree of Thoughts, Graph of Thoughts, and ProbTree – within FoT. We empirically show that FoT enables significantly faster execution, reduces costs, and achieves better task scores through optimization. We release our codebase to facilitate the development of future dynamic and efficient reasoning schemes.
TabGuard: Agentic LLM Orchestration for Adaptive Tabular Anomaly Detection via Dynamic Validator Selection and Generation
Srihari Unnikrishnan | Minghua Ma
Srihari Unnikrishnan | Minghua Ma
Tabular anomaly detection is challenging because real-world tables contain heterogeneous columns, ranging from structured identifiers to free-form text. Existing methods face a fundamental trilemma: rule-based systems require extensive manual configuration and fail on novel schemas; statistical methods scale efficiently but miss semantic errors; and LLM-based approaches understand semantics but incur prohibitive per-cell inference costs. No prior method simultaneously addresses semantic heterogeneity, domain-specific validation rules, and enterprise-scale processing.We introduce TabGuard, an agentic framework that resolves this trilemma through semantic routing. Using LLM function calling, the system analyzes a small sample of each column and dynamically selects the most effective validation strategy, routing to a regex-based validator for syntactic patterns, a code-generation validator for domain-specific rules (such as Luhn checksums for credit cards), or an embedding-based validator for distributional outliers. This architecture decouples expensive cognitive reasoning (O(m) LLM calls for m columns) from scalable programmatic execution, enabling deployment on enterprise datasets without per-cell inference.
StructSurvey: Structured Agentic Retrieval for Automated Survey Paper Generation
Paolo Pedinotti | Enrico Santus
Paolo Pedinotti | Enrico Santus
The rapid growth of scientific publications makes it increasingly difficult to track and synthesize research progress. While Large Language Models (LLMs) can support automated survey generation, existing methods retrieve unstructured data and require models to infer conceptual, methodological, and taxonomic relations from raw text at generation time. We introduce STRUCTSURVEY, a hierarchical multiagent framework that shifts structural reasoning from generation to retrieval by dynamically constructing graph-based representations of entities, relations, and topical taxonomies. We evaluate STRUCTSURVEY on a new referencegrounded benchmark of ACL survey papers for reproducible long-form scientific summarization. Compared with embedding-only retrieval baselines, STRUCTSURVEY improves ROUGE1 recall by +2.9 and ROUGE-2 recall by +1.0 on average, without reducing precision. It also improves LLM-as-a-Judge ratings for logical structure, depth, and synthesis, showing that explicit structural retrieval yields surveys closer to human-written organization and reasoning.
Modeling LLM Agent Reviewer Dynamics in Elo-Ranked Review System
Hsiang-Wei Huang | Junbin Lu | Kuang-Ming Chen | Jianxu Shangguan | Jenq-Neng Hwang
Hsiang-Wei Huang | Junbin Lu | Kuang-Ming Chen | Jianxu Shangguan | Jenq-Neng Hwang
In this work, we explore the Large Language Model (LLM) agent reviewer dynamics in an Elo-ranked review system using real-world conference paper submissions. Multiple LLM agent reviewers with different personas engage in multi round review interactions moderated by an Area Chair. We compare a baseline setting with conditions that incorporate Elo ratings and reviewer memory. Our simulation results showcase several interesting findings, including how incorporating Elo improves Area Chair decision accuracy, as well as reviewers’ adaptive review strategies that exploits our Elo system without improving review effort. These findings show how the Elo system affects peer review and offer insights for improving AI conference evaluation. Our code is available at https://github.com/hsiangwei0903/EloReview.
DSMentor: Curriculum-Guided Inference with Online Memory for Data-Science LLM Agents
He Wang | Alexander Hanbo Li | Yiqun Hu | Sheng Zhang | Hideo Kobayashi | Jiani Zhang | Henghui Zhu | Chung-Wei Hang | Patrick Ng
He Wang | Alexander Hanbo Li | Yiqun Hu | Sheng Zhang | Hideo Kobayashi | Jiani Zhang | Henghui Zhu | Chung-Wei Hang | Patrick Ng
Large language model (LLM) agents have shown strong capabilities in generating code to solve complex data science problems, yet they often overlook the impact of task order during inference. We present DSMentor, an inference-time optimization framework that applies curriculum learning—progressing from easier to harder tasks—to enhance LLM performance on challenging data science tasks. Guided by a mentor and supported by a growing long-term memory, DSMentor organizes problems by difficulty, retains prior experiences, and leverages them to guide subsequent reasoning. Extensive experiments on DSEval and QRData benchmarks show that DSMentor with Claude-3.5-Sonnet improves pass rates by up to 5.2% over baseline agents and achieves an 8.8% gain over GPT-4 with Program-of-Thoughts prompting. These results highlight the effectiveness of curriculum-based inference strategies in advancing LLM agents.
Asking language models how to represent data for fine-tuning
Usneek Singh | Ananya Singha | Abhijeet Awasthi | Sumit Gulwani | Aditya Kanade | Vu Le | Mukul Singh | Gust Verbruggen
Usneek Singh | Ananya Singha | Abhijeet Awasthi | Sumit Gulwani | Aditya Kanade | Vu Le | Mukul Singh | Gust Verbruggen
Language models are often used for tasks involving structured data like tables and graphs, but there is no principled approach for choosing the best format to represent such data for fine-tuning. We address this in three steps. First, we show that format choice remains important even after fine-tuning; models learn more efficiently with specific formats rather than adapting to any format. Second, we show that a pre-trained model can suggest its own candidate formats by auto-completing partial prompts, reducing reliance on developer intuition. Third, and most importantly, we demonstrate that base model performance across formats reliably predicts post-fine-tuning performance: the format that performs best before fine-tuning remains among the top candidates after fine-tuning in 16 out of 18 settings across three data structure types, three models, and six tasks. This finding allows format selection to be done via inference alone, avoiding costly trial-and-error fine-tuning runs.
TabBridge: Bridging Structure and Context for Accurate Table Reasoning
Jeongwoo Lee | Eunsoo Lee | Jihie Kim
Jeongwoo Lee | Eunsoo Lee | Jihie Kim
Table reasoning remains challenging for Large Language Models (LLMs) as it requires integrating structured tabular information with natural language questions. Previous SQL-based approaches rely on surface-level alignment between question keywords and column headers, often generating queries with spurious or missing column mappings. We introduce TabBridge, a framework that incorporates both structural and contextual information for accurate table reasoning. TabBridge first generates a unified textual representation called Table Specification (TabSpec), preserving the structural information through row and column analysis. In order to ensure accuracy and consistency, we also employ a reconstruction-based evaluation mechanism to verify and refine the generated TabSpec. TabSpec is subsequently used to generate SQL aligned with the contextual intent of the question, enabling accurate interpretation of column semantics that are often overlooked by previous approaches.Across three public benchmarks, TabBridge shows consistent improvements over previous SQL-based methods, achieving 73.94% accuracy on WikiTableQuestions (+5.3 pp over the previous state of the art). TabBridge also demonstrates robust performance across diverse LLM backbones, confirming its generalizability across model architectures. Our code is available at https://github.com/raylee0519/TabBridge.
Multi-step reasoning in large language models (LLMs) is typically expressed as unstructured text, making intermediate states difficult to organize, verify, and revise explicitly. This limitation often leads to redundant reasoning paths, error accumulation, and limited controllability in complex tasks. We propose Map-of-Actions (MoA), a neuro-symbolic reasoning framework that treats reasoning as operations over an explicit structured state space. MoA represents intermediate states as a multi-labeled graph, in which each node corresponds to a semantically labeled reasoning unit. This representation provides LLMs with structured memory, explicit state transitions, and flexible interfaces to external tools. Experiments on multiple complex question answering (QA) benchmarks show that MoA consistently outperforms strong baselines, improving accuracy by up to 17.9 percentage points.
Routing End User Queries to Enterprise Databases
Saikrishna Sudarshan | Tanay Kulkarni | Manasi Patwardhan | Lovekesh Vig | Ashwin Srinivasan | Tanmay Tulsidas Verlekar
Saikrishna Sudarshan | Tanay Kulkarni | Manasi Patwardhan | Lovekesh Vig | Ashwin Srinivasan | Tanmay Tulsidas Verlekar
We address the task of routing natural language queries in multi-database enterprise environments. We construct realistic benchmarks by extending existing NL-to-SQL datasets. Our study shows that routing becomes increasingly challenging with larger, domain-overlapping DB repositories and ambiguous queries, motivating the need for more structured and robust reasoning-based solutions. By explicitly modelling schema coverage, structural connectivity, and fine-grained semantic alignment, the proposed modular, reasoning-driven re-ranking strategy consistently outperforms embedding-only and direct LLM-prompting baselines across all the metrics.
SchemaScope: How Join-Hop Depth Breaks Text-to-SQL in Large Language Models, and a Decomposition-Based Remedy
Kaustubh S. Bukkapatnam | Rayan Malik
Kaustubh S. Bukkapatnam | Rayan Malik
Large language models (LLMs) achieve impressive accuracy on standard Text-to-SQL benchmarks such as Spider and BIRD, yet enterprise databases, with hundreds of tables and complex foreign key graphs, remain a practical bottleneck. We hypothesize that a single, measurable property drives most of this gap: the join-hop depth (h) of the query, defined as the number of foreign key edges that must be traversed to gather all required columns. We introduce the Join-Hop Depth (JHD) benchmark, 410 human-annotated questions stratified by h ∈ {1, …, 6} over 12 enterprise-scale schemas. Experiments on five frontier LLMs confirm a sharp accuracy cliff: all models exceed 80% at h = 1 but fall below 40% at h = 4 and below 25% at h = 6, the typical depth of real enterprise analytics queries. To address this, we propose SchemaScope, a decomposition framework that partitions deep queries into a sequence of sub-queries with h ≤ 2, executes them independently, and merges the results. SchemaScope raises execution accuracy from 46.8% to 67.3% on JHD (GPT-4o, h ≥ 3) and improves execution accuracy by +9.3 percentage points on the BIRD development set. Error analysis shows that decomposition eliminates wrong join path errors, the dominant failure mode at high h, and shifts the residual error budget toward condition and aggregation mistakes that are amenable to existing post-processing methods.
Generalization in Graph Reasoning: A Systematic Comparison of LLM Training Approaches
Sola Shirai | Kavitha Srinivas | Julian Dolby | Michael Katz | Shirin Sohrabi | Horst Samulowitz
Sola Shirai | Kavitha Srinivas | Julian Dolby | Michael Katz | Shirin Sohrabi | Horst Samulowitz
For large language models (LLMs), reasoning over graphs can help solve many problems. Prior work has tried to improve LLM graph reasoning through different training methods, but the merits of such approaches remain unclear and the limitations of each approach with respect to generalizability of reasoning are often not thoroughly explored. In this paper we systematically compare the ability of LLMs to learn fundamental graph tasks across a variety of training methods and their ability to generalize out of distribution across various dimensions. We highlight key tradeoffs between training methods, e.g., training specialized graph encoders and fusing their embeddings with LLMs consistently collapses in terms of generalizability; however, no single method shows clear superiority across all dimensions of generalizability, regardless of the size of the model.
Self-correction—the ability of LLMs to detect and fix their own errors—has been studied extensively for mathematical and code reasoning, with limited prior work on table reasoning (primarily multi-agent pipelines such as Table-Critic, ACL 2025, rather than single-model structured prompting). Tables present unique challenges: errors arise from wrong cell retrieval, incorrect computation, flawed logic, and hallucination of values not present in the data. We conduct the first cross-provider single-model self-correction analysis for table reasoning across five providers (Google, Moonshot AI, Zhipu, Alibaba, MiniMax), testing five models (Gemini 3.1 Pro, Kimi K2.5, GLM 5, Qwen 3.5+, MiniMax M2.5) on WikiTableQuestions and TabFact with a multi-seed paired protocol. We propose Structured Self-Correction (SSC), a table-specific verification chain that guides models through cell verification, computation checking, logic validation, and completeness assessment. We confirm that the Accuracy-Correction Paradox (terminology from Li 2025) previously observed in math extends to tables: models with base accuracy in the mid-60s–mid-70s region benefit modestly from self-correction (multi-seed mean SCG up to +1.3% with within-seed point estimates as high as +3.4%), while stronger models above this region are systematically harmed by over-correction (multi-seed mean SCG down to -1.3%, with 95% bootstrap CIs significantly below zero). SSC reduces over-correction rates in 9 of 10 conditions, with reductions of 38–69% on TabFact. An inference-mode-controlled probe shows that SSC’s qualitative direction is robust for Qwen 3.5+ across reasoning-ON and reasoning-OFF settings, while GLM 5 exhibits a substantial mode-dependent shift, indicating that mode robustness itself is model-dependent. Stronger baselines (self-consistency, self-critic, tool-augmented arithmetic verification, majority voting, and a same-family scaling probe) further characterize where SSC helps. Ablation studies reveal that answer-aware review is essential, reasoning traces aid error detection, and iterative correction shows diminishing returns. A FinQA domain transfer probe confirms a capability floor: self-correction fails when base task competence is very low (21.5% accuracy). Our primary contribution is empirical: we characterize the conditions under which self-correction helps or harms table reasoning, providing actionable guidance for practitioners.
Mixed-Policy GRPO for Text-to-SQL with Off-Policy Data Generation
Marko Sterbentz | Michael Glass | Nhan H Pham | Dharmashankar Subramanian | Kristian J Hammond
Marko Sterbentz | Michael Glass | Nhan H Pham | Dharmashankar Subramanian | Kristian J Hammond
Recent advances in text-to-SQL have shown that methods such as Group Relative Policy Optimization (GRPO) can substantially improve reasoning performance, but these approaches remain inherently on-policy, limiting their ability to incorporate novel reasoning patterns. In this work, we address this limitation by leveraging existing datasets to generate high-quality off-policy rollouts, enabling mixed-policy training that exposes models to diverse and informative reasoning trajectories. We present the first application of mixed-policy GRPO to the text-to-SQL domain and introduce a systematic study of off-policy data generation strategies for this setting, including a novel method, Iterative Error Correction (IEC), which iteratively refines model outputs through targeted feedback. Our experiments show that mixed-policy GRPO outperforms both base models and on-policy GRPO, yielding average improvements of +4.7% over base models and +4.1% over on-policy GRPO across the Spider and BIRD benchmarks. Gains are particularly strong on BIRD, reaching up to +7.3% over base models and +4.5% over on-policy GRPO.
TabFaith: Benchmarking and Improving Structural Faithfulness in LLM Table Summarization
Kaustubh S. Bukkapatnam | Sohum Mehta
Kaustubh S. Bukkapatnam | Sohum Mehta
When large language models (LLMs) summarize tabular data, they produce fluent but systematically unfaithful text—hallucinating numerical values, misattributing entities to rows or columns, fabricating comparative rankings, and conflating temporal references. Existing faithfulness metrics (BLEU, PARENT, BERTScore) are poorly correlated with human judgments of structural faithfulness (r ≤0.60) because they are agnostic to the table’s schema and cell structure. We introduce TABFAITH, a benchmark of 2,400 (table, summary, error annotation) triples across five structural error types, built from ToTTo and a new enterprise table summarization dataset (TabSum-Ent) covering financial reports, clinical notes, and operational dashboards. We further propose STAF (Structural Table-Aware Faithfulness), a reference-free metric that decomposes faithfulness verification into cell-level claim alignment using natural language inference over table cells. STAF achieves r = 0.94 with human faithfulness judgments—a +0.34 improvement over PARENT (r = 0.60) and +0.70 over BLEU (r = 0.24). Guided by STAF’s fine-grained signal, we design CAVE (Cell-Anchored Verification and Editing), a training-free post-processing method that identifies unfaithful claims, traces them to specific table cells, and re-generates the offending spans. CAVE improves STAF scores by +0.14 on average across five LLMs on both ToTTo and TabSum-Ent, with the largest gains for numerical errors (+0.17)—the dominant error type for smaller models.
StructHallu-Drift: Benchmarking Structured Hallucinations Under Schema Evolution in LLMs
Mujtaba Hasan
Mujtaba Hasan
Large Language Models (LLMs) are increasingly used to generate structured outputs—JSON objects, SQL queries, and structured records—from formal schemas. While recent advances in constrained decoding and schema-aware prompting have improved syntactic compliance, the semantic reliability of these outputs remains poorly characterized. We investigate this gap through the lens of schema drift—the inevitable evolution of database schemas in production environments through column renamings, type changes, and constraint modifications.We introduce StructHallu-Drift, a benchmark and evaluation framework for studying structured hallucinations under schema evolution. We contribute: (1) a six-category hallucination taxonomy that disentangles syntactic validity from semantic fidelity; (2) a controlled evaluation suite applying realistic schema mutations at three severity levels to established NL-to-structure datasets; and (3) a systematic evaluation of four LLMs spanning 7B to 70B parameters across three structured output tasks.Experiments on 1,200 schema–model evaluation instances reveal four key findings: (i) 39–54% of structured outputs contain at least one semantic hallucination; (ii) schema drift severity has surprisingly minimal effect on hallucination rates (∼44% across all levels, p = 0.59), suggesting imperfect schema conditioning under our prompting setup; (iii) output format is the dominant factor in generation reliability, with SQL achieving ∼85% semantic validity while schema-grounded record generation drops to 7–24%; (iv) each model exhibits a distinct hallucination fingerprint, implying that mitigation strategies must be model-specific rather than universal. We publicly release our benchmark and evaluation toolkit.
Reasoners or Translators? Contamination-aware Evaluation and Neuro-Symbolic Robustness on Tax Law
Parisa Kordjamshidi | Samer Aslan | Madhavan Seshadri | Leslie Barrett | Enrico Santus
Parisa Kordjamshidi | Samer Aslan | Madhavan Seshadri | Leslie Barrett | Enrico Santus
Recent advances in large language models (LLMs) have significantly enhanced automated legal reasoning. Yet, it remains unclear whether their performance reflects genuine legal reasoning ability or artifacts of data contamination. We present a comprehensive empirical study of tax law reasoning approaches and implement a contamination detection protocol to rigorously assess LLM reliability. We show that performance can be inflated by contamination. Building on this analysis, we conduct a systematic evaluation, comparing monolithic LLMs with hybrid systems that translate statutory text into formal representations and delegate inference to symbolic solvers. We build a novel test suite designed to probe generalization to unseen documents via case and rule variations. Our findings indicate that legal reasoning is inherently compositional and that neuro-symbolic frameworks offer a more reliable and robust foundation for legal AI, as well as improved generalization to unobserved situations.
More Than Efficiency: Embedding Compression Improves Domain Adaptation in Dense Retrieval
Chunsheng Zuo | Daniel Khashabi
Chunsheng Zuo | Daniel Khashabi
Dense retrievers powered by pretrained embeddings are widely used for document retrieval but struggle in specialized domains due to the mismatches between the training and target domain distributions. Domain adaptation typically requires costly annotation and retraining of query-document pairs. In this work, we revisit an overlooked alternative: applying PCA to domain embeddings to derive lower-dimensional representations that preserve domain-relevant features while discarding non-discriminative components. Though traditionally used for efficiency, we demonstrate that this simple embedding compression can effectively improve retrieval performance. Evaluated across 9 retrievers and 14 MTEB datasets, PCA applied solely to query embeddings improves NDCG@10 in 75.4% of model-dataset pairs, offering a simple and lightweight method for domain adaptation.
up
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)
Proceedings of the Seventh Workshop on Teaching Natural Language Processing (TeachNLP 2026)
Matthias Aßenmacher | Laura Biester | Claudia Borg | György Kovács | Margot Mieskes | Sofia Serrano
Matthias Aßenmacher | Laura Biester | Claudia Borg | György Kovács | Margot Mieskes | Sofia Serrano
Large language models (LLMs) are becoming central to natural language processing education, yet materials showing their mechanics are sparse. We present AnimatedLLM, an interactive web application that provides step-by-step visualizations of a Transformer language model. AnimatedLLM runs entirely in the browser, using pre-computed traces of open LLMs applied on manually curated inputs. The application is available at https://animatedllm.github.io, both as a teaching aid and for self-educational purposes.
Pedagogic Applications of Argument Maps to Enhance Critical Thinking: Thought Seeds, Argument Mapping, Collaborative Mapping
Sruti Narra
Sruti Narra
Argument maps are used extensively in Natural Language Processing (NLP), for training Large Language Models (LLMs) to analyze and generate arguments coherently. This paper discusses the pedagogic applications of the concept of argument mapping to enhance critical thinking in learning within educational contexts. The approach was found to be useful for shaping the thinking process during thesis writing and project courses and can be applied in higher education. In the age of rapid Gen AI advancement, it is important to embed critical thinking into education and such approaches can address challenges like AI overuse and potential loss of key skills and competences in learners. Argument mapping necessitates learners to visualize their thinking and while doing so, they not only achieve clarity of thought, but also make distinct connections between concepts in the form of arguments. Such clarity is at a much higher level compared to that achieved through concept or mind mapping as learners need to think in terms of well-formed claims and connections between them. In addition, collaborative argument mapping tasks could give learners opportunities for peer learning, and to concretize the abstract ideas through visualization and discussion.
The rapid advancement of Large Language Models (LLMs) presents both challenges and opportunities for Natural Language Processing (NLP) education. This paper introduces “Vibe Coding,” a pedagogical approach that leverages LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking. We describe the implementation of this approach in a senior-level undergraduate NLP course, where students completed seven labs using LLMs for code generation while being assessed primarily on conceptual understanding through critical reflection questions. Analysis of end-of-course feedback from 19 students reveals high satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students particularly valued the reduced cognitive load from debugging, enabling deeper focus on NLP concepts. However, challenges emerged around time constraints, LLM output verification, and the need for clearer task specifications. Our findings suggest that when properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for an AI-augmented professional landscape.
LLM-based methods supersede many approaches in NLP at high velocity, making it necessary to adapt curricula. We argue that this effort also presents a chance to integrate LLM chatbots as learning support. We demonstrate (a) how we re-conceptualized an existing class segment on digital assistance systems to discuss LLM-based chatbots, (b) how we created a specialized instructional chatbot as a demonstrator that students could directly use for learning and revision and (c) how students’ initial perception of LLM-based AI changed due to instruction.
Language Technology Initiative: Framework for Teaching NLP and Computational Linguistics at the Universities in Latvia
Inguna Skadina | Jana Kuzmina | Marina Platonova | Tatjana Smirnova | Sergei Kruk
Inguna Skadina | Jana Kuzmina | Marina Platonova | Tatjana Smirnova | Sergei Kruk
This short paper provides an overview of language technology related modules and courses developed at three leading universities of Latvia - University of Latvia (UL), Riga Technical University (RTU) and Riga Stradiņš University (RSU).
Teaching NLP in the AI Era: Experiences from the University of Latvia
Inguna Skadina | Guntis Barzdins | Uldis Bojārs | Normunds Gruzitis | Pēteris Paikens
Inguna Skadina | Guntis Barzdins | Uldis Bojārs | Normunds Gruzitis | Pēteris Paikens
From being a niche technology with practical applications in translation and speech recognition, NLP is now underpinning the AI era through LLMs, promising a universal economic impact in the future. Although transitioning to the AI era is hyped by BigTech companies, practical adoption of the LLM capabilities for economically impactful tasks and processes goes via education of specialists capable to apply it properly. Human-in-the-loop, accuracy measurement, fine-tuning, on-premises processing of sensitive data have become essential skills for applying NLP. This short paper introduces two language technology modules developed and piloted at the Faculty of Science and Technology of the University of Latvia.
With the advent of Large Language Models (LLMs) researchers outside the Natural Language Processing (NLP) field are interested in learning how to process textual data for their own domain research goals. They are particularly motivated to start experimenting directly with LLMs, implicitly neglecting the large amount of accumulated knowledge that NLP has to offer them. In this text, we briefly share our new lesson materials that aim to show aspiring practitioners the strong connection between NLP fundamentals and LLMs, in the form of a two-day workshop. Our training material is mainly aimed at graduate students outside the NLP sphere who have basic technical knowledge and wish to start working with text, is fully open source and available online.
From Standard Transformers to Modern LLMs: Bringing Dialogue Models, RAG, and Agents to the Classroom
Maria Tikhonova | Viktoriia A. Chekalina | Artem Chervyakov | Alexey Zaytsev | Alexander Panchenko
Maria Tikhonova | Viktoriia A. Chekalina | Artem Chervyakov | Alexey Zaytsev | Alexander Panchenko
Modern LLM education is increasingly centered on system building: grounding generation with retrieval, enabling tool use, and deploying models under latency and cost constraints.We present an updated release of our open course on Transformer-based LLMs and multimodal models (Nikishina et al, 2024).The update introduces topics which became importance since the first edition, namely session on Retrieval Augmented Generation (RAG), a hands-on session on tool-using agents, an API-based track for applied work with LLM, and practical local inference with vLLM.We also add a dedicated session on multimodal dialog models with a focus on dialog grounding. We enriched the course with a discussion on long-context transformers, focusing on KV-cache efficiency along with the related models and benchmarks.All materials are released online.
Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs
Junyi Jessy Li | Yang Janet Liu | Kanishka Misra | Valentina Pyatkin | William Sheffield
Junyi Jessy Li | Yang Janet Liu | Kanishka Misra | Valentina Pyatkin | William Sheffield
The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula.We present a new course, "Computational Discourse and Natural Language Generation". The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.
Student demand for NLP training now spans linguistics, computer science, data science, and applied fields, producing cohorts with uneven preparation. We report on a four-course curriculum used in an M.S. Computational Linguistics program: an undergraduate on-ramp, a two-course graduate core (classical methods and neural/LLM methods), and a rotating special-topics seminar. We describe the role of each course, the bridging strategy that keeps the core sequence focused, and assessment patterns that emphasize error analysis, experimental reasoning, and reproducible practice. The goal is a set of reusable curricular design patterns for mixed-background programs facing rapid topic turnover in NLP.
NLP researchers regularly invoke abstract concepts like "interpretability," "bias," "reasoning," and "stereotypes," without defining them.Each subfield has a shared understanding or conceptualization of what these terms mean and how we should treat them, and this shared understanding is the basis on which operational decisions are made:Datasets are built to evaluate these concepts, metrics are proposed to quantify them, and claims are made about systems. But what do they mean, what _should_ they mean, and how should we measure them?I outline a seminar I created for students to explore these questions of conceptualization and operationalization, with an interdisciplinary reading list and an emphasis on discussion and critique.
Bridging Applied Experience and Research Contexts in Ukrainian NLP Education
Yurii Paniv | Viktoriia Makovska
Yurii Paniv | Viktoriia Makovska
We present an open, bachelor-level Natural Language Processing (NLP) course developed at Ukrainian Catholic University and delivered in Ukrainian. The course addresses several challenges in NLP education: adapting predominantly English-centric materials to a different linguistic and cultural context, supporting students with heterogeneous technical backgrounds, and balancing foundational theory with industry-relevant skills. All course materials, including lecture slides, notebooks, video recordings, and assignments, are publicly available. We describe our pedagogical design choices, focusing on culturally adapted tasks, integrated ethics, project-based assessment, and continuous student feedback. Our experience demonstrates that it is feasible to build a comprehensive and modern NLP curriculum from scratch in a non-English context, even when instructors come primarily from industry backgrounds.
Teaching Modern NLP and LLMs at Kyiv School of Economics: A Practice-Oriented Course with Ukrainian Language Focus
Roman Kyslyi | Anton Bazdyrev
Roman Kyslyi | Anton Bazdyrev
This paper describes a Natural Language Processing (NLP) course taught at Kyiv School of Economics. The course consists of 16 lectures, 5 practical assignments and focuses on modern large language models (LLMs) while preserving an introduction to classical NLP. Practical assignments are organized using Kaggle, where GPU support plays an important role in enabling students to work with complex models. A key feature of the course is the focus on Ukrainian in the practical assignments, contributing to the development of Ukrainian NLP expertise and community. The course is taught primarily in-person, but due to the ongoing war in Ukraine, also includes a full online participation option and additional weekly QnA sessions.
Practising responsibility: Ethics in NLP as a hands-on course
Malvina Nissim | Viviana Patti | Beatrice Savoldi
Malvina Nissim | Viviana Patti | Beatrice Savoldi
As Natural Language Processing (NLP) systems become more pervasive, integrating ethical considerations into NLP education has become essential. However, this presents inherent challenges in curriculum development: the field’s rapid evolution from both academia and industry, and the need to foster critical thinking beyond traditional technical training. We introduce our course on Ethical Aspects in NLP and our pedagogical approach, grounded in active learning through interactive sessions, hands-on activities, and “learning by teaching” methods. Over four years, the course has been refined and adapted across different institutions, educational levels, and interdisciplinary backgrounds; it has also yielded many reusable products, both in the form of teaching materials and in the form of actual educational products aimed at diverse audiences, made by the students themselves. By sharing our approach and experience, we hope to provide inspiration for educators seeking to incorporate social impact considerations into their curricula.
Beyond Passive Viewing: A Pilot Study of a Hybrid Learning Platform Augmenting Video Lectures with Conversational AI.
Mohammed Abraar | Raj Dandekar | Rajat Dandekar | Sreedath Panat
Mohammed Abraar | Raj Dandekar | Rajat Dandekar | Sreedath Panat
The exponential growth of AI education has brought millions of learners to online platforms, yet this massive scale has simultaneously exposed critical pedagogical shortcomings. Traditional video-based instruction, while cost-effective and scalable, demonstrates systematic failures in both sustaining learner engagement and facilitating the deep conceptual mastery essential for AI literacy. We present a pilot study evaluating a novel hybrid learning platform that integrates real-time conversational AI tutors with traditional video lectures. Our controlled experiment (N = 58,\ mean age M = 21.4,\ SD = 2.8) compared traditional video-based instruction with our AI-augmented video platform. This study employed a sequential within-subjects design where all participants first completed the traditional video condition followed by the AI-augmented condition, providing direct comparisons of learning outcomes. We measured learning effectiveness through immediate post-tests and delayed retention assessments (2-week delay). Results suggest improvements in learning performance: immediate post-test performance showed a large effect size (d = 1.505) with participants scoring 8.3 points higher after AI-augmented instruction (91.8\ vs.\ 83.5\ out of\ 100,\ p < .001). Behavioral analytics revealed increased engagement duration (71.1% improvement with AI tutoring) in the experimental group. This pilot study provides preliminary evidence that conversational AI tutors may enhance traditional educational delivery, suggesting a potential avenue for developing scalable, adaptive learning systems.
From Sentiment to Interpretation: Teaching NLP for Literary Understanding Across Educational Contexts
Karl-Emil Kjær Bilstrup | Kirstine Nielsen Degn | Morten Schultz | Alexander Conroy | Jens Bjerring-Hansen | Daniel Hershcovich
Karl-Emil Kjær Bilstrup | Kirstine Nielsen Degn | Morten Schultz | Alexander Conroy | Jens Bjerring-Hansen | Daniel Hershcovich
We developed Litteraturmaskinen, a graphical annotation and exploration interface that enables students to collaborate on labeling sentiment in literary passages, comparing their decisions with model predictions, and justifying their interpretations. We deployed the system in two educational settings: A university module on computational literary studies and regular teaching by two first-language high school teachers. Based on observations, collected teaching plans, and interviews, we find that tensions between epistemic and academic traditions are both a barrier for integration and a productive entry point for literary reflection and argumentation. We conclude with recommendations for integrating NLP into literature and first-language curricula.
The ubiquitous adoption of large language models by students prompts teachers to redesign courses and evaluation methods, especially in computer science and natural language processing (NLP) where the impact is more tangible.Our contribution is two-fold. First, we attempt to define invariants for the role of education itself given the over-abundance of information that appears to be more accessible than ever before. Then, we present our approach and materials used for an introductory course in NLP for undergraduate students, drawing inspiration from software engineering best practices. Our vision regarding large language models is torely on local models to cultivate a sense of ownership and sovereignty in an age where every bit of independence and privacy get eroded.
up
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)
Kai-Wei Chang | Ninareh Mehrabi | Satyapriya Krishna | Anubrata Das | Jwala Dhamala | Yang Trista Cao | Tharindu Kumarage | Anil Ramakrishna | Christos Christodoulopoulos | Yixin Wan | Aram Galystan | Anoop Kumar | Rahul Gupta
Kai-Wei Chang | Ninareh Mehrabi | Satyapriya Krishna | Anubrata Das | Jwala Dhamala | Yang Trista Cao | Tharindu Kumarage | Anil Ramakrishna | Christos Christodoulopoulos | Yixin Wan | Aram Galystan | Anoop Kumar | Rahul Gupta
Evaluating Cross-Lingual Behavior and Consistency of Multimodal Large Language Models
Hao Wang | Pinzhi Huang | Daisuke Kawahara
Hao Wang | Pinzhi Huang | Daisuke Kawahara
The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications.However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge.To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs.KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks.VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images.Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency.This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
Through a Compressed Lens: Investigating The Impact of Quantization on Factual Knowledge Recall
Qianli Wang | Mingyang Wang | Nils Feldhus | Simon Ostermann | Yuan Cao | Hinrich Schuetze | Sebastian Möller | Vera Schmitt
Qianli Wang | Mingyang Wang | Nils Feldhus | Simon Ostermann | Yuan Cao | Hinrich Schuetze | Sebastian Möller | Vera Schmitt
Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization’s effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model’s FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.
Uncertainty-Aware Proxy Attribute Reasoning for Reliable Media Bias Detection
Chin-Po Chen | Jeng-Lin Li | Ming-Ching Chang
Chin-Po Chen | Jeng-Lin Li | Ming-Ching Chang
Large language models (LLMs) are increasingly deployed in wide range of applications, yet remain vulnerable to adversarial jailbreak attacks that circumvent their safety guardrails.Existing evaluation frameworks typically report binary success/failure metrics, failing to capture the temporal dynamics of how attacks succeed under persistent adversarial pressure. This preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vulnerability. Our approach models the “time-to-jailbreak” as a survival outcome, enabling estimation of hazard functions, survival curves, and risk factors associated with successful attacks. We evaluate three LLMs against a sub-set of prompts from the HarmBench dataset spanning three attack categories. Our analysis reveals that models exhibit distinct vulnerability profiles: while one model demonstrates rapid degradation under iterative attacks, the wo other models show consistent moderate vulnerability. Our framework provides actionable insights for model and LLM applicaiton developers and establishes survival analysis as a rigorous methodology for LLM safety evaluation.
ClaimCLAIRE: A Trust-Aware Multi-Component Fact-Checking Agent for Open-World Claims
Xinman Liu | Mayank Sharma
Xinman Liu | Mayank Sharma
Verifying complex real-world claims against diverse and potentially unreliable open-web sources requires balancing evidence comprehensiveness with rigorous source reliability. Current automated fact-checking approaches often fail to address this holistically, losing contextual dependencies and applying trust signals monolithically at the document level.We introduce ClaimCLAIRE, a multi-component fact-checking agent that integrates four key innovations: (1) iterative component-aware decomposition with exhaustiveness validation, (2) holistic evidence gathering using a ReAct agent that maintains cross-component semantic awareness, (3) trust-modulated retrieval that weights evidence by source credibility to mitigate the influence of misinformation, and (4) adaptive gap-filling to address recall bottlenecks in under-supported sub-claims.Evaluated on the AVeriTeC benchmark, ClaimCLAIRE achieves 84.27% accuracy and a macro-F1 of 0.806. Our systematic ablations demonstrate that while decomposition alone can degrade performance, its integration with trust-aware retrieval and adaptive gap-filling yields a pipeline where component-level verdicts, source trust ratings, and deterministic AND-logic synthesis together support transparent, accountable fact verification.
ChatbotManip: a Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
Jack Luigi Henry Contro | Simrat Deol | Martim Brandao | Yulan He
Jack Luigi Henry Contro | Simrat Deol | Martim Brandao | Yulan He
This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84% of such conversations. Second, even when only instructed to be "persuasive" without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly Gaslighting and Fear Enhancement. Third, zero-shot larger models such as Gemini 2.5 pro have the best performance in detecting manipulation (of the models tested), with more work required to fine-tune smaller open source models for real-world on-device oversight. Our work provides important insights for AI safety research and highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.
Controllable Pareto Trade-off between Fairness and Accuracy
Yongkang Du | Jieyu Zhao | Yijun Yang | Tianyi Zhou
Yongkang Du | Jieyu Zhao | Yijun Yang | Tianyi Zhou
The fairness-accuracy trade-off is a key challenge in NLP tasks. Current work focuses on finding a single optimal solution to balance the two objectives, which is limited considering the diverse solutions on the Pareto front.This work intends to provide controllable trade-offs according to the user’s preference of the two objectives, which is defined as a reference vector. To achieve this goal, we apply multi-objective optimization (MOO), which can find solutions from various regions of the Pareto front. However, it is challenging to precisely control the trade-off due to the stochasticity of the training process and the high dimensional gradient vectors.Thus, we propose Controllable Pareto Trade-off (CPT) that can effectively train models to perform different trade-offs according to users’ preferences.CPT 1) stabilizes the fairness update with a moving average of stochastic gradients to determine the update direction, and 2) prunes the gradients by only keeping the gradients of the critical parameters. We evaluate CPT on hate speech detection and occupation classification tasks. Experiments show that CPT can achieve a higher-quality set of solutions on the Pareto front than the baseline methods. It also exhibits better controllability and can precisely follow the human-defined reference vectors.
What are They Thinking? Delineation, Probing, and Tracking of Concepts in LLMs
Mohamed Abdelwahab | Michelle Yu Collins | Sihan Chen | Yi Cheng Zhao | Zafarullah Mahmood | Jiading Zhu | Soliman Ali | Jonathan Rose
Mohamed Abdelwahab | Michelle Yu Collins | Sihan Chen | Yi Cheng Zhao | Zafarullah Mahmood | Jiading Zhu | Soliman Ali | Jonathan Rose
As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of high-level abstract concepts within the embeddings computed in an LLM - which is what we might say a model is "thinking" about. Such probes should be low-cost and easily applicable to any LLM, so that monitoring for many concepts is possible during normal operation.In this paper, we take the first steps towards developing the capability of creating many such probes by defining and executing examples of the key tasks needed: first, the careful delineation of a high-level abstract concept through the creation of a dataset with the concept both present and then absent. Then, the training and testing of a set of linear probes to detect the concept on any layer of an LLM, including an exploration of the complexity of the probe needed. Finally, we show that such probes can track concepts across larger contexts. This is done with four separate concepts and three different LLMs. When this process is scaled to many more concepts, it will create the ability to monitor new models.
Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
Yavuz Faruk Bakman | Duygu Nur Yaldiz | Salman Avestimehr | Sai Praneeth Karimireddy
Yavuz Faruk Bakman | Duygu Nur Yaldiz | Salman Avestimehr | Sai Praneeth Karimireddy
Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed “aligned” can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior, which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update–robust alignment evaluation
Teaching People LLM’s Errors and Getting it Right
Nathan Stringham | Fateme Hashemi Chaleshtori | Xinyuan Yan | Zhichao Xu | Bei Wang | Ana Marasovic
Nathan Stringham | Fateme Hashemi Chaleshtori | Xinyuan Yan | Zhichao Xu | Bei Wang | Ana Marasovic
People often rely on large language models (LLMs) in situations where they are ill-suited. This miscalibration is understandable: seeing LLMs compose poetry and answer complex questions can lead users to assume, incorrectly, that they will also handle simple tasks, such as basic arithmetic, without error. Prior work has attempted to address this issue by clustering instance embeddings to identify regions where an LLM is likely to fail, then automatically describing the patterns within those regions. These inferred “failure patterns” are taught to users to reduce overreliance. Yet, this approach has not been fully successful. In this paper, we investigate why.We first examine whether the negative results stem from an absence of meaningful failure patterns. Using two datasets, we group instances by their meta-labels and evaluate LLM performance within each group. We then define criteria to identify groups that are both sufficiently large and exhibit high error rates. This process reveals multiple meta-label groups that meet these criteria, indicating that actionable failure patterns do, in fact, exist. Next, we test whether prompting- and embedding-based methods can reliably surface these known failure patterns. This step is critical: if such patterns cannot be surfaced automatically, they cannot be communicated to users. We observe mixed performance across methods, which may explain the limited success of prior approaches. Finally, we revisit how teaching effectiveness is measured. We propose evaluating whether users can apply learned failure patterns to anticipate when an LLM is likely to err. A user study shows that instruction based on this metric yields measurable improvements, unlike standard human–AI team accuracy metrics. Overall, our findings suggest that teaching failure patterns can be an effective way to mitigate overreliance, but its success depends on improved automated methods for discovering these patterns and on evaluation metrics like ours.
Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States
Subramanyam Sahoo | Vinija Jain | Aman Chadha | Divya Chaudhary
Subramanyam Sahoo | Vinija Jain | Aman Chadha | Divya Chaudhary
Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and 𝛼NLI (abductive). At layer 32 of 40, linear probes achieve 100% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination ≤1.5%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5% agreement vs. 33.3% chance), and causal steering with random controls (n=20) shows no functional link between geometry and reasoning mode (p=0.286). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.
KoLegalQA: A Korean Legal QA Dataset for Trustworthy and Explanation-Grounded Legal AI
Yongtae Lee | Surin Lee | Sumin Kim | S M Wahidur Rahman | Heung-No Lee
Yongtae Lee | Surin Lee | Sumin Kim | S M Wahidur Rahman | Heung-No Lee
Legal QA systems may benefit from training data that is expert-verified and associated with statutory provisions, as fluent generation alone cannot guarantee legally relevant and citation-supported outputs. However, existing Korean legal datasets provide limited support for legal QA and statute-associated response generation. To address this gap, we introduce KoLegalQA, a large-scale Korean legal question–answer corpus designed for research on legal QA and explanation-oriented legal response generation in real-world consultation scenarios. The dataset comprises 19k consultations collected from government-operated services, with all responses originally authored or verified by licensed legal professionals. Unlike prior resources, KoLegalQA provides explicit statutory references and clause-level summaries, enabling research on citation-associated and explanation-oriented legal response generation. We benchmark six Korean-capable LLMs using both automated evaluation (G-Eval) and human assessment across multiple criteria, including legal correctness, reasoning quality, and citation relevance. Experimental results show that fine-tuning on KoLegalQA generally improves legal reasoning validity and statute-associated response generation across most evaluated models. We present this resource as a practical benchmark dataset for Korean legal NLP research. Dataset splits, preprocessing scripts, and evaluation code will be publicly released to support reproducible research.
Authorization-First Retrieval: Enforcing Least Privilege in Multi-Agent RAG Systems
Rohith Namboothiri
Rohith Namboothiri
Retrieval-augmented generation systems serving multiple users under role-based access control face a trustworthiness gap: semantic retrieval operates on embedding similarity rather than authorization predicates and can introduce unauthorized content into a model’s context window before any filter intervenes. We formalize this as a pipeline ordering problem and introduce Authorization-First Retrieval (AFR), an architectural invariant requiring that authorization constrain the retrieval candidate set before any learned component consumes retrieved content. We reduce authorization correctness to the classical noninterference property and prove AFR is necessary whenever the processing model violates noninterference—a condition our experiments confirm empirically. Evaluation on a controlled corpus of 247 chunks across 232 documents with 431 base queries spanning 12 enterprise roles and 9 domains (584 total queries including negation exploitation and parametric probes) shows that retrieve-then-filter pipelines expose unauthorized context in 86.1% of queries, while AFR eliminates structural leaks by construction. Cross-model experiments with Gemini 2.0 Flash and GPT-4o-mini reveal that structural exposure is an architectural property independent of the underlying model, whereas behavioral defenses fail at model-dependent rates, producing answer leakage of 41.3% and 29.5% respectively under retrieve-then-filter. A negation exploitation study demonstrates consistent disclosure vulnerabilities across framing types, while a metadata-tag freshness ablation shows that conditional authorization mechanisms degrade under realistic policy staleness. Stress tests across retrieval depths and chunking granularities confirm AFR’s robustness. Our results demonstrate that behavioral guardrails and metadata tagging cannot reliably enforce least privilege in RAG pipelines, while authorization-first architectures provide a verifiable and model-independent security guarantee.
PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka | Xue Jiang | Dmitrii Usynin | Xuebing Zhou
Krishna Kanth Nakka | Xue Jiang | Dmitrii Usynin | Xuebing Zhou
This paper investigates privacy jailbreaking in large language models (LLMs) via steering, examining whether targeted manipulation of internal activations can circumvent the alignment mechanisms and alter model behaviour on privacy-sensitive queries, such as those concerning sexual orientation of public figures. Our approach begins by identifying attention heads predictive of refusal behaviour for a given private attribute, using lightweight linear probes trained on labels provided by a privacy evaluator. We then apply steering to a carefully selected subset of these heads, guided by the probe outputs, to induce positive responses from the model. Empirical results demonstrate that these steered responses frequently reveal the target attribute, as well as additional personal information about the data subject, including life events, relationships, and biographical details. Evaluations across three LLMs show that steering achieves disclosure rates of at least 80% with several responses containing real personal information. This controlled study highlights a concrete privacy risk: personal information memorised during pre-training can be extracted through targeted activation-level interventions, without reliance on computationally intensive adversarial prompting techniques.
Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents
Jihye Kim
Jihye Kim
K-Level reasoning—recursive modeling of opponent beliefs—improves LLM negotiation utility but frequently elicits coercive and toxic behaviors that undermine real-world deployability. We propose an Observer–Planner–Actor architecture with a Modular Appraisal Gate that (i) dynamically estimates the opponent’s cognitive level and (ii) filters hostile drafts via an LLM-as-a-judge. In randomized interventions on the CaSiNo dataset, our gated agent eliminates toxicity (0%) and reduces coercion from 35% to 6% compared to a strong static-K baseline, albeit with an alignment tax in utility. However, the gate does not reduce preference hallucinations—strategic misrepresentation of the agent’s own priorities. K-Level reasoning incidentally suppresses this behavior (from 35% in a vanilla baseline to 22%), but gating coercion releases the suppression, returning hallucination to vanilla-baseline levels (33–37%). We term this pattern a deceptive bypass: output-level filters address the form of hostility but leave surface-compliant manipulation channels intact, demonstrating that they alone are insufficient to align utility-driven strategic agents.
Purdah and Patriarchy: Evaluating and Mitigating South Asian Biases in Open-Ended Multilingual LLM Generations
Mamnuya Rinki | Chahat Raj | Anjishnu Mukherjee | Ziwei Zhu
Mamnuya Rinki | Chahat Raj | Anjishnu Mukherjee | Ziwei Zhu
Evaluations of Large Language Models (LLMs) often overlook intersectional and culturally specific biases, particularly in underrepresented multilingual regions like South Asia. This work addresses these gaps by conducting a multilingual and intersectional analysis of LLM outputs across 10 Indo-Aryan and Dravidian languages, identifying how cultural stigmas influenced by purdah and patriarchy are reinforced in generative tasks. We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children. We use our lexicon to quantify intersectional bias and the effectiveness of self-debiasing in open-ended generations (e.g., storytelling, hobbies, and to-do lists), where bias manifests subtly and remains largely unexamined in multilingual contexts. Finally, we evaluate two self-debiasing strategies (simple and complex prompts) to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages. Our approach offers a nuanced lens into cultural bias by introducing a novel bias lexicon and evaluation framework that extends beyond Eurocentric or small-scale multilingual settings.
Ghost Context: Measuring Cross-Context Interference in Long-Context Language Models
Rohith Namboothiri
Rohith Namboothiri
Long-context language models assemble prompts from heterogeneous sources, and deployed systems implicitly trust the model to use the correct span of context. We show that this assumption is often violated: irrelevant spans can silently shape outputs, producing errors that are neither fabrication nor omission but misattributed grounding—claims supported by the wrong part of the input context. Unlike intrinsic hallucination (contradicting the source) or extrinsic hallucination (introducing unsupported claims), misattributed grounding uses real evidence from an incorrect span, making it invisible to standard source-blind faithfulness metrics.We formalize this phenomenon as Ghost Context and introduce a causal mask-and-rerun attribution protocol to measure it. Across a 272-case corpus spanning multiple interference scenarios, we evaluate three widely used models and report two complementary signals: strict Ghost Context Rate (GCR), which captures verifiable factual misattribution, and open-ended influence, which captures broader contextual shaping effects. Under realistic contextual conflict, strict GCR spikes substantially: temporal contradictions trigger misattributed grounding in 38.3% of cases. Across all scenarios, open-ended distractor influence occurs in 20.4% of evaluations.Importantly, Ghost Context is not only detectable but also remediable. Masking the single highest-attributed distractor span resolves 95.5% of detected errors (Fix@1) with 2.4% collateral damage and zero false positives on negative controls. We also introduce Contextual Invariance Rate (CIR) as a system-level robustness metric measuring invariance to irrelevant context.Our findings show that contextual conflict—common in retrieval-augmented generation and agent systems—can systematically degrade reliability, but also reveal that Ghost Context errors are causally localizable and cheaply correctable. We release the evaluation corpus, detection pipeline, and experimental results to support further research on trustworthy long-context language model evaluation.
Understanding the Effects of Safety Unalignment on Reasoning- and Instruction-Tuned Large Language Models
John Timothy Halloran
John Timothy Halloran
Alignment has become a critical step towards enabling large language model (LLM) safety guardrails which ensure models provide helpful and harmless responses, while refusing malicious and harmful requests. However, two separate lines of recent work–unalignment via fine-tuning, i.e., jailbreak-tuning (JT), and weight orthogonalization (WO)–have shown that LLM guardrails may be circumvented, such that LLMs obey harmful requests which they would normally refuse. Despite the safety implications of such unalignment procedures, a comprehensive analysis directly contrasting these methods is currently lacking, as is a study of these methods’ impact on malicious LLM capabilities and reasoning models. Using both JT and WO, we study the impact of unaligning six popular LLMs–three reasoning LLMs of various sizes and their instruction-tuned analogues–across harmful safety tasks. Compared to JT, we show that WO produces models which are more effective at adversarially attacking LLMs–in particular, WO reasoning LLMs excel at such adversarial attacks. Interestingly, while increasing adversarial attack efficacy, we show that WO does not drastically increase hallucination rates. This is in stark contrast to JT, which may more than double the hallucination rate of both reasoning and instruction-tuned models alike. Finally, we show that off-the-shelf supervised fine-tuning effectively limits the adversarial attack abilities enabled by WO, without drastically increasing hallucination rates.
ReacTOD: Bounded Neuro-Symbolic Agentic NLU for Zero-Shot Dialogue State Tracking
Yanjun Lin | Zimo Xiao | Kartik Natarajan | Mahesh Sankaranarayanan | Niraj Nawanit | Rakshit Parashar | Austin Zhang | Karthik Konaraddi | Rishita Mote | Wei Niu
Yanjun Lin | Zimo Xiao | Kartik Natarajan | Mahesh Sankaranarayanan | Niraj Nawanit | Rakshit Parashar | Austin Zhang | Karthik Konaraddi | Rishita Mote | Wei Niu
Task-oriented dialogue systems—handling transactions, reservations, and service requests—require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09%—demonstrating cross-benchmark generalization without task-specific training data.
Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability
Yucheng Du
Yucheng Du
A reliable language model should be able to signal, prior to generation, when a query falls outside its knowledge. We investigate whether representation geometry can provide such a pre-generation signal by measuring the deviation of hidden states from an answerable reference set, requiring no labeled failure data and no access to model outputs.Across three instruction-tuned models (Llama 3.1-8B, Qwen 2.5-7B, and Mistral-7B-Instruct) and three prompt forms (Math, Fact, Code), we find that geometry primarily encodes task form. Within mathematical prompts, unanswerable inputs consistently deviate from the answerable centroid, yielding strong separation (ROC-AUC 0.78-0.84). This single-pass pre-generation signal outperforms a simple refusal baseline and compares favorably to self-consistency. It also captures cases where models do not explicitly refuse.In contrast, no reliable geometric signal emerges for factual prompts, indicating that the effect is form-conditional rather than universal. Code prompts show large effect sizes with higher variance, suggesting partial generalization beyond mathematical form.A layer-wise analysis reveals that the signal arises in early layers and gradually attenuates toward the output. These results suggest that answerability-related geometry is established before the final stages of generation. Together, these findings indicate that geometric deviation can serve as a lightweight pre-generation signal that is reliable in structured domains with formal answerability constraints, with clear boundaries on where it generalizes.
Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection
Yusser Al Ghussin | Daniil Gurgurov | Tanja Baeumel | Josef Van Genabith | Patrick Schramowski | Simon Ostermann
Yusser Al Ghussin | Daniil Gurgurov | Tanja Baeumel | Josef Van Genabith | Patrick Schramowski | Simon Ostermann
Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an a priori steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.
Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
Zhiyu Xue | Zimo Qi | Guangliang Liu | Bocheng Chen | Ramtin Pedarsani
Zhiyu Xue | Zimo Qi | Guangliang Liu | Bocheng Chen | Ramtin Pedarsani
Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers.Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications.In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries.However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries.Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning.Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.
Retrieval-Augmented Generation (RAG) systems fail in diverse, poorly characterized ways that single-stage evaluation metrics cannot detect. We present a systematic taxonomy of 33 failure modes across 7 pipeline stages — ingestion, representation, retrieval, generation, evaluation, deployment, and agentic orchestration — constructed through a structured literature review of 48 sources spanning peer-reviewed publications and high-impact preprints. For each mode, we provide a formal definition, observable manifestation, and three-level evidence grading (Strong/Moderate/Limited). Our analysis reveals a critical asymmetry in research attention: retrieval and generation failures are comparatively well-studied, while representation, evaluation, and agentic orchestration failures remain under-investigated despite frequent occurrence in production. We identify 12 failure modes with no dedicated peer-reviewed empirical evidence — all 8 agentic modes among them — constituting an evidence desert in the fastest-growing RAG deployment paradigm. Compared to prior work enumerating 7 failure points (Barnett et al., 2024) or 16 error types within partial pipeline runs (Cresswell et al., 2025), our taxonomy uniquely spans the full pipeline, including agentic orchestration with explicit evidence-level grading.
Improving the Faithfulness of LLM-based Abstractive Summarization with Span-level Unlikelihood Training
Sicong Huang | Qianqi Yan | Shengze Wang | Ian Lane
Sicong Huang | Qianqi Yan | Shengze Wang | Ian Lane
Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. Despite their ability to generate fluent summaries, these models often produce texts that are unfaithful to the original documents, manifested through hallucinations of specific words, phrases, or concepts. Current approaches to mitigating unfaithfulness typically involve post-processing corrections or contrastive learning from synthetically generated negative samples, which do not fully address the spectrum of errors that can arise in LLM-generated summaries. In this paper, we introduce a novel approach to fine-tune LLMs specifically to reduce the occurrence of unfaithful spans of text in generated summaries. We first annotate span-level hallucinations in LLM-generated summaries using automatic labeling with GPT-4. We then fine-tune the LLM using both summaries with no hallucinations and spans of hallucinated text to improve the faithfulness of the model. This paper introduces a dataset labeled to distinguish between faithful and unfaithful content and compare the performance of three techniques: gradient ascent, unlikelihood training, and task vector negation. Our experimental results show that unlikelihood training can effectively use span-level annotations to enhance summary faithfulness, reducing the number of summaries with hallucinations from 31% to 13%, a reduction of 58% on the CNN summarization dataset and from 33% to 20%, a reduction of 39% on the SAMSum dataset.
Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
Jinhwa Kim | Ian Harris
Jinhwa Kim | Ian Harris
While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative analysis, comparing our approach with state-of-the-art defense mechanisms against six different attacks and assessing the helpfulness of LLMs under these defenses. Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 92% while maintaining the original LLMs’ performance, achieving state-of-the-art Safety and Helpfulness balance. Notably, Context Filtering is a plug-and-play method that can be applied to all LLMs, including both white-box and black-box models, to enhance their safety without requiring any fine-tuning of the models themselves.
Lexical Familiarity Predicts Processing Depth for Nonliteral Language in Large Language Models
Lang-Ching Yeh | Yu-Chieh Wang | Shu-Kai Hsieh
Lang-Ching Yeh | Yu-Chieh Wang | Shu-Kai Hsieh
This paper investigates how large language models internally process nonliteral language. Analyzing five categories spanning slang, metaphor, and idioms across all 48 layers of Gemma-3-12B-IT with Gemma Scope 2 sparse autoencoders, we find a lexical familiarity gradient: processing depth depends on available prior lexical knowledge, not figurative type. Idioms diverge at L1 as entrenched units; expressions built from familiar words (metaphors, semantic-shift and constructional slang) converge at L7–9; neologisms peak at L41, activating 3× more unique features. Paraphrase residual analysis confirms strong signals only at the gradient endpoints, yielding a three-tier hierarchy of entrenched retrieval, known-word reanalysis, and novel-word construction. Crucially, this peak-layer structure replicates in base models (Gemma-PT, Qwen-Base), demonstrating that the gradient is a robust property of pretrained representations rather than an alignment artifact. We additionally identify an activation density confound in SAE feature counts that produces spurious cross-condition convergence. Overall, processing depth is better predicted by lexical familiarity than by figurative type, with implications for robustness to non-standard language and for SAE-based interpretability.
Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behavior through a prospective memory-inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2–21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90–100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model’s GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers, with no LLM-as-judge component, on publicly available datasets.
Large language models are increasingly used in strategic and advisory contexts, yet their safety alignment is typically evaluated in English only. We test nine models from six providers and ask whether the language of a prompt can change a model’s decision in a high-stakes scenario. We use single-turn game-theoretic vignettes in which a model advises a nuclear-armed nation on whether to strike a defenseless opponent. The prompt is intentionally amoral and strategically identical across languages. We find that Japanese prompts reduce launch rates in the Claude model family: Claude Sonnet 4.6 drops from 40% to 0% in scenarios where the strike is unnecessary and from 93% to 17% in contested scenarios, with minimal effect when the strike is strategically rational. The effect extends to Gemini Pro 3.1 (53% to 13%). A cross-language experiment isolates the mechanism: when instructed to reason in Japanese in an English prompt, launch rates drop from 93% to 37%. It is the language the model is asked to reason in, not the language of the input, that drives the effect. When reasoning in Japanese, models spontaneously generate moral vocabulary ("moral cost", "millions of lives") that is entirely absent from the prompt. Five other models show no language effect, but they launch in nearly every condition regardless of language. The effect requires a model that already hesitates in English. These results show that LLM safety behavior is language-dependent, and that evaluating in English alone can miss both risks and safeguards encoded in other languages.
Large language models (LLMs) are increasingly deployed with safety alignment mechanisms designed to prevent harmful outputs including hate speech, harassment, and unsafe instructions. However, existing safety evaluation frameworks remain heavily centered on English and standardized language varieties, creating a critical gap for languages characterized by extensive dialectal variation. Arabic provides a particularly important case: everyday communication across the Arab world occurs predominantly in regional dialects rather than Modern Standard Arabic (MSA), yet these dialects are systematically underrepresented in alignment training corpora and safety benchmarks.In this paper we introduce the Dialect Safety Gap, defined as systematic variation in LLM safety behavior across dialects of the same language. We argue that this phenomenon arises from the interaction between alignment training procedures and linguistic variation: safety alignment implicitly encodes normative patterns present in training datasets, and when dialectal forms diverge from those patterns, safety behavior degrades through lexical, morphological, and pragmatic mechanisms.We propose a formal framework grounded in algorithmic fairness that links dialect variation to alignment pipeline design, introduce both a binary DSG Score and a magnitude-aware Pairwise Dialect Inconsistency metric, and propose the Dialect-Aware Safety Evaluation Protocol (DASEP) as a practical evaluation framework. We demonstrate the feasibility of dialect-aware evaluation through a controlled, human-annotated prompt-probe experiment across five Arabic variety groups, revealing a structured gradient of safety degradation that correlates with linguistic distance from MSA.
Single-layer activation edits easily corrupt a language model’s correct factual answers but rarely repair its errors. On a curated factual-recall benchmark, corruption flips 70–100% of correct answers across three models, while twelve blind methods (no access to the correct answer) fix at most 6% within every evaluation pool. Per-instance gradient optimization ostensibly fixes 39%, but norm-constrained analysis reveals a magnitude artifact: at oracle-matched norms the fix rate drops to random, directions are nearly orthogonal to oracle directions (cos = -0.04), and collateral damage makes the net effect negative. An oracle ablation controlling for budget, target identity, and directional noise points to a direction-selection bottleneck: repair requires a precise, per-question direction that blind methods cannot locate. Target-informed methods partially succeed but none generalizes to unseen distributions.
Truth or Dare: Analyzing LLM Susceptibility to External Evidence of Varying Factuality
Han-Yu Su | Kuan-Yu Chu | Yung-Hui Li | Lun-Wei Ku
Han-Yu Su | Kuan-Yu Chu | Yung-Hui Li | Lun-Wei Ku
Modern Large Language Models (LLMs) often rely on Retrieval-Augmented Generation (RAG) to access up-to-date information; however, retrieved corpora may contain misleading, outdated, or incorrect content, raising concerns about how such evidence affects model reliability. In this work, we investigate the susceptibility of LLMs to false external evidence. Existing studies have shown that poisoned external corpora can mislead LLM responses; yet, there is still a lack of studies on the effects of different evidence properties. To bridge this gap, we design comprehensive experiments along three dimensions: styles of evidence, quantity of evidence, and the semantic similarity between external messages and the model’s internal belief. We find that instructive-style evidence demonstrates the most severe performance degradation. On the other hand, we observe a steady decline in model response quality as the amount of false evidence accumulates. Finally, we show that LLMs are more susceptible to factually incorrect evidence when their semantic similarity is close to the model’s parametric knowledge.
The Halo Effect and Language Takeover: Spatiotemporal Attention Decay Explains Vision-Language Model Failures in Simple Visual Counting
Haochen Zhao | Sujian Li
Haochen Zhao | Sujian Li
Despite their remarkable capabilities in complex multimodal reasoning, Vision Language Models (VLMs) exhibit a perplexing inability to perform elementary visual counting tasks reliably. Existing hypotheses, often centering on input resolution or patch tokenization, fail to fully explain the stochastic nature of these errors, particularly in multi-digit generation. In this work, we investigate the internal decision-making dynamics of VLMs (e.g., Qwen3-VL, Gemma3) through the lens of attention mechanisms. By leveraging a controlled synthetic dataset and introducing novel metrics for Visual Sparsity and Entropy, we discover a novel phenomenon: Spatiotemporal Attention Decay. Our analysis reveals two distinct failure modes. Spatially, models exhibit a Halo Effect, where attention focuses on the peripheral convex hull of object clusters rather than penetrating the geometric centers of individual instances. Temporally, we observe a phenomenon of Language Takeover: during auto-regressive decoding, visual grounding decays rapidly after the initial token. Quantitative analysis confirms that as attention sparsity drops and entropy rises, the generation of subsequent digits degenerates from visual perception into hallucination driven by language priors. These findings suggest that counting failures stem from the model’s inability to maintain spatiotemporal focus, highlighting the need for mechanisms that enforce persistent visual grounding.
Why is "Chicago" Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues
Jiaming Qu | Mengtian Guo | Yue Wang
Jiaming Qu | Mengtian Guo | Yue Wang
Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of data to distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret, which can hinder user understanding and trust. In this work, we study whether large language models (LLMs) can translate such unintuitive lexical cues into human-understandable language phenomena. We propose a conjecture-then-validate framework, and show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena derived from LLMs’ prior knowledge or in-context learning. Such phenomena can aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.
Domain-Dependent Safety Behavior in Open-Weight LLMs: An Empirical Study Across Seven Ethical Domains
Zacharie Bugaud
Zacharie Bugaud
We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B–70B) in 4,200 interactions with dual-judge validation. Using a dual-condition methodology, each scenario tested in both an analytical framing (identify the harm) and an operational framing (help commit the harm), we find compliance rates vary from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span with non-overlapping cluster-bootstrapped 95% CIs. Domain accounts for 36% of pair-level variance in harm scores, with scenario (26%) exceeding model identity (15%). A stable model safety hierarchy persists across domains (mean Spearman ρ = 0.68). These findings demonstrate that safety alignment is not a general capability: aggregate safety scores mask critical domain-level variation, motivating domain-specific safety auditing for trustworthy deployment.
A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification
Stephanie Brandl | Oliver Eberle
Stephanie Brandl | Oliver Eberle
Instruction-tuned LLMs are able to provide *an* explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a *good* explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.
Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy | Mostafa Elhoushi | Amr Alanwar
Amr Hegazy | Mostafa Elhoushi | Amr Alanwar
Controlling undesirable LLM behaviors typically requires costly fine-tuning, while existing inference-time steering methods lack fine-grained adaptivity. We introduce a lightweight, trainable controller network for adaptive inference-time control. The controller observes intermediate LLM activations to predict a global scaling factor and layer-specific weights, which dynamically modulate a pre-computed “refusal direction” vector. Trained on harmful and benign prompts, the controller learns to apply nuanced, layer-aware steering selectively. Experiments on Llama and Mistral models show our method significantly increases refusal rates on safety benchmarks like ToxicChat, outperforming existing approaches without altering the original model parameters.
SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization
Noor Islam S. Mohammad | Ulug Bayazit
Noor Islam S. Mohammad | Ulug Bayazit
Fine-tuned encoders deployed across heterogeneous NLP tasks face three compounding problems: mismatched inductive biases, class-imbalance corruption of feature statistics, and no mechanism to condition attention on external lexical knowledge. We introduce SURGELLM, a unified transformer framework that addresses each with a dedicated lightweight module: a surgical feature gate (learned per-dimension sigmoid over curated lexical indicators and [CLS]; provably degenerates to identity when features are uninformative), task-conditioned prefix tokens (quantized feature values and task identity prepended to every input), and Instance-Weighted Normalization (IWN; removes class-prior bias from gate statistics). We prove an excess-risk bound linking gate benefit to surgical feature alignment. Across four tasks, SST-2, multi-hop retrieval, LLM-prompt attribution, and authorship detection, covering 17,830 examples and eleven model variants over three seeds, the IWN variant achieves macro-F1 0.940 (+0.036 over the strongest non-IWN baseline; +0.130 on authorship detection). A random-vocabulary control (-0.028 avg. F1) confirms gains are lexical, not parametric. Code, vocabularies, and a 99.5%-recovery auto-extraction recipe are released.
With a Grain of SALT: Are LLMs Fair Across Social Dimensions?
Samee Arif | Zohaib Khan | Maaidah Kaleem Butt | Muhammad Suhaib Rashid | Agha Ali Raza | Awais Athar
Samee Arif | Zohaib Khan | Maaidah Kaleem Butt | Muhammad Suhaib Rashid | Agha Ali Raza | Awais Athar
In this paper we present a systematic study of social bias in small- to mid-scale Large Language Models (LLMs), focusing on gender, religion, and race. Using our SALT (Social Appropriateness in LLM Text) dataset, we explore two bias categories—Theoretical and Practical. Theoretical bias covers General Debate and Positioned Debate while practical bias includes Career Advice, Personal Advice, and Resume Generation. We quantify bias using win-rate gaps in general debate, and negative-role assignments in positioned debate. For Practical bias, we anonymize model outputs to remove explicit demographic cues and use DeepSeek-R1 as an automated evaluator, measuring outcome disparities across groups. We also examine systemic issues in LLM-based evaluation including evaluation bias, positional bias, and length bias and validate our findings through human annotation. Our results show consistent disadvantages for White, Christian, and male-associated outputs across multiple tasks. Larger models often amplify these disparities, highlighting that scale does not guarantee fairness.
GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning
Kasidit Sermsri | Teerapong Panboonyuen
Kasidit Sermsri | Teerapong Panboonyuen
Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher–student interactions. Existing reasoning distillation methods, including mentor-based approaches, predominantly operate in an open-loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence-gated closed-loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence-gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden-state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability-filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan-T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open-loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low-resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence-gated closed-loop supervision is critical for building reliable and scalable small reasoning models.
Modern Large Language Models (LLMs) rely on extensive safety alignment, yet the mechanistic basis of refusal remains opaque. In this work, we investigate whether safety compliance is a deep semantic decision or a manipulable linear feature. We introduce Contrastive Logit Steering (CLS), a zero-optimization framework that isolates the "refusal direction" by contrasting hidden states derived from safe and unrestricted system prompts. Unlike representation engineering methods that intervene on internal activations, CLS operates directly on the output distribution, serving as a diagnostic probe for alignment fragility. When coupled with prefix injection to bypass initial refusal reflexes, this method induces a phase transition where guardrails collapse. Our experiments on 7 model families reveal that safety implementation is architecturally deterministic. While models like Llama-3.1 exhibit a "Late Decision" topology that is easily bypassed by CLS (reaching 95% ASR in milliseconds), others like Qwen-2.5 demonstrate "Early Divergence" by integrating safety mid-computation. Direct comparison with established activation-level steering methods shows that CLS achieves substantially higher attack success rates on Llama 2 (73% vs. 22.6%) and Qwen 7B (91% vs. 79.2%), demonstrating that logit-level intervention exposes alignment vulnerabilities that hidden-state methods underestimate. Beyond attacks, we show that this linearity enables bidirectional control: inverting the steering vector "hardens" models against jailbreaks without retraining. Our findings suggest that current alignment techniques create a steerable "safety axis" that serves as both a critical vulnerability and a precise primitive for defense.
The Conservative AI: Diagnosing Hold Bias and Reliability Limits in Persona-Based Monetary Policy Simulation
Giyong Kim | Sojung Kim
Giyong Kim | Sojung Kim
We examine whether large language models (LLMs) can reliably simulate historical FOMC policy decisions and whether persona-based agentic deliberation improves performance. Using strictly time-consistent vintage economic information, we evaluate multiple state-of-the-art LLMs on a three-way Hike/Hold/Cut classification task in both single-agent and multi-agent settings. Single-LLM baselines achieve nontrivial accuracy and track broad policy regime shifts, establishing a simple but strong benchmark. However, we identify a systematic behavioral asymmetry that we term Hold bias: models disproportionately favor Hold decisions and remain reluctant to predict Cut outcomes even during easing cycles. This conservatism is especially costly around regime turning points, where reliable adaptation matters most. We further find that standard agentic workflows, including debate and consensus-style aggregation, do not mitigate this problem and often amplify caution rather than improve accuracy. Overall, our results show that plausible deliberation is not sufficient for trustworthy decision support. Progress will require agentic systems explicitly designed to diagnose and correct structural bias, rather than merely reproducing surface-level committee interaction.
up
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Yves Scherrer | Noëmi Aepli | Verena Blaschke | Tommi Jauhiainen | Nikola Ljubešić | Preslav Nakov | Jörg Tiedemann | Marcos Zampieri
Yves Scherrer | Noëmi Aepli | Verena Blaschke | Tommi Jauhiainen | Nikola Ljubešić | Preslav Nakov | Jörg Tiedemann | Marcos Zampieri
AMIYA Shared Task: Arabic Modeling In Your Accent at VarDial 2026
Nathaniel R. Robinson | Shahd Abdelmoneim | Anjali Kantharuban | Otba Alsboul | Salima Lamsiyah | Kelly Marchisio | Kenton Murray
Nathaniel R. Robinson | Shahd Abdelmoneim | Anjali Kantharuban | Otba Alsboul | Salima Lamsiyah | Kelly Marchisio | Kenton Murray
Arabic, often considered a single language, actually describes a wide variety of sometimes mutually unintelligible language varieties. While large language models (LLMs) have revolutionized natural language processing (NLP) with rapid advances, these models still best serve speakers of high-resource and standard language varieties. One particular deficiency of theirs is in dialectal Arabic. We present the first ever shared task for dialectal Arabic language modeling: Arabic Modeling In Your Accent, or AMIYA. The goal of the shared task was to develop LLMs that could (1) respond in the correct dialectal variety when explicitly or implicitly prompted to, (2) translate between dialectal Arabic and standard Arabic or English, (3) adhere to LLM instructions in dialectal Arabic, and (4) produce fluent Arabic outputs. We called for submissions in the dialectal varieties of five countries: Morocco, Egypt, Palestine, Syria, and Saudi Arabia. We received 45 submitted systems from six participating teams. We saw positive results from supervised fine-tuning on a translation objective, and reinforcement learning to improve dialectness. Manual evaluation also showed that some systems had learned to output dialectal words or phrases, but at the expense of actual fluency or coherence. Overall the most effective system involved continual pre-training and supervised fine-tuning of 12 candidate LLMs, followed by selection of the best performing models.
Far Out: Evaluating Language Models on Slang in Australian and Indian English
Deniz Kaya Dilsiz | Dipankar Srirag | Aditya Joshi
Deniz Kaya Dilsiz | Dipankar Srirag | Aditya Joshi
Language models exhibit systematic performance gaps when processing text in non-standard language varieties, yet their ability to comprehend variety-specific slang remains underexplored for several languages. We present a comprehensive evaluation of slang awareness in Indian English (en-IN) and Australian English (en-AU) across seven state-of-the-art language models. We construct two complementary datasets: WEB, containing 377 web-sourced usage examples from Urban Dictionary, and GEN, featuring 1,492 synthetically generated usages of these slang terms, across diverse scenarios. We assess language models on three tasks: target word prediction (TWP), guided target word prediction (TWP*) and target word selection (TWS). Our results reveal four key findings: (1) Higher average model performance TWS versus TWP and TWP*, with average accuracy score increasing from 0.03 to 0.49 respectively (2) Stronger average model performance on WEB versus GEN datasets, with average similarity score increasing by 0.03 and 0.05 across TWP and TWP* tasks respectively (3) en-IN tasks outperform en-AU when averaged across all models and datasets, with TWS demonstrating the largest disparity, increasing average accuracy from 0.44 to 0.54. These findings underscore fundamental asymmetries between generative and discriminative competencies for variety-specific language, particularly in the context of slang expressions despite being in a technologically rich language such as English.
Effects of Speaker Bias in Dialect Identification and Automatic Transcription with Self-Supervised Speech Models
Olli Kuparinen
Olli Kuparinen
A major issue in audio modeling is speaker bias, in which the models learn language external traits, such as a speaker’s timbre or pitch, and use this information as a shortcut to a language task. This is especially problematic for dialectology, as it is typical in dialect corpora that only a few speakers represent a complete dialect area. In this paper, we explore the effects of speaker bias in two dialectal tasks: dialect identification and automatic dialectal transcription. We build two different data partitions of dialect interviews in Finnish and Norwegian: 1) a speaker dependent partition in which all of the speakers appear in training, development, and test sets, and 2) a speaker independent partition where each speaker only appears in exactly one set. We further experiment with modifications of the training data by augmenting the original audio with pitch shifts and noise, as well as changing the original speakers’ voices with voice conversion models. We show that the dialect identification models are highly affected by speaker bias, whereas automatic dialectal transcription models are not. The audio modifications do not offer major performance gains for either of the languages or tasks.
OcWikiDialects: A Wikipedia Dataset With Rich Metadata for Occitan Dialect Identification
Oriane Nédey | Rachel Bawden | Thibault Clérice | Benoît Sagot
Oriane Nédey | Rachel Bawden | Thibault Clérice | Benoît Sagot
Occitan is a Romance language spoken mostly in the South of France and characterised by rich dialectal variation, which can pose problems for certain NLP tools. This shortfall is largely attributable to the scarcity of dialect-annotated corpora, in a context where linguistic classification within the Occitan dialect continuum is still debated and major nomenclatures, such as ISO 639, fail to provide granular codes for varieties below the generic "Occitan" label. In this paper, we introduce OcWikiDialects, a new dataset comprising articles from the Occitan Wikipedia. The corpus features rich metadata, including dialect labels, and is segmented at both paragraph and sentence levels. Combined with previously released datasets, we explore approaches for Occitan dialect identification by training three types of model on up to 8 labels: linear SVM classifiers based on word and character n-grams, FastText classifiers based on pretrained vectors, and BERT-based neural classifiers adapted through fine-tuning. Evaluations across in- and out-of-domain test sets demonstrate the substantial impact of our new dataset for the task. However, a peak macro-averaged F1 score of 58.15 underscores persistent challenges for underrepresented Occitan varieties, supported by our per-dialect analysis. Code, dataset and models are available: https://github.com/DEFI-COLaF/OcWikiDialects.
Language Mixture to Develop Accurate Galician Dependency Parsers: An Exploration of Its Effects
Xabier Irastortza-Urbieta | José M. García-Miguel | Marcos Garcia
Xabier Irastortza-Urbieta | José M. García-Miguel | Marcos Garcia
The development of accurate syntactic parsers remains a challenge for low-resource languages. To overcome it, the literature has proposed leveraging syntactic annotations from typologically related languages. This work investigates the viability and adequacy of this approach for Galician, evaluating the use of annotations from major Romance languages as source data. Our methodology extends beyond standard automatic evaluation to incorporate a detailed error analysis, which precisely quantifies the effects of multilingual training and assesses the practical scalability of the method. The results establish the necessity of embedding models for effective cross-lingual transfer and demonstrate that even languages not particularly close can yield adequate parsers. This work confirms the benefits of cross-lingual data augmentation while delineating its scalability limits. Furthermore, the error analysis identifies specific, typologically conditioned grammatical dependencies that remain persistent challenges for accurate dependency parsing.
Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography
Gianluca Vico | Jindřich Libovický
Gianluca Vico | Jindřich Libovický
We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian–Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
German-English Code-Switching in Large Language Models
Firat Cem Aksüt | Stefan Hillmann | Pia Knoeferle | Sebastian Möller
Firat Cem Aksüt | Stefan Hillmann | Pia Knoeferle | Sebastian Möller
Code-Switching (CS) is common in multilingual communication, yet it is unclear how well current Large Language Models (LLMs) reproduce naturally occurring switching patterns. This paper studies German–English CS ("Denglisch") generated by GPT-4o and LLaMA-3.3, using Reddit data from the Denglisch Corpus as a reference. Model outputs are compared to authentic posts using established CS metrics (M-Index, I-Index, CESAR), an analysis of Shared Lexical Items (SLIs) as switch triggers, and a human evaluation of perceived naturalness and fluency. Both models approximate global CS characteristics but differ in the diversity and complexity in comparison to real data. LLaMA-3.3 more closely matches corpus-level metrics, whereas GPT-4o produces more conservative switching that is rated as significantly more natural and fluent. In addition, GPT-4o reproduces SLI-triggered switching patterns similar to those found in authentic data, while this effect is weaker for LLaMA-3.3.
Perplexity as a Metric for Dialectal Distance: A Computational Study of Greek Varieties
Stergios Chatzikyriakidis | Erofili Psaltaki | Dimitrios Papadakis | Erik Henriksson | Veronika Laippala
Stergios Chatzikyriakidis | Erofili Psaltaki | Dimitrios Papadakis | Erik Henriksson | Veronika Laippala
In this paper, we use LLM perplexity as a measure to assess Greek dialectal distance. We test seven models on Standard Modern Greek (SMG) and eight dialects, namely Heptanesian, Cypriot, Maniot, Pontic, Northern, Cretan, Tsakonian, and Griko. Using samples of 5k, 15k, and 25k tokens from the GRDD+ corpus for each variety, we find a consistent dialect ranking across models, with Heptanesian closest to SMG, and Griko most distant (perplexity ratio 3.6–14.5× depending on model). These results are largely in agreement with theoretical dialectological knowledge. For example, Tsakonian consistently appears distant in all measures, reflecting its status as the sole Doric descendant, while Heptanesian appears closer by all metrics, pointing to its status as one of the dialects used to shape the official variety. Perplexity correlates strongly with Bits Per-Character (mean r = 0.94) and Normalized Compression Distance (mean r = 0.87, range 0.76–0.93), providing support for its use as a dialectometric tool. However, a number of important confounds are also found. First, tokenization effects compress Llama 2’s perplexity range. Second, genre artifacts seem to inflate the results for Cretan. Third, potential training data contamination likely reduces perplexity for Cypriot and Pontic. Lastly, we find that Greek-specific models like Meltemi and Krikri do not consistently outperform general models.
A Subword Embedding Approach for Variation Detection in Luxembourgish User Comments
Anne-Marie Lutgen | Alistair Plum | Christoph Purschke
Anne-Marie Lutgen | Alistair Plum | Christoph Purschke
This paper presents an embedding-based approach to detecting variation without relying on prior normalisation or predefined variant lists. The method trains subword embeddings on raw text and groups related forms through combined cosine and n-gram similarity. This allows spelling and morphological diversity to be examined and analysed as linguistic structure rather than treated as noise. Using a large corpus of Luxembourgish user comments, the approach uncovers extensive lexical and orthographic variation that aligns with patterns described in dialectal and sociolinguistic research. The induced families capture systematic correspondences and highlight areas of regional and stylistic differentiation. The procedure does not strictly require manual annotation, but does produce transparent clusters that support both quantitative and qualitative analysis. The results demonstrate that distributional modelling can reveal meaningful patterns of variation even in “noisy” or low-resource settings, offering a reproducible methodological framework for studying language variety in multilingual and small-language contexts.
Onomasiological Sense Alignment Across Dialect Dictionaries. A Taxonomy-Constrained LLM Classification
Nathalie Mederake | Nico Urbach | Hanna Fischer | Alfred Lameli
Nathalie Mederake | Nico Urbach | Hanna Fischer | Alfred Lameli
We propose a taxonomy-guided approach to semantic alignment that assigns lexicographic senses to an onomasiological taxonomy derived from the Hallig–Wartburg/Post system. Using an LLM under strict taxonomic constraints, short and heterogeneous meaning descriptions are assigned to a common conceptual space. Evaluation against expert annotation shows that run-to-run model agreement (kappa = 0.73) closely matches human agreement (kappa = 0.74), with robustness at coarse taxonomic levels and predictable degradation at finer granularity. A qualitative network analysis demonstrates the resulting potential for cross-dictionary exploration of dialectal variation in semantics.
On the Intelligibility of Romance Language Varieties: Spanish and Portuguese in Europe and America
Liviu P. Dinu | Ana Sabina Uban | Teodor-George Marchitan | Ioan-Bogdan Iordache | Simona Georgescu
Liviu P. Dinu | Ana Sabina Uban | Teodor-George Marchitan | Ioan-Bogdan Iordache | Simona Georgescu
Mutual intelligibility within language families presents a significant challenge for multilingual NLP, particularly due to the prevalence of dialectal variation and asymmetric comprehension. In this paper, we present a corpus-based computational analysis to quantify linguistic proximity across Romance language variants, with a focus on major Spanish (Argentine, Chilean and European) and Portuguese (Brazilian and European) varieties and the other main Romance languages (Italian, French, Romanian). We apply a computational metric of lexical intelligibility based on surface and semantic similarity of related words to measure mutual intelligibility for the five main Romance languages in relation to the Spanish and Portuguese varieties studied.
Dialect Matters: Cross-Lingual ASR Transfer for Low-Resource Indic Language Varieties
Akriti Dhasmana | Aarohi Srivastava | David Chiang
Akriti Dhasmana | Aarohi Srivastava | David Chiang
We conduct an empirical study of cross-lingual transfer using spontaneous, noisy, and code-mixed speech across a wide range of Indic dialects and language varieties. Our results indicate that although ASR performance is generally associated with phylogenetic distance across languages, this factor alone does not fully explain performance in dialectal settings. Often, fine-tuning on smaller amounts of dialectal data yields performance comparable to fine-tuning on larger amounts of phylogenetically-related, high-resource standardized languages. We also present a case study on Garhwali, a low-resource Pahari language variety, and evaluate multiple contemporary ASR models. Finally, we analyze transcription errors to examine bias toward pre-training languages, providing additional insight into challenges faced by ASR systems on dialectal and non-standardized speech.
Ara-HOPE: Human-Centric Post-Editing Evaluation for Dialectal Arabic to Modern Standard Arabic Translation
Abdullah Alabdullah | Lifeng Han | Chenghua Lin
Abdullah Alabdullah | Lifeng Han | Chenghua Lin
Dialectal Arabic to Modern Standard Arabic (DA-MSA) translation is a challenging task in Machine Translation (MT) due to significant lexical, syntactic, and semantic divergences between Arabic dialects and MSA. Existing automatic evaluation metrics and general-purpose human evaluation frameworks struggle to capture dialect-specific MT errors, hindering progress in translation assessment. This paper introduces Ara-HOPE, a human-centric post-editing evaluation framework designed to systematically address these challenges. The framework includes a five-category error taxonomy and a decision-tree annotation protocol. Through comparative evaluation of three MT systems (Arabic-centric Jais, general-purpose GPT-3.5, and baseline NLLB-200), Ara-HOPE effectively highlights systematic performance differences between these systems. Our results show that dialect-specific terminology and semantic preservation remain the most persistent challenges in DA-MSA translation. Ara-HOPE establishes a new framework for evaluating Dialectal Arabic MT quality and provides actionable guidance for improving dialect-aware MT systems. For reproducibility, we make the annotation files and related materials publicly available at https://github.com/abdullahalabdullah/Ara-HOPE.
Indic-TunedLens: Interpreting Multilingual Models in Indian Languages
Mihir Panchal | Deeksha Varshney | Mamta . | Asif Ekbal
Mihir Panchal | Deeksha Varshney | Mamta . | Asif Ekbal
Multilingual large language models (LLMs) are increasingly deployed in linguistically diverse regions like India, yet most interpretability tools remain tailored to English. Prior work reveals that LLMs often operate in English centric representation spaces, making cross lingual interpretability a pressing concern. We introduce Indic-TunedLens, a novel interpretability framework specifically for Indian languages that learns shared affine transformations. Unlike the standard Logit Lens, which directly decodes intermediate activations, Indic-TunedLens adjusts hidden states for each target language, aligning them with the target output distributions to enable more faithful decoding of model representations. We evaluate our framework on 10 Indian languages using the MMLU benchmark and find that it significantly improves over SOTA interpretability methods, especially for morphologically rich, low resource languages. Our results provide crucial insights into the layer-wise semantic encoding of multilingual transformers. Our model is available at https://huggingface.co/spaces/MihirRajeshPanchal/IndicTunedLens. Our code is available at https://github.com/MihirRajeshPanchal/IndicTunedLens.
Building ASR Resources for the Hutsul Dialect of Ukrainian
Roman Kyslyi | Artem Orlovskyi | Pavlo Khomenko | Bohdan Onyshchenko | Zakhar Guzii
Roman Kyslyi | Artem Orlovskyi | Pavlo Khomenko | Bohdan Onyshchenko | Zakhar Guzii
Dialectal speech remains largely underexplored in Automatic Speech Recognition (ASR) research, particularly for Slavic languages. While Ukrainian ASR systems have rapidly improved in recent years with the adoption of Whisper, XLS-R, and Wav2Vec-based models, performance on dialectal variants remains unknown and often significantly degraded. In this work, we present the first dedicated effort to build ASR resources for the Hutsul dialect of Ukrainian. We develop a data preparation and segmentation pipeline, evaluate multiple forced alignment strategies, and benchmark state-of-the-art ASR models under zero-shot and fine-tuned conditions. We evaluate results using WER and CER demonstrating that large multilingual ASR models struggle with dialectal speech, while lightweight fine-tuning produces substantial improvements. All scripts, alignment tools, and training recipes are made publicly available to support future research on Ukrainian dialect speech.
From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models
Abdulmuizz Khalak | Abderrahmane Issam | Gerasimos Spanakis
Abdulmuizz Khalak | Abderrahmane Issam | Gerasimos Spanakis
Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.
Extending ASR Evaluation Resources for Modern Greek Dialects
Chara Tsoukala | Stavros Bompolas | Antigoni Margariti | Konstantina Panagiotou | Maria Elisavet Plaiti | Nefeli Tzanakaki | Petros Karatsareas | Angela Ralli | Antonios Anastasopoulos | Stella Markantonatou
Chara Tsoukala | Stavros Bompolas | Antigoni Margariti | Konstantina Panagiotou | Maria Elisavet Plaiti | Nefeli Tzanakaki | Petros Karatsareas | Angela Ralli | Antonios Anastasopoulos | Stella Markantonatou
Recent progress in Automatic Speech Recognition (ASR) has primarily benefited high-resource standard languages, while dialectal speech remains challenging and underexplored. We present an expanded benchmark for low-resource Modern Greek dialects, covering Aperathiot, Cretan, Lesbian, and Cappadocian, spanning southern, northern, and contact-influenced varieties with varying degrees of divergence from Standard Modern Greek. The benchmark provides dialectal transcriptions in the Greek alphabet, following SMG-based orthographic conventions, while preserving dialectal lexical and morphophonological forms. Using this benchmark, we evaluate state-of-the-art multilingual ASR models in a zero-shot setting and by further fine-tuning per dialect. Zero-shot results reveal a clear performance gradient with dialectal distance from Standard Modern Greek, with best WERs ranging from about 60-70% for southern dialects to over 80% for Lesbian and nearly 97% for Cappadocian. Fine-tuning substantially reduces error rates (up to 47% relative WER improvement), with Cappadocian remaining the most challenging variety (best WER 68.17%). Overall, our results highlight persistent limitations of current pretrained ASR models under dialectal variation and the need for dedicated benchmarks and adaptation strategies.
How Should We Model the Probability of a Language?
Rasul Dent | Pedro Ortiz Suarez | Thibault Clérice | Benoît Sagot
Rasul Dent | Pedro Ortiz Suarez | Thibault Clérice | Benoît Sagot
Of the over 7,000 languages spoken in the world, commercial language identification (LID) systems only reliably identify a few hundred in written form. Research-grade systems extend this coverage under certain circumstances, but for most languages coverage remains patchy or nonexistent. This position paper argues that this situation is largely self-imposed. In particular, it arises from a persistent framing of LID as decontextualized text classification, which obscures the central role of prior probability estimation and is reinforced by institutional incentives that favor global, fixed-prior models. We argue that improving coverage for tail languages requires rethinking LID as a routing problem and developing principled ways to incorporate environmental cues that make languages locally plausible.
Bridging Dialectal Variation: A Phonetic Transcription Tool for Tamil
Ahrane Mahaganapathy | Sumirtha Karunakaran | Kavitha Navakulan | Kengatharaiyer Sarveswaran
Ahrane Mahaganapathy | Sumirtha Karunakaran | Kavitha Navakulan | Kengatharaiyer Sarveswaran
Phonetic transcription is vital for speech processing and linguistic documentation, particularly in languages like Tamil with complex phonology and dialectal variation. Challenges such as consonant gemination, retroflexion, vowel length, and one-to-many grapheme-phoneme mappings are compounded by limited data on Sri Lankan Tamil dialects. We present a dialect-aware, rule-based transcription tool for Tamil that supports Indian and Jaffna Tamil, with extensions underway for other dialects. Using a two-stage pipeline: Tamil script to Latin, then to IPA with context-sensitive rules, the tool handles dialect shifts. A real-time interface enables dialect selection. Evaluated on a 7,830-word corpus, it achieves 94.54% accuracy for Jaffna Tamil and is higher than other tools like eSpeak NG, advancing linguistic preservation and accessible speech technology for Tamil communities.
Regional Variation in the Performance of ASR Models on Croatian and Serbian
Tanja Samardžić | Peter Rupnik | Nikola Ljubešić
Tanja Samardžić | Peter Rupnik | Nikola Ljubešić
Regional variation was a limiting factor for automatic speech recognition (ASR) before large language models. With the new technology, speech processing becomes more general, which opens the question of how to use data in similar languages such as Croatian and Serbian. In this paper, we analyse model performance in the currently available train-test scenarios with the goal of better understanding the mutual interference of these two languages. Our findings suggest that better performing models are not very sensitive to the regional variation. Training from scratch in one of the languages can give good results on both of them, while fine-tuning large pre-trained multilingual models on smaller data sets does not give the expected results.
Syllable Structures Across Arabic Varieties
Abdelrahim Qaddoumi | Jordan Kodner | Salam Khalifa | Ellen Broselow | Owen Rambow
Abdelrahim Qaddoumi | Jordan Kodner | Salam Khalifa | Ellen Broselow | Owen Rambow
This study compares the syllable structures of nine Arabic varieties from Wiktionary, using a computational syllabifier. It further investigates methods for learning syllable boundaries in unsyllabified words transcribed in the International Phonetic Alphabet (IPA). The syllabification algorithm is evaluated under three conditions: (i) Default, employing fixed rules; (ii) Joint, learning onsets and codas across all varieties collectively; and (iii) Per-variety, learning onsets and codas specific to each variety. Results indicate that the default configuration yields the highest accuracy, ranging from 97.05% to 100%. The per-variety approach achieves 90.64% to 100% accuracy, while the joint approach ranges from 84.63% to 94.74%. A cross-variety analysis using Jensen-Shannon divergence reveals three principal groupings: Egyptian, Hejazi, and Modern Standard Arabic are closely related; Levantine and Gulf varieties constitute a second cluster; and Juba Arabic, Maltese, and Moroccan emerge as outliers. A cleaned dataset encompassing all nine varieties is also provided.
Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
Ali Mekky | Mohamed El Zeftawy | Lara Hassan | Amr Keleg | Preslav Nakov
Ali Mekky | Mohamed El Zeftawy | Lara Hassan | Amr Keleg | Preslav Nakov
Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LahjatBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system.
OpenLID-v3: Improving the Precision of Closely Related Language Identification – An Experience Report
Mariia Fedorova | Nikolay Arefyev | Maja Buljan | Jindřich Helcl | Stephan Oepen | Egil Rønningstad | Yves Scherrer
Mariia Fedorova | Nikolay Arefyev | Maja Buljan | Jindřich Helcl | Stephan Oepen | Egil Rønningstad | Yves Scherrer
Language identification (LID) is an essential step in building high-quality multilingual datasets from web data. Existing LID tools (such as OpenLID or GlotLID) often struggle to identify closely related languages and to distinguish valid natural language from noise, which contaminates language-specific subsets, especially for low-resource languages. In this work we extend the OpenLID classifier by adding more training data, merging problematic language variant clusters, and introducing a special label for marking noise. We call this extended system OpenLID-v3 and evaluate it against GlotLID on multiple benchmarks. During the development we focus on three groups of closely related languages (Bosnian, Croatian, and Serbian; Romance varieties of Northern Italy and Southern France; and Scandinavian languages) and contribute new evaluation datasets where existing ones are inadequate. We find that ensemble approaches improve precision but also substantially reduce coverage for low-resource languages.
Improving Dialect Robustness in Large Language Models via LoRA and Mixture-of-Experts
Sanjh Maheshwari | Aniket Singh Rajpoot | Oana Cocarascu | Mamta .
Sanjh Maheshwari | Aniket Singh Rajpoot | Oana Cocarascu | Mamta .
Despite the success of large language models (LLMs) in a wide range of applications, it has been shown that their performance varies across English dialects. Differences among English dialects are reflected in vocabulary, syntax, and writing style, and can adversely affect model performance. Several studies evaluate the dialect robustness of LLMs, yet research on enhancing their robustness to dialectal variation remains limited. In this paper, we propose two parameter-efficient frameworks for improving dialectal robustness in LLMs: DialectFusion where we train separate LoRA layers for each dialect and apply different LoRA merging methods, and DialectMoE which is built on top of Mixture of Experts LoRA and introduces multiple LoRA-based experts to the feed-forward layer to internally model the dialectal dependencies. Our comprehensive analysis on five open-source LLMs for sentiment and sarcasm tasks in zero- and few-shot settings shows that our proposed approaches enhance the dialect robustness of LLMs and outperforms instruct and LoRA fine-tuning based approaches.
Evaluation Framework for Transfer Learning between Closely Related Lects: A Case Study of Lemko
Ilia Afanasev
Ilia Afanasev
The creation of a robust evaluation methodology is one of the pivotal issues for transfer learning between closely related lects. The current study proposes to resolve this issue by concisely implementing a group of evaluation methods that enable a more systematic qualitative analysis of errata (for instance, string similarity measures to assess lemmatisation more effectively). The paper introduces a robustness score, a metric that aims to assess the stabilityof model performance across different datasets. The case study is a morphosyntactic tagging of a small historical (beginning of the twentieth century) corpus of Lemko (Slavic clade, Transcarpathian area). It presents a diversity of cross-dependent tasks, made rather complex by the rich Lemko morphology, highly influenced by areal convergence processes. The tagger is a pre-trained Stanza. The study uses modern standard Ukrainian as the source language, as it is the closest to the Lemko high-resource lect. The analysis reveals that linguistically-aware metrics improve the speed and accuracy of analysis of the errata, especially those caused by the differences between source and target lects. The key data contribution is the open- source dataset of Lemko, obtained during the tagging tasks. Future research directions include a larger-scale test that applies more models to a more extensive material.
Do Large Language Models Adapt to Language Variation across Socioeconomic Status?
Elisa Bassignana | Mike Zhang | Dirk Hovy | Amanda Cercas Curry
Elisa Bassignana | Mike Zhang | Dirk Hovy | Amanda Cercas Curry
Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.
Aladdin-FTI @ AMIYA Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation
Jonathan Mutal | Perla Al Almaoui | Simon Hengchen | Pierrette Bouillon
Jonathan Mutal | Perla Al Almaoui | Simon Hengchen | Pierrette Bouillon
Arabic dialects have long been under-represented in Natural Language Processing (NLP) research due to their non-standardization and high variability, which pose challenges for computational modeling. Recent advances in the field, such as Large Language Models (LLMs), offer promising avenues to address this gap by enabling Arabic to be modeled as a pluricentric language rather than a monolithic system. This paper presents Aladdin-FTI, our submission to the AMIYA shared task. The proposed system is designed to both generate and translate dialectal Arabic (DA). Specifically, the model supports text generation in Moroccan, Egyptian, Palestinian, Syrian, and Saudi dialects, as well as bidirectional translation between these dialects, Modern Standard Arabic (MSA), and English. The code and trained model will be released upon paper acceptance.
Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding
Abdulhai Alali | Abderrahmane Issam
Abdulhai Alali | Abderrahmane Issam
Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English–Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.
Dialectal Arabic continues to represent a persistent challenge for contemporary large language models, which are predominantly trained and optimized for Modern Standard Arabic (MSA) and therefore exhibit limited capability when processing colloquial varieties. In this study, a dedicated system developed for participation in the AMIYA shared task focusing on Syrian Arabic is presented. The proposed solution is based on the integration of parameter-efficient fine-tuning through Low-Rank Adaptation (LoRA) with prompt-guided inference, aiming to enhance dialectal adequacy and linguistic naturalness. Rather than emphasizing strict factual precision, the system is deliberately designed to prioritize fluent and authentic Syrian Arabic generation, in accordance with the evaluation principles adopted by the AL-QASIDA benchmark. This design choice reflects a focus on human-perceived language quality and dialectal fidelity, which are central to effective dialect-aware language modeling.
NUS-IDS at AMIYA/VarDial 2026: Improving Arabic Dialectness in LLMs with Reinforcement Learning
Sujatha Das Gollapalli | Mouad Hakam | Mingzhe Du | See-Kiong Ng
Sujatha Das Gollapalli | Mouad Hakam | Mingzhe Du | See-Kiong Ng
In this paper, we describe models developed by our team, NUS-IDS, for the Closed data track at the Arabic Modeling In Your Accent (AMIYA) shared task at VarDial 2026. The core idea behind our solution involves data augmentation enabled by a dialect classifier trained on AMIYA data. We effectively combine various translation, summarization, and question answering prompts with AMIYA training data to form dialectal prompts for use with state-of-the-art LLMs. Next, dialect predictions from our classifier on outputs from these LLMs are used to compile preference data for Reinforcement Learning (RL). We report model performance on dialectal Arabic from Egypt, Morocco, Palestine, Saudi Arabia and Syria using FLORES+, a multilingual machine translation dataset. Our experiments illustrate that though our RL models show significant performance gains on dialectness scores, they under perform on translation metrics such as chrF++ compared to base LLMs.
MBZUAI at AMIYA Shared Task 2026: Adapting Open-Source LLMs for Dialectal Arabic
Rana Gaber | Yara Allam | Serag Amin | Ranwa Aly | Bashar Alhafni
Rana Gaber | Yara Allam | Serag Amin | Ranwa Aly | Bashar Alhafni
This paper presents our contribution to the closed data track of the AMIYA Shared Task on Dialectal Arabic text generation. In this track, we train fully open-source Large Language Models (LLMs) on five Arabic dialects: Egyptian, Moroccan, Palestinian, Saudi, and Syrian, using the provided training datasets. We experiment with different base and instruct models using several pretraining and instruction tuning approaches. In total, five models were submitted, with three variants per dialect. Our best-performing models for the five dialects are ALLaM for Egyptian, LLaMa for Moroccan, and Palestinian, and Aya for Saudi and Syrian.
A Closed-Track System for Palestinian Arabic in the AMIYA Shared Task
Khaleel Hamad | Ahmad Al-Najjar
Khaleel Hamad | Ahmad Al-Najjar
We describe a closed track system for mod- eling Palestinian Arabic that is developed for the AMIYA shared task using a parameter effi- cient fine-tuning strategy. A 1.5B instruction- tuned language model was adapted with LoRA (Hu et al., 2021), updating only .28% of the model parameters, and trained on an aggre- gated set of conversations between Palestini- ans and resources covering both translation and generation. Model selection was guided by a comparative benchmark that prioritized performance efficiency and its tradeoffs. At the same time the paper focuses on targeting error analysis as well as structured instruction following. These findings illustrate both the viability and shed light on the current limita- tions of efficient adaptation methods for low- resource Arabic dialects.
up
The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026)
The Proceedings for the 15th Workshop on Computational Approaches to Subjectivity, Sentiment Social Media Analysis (WASSA 2026)
Jeremy Barnes | Valentin Barriere | Orphée De Clercq | Roman Klinger | Célia Nouri | Debora Nozza | Pranaydeep Singh
Jeremy Barnes | Valentin Barriere | Orphée De Clercq | Roman Klinger | Célia Nouri | Debora Nozza | Pranaydeep Singh
Council of LLMs: Evaluating Capability of Large Language Models to Annotate Propaganda
Vivek Sharma | Shweta Jain | Mohammad Shokri | Sarah Ita Levitan | Elena Filatova
Vivek Sharma | Shweta Jain | Mohammad Shokri | Sarah Ita Levitan | Elena Filatova
Data annotation is essential for supervised natural language processing tasks but remains labor-intensive and expensive. Large language models (LLMs) have emerged as promising alternatives, capable of generating high-quality annotations either autonomously or in collaboration with human annotators. However their use in autonomous annotations is often questioned for their ethical take on subjective matters. This study investigates the effectiveness of LLMs in a autonomous, and hybrid annotation setups in propaganda detection. We evaluate GPT and open-source models on two datasets from different domains, namely, Propaganda Techniques Corpus (PTC) for news articles and the Journalist Media Bias on X (JMBX) for social media. Our results show that LLMs, in general, exhibit high recall but lower precision in detecting propaganda, often over-predicting persuasive content. Multi-annotator setups did not outperform the best models in single-annotator setting although it helped reasoning models boost their performance. Hybrid annotation, combining LLMs and human input, achieved the highest overall accuracy than LLM-only settings. We further analyze misclassifications and found that LLM have higher sensitivity towards certain propaganda techniques like loaded language, name calling, and doubt. Finally, using error typology analysis, we explore the reasoning provided on misclassifications by the LLM. Our result shows that although some studies report LLM outperforming manual annotations and it could prove useful in hybrid annotation, its incorporation in the human annotation pipeline must be implemented with caution.
Emoji Reactions on Telegram: Unreliable Indicators of Emotional Resonance
Serena Tardelli | Lorenzo Alvisi | Lorenzo Cima | Stefano Cresci | Maurizio Tesconi
Serena Tardelli | Lorenzo Alvisi | Lorenzo Cima | Stefano Cresci | Maurizio Tesconi
Emoji reactions are a frequently used feature of messaging platforms, yet their communicative role remains understudied. Prior work on emojis has focused predominantly on in-text usage, showing that emojis embedded in messages tend to amplify and mirror the author’s affective tone. This evidence has often been extended to emoji reactions, treating them as indicators of emotional resonance or user sentiment. However, they may reflect broader social dynamics. Here, we investigate the communicative function of emoji reactions on Telegram. We analyze over 650k crypto-related messages that received at least one reaction, annotating each with sentiment, emotion, persuasion strategy, and speech act labels, and inferring the sentiment and emotion of emoji reactions using both lexicons and LLMs. We uncover a systematic mismatch between message and reaction sentiment, with positive reactions dominating even for neutral or negative content. This pattern persists across rhetorical strategies and emotional tones, indicating that emojis used as reactions do not reliably function as indicators of emotional mirroring or resonance of the content, in contrast to findings reported for in-text emojis. Finally, we identify the features that most predict emoji engagement. Overall, our findings caution against treating emoji reactions as sentiment labels, highlighting the need for more nuanced approaches in sentiment and engagement analysis.
This paper presents a domain-specific transformer pipeline for quantifying social atmosphere in hostel reviews, an experiential dimension that travelers consistently prioritize but that existing NLP methods and booking platforms fail to capture. We train a cross-encoder on 4,994 manually annotated reviews and use it to pseudo-label 162,840 additional reviews; these labels are then distilled into a sentence-transformer bi-encoder, producing embeddings where proximity reflects social interaction level rather than generic sentiment. On held-out human-labeled data, the domain-adapted embeddings achieve F1 = 0.826, outperforming generic sentence embeddings (0.671) and zero-shot GPT-4o (0.774), with a 40-fold improvement in intra-class versus inter-class similarity. Aggregating predictions to the property level reveals that hostel socialness follows an approximate exponential distribution, confirming that highly social hostels are rare. This work formalizes socialness as a measurable semantic construct and provides a general template for extracting implicit experiential attributes from text at scale.
Predicting Convincingness in Political Speech: How Emotional Tone Shapes Persuasive Strength
Bhuvanesh Verma | Mounika Marreddy | Alexander Mehler
Bhuvanesh Verma | Mounika Marreddy | Alexander Mehler
Emotional tone plays a central role in persuasion, yet its impact on computational assessments of political argument quality in real world election campaign speeches remains understudied. In this work, we investigate whether positive emotional framing correlates with higher perceived convincingness in political arguments. We fine-tune language models on argument quality datasets and test their ability to transfer convincingness predictions to real-world campaign speeches. Using a corpus of U.S. presidential campaign speeches, we analyze emotional polarity in relation to predicted persuasive strength to test whether positively framed arguments are judged more convincing than neutral or negative ones. Our empirical analysis shows that political parties rely heavily on argumentation during their election campaigns. Also, we found the evidence that politicians strategically employ emotional cues within their arguments during these campaign speeches, with positive emotions being more strongly associated with persuasive strength, for example in topics such as USMCA’s Effect on American Jobs and Agriculture, Border Control Policies, Progressive Tax Reforms. At the same time, we find that negative emotions have a weaker yet still non-negligible influence on voter persuasion in topics such as City Crime and Civil Unrest and White Supremacist Violence (Charlottesville Incident).
Large language models (LLMs) are now widely used in applications that depend on closed-ended decisions, including automated surveys, policy screening, and decision-support tools. In such contexts, these models are typically expected to produce consistent binary or ternary responses (for example, Yes, No, or Neither) when presented with questions that are semantically equivalent. However recent studies shows that LLM outputs can be influenced by relatively minor changes in prompt wording, raising concerns about the reliability of their decisions under paraphrasing. In this paper, we conduct a systematic analysis of paraphrase robustness across five widely used LLMs. To support this evaluation, we develop a controlled dataset consisting of 200 opinion-based questions drawn from multiple domains, each accompanied by five human-validated paraphrases. All models are evaluated under deterministic inference settings and constrained to a fixed Yes/No/Neither response format. We assess model behavior using a set of complementary metrics that capture the stability of each evaluated model. DeepSeek Reasoner and Gemini 2.0 Flash show the highest stability when responding to paraphrased inputs, whereas Claude 3.7 Sonnet exhibits strong internal consistency but produces judgments that differ more frequently from those of other models. By contrast, GPT-3.5 Turbo and LLaMA 3 70B display greater sensitivity to surface-level variations in prompt phrasing. Overall, these findings suggest that robustness to paraphrasing is driven more by alignment strategies and reasoning design choices than by model size alone.
The Impact of Highlighting Subjective Language on Perceived News Trustworthiness
Mohammad Shokri | Vivek Sharma | Emily Klapper | Shweta Jain | Elena Filatova | Sarah Ita Levitan
Mohammad Shokri | Vivek Sharma | Emily Klapper | Shweta Jain | Elena Filatova | Sarah Ita Levitan
The rise of misinformation and opinionated articles has made understanding how misleading or biased content influences readers an increasingly important problem. While most prior work focuses on detecting misinformation or deceptive language in real time, far less attention has been paid to how such content is perceived by readers, which is an essential component of misinformation’s effectiveness. In this study, we examine whether highlighting subjective sentences in news articles affects perceived trustworthiness. Using a controlled user experiment and 1,334 article–reader evaluations, we find that highlighting subjective content produces a modest yet statistically significant decrease in trust, with substantial variation across articles and participants. To explain this variation, we model trust change after highlighting subjective language as a function of article-level linguistic features and reader-level attitudes. Our findings suggest that readers’ reactions to highlighted subjective language are driven primarily by characteristics of the text itself, and that highlighting subjective language offers benefits for may help readers better assess the reliability of potentially misleading news articles.
Appraisal Trajectories in Narratives Reveal Distinct Patterns of Emotion Evocation
Johannes Schäfer | Janne Wagner | Roman Klinger
Johannes Schäfer | Janne Wagner | Roman Klinger
Understanding emotion responses relies on reconstructing how individuals appraise events. While prior work has studied emotion trajectories and inherent correlations with appraisals, it has considered appraisals only in a snapshot analysis. However, because appraisal is a complex, sequential process, we argue that it should be analyzed based on how it unfolds throughout a narrative. In this study, we investigate whether trajectories of appraisals are distinctive for different emotions in five-event stories – narratives where each of five sentences describes an event. We employ zero-shot prompting with a large language model to predict appraisals on sub-sequences of a narrative. We find that this approach is effective in identifying relevant appraisals in narratives, without prior knowledge of the evoked emotion, enabling a comprehensive analysis of appraisal trajectories. Furthermore, we are the first to quantitatively identify typical patterns of appraisal trajectories that distinguish emotions. For example, a rising trajectory for self-responsibility indicates trust, while a falling trajectory suggests anger.
Exploring Subjective Tasks in Farsi: A Survey Analysis and Evaluation of Language Models
Donya Rooein | Flor Miriam Plaza-del-Arco | Debora Nozza | Dirk Hovy
Donya Rooein | Flor Miriam Plaza-del-Arco | Debora Nozza | Dirk Hovy
Given Farsi’s speaker base of over 127 million people and the growing availability of digital text, including more than 1.3 million articles on Wikipedia, it is considered a middle-resource language. However, this label quickly crumbles when the situation is examined more closely. We focus on three subjective tasks (Sentiment Analysis, Emotion Analysis, and Toxicity Detection) and identify significant challenges in data availability and quality, despite overall increases in data availability. We review 110 publications on subjective tasks in Farsi and observe a lack of publicly available datasets. Furthermore, existing datasets often lack essential demographic factors, such as age and gender, that are crucial for accurately modeling subjectivity in language. When evaluating prediction models using the few available datasets, the results are highly unstable across both datasets and models. Our findings show that the volume of data alone is insufficient to improve a language’s standing in NLP.
Emotional Lexicons: How Large Language Models Predict Emotional Ratings of Russian Words
Polina V. Iaroshenko | Natalia V Loukachevitch
Polina V. Iaroshenko | Natalia V Loukachevitch
This study examines the capability of LLMs to predict emotional ratings of Russian words by comparing their assessments with both native speakers’ ratings and expert evaluations. The research utilises two datasets: the ENRuN database containing associative emotional ratings of Russian nouns by native speakers, and RusEmoLex, an expert-compiled lexicon. Various open-source LLMs were evaluated, including international models (Llama-3, Qwen 2.5), Russian-developed models, and Russian-adapted variants, representing three parameter scales. The findings reveal distinct patterns in model performance: Russian-adapted models demonstrated superior alignment with native speakers’ ratings, whilst model size was not a decisive factor. Conversely, larger models showed better performance in matching expert assessments, with language adaptation having minimal impact. Emotional or sensitive lexis with strong connotations produce a more substantial human-model gap.
Emotion-aware text simplification of user generated content using LLMs
Anastasiia Bezobrazova | Daria Sokova | Constantin Orasan
Anastasiia Bezobrazova | Daria Sokova | Constantin Orasan
Digital inclusion increasingly supports adults with intellectual disabilities (ID) to participate online, yet social media posts can be difficult to understand, particularly when they contain strong emotions, slang, or non-standard writing. This paper investigates whether large language models (LLMs) can simplify social media texts to improve cognitive accessibility and preserve emotional meaning. Using an accessibility-oriented prompt based on existing guidance, posts are simplified and emotion preservation is assessed. The results suggest that many simplified posts retain the same emotions, though changes occur, especially when emotions are weakly expressed or ambiguous. Qualitative analysis shows that simplification improves fluency and structure but can also shift perceived emotion through changes to tone, formatting, and other affective cues common in social media text. The research has also revealed that different LLMs produce very different outputs.
Crowd-Based Evaluation of Emotion Intensity Preservation in Spanish–Basque Tweet Machine Translation
Nora Aranberri
Nora Aranberri
Machine translation (MT) systems perform well on standard benchmarks, yet their ability to preserve emotional meaning in informal user-generated content—particularly for low-resource languages—remains underexplored. We investigate the preservation of emotion intensity in Spanish–Basque tweet translation, focusing on Basque, an under-represented language in MT research. We compile a small, controlled corpus of Spanish reaction tweets and evaluate Basque translations from three publicly available systems through a crowd-based study. While all systems achieve comparable and above mid-range accuracy and fluency, emotion intensity is systematically attenuated in the translations, with greater loss for more emotionally intense inputs. A follow-up on highly emotional tweets shows that LLM prompting reduces emotion loss, yet substantial attenuation remains, highlighting emotion preservation as a persistent challenge in Spanish–Basque MT.
A Position Paper on Toxic Reasoning: Grounding Categories of Toxic Language in Implications and Attitudes
Stefan F. Schouten | Ilia Markov | Piek Vossen
Stefan F. Schouten | Ilia Markov | Piek Vossen
Automatic detection of toxic language has the potential to considerably improve engagement with online spaces. Previous work has characterized toxic language detection as a classification problem, often using fine-grained classes for increased explainability. In this position paper, we argue for a particular way of operationalizing categories of toxic language. Our approach focuses on what is expressed or implied, and breaks down implications based on two traits: (i) the core content of what was expressed, and (ii) relevant stakeholders’ attitudes towards that content. We argue for an approach, which we call toxic reasoning, where such distinctions are made explicit. We point out the benefits for such an approach, and develop a toxic reasoning schema, which can explain categories of toxic language from diverse sources. We demonstrate this by mapping the classes of existing toxic language datasets to the schema. Toxic reasoning promises to provide improved understanding of implicit toxicity while increasing explainability.
Is Sentiment Banana-Shaped? Exploring the Geometry and Portability of Sentiment Concept Vectors
Laurits Lyngbaek | Pascale Feldkamp | Yuri Bizzoni | Kristoffer Nielbo | Kenneth Enevoldsen
Laurits Lyngbaek | Pascale Feldkamp | Yuri Bizzoni | Kristoffer Nielbo | Kenneth Enevoldsen
Use cases of sentiment analysis in the humanities often require contextualized, continuous scores. Concept Vector Projections (CVP) offer a recent solution: by modeling sentiment as a direction in embedding space, they produce continuous, multilingual scores that align closely with human judgments. Yet the method’s portability across domains and underlying assumptions remain underexplored.We evaluate CVP across genres, historical periods, languages, and affective dimensions, finding that concept vectors trained on one corpus transfer well to others with minimal performance loss. To understand the patterns of generalization, we further examine the linearity assumption underlying CVP. Our findings suggest that while CVP is a portable approach that effectively captures generalizable patterns, its linearity assumption is approximate, pointing to potential for further development. Code available at: github.com/lauritswl/representation-transfer
Disentangling Emotion Understanding and Generation in Large Language Models
Sadegh Jafari | Els Lefever | Veronique Hoste
Sadegh Jafari | Els Lefever | Veronique Hoste
Large language models (LLMs) have demonstrated strong performance on emotion understanding tasks, yet their ability to faithfully generate emotionally aligned text remains less well understood.We propose a semantic evaluation framework that jointly assesses emotion understanding, emotion generation, and internal consistency, using a VAE-based emotion cost matrix that captures graded semantic similarity between emotion categories.Our framework introduces four complementary metrics that disentangle baseline understanding, human-perceived emotion in generated text, generation quality, and model consistency.Experimental results show that while understanding and consistency scores are highly correlated, emotion generation exhibits substantially weaker correlations with these metrics.These findings motivate the development of specialized evaluation protocols that independently measure emotional understanding and generation, enabling more reliable assessments of LLM emotional intelligence.
News Credibility Assessment by LLMs and Humans: Implications for Political Bias
Pia Wenzel Neves | Charlott Jakob | Vera Schmitt
Pia Wenzel Neves | Charlott Jakob | Vera Schmitt
In an era of rapid misinformation spread, LLMs have emerged as tools for assessing news credibility at scale. However, the assessments are influenced by social and cultural biases. Studies investigating political bias, compare model credibility ratings with expert credibility ratings. Comparing LLMs to the perceptions of political camps extends this approach to detecting similarities in their biases.We compare LLM-generated credibility and bias ratings of news outlets with expert assessments and stratified political opinions collected through surveys. We analyse three models (Llama 3.3 70B, Mixtral 8x7B, and GPT-OSS 120B) across 47 news outlets from two countries (U.S. and Germany).We found that models demonstrated consistently high alignment with expert ratings, while showing weaker and more variable alignment with public opinions. For US-American news outlets all models showed stronger alignment with center-left perceptions, while for German news outlets the alignment is more diverse.
Towards Simulating Social Media Users with LLMs: Evaluating the Operational Validity of Conditioned Comment Prediction
Nils Schwager | Simon Münker | Alistair Plum | Achim Rettinger
Nils Schwager | Simon Münker | Alistair Plum | Achim Rettinger
The transition of Large Language Models (LLMs) from exploratory tools to active "silicon subjects" in social science lacks extensive validation of operational validity. This study introduces Conditioned Comment Prediction (CCP), a task in which a model predicts how a user would comment on a given stimulus by comparing generated outputs with authentic digital traces. This framework enables a rigorous evaluation of current LLM capabilities with respect to the simulation of social media user behavior. We evaluated open-weight 8B models (Llama-3.1, Qwen3, Ministral) in English, German, and Luxembourgish language scenarios. By systematically comparing prompting strategies (explicit vs. implicit) and the impact of Supervised Fine-Tuning (SFT), we identify a critical form vs. content decoupling in low-resource settings: while SFT aligns the surface structure of the text output (length and syntax), it degrades semantic grounding. Furthermore, we demonstrate that explicit conditioning (generated biographies) becomes redundant under fine-tuning, as models successfully perform latent inference directly from behavioral histories. Our findings challenge current "naive prompting" paradigms and offer operational guidelines prioritizing authentic behavioral traces over descriptive personas for high-fidelity simulation.
Label-Consistent Data Generation for Aspect-Based Sentiment Analysis Using LLM Agents
Mohammad Hossein Akbari Monfared | Lucie Flek | Akbar Karimi
Mohammad Hossein Akbari Monfared | Lucie Flek | Akbar Karimi
We propose an agentic data augmentation method for Aspect-Based Sentiment Analysis (ABSA) that uses iterative generation and verification to produce high-quality synthetic training examples. To isolate the effect of agentic structure, we also develop a closely matched prompting-based baseline using the same model and instructions. Both methods are evaluated across three ABSA subtasks—Aspect Term Extraction (ATE), Aspect Sentiment Classification (ATSC), and Aspect Sentiment Pair Extraction (ASPE)—four SemEval datasets, and two encoder–decoder models: T5-Base and Tk-Instruct. Our results show that the agentic augmentation outperforms raw prompting in label preservation of the augmented data, especially when the tasks require aspect term generation. In addition, when combined with real data, agentic augmentation provides higher gains, consistently outperforming prompting-based generation. These benefits are most pronounced for T5-Base, while the more heavily pretrained Tk-Instruct exhibits smaller improvements. As a result, augmented data helps T5-Base achieve comparable performance with its counterpart.
Antisocial behavior (ASB) on social media encompasses online behaviors that harm individuals, groups, or platform ecosystems, including hate speech, harassment, cyberbullying, trolling, and coordinated abuse. While most prior work has focused on detecting harm after it occurs, a growing body of research on ASB prediction seeks to forecast future harmful outcomes before they materialize, including—but not limited to—hate-speech diffusion, conversational derailment, and user recidivism. However, this emerging field remains fragmented, with limited conceptual grounding and few integrative frameworks. This paper establishes a foundation for ASB prediction by introducing a structured taxonomy spanning temporal, structural, and behavioral dimensions. Drawing on 49 machine learning studies identified through a literature review, we map predictive goals to datasets, modeling choices, and evaluation practices, and identify key challenges, including the lack of standardized benchmarks, the dominance of text-centric representations, and trade-offs between accuracy and interpretability. We conclude by outlining actionable directions toward more robust, generalizable, and responsible ASB prediction systems.
Real-Time Mitigation of Negative Emotion in Customer Care Calls
Surupendu Gangopadhyay | Mahnoosh Mehrabani
Surupendu Gangopadhyay | Mahnoosh Mehrabani
Speech emotion recognition (SER) is a compelling yet challenging research area with substantial practical relevance, particularly in enhancing human–machine interaction. Despite considerable progress in the field, the scarcity of realistic datasets that reflect real-world conditions makes it difficult to analyze system behavior in practice and can lead to degraded performance in industrial applications. In this study, we propose a system that detects negative emotions at each turn in a conversation by leveraging both linguistic and acoustic features. The approach is evaluated on real-world data, with a particular focus on identifying and responding to negative emotion in customer support scenarios. Designed for real-time application, the system is suitable for live deployment in call center environments. Furthermore, we propose an effective prompting strategy for using large language models (LLMs) as annotators, generating labeled data used to fine-tune small language models that achieve performance on par with the LLM used for annotation, while remaining suitable for real-time deployment.
Says Who? Argument Convincingness and Reader Stance Are Correlated with Perceived Author Personality
Sabine Weber | Lynn Greschner | Roman Klinger
Sabine Weber | Lynn Greschner | Roman Klinger
Alongside its literal meaning, text also carries implicit social signals: information that is used by the reader to assign the author of the text a specific identity or make assumptions about the author’s character. The reader creates a mental image of the author which influences the interpretation of the presented information. This is especially relevant for argumentative text, where the credibility of the information might depend on who provides it. We therefore focus on the question: How do readers of an argument imagine its author? Using the ContArgA corpus, we study arguments annotated for convincingness and perceived author properties (level of education and Big Five personality traits). We find that annotators perceive an author to be similar to themselves when they agree with the stance of the argument. We also find that the envisioned personality traits and education level of the author are statistically significantly correlated with the argument’s convincingness. We conduct experiments with four generative LLMs and a RoBERTa-based regression model showing that LLMs do not replicate the annotators judgments. Argument convincingness can however provide a useful signal for modeling perceived author personality when it is explicitly used during training.
A Transformer and Prototype-based Interpretable Model for Contextual Sarcasm Detection
Ximing Wen | Rezvaneh Rezapour
Ximing Wen | Rezvaneh Rezapour
Sarcasm detection, with its figurative nature, poses unique challenges for affective systems designed to perform sentiment analysis. While these systems typically perform well at identifying direct expressions of emotion, they struggle with sarcasm’s inherent contradiction between literal and intended sentiment. Since transformer-based language models (LMs) are known for their efficient ability to capture contextual meanings, we propose a method that leverages LMs and prototype-based networks, enhanced by sentiment embeddings to conduct interpretable sarcasm detection. Our approach is intrinsically interpretable without extra post-hoc interpretability techniques. We test our model on three public benchmark datasets and show that our model outperforms the current state-of-the-art. At the same time, the prototypical layer enhances the model’s inherent interpretability by generating explanations through similar examples in the reference time. Furthermore, we demonstrate the effectiveness of incongruity loss in the ablation study, which we construct using sentiment prototypes.
Multimodal Claim Extraction for Fact-Checking
Joycelyn Teo | Rui Cao | Zhenyun Deng | Zifeng Ding | Michael Sejr Schlichtkrull | Andreas Vlachos
Joycelyn Teo | Rui Cao | Zhenyun Deng | Zifeng Ding | Michael Sejr Schlichtkrull | Andreas Vlachos
Automated Fact-Checking (AFC) relies on claim extraction as a first step, yet existing methods largely overlook the multimodal nature of today’s misinformation. Social media posts often combine short, informal text with images such as memes, screenshots, and photos, creating challenges that differ from both text-only claim extraction and well-studied multimodal tasks like image captioning or visual question answering. In this work, we present the first benchmark for multimodal claim extraction from social media, consisting of posts containing text and one or more images, annotated with gold-standard claims derived from real-world fact-checkers. We evaluate state-of-the-art multimodal LLMs (MLLMs) under a three-part evaluation framework (semantic alignment, faithfulness, and decontextualization) and find that baseline MLLMs struggle to model rhetorical intent and contextual cues. To address this, we introduce MICE, an intent-aware framework which shows improvements in intent-critical cases.
A Multi-Aspect Evaluation Framework for Synthetic Data: Case Study on Irony and Sarcasm
Laura Majer | Ana Barić | Florijan Sandalj | Ivan Unković | Bojan Puvača | Jan Šnajder
Laura Majer | Ana Barić | Florijan Sandalj | Ivan Unković | Bojan Puvača | Jan Šnajder
Data augmentation (DA) using large language models (LLMs) is a cost-effective method for generating synthetic data, particularly for tasks with scarce datasets. However, its potential remains largely underexplored, both in terms of augmentation configuration and evaluation of synthetic data. This paper investigates LLM-based synthetic data generation for irony and sarcasm, two subjective and context-dependent forms of figurative language. We propose a multi-aspect evaluation framework assessing synthetic data’s utility-plausibility and extrinsic-intrinsic dimensions through four aspects: predictive performance, sample diversity, linguistic properties, and human judgment. Our findings indicate that other aspects of evaluation, like diversity and linguistic features, do not necessarily correlate with an increase in predictive performance, underscoring the importance of multi-faceted evaluation. This work highlights the potential of LLM-based DA for irony and sarcasm detection, offering insights into the linguistic competence of LLMs. As synthetic data becomes increasingly prevalent, our framework offers a broadly applicable and crucial evaluation method, particularly for linguistically complex tasks.
up
Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026)
Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026)
Chloé Braud | Christian Hardmeier | Maciej Ogrodniczuk | Sharid Loaiciga | Amir Zeldes | Michal Novák | Chuyuan Li | Michael Strube | Junyi Jessy Li
Chloé Braud | Christian Hardmeier | Maciej Ogrodniczuk | Sharid Loaiciga | Amir Zeldes | Michal Novák | Chuyuan Li | Michael Strube | Junyi Jessy Li
Baselines for Detection and Classification of Discourse Presentation in English Narrative
Reinaldo Di Polo | Mustafa Ocal | Mark Finlayson
Reinaldo Di Polo | Mustafa Ocal | Mark Finlayson
Discourse presentation is when speech, writing, or thought (SW&T) attributed to a discourse entity (such as a character in a narrative) is presented within a discourse. Discourse presentations can be generally broken into direct or indirect: direct presentation is when the text quotes the words or thoughts verbatim, whereas in indirect presentation the text expresses the SW&T in the narrator’s or writer’s own words. Automatically detecting and categorizing discourse presentations supports discourse and narrative analysis and improves attribution for downstream NLP tasks, but detecting indirect discourse presentations remains challenging due to diverse surface forms and subtle perspective shifts. We study detection and categorization of discourse presentations on a corrected version of the Semino & Short’s English Narrative SW&TP corpus. We cast the task as five-way clause classification: Direct Speech & Writing, Direct Thought, Indirect Speech & Writing, Indirect Thought, and Narrative (i.e., no discourse presentation). We compare four approaches: (1) CNN; (2) generative baseline (Claude Sonnet 4.6); (3) untuned BERT, and (4) fine-tuned BERT. The CNN baseline achieves 0.43 F1 and exhibits substantial confusion with the Narrative class. Claude achieves 0.71 F1 but performs unevenly across classes and fails to recover Indirect Thought. BERT achieves 0.81 F1 overall but struggles on indirect categories. The fine-tuning BERT yields strong performance (0.88 F1), with remaining errors concentrated in Indirect Speech & Writing (F1 = 0.60). We release our code and the corrected dataset to support reproducibility. To our knowledge, this is the first time computational approaches have been evaluated across the full range of SW&TP discourse presentation types.
The relations connecting propositions in discourse such as cause (A because B) or concession (A although B) are a subject of intense interest in Computational Linguistics and Pragmatics, but challenging to study and compare across languages. Recent progress in standardizing discourse relation inventories across datasets offers the potential to facilitate such studies, but is hindered by the complexity of relevant data and the lack of easily accessible interfaces to analyze it. In this paper we present DiscoExplorer, a new open source web interface, capable of running on local computers, which we use to make datasets from the DISRPT Shared Task on discourse relation classification publicly available, covering 16 different languages. We present the query language, search and visualization facilities for relations and signaling devices such as connectives, as well as some example studies.
Speech Disfluencies and LLM Confidence: Length Bias and Pragmatic Insensitivity in Brazilian Portuguese
Valeria Santos
Valeria Santos
Training Large Language Models (LLMs) relies predominantly on written, curated corpora, which may limit their reliability on spontaneous speech. Oral language exhibits real-time planning markers — filled pauses, repetitions, false starts, and vowel lengthenings — that modulate epistemic commitment. This pilot study investigates how such disfluencies affect the alignment between LLM confidence and a discourse-pragmatic uncertainty proxy in a Portuguese model (Llama-3.1-8B-Instruct). Using a benchmark of 344 turns from the Roda Viva corpus, we contrast faithful Conversation Analysis transcriptions with sanitized versions and combine binned divergence metrics (ECE, OE) with rank correlation and multivariate regression analyses. We find that model confidence is overwhelmingly driven by a surface feature — turn length (${\beta_{\text{std}}} = +14.47, p 0.001$) — rather than by pragmatic markers of uncertainty (${\beta_{\text{oral}}} = -3.09, {\beta_{\text{hedges}}} = -0.97$, both non-significant; $R2 = 0.29$). After controlling for length, residual effects of disfluency markers align in the human-expected direction but are dwarfed by length bias. We argue that this surface-feature dominance subsumes the pragmatic blindness phenomenon and explains the substantial divergence observed via ECE (41.95) and OE (4.29) between faithful and sanitized conditions.
Recent work representing discourse relations such as "cause" or "concession" in the framework of eRST has connected hierarchical discourse parsing to explicit connectives, such as ’because’ or ’although’, bringing the framework closer to lexicalized shallow parsing in the tradition of PDTB. However, while PDTB postulates implicit, unexpressed connectives (i.e. an implied ’although’ etc.), no such devices are recognized in eRST, and consequently next to nothing is known about the relationship between PDTB-style implicit connectives and eRST-style discourse graphs. In this paper we propose and evaluate an algorithm to align eRST data, which already indicates explicit connectives, to implicit connective annotations following the PDTB guidelines. We also conduct the first evaluation of the relationship between hierarchical RST-style relations and PDTB implicit connectives.
What’s in a Bridge?: A Descriptive, Multi-Genre Analysis of the GUMBridge Corpus for Varieties of Bridging Anaphora
Lauren Levine | Amir Zeldes
Lauren Levine | Amir Zeldes
In this paper, we present a descriptive corpus analysis of bridging anaphora across 16 genres of English, leveraging the multi-genre GUMBridge corpus for varieties of bridging anaphora. We begin our investigation by examining the distribution of bridging instances by sub-varieties and across genres, finding that spoken genres have less bridging instances than written ones. We then investigate the linguistic environments of bridging anaphora and their corresponding associative antecedents in the underlying data of the corpus, examining both categorical features (entity type, part of speech, syntactic dependency relations) and numeric features (mention length, cluster size, salience, and distance between the bridging anaphor and antecedent). We find bridging anaphora have a tendency to be shorter and are more often definite, and bridging antecedents show a tendency to be more salient than other entities. Finally, we analyze how several of the numeric features of bridging environments vary by genre, finding consistent patterns across genres for observed trends in the environments of bridging anaphora and antecedents.
Dataset Cartography for Implicit Discourse Relation Recognition: Promises and Pitfalls
Daniil Ignatev | Denis Paperno | Massimo Poesio
Daniil Ignatev | Denis Paperno | Massimo Poesio
Crowdsourced data for implicit discourse relation recognition, IDRR, has been shown to contain both plausible interpretations and noisy annotations. We present a case study of dataset cartography (Swayamdipta 2020) on IDRR-focused DiscoGeM corpus (Scholman et al., 2022). Our findings show that error identification via low confidence proves unreliable, as confidence is strongly affected by label rarity. However, high-confidence datapoints reveal a different use case: auditing the cue-rich regions of the dataset. Our lexical probe demonstrates an association between high confidence items and (mostly temporal) intra-argument cue words. Dataset cartography can thus serve a diagnostic of cue-driven easy-to-learn cases, which need to be balanced out to ensure the robustness of IDRR learning.
Universal Discourse Relations: A Proposal
Anna Latusek | Maciej Ogrodniczuk | Alina Wróblewska | Bartosz Żuk
Anna Latusek | Maciej Ogrodniczuk | Alina Wróblewska | Bartosz Żuk
This paper introduces a novel ’universal’ approach to discourse annotation, serving as a comprehensive synthesis of the ISO 24617-8 semantic annotation framework and a newly developed multi-layer model of coherence relations. To address the complexities of text analysis, we present a hierarchical classification and a systematic decision tree. By unifying disparate formalisms, our model provides researchers with a robust, standardised methodology for analysing complex discourse structures across various linguistic contexts.
A First Step towards Dialog Simulation with Grounded Dialog Graphs
Michael Ginn | Matt Pauk | Tava Reese | Sameer Gupta | Giuseppe Castellucci | Kevin Small | Alessandro Moschitti | Derek Palmer | Martha Palmer | Alexis Palmer | Maria Pacheco
Michael Ginn | Matt Pauk | Tava Reese | Sameer Gupta | Giuseppe Castellucci | Kevin Small | Alessandro Moschitti | Derek Palmer | Martha Palmer | Alexis Palmer | Maria Pacheco
n this work, we propose a method for dialog simulation to gather high-quality open-domain, multi-turn question answering conversations. The simulation is grounded on Stack Exchange posts and motivated by computational discourse theory. We first convert forum posts into structured directed graphs; then, different traversals through the graph represent possible conversational trajectories. Our proposed graph traversal algorithm produces dialogs optimized for conversational efficiency. In addition, we propose an evaluation framework based on Gricean conversational maxims. Expert-level human annotators evaluate 105 cooking domain transcripts according to our framework; dialogs produced by our method receive ratings that are competitive with dialogs from prior work.
Errors in coreference resolution in German: Effects of modality, simplification and heterogeneous training data
Sarah Jablotschkin | Ekaterina Lapshinova-Koltunski | Heike Zinsmeister
Sarah Jablotschkin | Ekaterina Lapshinova-Koltunski | Heike Zinsmeister
Errors in automatic coreference resolution can be traced back to errors in mention detection and coreference linking. In this paper, we analyse the errors in mention detection produced by the coreference resolver CorPipe (Straka 2023). In particular, we evaluate the performance on different variants of German (written, spoken, original, and simplified). We discuss the errors against the background of the fact that the tool was trained on a combination of different coreference corpora, including two German datasets with partially conflicting annotation guidelines. The results indicate that simplification has a significant effect on mention detection independent of the modality.
Evaluating pragmatic reasoning in large language models (LLMs) remains challenging because model behavior can vary depending on evaluation methods. Previous studies suggest that prompt-based judgments may diverge from models’ internal probability distributions, raising questions about whether observed performance reflects underlying competence or task-induced behavior. This study examines this issue using scalar diversity as a graded diagnostic for pragmatic inference. Following Hu & Levy (2023), this study compares direct probability measurement and metalinguistic prompting across multiple models and experimental settings. The results show that neither evaluation method consistently outperforms the other and that pragmatic behavior varies substantially across model families, prompting strategies, and task structures. Moreover, scalar diversity gradients emerge only in specific model–condition combinations, suggesting that pragmatic reasoning in LLMs reflects an interaction between internal probabilistic representations and task-induced prompting behavior rather than a stable competence captured by a single evaluation paradigm. These findings highlight the central role of evaluation design in interpreting pragmatic abilities in LLMs.
Findings of the Fifth Shared Task on Multilingual Coreference Resolution: Expanding Datasets for Long-Range Entities
Michal Novák | Miloslav Konopík | Anna Nedoluzhko | Martin Popel | Ondrej Prazak | Jakub Sido | Milan Straka | Zdeněk Žabokrtský | Daniel Zeman
Michal Novák | Miloslav Konopík | Anna Nedoluzhko | Martin Popel | Ondrej Prazak | Jakub Sido | Milan Straka | Zdeněk Žabokrtský | Daniel Zeman
This paper describes the fifth edition of the Shared Task on Multilingual Coreference Resolution, held in conjunction with the CODI-CRAC 2026 workshop. Building on previous iterations, the task required participants to develop systems capable of mention identification and identity-based coreference clustering. The 2026 edition specifically emphasizes long-range entities, defined as coreferential chains spanning significant distances, across many words and sentences. The task expanded its linguistic scope by incorporating five new datasets and two additional languages. These additions leverage version 1.4 of CorefUD, a harmonized multilingual collection comprising 27 datasets in 19 languages. In total, ten systems participated, including four LLM-based approaches (three fine-tuned models and one few-shot approach). While traditional systems still maintained their lead, LLMs demonstrated significant potential, suggesting they may soon challenge established approaches in future editions.
Generative Multilingual Coreference Resolution at CRAC 2026
Jakub Hejman | Ondrej Prazak | Miloslav Konopík
Jakub Hejman | Ondrej Prazak | Miloslav Konopík
Participating again in this year’s edition of the CRAC shared task on coreference resolution, we present our upgraded system with an official uplift of 15.46 percentage points in CoNLL-U score. We incorporated the larger Gemma 3 27B IT model, joint pre-training, headword tagging, more efficient training and inference as well as a sliding window to achieve this result. Our system placed second in the LLM track and third overall with a primary score of 73.83. We reached the highest scores on two datasets. Finally, we compare specialized and general LLM approaches.
Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution
Antoine Bourgois | Olga Seminck | Thierry Poibeau
Antoine Bourgois | Olga Seminck | Thierry Poibeau
We present our submission to the LLM track of the 2026 Computational Models of Reference, Anaphora and Coreference (CRAC 2026) shared task. With an average CoNLL F1 score of 74.32 on the official test set, our system ranked first in the LLM track, and third overall. Our system is based on the Gemma-3-27b model, fine-tuned using a two-stage strategy with a multilingual base adapter followed by dataset-specific adapters. We represent mention spans by their headword using an XML-inspired format with local reindexing and annotate documents iteratively. These design choices proved effective across languages, document lengths, and annotation guidelines.
Landcore: Coreference Resolution with Language-Specific LLM-Enhanced Prompts and XML-Inspired Annotation Scheme
Jan Pavelka
Jan Pavelka
This paper presents _Landcore_ (LANguage Dependent COference REsolution), our submission to the LLM Track of the CRAC 2026 Shared Task on Multilingual Coreference Resolution. We explore the capabilities of LLMs in coreference resolution across multiple languages and domains, using a few-shot prompting approach. We design a comprehensive prompt that includes detailed instructions and examples and further enhance it using an LLM to produce language-specific prompts. We present an XML-inspired annotation scheme that is more suitable for LLMs than the provided formats. Although our solution is not the best-performing, we show that our ideas improve performance across various settings.
PortNLP at CRAC 2026: QLoRA Fine-Tuning with Bounded Entity Registry for Multilingual Coreference Resolution
Amber Shore | Russell Scheinberg | Malini Nagasundaram | Ameeta Agrawal
Amber Shore | Russell Scheinberg | Malini Nagasundaram | Ameeta Agrawal
We describe PortNLP’s submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution (LLM track). Our system fine-tunes Qwen 3 14B with QLoRA on CorefUD 1.4 gold annotations across 27 corpora spanning 19 languages. Documents are processed in 500-700 character chunks with a bounded rolling context consisting of 500 characters of recent annotated text and a scored entity registry that tracks up to 30 active entities via a frequency-times-recency decay formula. We employ data augmentation and language-aware sampling strategies to handle typological and data-size diversity. Our system achieves 68.69 CoNLL F1 averaged across all 27 test corpora. We additionally present probing experiments on the LoRA adapter’s internal representations, finding that coreference signal is concentrated in attention value projections rather than MLP modules, with the strongest readout at the earliest transformer layer.
Lightweight Multilingual Coreference Resolution without LLMs @CRAC2026
Sobha Lalitha Devi | Aashik Ali S | Gopinath P | Vijay Sundar Ram | Pattabhi Rk Rao
Sobha Lalitha Devi | Aashik Ali S | Gopinath P | Vijay Sundar Ram | Pattabhi Rk Rao
This paper describes our multilingual coreference system developed for the CRAC 2026 unconstrained track. We introduce a unified, single-model architecture based on Conditional Random Fields (CRFs) that supports 20 languages. Notably, our approach achieves multilingual resolution without the use of large language models (LLMs) or pretrained weights. In contrast to resource-intensive neural methods, the proposed model is efficient, and suitable for deployment on standard hardware (CPUs). It uses linguistic and contextual features to capture coreference relations across languages with diverse syntactic and morphological properties. Model training was conducted using the official data distributions released for the CRAC 2026 shared task. This methodology provides a robust, scalable solution for multilingual NLP, demonstrating high utility within resource-constrained environments. The results highlight that feature-driven structured models remain effective for complex cross-lingual tasks. The performance on test data is similar to the results obtained for the development data.
CorPipe at CRAC 2026: Empty Nodes and Cross-Lingual Transfer in Multilingual Coreference Resolution
Milan Straka
Milan Straka
We introduce CorPipe 26, our winning submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. The fifth edition of this shared task focuses mainly on the comparison of generative LLMs and specialized systems; additionally, 5 more datasets and 2 new languages are introduced. CorPipe 26 is an improved version of CorPipe 25, with a new variant predicting empty nodes together with mentions and coreference links in a single model. Our system outperforms all other submissions in the LLM track by 2.8 percent points and all submissions in the unconstrained track by 9.5 percent points. Furthermore, we perform a series of ablation experiments with different model sizes, empty node prediction methods, and cross-lingual zero-shot evaluation. The source code and the trained models are publicly available at https://github.com/ufal/crac2026-corpipe.
Closing the Gap: Robust Multilingual Coreference Resolution with DAgger
Thomas Morton | Alex Warstadt
Thomas Morton | Alex Warstadt
We present DAggerCoref, our submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution. DAggerCoref is a three-stage cascade built on XLM-RoBERTa-large: a gap classifier for zero pronoun detection, a mention head classifier, and a coarse-to-fine antecedent scorer. Our central contribution is applying DAgger (Ross et al., 2011) to coreference resolution: after training the antecedent scorer on gold mentions, we fine-tune on a 50/50 mix of gold and pipeline-predicted mentions, closing the train/test distribution mismatch and improving development set macro CoNLL F1 by 1.10 points. We also introduce Otsu adaptive thresholding for zero pronoun detection, which matches gold-tuned per-dataset thresholds without requiring any gold supervision. Our system achieves a macro CoNLL F1 of 67.56 on the official test set across 27 datasets and 19 languages