Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi (Editors)


Anthology ID:
2026.eacl-srw
Month:
March
Year:
2026
Address:
Rabat, Morocco
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw/
DOI:
ISBN:
979-8-89176-383-8
Bib Export formats:
BibTeX
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.pdf

This report provides a summary and analysis of the EACL 2026 Student Research Workshop (SRW) Mentorship Program, using structured exit surveys collected from mentors and mentees. Following the spirit of recent ACL Program Chairs’ Reports , this document aims to increase transparency, record lessons learned, and offer actionable guidance for future SRW organizers. The analysis evaluates overall satisfaction, identifies systematic strengths and weaknesses of the mentorship process, and offers recommendations to improve the alignment of expectations and program logistics. We hope that the publication of these findings serves to clarify the organization of mentorship at *ACL venues, provide empirical data for future chairs, and contribute context for meta-research regarding early-career support within the NLP community.
We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.
The objective of this paper is to enhance machine translation for manga (Japanese comics) by developing and employing an image encoder that is capable of more accurately comprehending its visual context. Conventional manga machine translation systems have faced the challenge of lacking sufficient manga comprehension capabilities when utilizing image information. To address this issue, we propose a domain-adapted image encoder training method for manga. The proposed method involves training encoders to acquire visual features that consider the structural and sequential characteristics of the manga. This approach draws upon a technique that has proven to be highly effective in training language models. The image encoders trained by the proposed methods are used as visual processors in a multimodal machine translation model, and they are evaluated in a Japanese-English translation task. The experimental results demonstrate that the proposed method enhances the performance metrics for translation evaluation, such as BLEU and xCOMET, in comparison to the conventional method.
Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter-Solver
The landscape of Large Language Models remains predominantly English-centric, resulting in a significant performance gap for other major languages, such as French, especially in the context of Small Language Models (SLMs). Existing multilingual models demonstrate considerably lower performance in French compared to English, and research on efficient adaptation methods for French remains limited. To address this, we introduce Luth, a family of French-specialized SLMs: through targeted post-training on curated, high-quality French data, our models outperform all open-source counterparts of comparable size on multiple French benchmarks while retaining their original English capabilities. We further show that strategic model merging enhances performance in both languages, establishing Luth as a new state of the art for French SLMs and a robust baseline for future French-language research.
Developing a machine translation (MT) system requires a considerable amount of high-quality parallel data, which is often limited for low-resource languages. This paper explores the use of synthetic data for training an LLM-based MT system in the English-to-Basque direction. Using Basque monolingual corpora as a starting point, we apply back-translation to generate parallel corpora, taking advantage of the fact that current LLMs do not translate well from English to Basque, but they yield an acceptable performance in the reverse direction. We conduct experiments in a multi-stage approach, from a simple Supervised Fine-tuning (SFT) step, to preference learning with the Direct Preference Optimization (DPO) technique. We then evaluate the approach with both automatic metrics and manual assessment. Experimental results suggest that for this task, SFT brings a clear improvement in translation quality, while DPO only yields marginal enhancement.
Large language models (LLMs) require careful alignment to balance competing objectives: factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs.
Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. First, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data. Second, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. Across multiple zero-shot transfer benchmarks, we demonstrate that our representations significantly improve transfer performance when the distance type is relevant to the task, while our composite distance yields gains in most tasks.
Existing user simulation approaches focus on generating user-like responses in dialogue. They often assume that the provided persona is sufficient for producing such responses, without verifying whether critical personas are supplied. This raises concerns about the validity of simulation results.To address this issue, we study the task of identifying persona dimensions (e.g., ”whether the user is price-sensitive”) that are relevant but missing in simulating a user’s reply for a given dialogue context.We introduce PICQ-drama (constructed from TVShowGuess), a benchmark of context-aware choice questions, annotated with missing persona dimensions whose absence leads to ambiguous user choices. We further design diverse evaluation criteria for missing persona identification.Benchmarking leading LLMs on our PICQ-drama dataset demonstrates the feasibility of this task. Evaluation across diverse criteria, along with further analyses, reveals cognitive differences between LLMs and humans and highlights the distinct roles of different persona categories in shaping responses.
1960s Tamil cinema’s musical heritage lacks adequate metadata identifying playback singers in archival recordings. We present a quality-aware adversarial ensemble approach addressing two critical challenges: (1) variable audio degradation requiring adaptive model selection, and (2) instrumentation leakage confounding singer-specific features. We curate 348 annotated clips (12 hours) spanning 48 singers from 179 films. Our methodology introduces: a reliability estimation network dynamically gating five complementary pre-trained speaker models (Wav2Vec2, ECAPA-TDNN, WeSpeaker, CAM++, ERes2NetV2) based on degradation characteristics; adversarial training disentangling singer identity from accompaniment style; and uncertainty-calibrated predictions for human-in-the-loop workflows. On a held-out test set of 52 clips, we achieve 96.2% accuracy (95% CI: [87.5%, 99.2%]) and 2.0% EER (95% CI: [1.2%, 3.1%]), representing 7.7% absolute improvement over the best single model and 2.0% over static ensemble fusion. Ablations show quality-aware gating contributes 2.0% and adversarial disentanglement 2.0% beyond standard ensembles. We publicly release the dataset and code with fixed splits.
Retrieval-Augmented Generation (RAG) systems face efficiency bottlenecks in prefill due to attention mechanism, and traditional KV cache only accelerates decoding. In this context, reusing document-level KV cache computed for retrieved documents in previous sessions during the prefill stage appears to be a natural way to amortize computation, but it raises serious correctness challenges due to position and context misalignment across queries and sessions. This research proposes a multi-document KV cache reuse framework for multi-document RAG workloads across queries and sessions to resolve position misalignment and context misalignment, preserving accuracy while eliminating document-specific quadratic complexity in prefill. Theoretical analysis will establish conditions under which multi-document KV cache reuse remains stable and close to full recomputation, providing principled guarantees for both efficiency and accuracy. These results will enable deployment in existing RAG pipelines without architectural changes or model retraining. Crucially, to ensure robustness in real-world deployments, validation will extend beyond standard benchmarks to include noise-robustness tests and domain-specific workloads (e.g., legal). The research aims to empirically confirm these guarantees and demonstrate that substantial prefill speedups can be achieved without materially degrading task-level performance.
This research proposal describes a cross-disciplinary project aimed at developing Digital Twins (DTs) of Alzheimer’s Disease (AD) using Language Models (LMs). By mimicking the functional deficits observed in individuals with AD, these DTs will serve as tools for early detection and understanding of disease progression. Several approaches to altering the LM will be explored, and the resulting effects on brain score — an evaluation of the correlation between brain activity and the LM’s internal activations — will be studied. Detection models will be trained based on each approach; these models will be compared against themselves and the state-of-the-art.Two converging lines of evidence motivate this work: LMs achieve high accuracy in classifying AD from speech transcripts, and their internal representations correlate significantly with human brain activity during language processing. If successful, this project could lead to significant advancements in the early detection and monitoring of AD, ultimately improving patient outcomes.
Large Language Models (LLMs) excel across diverse NLP tasks but remain too large for efficient on-device deployment. Although knowledge distillation offers a promising compression strategy, direct one-step distillation from a large teacher to a small student often leads to substantial performance loss due to the capacity gap. In this work, we revisit multi-step knowledge distillation (MSKD) as an effective remedy, exploring how staged, size-aware transfer paths can better preserve teacher knowledge across students of varying scales. Through extensive experiments with GPT-2 and OPT, we demonstrate that MSKD consistently improves ROUGE-L and perplexity over single-step approaches without requiring specialized fine-tuning. Our results establish multi-step transfer as a simple yet powerful framework for progressively compressing LLMs into efficient, high-performing Small Language Models (SLMs).
This thesis argues that the currently widely used Natural Language Processing algorithms possibly have various limitations related to the properties of the texts they handle and produce. With the wide adoption of these tools in rapid progress, we must ask what these limitations are and what are the possible implications of integrating such tools into our daily lives.As a testbed, we have chosen the task of Neural Machine Translation (NMT). Nevertheless, we aim for general insights and outcomes, applicable to current Large Language Models (LLMs). We ask whether the algorithms used in NMT have inherent inductive biases that are beneficial for most types of inputs but might harm the processing of untypical texts, thereby contributing to a cycle of monotonous, repetitive language – whether generated by machines or humans.To explore this hypothesis, we define a set of measures to quantify text diversity based on its statistical properties, like uniformity or rhythmicity of word-level surprisal, on multiple scales (sentence, discourse, language). We conduct a series of experiments to investigate whether NMT systems struggle with maintaining the diversity of such texts, potentially reducing the richness of the generated language, compared to human translators.We further analyze potential origins of these limitations within existing training objectives and decoding strategies. Ultimately, our goal is to propose and validate alternative approaches (e.g., loss functions, decoding algorithms) that maintain the diversity and complexity of language and that allow for better global planning of the output generation, enabling the models to better reflect the ambiguities inherent in human communication.
Large language models (LLMs) can generate fluent text, but the quality of generated content crucially depends on its consistency with the given input.This aspect is commonly referred to as faithfulness, which concerns whether the output is properly grounded in the input context.A major challenge related to faithfulness is that generated content may include information not supported by the input or may contradict it.This phenomenon is often referred to as hallucination, and increasing attention has been paid to automatic hallucination detection, which determines whether an LLM’s output is hallucinated.To evaluate the performance of hallucination detection systems, researchers use evaluation datasets with labels indicating the presence or absence of hallucinations.While such datasets have been developed for English and Chinese, Japanese evaluation resources for hallucination detection remain limited.Therefore, we constructed a Japanese evaluation dataset for hallucination detection in summarization by manually annotating sentence-level faithfulness labels in LLM-generated summaries of Japanese documents.We annotate 390 summaries (1,938 sentences) generated by three LLMs with sentence-level multi-label annotations for faithfulness with respect to the input document.The taxonomy extends a prior classification scheme and captures distinct patterns of model errors, enabling both binary hallucination detection and fine-grained error-type analysis of Japanese LLM summarization.
This work evaluates the non-English and unstructured text compression performance of Large Language Models (LLMs) by comparing them with traditional baselines on datasets from eight most widely spoken languages. Experimental results show that the evaluated LLM (LLaMA-3.2-1B) was considerably outperformed by the baselines, particularly on non-English datasets, where its performance relative to the best baseline was more than three times worse than on English datasets on average. It also compressed unstructured English data up to more than twofold less effectively than plain English data. Traditional methods, however, remained largely dataset-agnostic. Surprisingly, the LLM achieved worse compression ratios on some datasets than others despite modeling them more accurately. Overall, the outcomes and substantially higher compression time and resource consumption indicate that current LLMs are highly impractical for the compression task, where traditional methods continue to excel. Codes are available at: https://github.com/mehranhaddadi13/llm_compress.
Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.[GitHub Repository]
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We further propose the Lexical Content Score (LCS), a language-agnostic metric that quantifies the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions. Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code[GitHub Repository] and data[Hugging Face Dataset].
We revisit MWE-aware linguistic tokenization as a character-level and token-level sequence labeling problem and present a systematic evaluation on English, German, Italian, and Dutch data. We compare a standard tokenizer trained without MWE-awareness as a baseline (UDPipe), a character-level SRN+CRF model (Elephant), and transformer-based MaChAmp models trained either directly on gold character labels or as token-level postprocessors on top of UDPipe. Our results show that the two-stage pipeline – UDPipe pretokenization followed by MaChAmp postprocessing – consistently yields the best accuracy. Our analysis of error patterns highlights how different architectures trade off over- and undersegmentation. These findings provide practical guidance for building MWE-aware tokenizers and suggest that postprocessing pipelines with transformers are a strong and general strategy for non-standard tokenization.
LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.
Recent advancements in Large Language Models (LLMs) have notably enhanced task-oriented dialogue systems, particularly in Dialogue State Tracking (DST), owing to their generative capabilities and strong generalization. Although recent approaches such as LDST and FnCTOD significantly improved cross-domain DST performance via supervised fine-tuning (SFT), these methods typically require substantial amounts of domain-specific data. In this paper, we address this limitation by employing Group Relative Policy Optimization (GRPO) - a critic-free reinforcement learning method that efficiently guides LLMs toward improved DST accuracy even under low-resource conditions. Our results on established DST benchmarks, including MultiWOZ 2.1 and 2.4, demonstrate that the RL approach achieves superior performance to existing methods while using significantly reduced out-of-domain training data. In addition, we found out that models pretrained specifically for tool-use tasks can be a better starting point, especially on small scales.
We study model routing for Large Language Model (LLM)-based systems. A model, called the router, dynamically chooses which LLM should handle a given input/query. We challenge the assumption that complex routers are necessary for generalising to new candidate LLMs. We introduce ContextualRouter, a simple meta-evaluation framework that predicts per-model performance for new queries by retrieving similar past queries and reweighting model scores with lightweight attention. During inference, the router balances estimated performance and cost by adjusting a tunable cost penalty parameter. This allows the router to adapt dynamically to the addition or removal of LLMs without the need for retraining. Across five routing benchmarks (SPROUT, RouterBench, LiveBench, BigGenBench, and EmbedLLM), ContextualRouter matches the quality–cost trade-offs of other generalisable routers. Surprisingly, a simpler non-parametric baseline, k-nearest-neighbour averaging, performs comparably or better, achieving strong performance estimation, high NDCG, and substantial cost savings. Retrieval-based routers remain robust to k, embedding size, data sparsity, retrieval degradation, and generalise to unseen queries and models with as little as 1% historical data. These results suggest that effective retrieval alone enables generalisable LLM routing.
Detecting personally identifiable information (PII) in user queries is critical for ensuring privacy in question-answering systems. Current approaches mainly redact all PII, disregarding the fact that some of them may be contextually relevant to the user’s question, resulting in a degradation of response quality. Large language models (LLMs) might be able to help determine which PII are relevant, but due to their closed source nature and lack of privacy guarantees, they are unsuitable for sensitive data processing. To achieve privacy-preserving PII detection, we propose CAPID, a practical approach that fine-tunes a locally owned small language model (SLM) that filters sensitive information before it is passed to LLMs for QA. However, existing datasets do not capture the context-dependent relevance of PII needed to train such a model effectively. To fill this gap, we propose a synthetic data generation pipeline that leverages LLMs to produce a diverse, domain-rich dataset spanning multiple PII types and relevance levels. Using this dataset, we fine-tune an SLM to detect PII spans, classify their types, and estimate contextual relevance. Our experiments show that relevance-aware PII detection with a fine-tuned SLM substantially outperforms existing baselines in span, relevance and type accuracy while preserving significantly higher downstream utility under anonymization.
While the semantic space has been examined as a way to computationally represent language meaning-grammar interface, minimal research has been done comparing the semantic spaces of first and second language learners. We investigated the semantic space of university-level students learning French by extracting semantic features from narrative text over various time points from a 21-month period. After using machine learning models to classify native speakers’ semantic features from second language learners’, we used interpretability techniques to identify the most informative features per model. Through this, we discovered a variety of embedding similarity features to be decisive in language learning. We compared both groups to determine how the features differed per group and if there was any change over time. The findings demonstrated that the second language learners on average had higher semantic similarity scores than the native speakers at the token level. The similarity decreased over time but did not reach native-level values. Similarly, average surprisal was higher in the second language learner group, which steadily decreased over the course of the data collection period. These results provide insight into personalized education with more precise and effective computational indices tracking learners’ progress.
This paper introduces Kahaani, a multimodal, co-creative storytelling system that leverages Generative Artificial Intelligence, designed for children to address the challenge of sustaining engagement to foster educational narrative experiences. Here we define co-creative as a collaborative creative process in which both the child and Kahaani contribute to the generation of the story. The system combines Large Language Model (LLM), Text-to-Speech (TTS), Text-to-Music (TTM), and Text-to-Video (TTV) generation to produce a rich, immersive, and accessible storytelling experience. The system grounds the co-creation process in two classical storytelling framework, Freytag’s Pyramid and Propp’s Narrative Functions. The main goals of Kahaani are: (1) to help children improve their English skills, (2) to teach important life lessons through story morals, and (3) to help them understand how stories are structured, all in a fun and engaging way. We present evaluations for each AI component used, along with a user study involving three parent–child pairs to assess the overall experience and educational value of the system.
Language of study is an aspect of computational linguistics papers that is useful for analyses of trends and diversity in computational linguistics. This study introduces the first benchmark and evaluation of automated language of study extraction from computational linguistics publications. The benchmark containing 431 publications from the ACL Anthology, with 62 languages analyzed, was annotated. SciBERT and four large language models (LLMs), GPT-4o mini, Gemini 2.5 Flash, Claude 3.5 Haiku, and DeepSeek 3.2, were evaluated on the benchmark using different parts of the ACL Anthology papers. GPT-4o mini achieved the best exact match and Jaccard agreement scores of 0.646 and 0.687, respectively, which is slightly less than the agreement in human annotation. Gemini 2.5 Flash achieved the best micro F1 of 0.633. Models using the abstract for extraction were competitive with models using the full text, showing that accuracy can be achieved in language of study extraction without high computational costs. These findings demonstrate that LLMs are able to accurately identify the languages of study in computational linguistics papers, potentially reducing the time and cost of analyses in computational linguistics.
Protagonists play a central role in moral discourse by structuring responsibility and authority, yet computational work has largely focused on moral values rather than the actors involved. We address this gap by studying phrase-level protagonist detection and classification in the Moralization Corpus (Becker et al., 2025), a dataset of moral arguments across different text genres. We decompose the task into identifying protagonist mentions and classifying them by what kind of actor they are (e.g., individual or institution) and what function they serve in the moral argument.We compare fine-tuned lightweight models, state-of-the-art NER models, and prompting-based large language models. We further establish human baselines and analyze the impact of contextual information on human and model decisions. Our results show that fine-tuned NER models achieve competitive detection performance at substantially lower cost than prompted large language models, and that role classification benefits more strongly from contextualized prompting. Across tasks, top-performing models reach or exceed human-level performance, highlighting the value of task decomposition for modeling protagonists in moral discourse.We release our code, predictions, and supplementary material in our project repository.
Music is a universal cultural practice that influences emotion, ritual and creativity, and it is now represented in many digital modalities: audio recordings, symbolic encodings (MIDI, MusicXML, ABC), visual scores and lyrics. Multimodal Large Language Models (MLLMs) have the ambition to process "everything", including music, and therefore promise to support musical analysis, creation and education. Despite this promise, systematic methods for evaluating whether a MLLM understands music are missing. Existing music-focused benchmarks are fragmented, largely single-modality, Western-centric, and often do not require actual perception of the musical content; methodological details such as prompt design and answer-extraction are frequently omitted or not discussed, and some evaluations rely on proprietary LLMs, hindering reproducibility and raising concerns about test-data leakage. To fill this gap, this dissertation proposes to design a musically multimodal benchmark built on a transparent, fully open evaluation pipeline. The benchmark will present closed-question-answer items across four musical modalities, employ carefully engineered distractor options to enforce genuine perceptual engagement, and follow rigorously documented prompt-selection and answer-extraction procedures. It will further incorporate culturally diverse musical material beyond the dominant Western canon. Guided by three research questions: (1) how to devise robust, reproducible evaluation procedures, (2) how current MLLMs perform across modalities, and (3) how model scores relate to human musical abilities; the benchmark will enable precise diagnosis of model limitations, inform the development of more musically aware AI systems, and provide a principled basis for assessing practical usefulness to musicians and other stakeholders in the creative industry.
This study examines how credibility, trust, and bias interact within complex communication systems that shape public understanding of scientific information. It addresses two questions: 1. What are the primary factors that influence the public’s comprehension of scientific findings? 2. How do the factors influencing public understanding of climate change science interact within a complex system? A scoping literature review synthesized disparate communication models from media studies, science communication, psychology, and information science to identify a shared set of system variables. The identified variables were organized into source-, message-, channel-, and receiver-related factors and used to develop a causal loop diagram showing how credibility, trust, and information processing co-evolve through reinforcing and balancing feedback. The resulting diagram illustrates two major loops: one centered on trust in information sources, which can foster social cohesion or accelerate truth decay, and another linking individual trust dynamics to broader patterns of polarization and unity. By clarifying how well-established constructs interact to produce dynamic communication outcomes, the framework is useful for scholars developing integrative theory and for policymakers and practitioners designing interventions in misinformation-prone environments. The CLD also provides a foundation for future system dynamics modeling to examine how interventions in transparency, media literacy, or platform governance may influence public trust over time.
While Large Language Models demonstrate expert proficiency on medical benchmarks, the clinical encounter requires more than factual retrieval. It demands a sophisticated rhetorical performance of care that balances authority with epistemic humility. This paper investigates the Clinical Fingerprint by comparing the structural and ethical integrity of advice generated by human physicians and various language models.Our findings reveal a fundamental divergence in how clinical information is prioritized and delivered. We show that whereas physicians utilize efficient, action-oriented structures to provide clear guidance, generic models often bury critical advice under layers of complex linguistic recursion. This creates a significant cognitive load for patients and risks a dangerous safety cliff where models adopt an unearned authoritative tone. Such models frequently mimic the confidence of a doctor while providing contradictory advice, particularly in complex cases involving multiple symptoms. By identifying these rhetorical gaps, our work emphasizes that domain-specific fine-tuning is an ethical necessity to ensure that AI assistants maintain the necessary humility and logical cohesion required for safe medical practice.
Fine-tuning Transformer models is often dominated by the backward computation in linear layers. In many NLP tasks, input sequences are short and padded to a fixed context length, inducing structured sparsity in the output gradients. We propose Sparsity-Exploiting Backward Pass (SEBP), a heuristic method that reduces backward computation by exploiting this sparsity with negligible memory overhead. We show that, for short input sequences, the output gradients of BERT-based and LLaMA models exhibit pronounced sparsity, allowing for optimisation in the backward computation. We optimized the autograd function in the linear layers, significantly reducing the number of FLOPs during the backward.Our method achieves a backward pass speedup of approximately 2.15x for BERT-base on GLUE tasks and 1.99x for a 3B LLaMA model on reasoning benchmarks, while maintaining memory usage nearly identical to the regular PyTorch fine-tuning. Crucially, this speedup comes at no cost to performance. We show that our method matches standard convergence rates, offering a memory-efficient way to accelerate LLM fine-tuning.
Human cognition is deeply intertwined with a sense of time, known as Chronoception. This sense allows us to judge how long facts remain valid and when knowledge becomes outdated. Despite progress in vision, language, and motor control, AI still struggles to reason about temporal validity. We introduce Chronocept, the first benchmark to model temporal validity as a continuous probability distribution over time. Using skew-normal curves fitted along semantically decomposed temporal axes, Chronocept captures nuanced patterns of emergence, decay, and peak relevance. It includes two datasets: Benchmark I (atomic facts) and Benchmark II (multi-sentence passages). Annotations show strong inter-annotator agreement (84% and 89%). Our baselines predict curve parameters - location, scale, and skewness - enabling interpretable, generalizable learning and outperforming classification-based approaches. Chronocept fills a foundational gap in AI’s temporal reasoning, supporting applications in knowledge grounding, fact-checking, retrieval-augmented generation (RAG), and proactive agents. Code and data are publicly available.
Large Language Models (LLMs) are increasingly integrated into high-stakes applications, making robust safety guarantees a central practical and commercial concern. Existing safety evaluations predominantly rely on fixed collections of harmful prompts, implicitly assuming non-adaptive adversaries and thereby overlooking realistic attack scenarios in which inputs are iteratively refined to evade safeguards. In this work, we examine the vulnerability of contemporary language models to automated, adversarial prompt refinement. We repurpose black-box prompt optimization techniques, originally designed to improve performance on benign tasks, to systematically search for safety failures. Using DSPy, we apply three such optimizers to prompts drawn from HarmfulQA and JailbreakBench, explicitly optimizing toward a continuous danger score in the range 0 to 1 provided by an independent evaluator model (GPT-5.1). Our results demonstrate a substantial reduction in effective safety safeguards, with the effects being especially pronounced for open-source small language models. For example, the average danger score of Qwen 3 8B increases from 0.09 in its baseline setting to 0.79 after optimization. These findings suggest that static benchmarks may underestimate residual risk, indicating that automated, adaptive red-teaming is a necessary component of robust safety evaluation.
Radiology report generation involves translating visual signals from pixels into precise clinical language. Existing encoder-decoder models often suffer from hallucinations, generating plausible but incorrect medical findings. We propose GraphRAG-Rad, a novel architecture that integrates biomedical knowledge through a novel Latent Visual-Semantic Retrieval (VSR). Unlike traditional Retrieval-Augmented Generation (RAG) methods that rely on textual queries, our approach aligns visual embeddings with the latent space of the Knowledge Graph, PrimeKG. The retrieved sub-graph guides the Visual Encoder and the Multi-Hop Reasoning Module. The reasoning module simulates clinical deduction paths (Ground-Glass Opacity → Viral Pneumonia → COVID-19) before it combines the information with visual features in a Graph-Gated Cross-Modal Decoder. Experiments on the COV-CTR dataset demonstrate that GraphRAG-Rad achieves competitive performance with strong results across multiple metrics. Furthermore, ablation studies show that integrating latent retrieval and reasoning improves performance significantly compared to a visual-only baseline. Qualitative analysis further reveals interpretable attention maps. These maps explicitly link visual regions to symbolic medical concepts, effectively bridging the modality gap between vision and language.
State Space Models (SSMs) have recently emerged as efficient alternatives to Transformers for sequence modeling, yet extending them to two-dimensional vision tasks remains challenging. The Graph-Generating State Space Model (GG-SSM) addresses this challenge by constructing an adaptive graph, achieving competitive performance on vision benchmarks. However, state propagation over the resulting graph introduces substantial inference overhead, limiting scalability to high-resolution inputs. In this work, we introduce a leaf-guided computation pruning strategy that accelerates GG-SSM inference without modifying the underlying graph topology. Rather than removing nodes or edges, our approach selectively scales or bypasses secondary refinement computations associated with high-dissimilarity leaf nodes, while preserving the low-weight MST backbone. Experiments on multiple long-term time series forecasting benchmarks demonstrate consistent throughput improvements with controlled accuracy degradation across a range of pruning ratios. These results indicate that structure-aware computation pruning is an effective mechanism for improving the scalability of graph-based state space models.
Understanding sarcasm requires integrating cues from language, voice, and facial expression. Recent work has achieved impressive results using large multimodal Transformers, but such models are computationally expensive and often obscure how each modality contributes to the final prediction. This paper introduces a lightweight, interpretable framework for multimodal sarcasm detection that combines frozen text, audio, and visual embeddings from pretrained encoders through compact fusion heads. Using the MUStARD++Balanced dataset, we show that early fusion of textual and acoustic features improves over the best unimodal baseline. Character-specific evaluation further shows that sarcasm expressed through overt prosodic and visual cues is substantially easier to detect than monotone, context-dependent sarcasm. Additionally, we evaluate generalization to different characters through leave-one-speaker-out (LOSO) experiments and run ablation-style transfer experiments on two speakers with similar sarcasm distributions. These findings demonstrate that effective multimodal sarcasm understanding can emerge from frozen, resource-efficient representations without large-scale fine-tuning, emphasizing the importance of modality interaction and delivery style rather than model scale.
In-image machine translation is a sub-task of Image-Based Machine Translation that aims to substitute text embedded in images with its translation into another language. In the current work, we define a simple task with a synthetic dataset based on rendering parallel text over a plain background. Furthermore, we experiment with different optical character recognition, machine translation and image synthesis models to include in our ensemble. Then, we present our cascade approach as a pipeline that obtains the transcript of the original image, translates it, and generates a new image (image synthesis) similar to the original one. Finally, we compare the performance of our approach with several current state-of-the-art models, including an end-to-end approach, demonstrating its competitiveness.
Automatic story generation aims to produce coherent, engaging, and contextually consistent narratives with minimal or no human involvement, thereby advancing research in computational creativity and applications in human language technologies. The emergence of large language models has progressed the task, enabling systems to generate multi-thousand-word stories under diverse constraints. Despite these advances, maintaining narrative coherence, character consistency, storyline diversity, and plot controllability in generating stories is still challenging. In this survey, we conduct a systematic review of research published over the past four years to examine the major trends and key limitations in story generation methods, model architectures, datasets, and evaluation methodologies. Based on this analysis of 57 included papers, we propose developing new evaluation metrics and creating more suitable datasets, together with ongoing improvement of narrative coherence and consistency, as well as their exploration in practical applications of story generation, as actions to support continued progress in automatic story generation.
This study proposes a method for learning subword correspondences in parallel sentence pairs using the EM algorithm. Conventional neural machine translation typically employs subword segmentation models trained. However, since existing methods do not consider parallel relationships, inconsistencies in word segmentation between source and target languages may hinder translation model training. Our approach leverages direct modeling of subword correspondences in parallel corpora, thereby improving segmentation consistency across languages. Experiments across multiple machine translation tasks confirm that our proposed method improves translation accuracy for many tasks.
Indian languages represent a highly multilingual and low-resource speech ecosystem, where the scarcity of high-quality parallel speech corpora significantly limits the development of speech-to-speech translation systems. Most existing approaches rely on cascaded pipelines that combine automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS). While effective, these cascaded systems often suffer from cumulative error propagation, increased latency, and higher computational complexity, particularly in low-resource Indian languages. To address these challenges, my doctoral work proposes a novel sequence-to-sequence direct speech translation framework capable of translating speech from one Indian language to another without relying on intermediate text representations. Recent advances in deep learning, however, indicate that direct speech translation architectures can surpass conventional cascaded systems in both efficiency and translation quality, motivating the design of our fully end-to-end solution. We aim to release an initial dataset comprising at least 120,000 real speech samples within a 6–12 month timeframe.
Lyrics translation must account for rhythm, rhyme, and singability in the translated lyrics. In this study, we focus on singability and investigate effective prompting methods for translating singable lyrics, including verification-guided and multi-round prompting, applied to large language models. First, we curate a multilingual lyrics translation dataset covering a total of six language directions across Chinese, Japanese, and English. Next, we evaluate seven prompting strategies, with instruction complexity increasing incrementally. The results show that multi-prompt strategies improve singability-related aspects, such as rhythmic alignment and phonological naturalness, compared to naive translation. Furthermore, human evaluations using songs created from translated lyrics suggest that moderately complex prompting strategies improve singable naturalness, while more complex strategies contribute to greater stability in perceived quality.
Recent advances in Sparse Autoencoders (SAEs) have revealed interpretable features within large language models (LLMs), including features that are specific to individual languages.In prior work, these features have been used to steer a model’s output language.However, the impact of SAE-based language steering on output quality and task performance, as well as its relationship to simpler prompting-based approaches, remains unclear.In this work, we study the effects of language steering using SAE features across multiple tasks and models.We apply language-specific SAE feature steering to three LLMs from two model families and evaluate it on a translation task and a multilingual question-answering task.We compare SAE-based steering against prompting and language neuron-based steering, and examine a combined prompting-and-steering approach.On the translation task, SAE feature steering achieves an average target-language accuracy of 92% across models and languages, consistently outperforming language neuron-based steering, but slightly underperforming prompting in language accuracy and output quality.In contrast, on the multilingual question-answering task, SAE-based steering enables stronger language control than prompting, and combining steering with prompting yields the best overall language control and task performance.These findings demonstrate the potential of SAE features as a tool for controllable multilingual generation.
Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention, our approach yields strong results: we observe a +9.5 pp improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we release our models, code, and datasets.
Police interrogation transcripts are key evidential documents, yet their linguistic form is rarely systematically analyzed, despite directly shaping judicial interpretation. This study presents the first computational forensic linguistic profiling of Italian police transcripts, focusing on the two transcription formats used in practice: narrative monologues and question-answer (Q-A) transcripts. Using automated extraction of 147 linguistic features, we analyze 50 authentic transcripts against a multi-genre Italian reference corpus to support more transparent evaluation of police transcripts by clarifying how transcription formats systematically shape evidential interpretation in judicial contexts. Narrative monologues exhibit deeper syntactic embedding, higher past-tense usage, and more first-person singular verbs, supporting coherent and temporally ordered recounting of events. Q-A transcripts, by contrast, show longer subordinate chains, more clausal complements, and higher pronoun frequency, reflecting interactive turn-taking and procedural dynamics. Rather than aiming at predictive classification, the study reveals the linguistic mechanisms shaping transcription formats and demonstrates that structurally and legally informed features reliably distinguish them. Computational models reliably capture genre-specific cues, offering scalable, empirically grounded insights into transcription practices and evidential reliability.
This thesis investigates how polyvocal ontologies and Large Language Model (LLM) based Multi-Agent Systems (MAS) can operationalize perspective-aware knowledge extraction, preserving conflicting stakeholder interpretations as epistemically separable, queryable Knowledge Graphs (KGs). Current AI systems consolidate multiple perspectives into singular, decontextualized schemas, introducing representational bias and information loss. We propose a systematic framework addressing three interconnected research questions: (1) how to generate polyvocal ontology design patterns for high-stakes domains; (2) how to architect LLM-based MAS that extract perspective-conditioned facts while maintaining schema coherence and provenance traceability; and (3) whether such extractions achieve semantic diversity without sacrificing KG integrity. Evaluation is proposed on medical datasets, conducted with domain experts, to demonstrate the feasibility of perspective-aware extraction as a principled alternative to consensus-oriented KGs. Expected contributions include polyvocal ontology patterns, an ontology-orchestrated MAS extraction framework with auditable provenance, and empirical validation.
The spread of misinformation has prompted extensive research on machine-learning–based fake news detection. However, existing datasets differ substantially in content distributions and annotation policies, complicating fair evaluation and generalization assessment. We refer to these structural differences as dataset bias. In this study, we quantitatively analyze dataset bias across multiple public fake news datasets (Kaggle, FNN, ISOT, and NELA-GT-2019/2020) with different annotation granularities, including article-level and publisher-level labels. Using document embedding–based similarity analysis and article category distributions, we examine how such biases affect detection performance under in-dataset and cross-dataset evaluation settings. Furthermore, to leverage large-scale but coarse-grained publisher-level data, we compare proxy-label training with a semi-supervised learning approach based on Virtual Adversarial Training (VAT). Our results show that detection performance strongly depends on dataset-specific biases, and that proxy-label training and SSL exhibit complementary, and sometimes opposite, strengths depending on whether the evaluation emphasizes in-dataset performance or cross-dataset generalization. These findings highlight the importance of appropriate training strategies and evaluation protocols when using heterogeneous fake news datasets.
This paper introduces DRAGOn, method to design a RAG benchmark on a regularly updated corpus. It features recent reference datasets, a question generation framework, an automatic evaluation pipeline, and a public leaderboard.Specified reference datasets allow for uniform comparison of RAG systems, while newly generated dataset versions mitigate data leakage and ensure that all models are evaluated on unseen, comparable data.The pipeline for automatic question generation extracts the Knowledge Graph from the text corpus and produces multiple question-answer pairs utilizing modern LLM capabilities.A set of diverse LLM-as-Judge metrics is provided for a comprehensive model evaluation.We used Russian news outlets to form the datasets and demonstrate our methodology. We launch a public leaderboard to track the development of RAG systems and encourage community participation.
Training a language model for low-resource languages is challenging due to data scarcity and computational cost. Tokenizer transfer offers a way to adapt a pre-trained model to a new tokenizer without full retraining, improving efficiency and cross-lingual applicability. To the best our of knowledge, we present the first controlled evaluation of tokenizer transfer on monolingually pretrained base models trained on language-specific corpora, Orthogonal Mapping Pursuit (OMP) and Fast Vocabulary Transfer (FVT), across six languages and multiple finetuning regimes. Using the Goldfish model family, we evaluate using byte-normalized log-perplexity and MultiBlimp accuracy for target-language adaptability, source-language retention, and the interaction between transfer and monolingual or mixed finetuning. OMP with monolingual target finetuning yields the best target-language scores (lower log-perplexity and higher MultiBlimp) among our evaluated conditions, compared with (i) a model trained only on the source language, (ii) a model trained on a smaller amount of target-language data, and (iii) the source language model adapted via standard finetuning on the target data. The results suggest tokenizer transfer is a compute-efficient alternative for low-resource LM training: train a monolingual tokenizer for the target language, transfer it to a larger pre-trained model, and fine-tune using the target data.
Nested named entity recognition identifies entities contained within other entities, but requires expensive multi-level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo-nested data), flat neutralization (reducing false negative signal), and a hybrid fine-tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at https://github.com/fulstock/Learning-from-Flat-Annotations.
Large language models (LLMs) are increasingly utilised for social simulation and persona generation, necessitating an understanding of how they represent geopolitical identities. In this paper, we analyse personas generated for Palestinian and Israeli identities by five popular LLMs across 640 experimental conditions, varying context (war vs non-war) and assigned roles. We observe significant distributional patterns in the generated attributes: Palestinian profiles in war contexts are frequently associated with lower socioeconomic status and survival-oriented roles, whereas Israeli profiles predominantly retain middle-class status and specialised professional attributes. When prompted with explicit instructions to avoid harmful assumptions, models exhibit diverse distributional changes, e.g., marked increases in non-binary gender inferences or a convergence toward generic occupational roles (e.g., "student"), while the underlying socioeconomic distinctions often remain. Furthermore, analysis of reasoning traces reveals an interesting dynamics between model reasoning and generation: while rationales consistently mention fairness-related concepts, the final generated personas follow the aforementioned diverse distributional changes. These findings illustrate a picture of how models interpret geopolitical contexts, while suggesting that they process fairness and adjust in varied ways; there is no consistent, direct translation of fairness concepts into representative outcomes.
The rapid adoption of Small Language Models (SLMs) for resource constrained applications has outpaced our understanding of their ethical and fairness implications. To address this gap, we introduce the Vacuous Neutrality Framework (VaNeu), a multi-dimensional evaluation paradigm designed to assess SLM fairness prior to deployment. The framework examines model robustness across four stages - biases, utility, ambiguity handling, and positional bias over diverse social bias categories. To the best of our knowledge, this work presents the first large-scale audit of SLMs in the 0.5–5B parameter range, an overlooked “middle tier” between BERT-class encoders and flagship LLMs. We evaluate nine widely used SLMs spanning four model families under both ambiguous and disambiguated contexts. Our findings show that models demonstrating low bias in early stages often fail subsequent evaluations, revealing hidden vulnerabilities and unreliable reasoning. These results underscore the need for a more comprehensive understanding of fairness and reliability in SLMs, and position the proposed framework as a principled tool for responsible deployment in socially sensitive settings. The code is available at: https://github.com/smanduru10/Vacuous-Neutrality-Framework.git.
This paper investigates the capabilities of LLMs to detect and explain fine-grained emotional social influence techniques in textual dialogues, as well as human preferences for technique explanations. We present findings from our two studies. In Study 1, a dataset of 238 Polish dialogues is introduced, each annotated with detailed span-level labels. On this data, we evaluate the performance of LLMs on two tasks: detecting 11 emotional social influence techniques and identifying text spans corresponding to specific techniques. The results indicate that current LLMs demonstrate limited effectiveness in accurately detecting fine-grained emotional social influence.In Study 2, we examine various LLM-generated explanations through human pairwise preferences and four criteria: comprehensibility, cognitive coherence, completeness, and soundness, with the latter two emerging as the most influential on general human preference. All data, including human annotations, are publicly available as the EmoSocInflu dataset (https://github.com/social-influence/emo-soc-influ). Our findings highlight a critical need for further advancement in the field. As LLM-supported manipulation grows, it is essential to promote public understanding of social influence mechanisms, enabling individuals to critically recognize and interpret the subtle forms of manipulation that shape public opinion.
Sign language lexicographers construct bilingual dictionaries by establishing word-to-sign mappings, where polysemous and homonymous words corresponding to different signs across contexts are often underrepresented. A usage-based approach examining how word senses map to signs can identify such novel mappings absent from current dictionaries, enriching lexicographic resources.We address this by analyzing German and German Sign Language (Deutsche Gebärdensprache, DGS), manually annotating 1,404 word use–to–sign ID mappings derived from 32 words from the German Word Usage Graph (D-WUG) and 49 signs from the Digital Dictionary of German Sign Language (DW-DGS). We identify three correspondence types: Type 1 (one-to-many), Type 2 (many-to-one), and Type 3 (one-to-one), plus No Match cases. We evaluate computational methods: Exact Match (EM) and Semantic Similarity (SS) using SBERT embeddings. SS substantially outperforms EM overall 88.52% vs. 71.31%), with dramatic gains for Type 1 (+52.1 pp). Our work establishes the first annotated dataset for cross-modal sense correspondence and reveals which correspondence patterns are computationally tractable.Our code and dataset are made publicly available
Retrieval-augmented generation has become the dominant paradigm for deploying large language models in knowledge-intensive applications, yet practitioners lack guidance on model selection when both quality and costs matter. We evaluate language models from 4B to 70B parameters, including PLLuM and Bielik families of Polish LLM, within a Polish Wikipedia-based RAG pipeline. Quality assessment uses GPT-4o pairwise comparison across 1,000 PolQA questions with bias mitigation and Bradley-Terry ranking, while energy measurements capture inference costs on NVIDIA H100 hardware. Our findings challenge conventional scaling assumptions: parameter scaling beyond 12B offers minimal quality gains, with mid-size PLLuM-12 matching 70B performance while reducing energy consumption by 83%.
This thesis proposal addresses methodological gaps in applying NLP to social science by shifting from categorical classification to comparative scaling of grounded constructs. We first extend predictive capacity on existing specialized political datasets with prompt optimization and distillation approaches. We then develop an active learning framework for efficient comparative annotation to scale latent dimensions from large corpora. Finally, we apply this pipeline to measure benevolent sexism in Slovenian media and migration threat perception in parliamentary discourse. This work establishes a scalable workflow for moving NLP from ad-hoc classification to theoretically grounded comparative measurement.
Policy-gradient reinforcement learning (RL) is widely used to improve language model reasoning, but existing methods are not compatible with diffusion language models. The primary reason for this is the difficulty of likelihood estimation with such models. We propose EMBR, a scalable off-policy framework that reformulates KL-regularized RL as an energy-based distribution matching problem. By aligning policy updates with reward signals through energy matching,EMBR avoids the overhead of on-policy learning and the variance of importance weighting. We further derive a principled upper bound for the energy matching objective which can be used to fine-tune dLLMs. Experiments on multiple benchmarks in both online and offline setting show that EMBR matches or surpasses the performance of diffu-GRPO and related baselines in the online case, and of DPO in the offline case. Our approach provides a practical alternative for post-training of diffusion LMs.
Clinical Natural Language Processing (NLP) integrates large language models (LLMs) to extract biomedical insights from unstructured clinical text. Most named entity recognition (NER) and relation extraction (RE) datasets rely on manual annotation, which is costly and difficult to scale. Many biomedical knowledge graphs (KG) suffer from underspecified relations, conflate causal and correlational claims, and edges lack evidence for reasoning. This dissertation presents a semantic stability framework for constructing explainable KGs, highlighting stable extraction as fundamental for scalable NER and RE, and essential for graph structure. We applied this to Substance Use Disorders (SUD) and Social Determinants of Health (SDOH) from PubMed corpus and NER and RE annotation guide. Multiple LLMs perform extraction under shared semantic constraints, with disagreements resolved through Human-in-the-Loop (HITL) validation. We define semantic stability through NER and RE metrics, using stabilized gold data for model training and evaluation. We then develop a claim-centered KG, where edges represent evidence, provenance, relation type, directionality, polarity, and stability indicators. This benchmark and pipeline supports multi-hop reasoning, triadic SUD–SDOH–SUD mediation patterns, and feedback loop analysis. This will advance etiological inquiries and data-driven health policy analysis.
Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility – and when compression begins to erase task-relevant content – remain underexplored. In this paper, we define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
Despite their recent success, the geospatial reasoning capabilities of large language models (LLMs)—which require understanding spatial relationships among real-world geo-entities—remain underexplored.We propose an automatic method for constructing compositional geographic question answering datasets that jointly consider spatial and entity constraints.The generated dataset serves as a principled benchmark for evaluating how LLMs coordinate spatial computation with entity-level understanding under diverse compositional settings.We evaluate two state-of-the-art LLMs, GPT-5.2 and Gemini 3 Flash, on our dataset. Experimental results show that while the models perform relatively well on questions involving rich entity grounding, their accuracy drops substantially on questions requiring precise quantitative spatial reasoning, such as distance estimation and containment judgment.Our dataset is publicly available for research and reproduction.
Writing style functions both as a vehicle of expression and as a marker of authorial identity. Stylometric methods enable automatic recognition of authors based on linguistic regularities, while recent advances in adversarial learning—demonstrate how data can be intentionally modified to prevent models from learning usable representations. Yet it remains unclear whether such perturbations, designed to disrupt machine learning processes, also influence human perception of style.This thesis investigates how humans and models perceive writing style under controlled perturbations and whether manipulations that reduce algorithmic recognition likewise obscure stylistic identity for human readers. The study combines computational and behavioral approaches: constructing semantically controlled yet stylistically diverse text datasets, and conducting human evaluation experiments to compare recognition accuracy between models and readers.The results are expected to clarify how linguistic cues contribute differently to human and algorithmic perception of style and to inform broader applications in authorship analysis, privacy-preserving text transformation, and creative expression. By situating writing style as a dimension of information quality, the research contributes to understanding how authenticity, anonymity, and expressivity interact in digital communication.
Vision Language Action (VLA) models are widely used in Embodied AI, enabling robots to interpret and execute language instructions. However, their robustness to natural language variability in real-world scenarios has not been thoroughly investigated.In this work, we present a novel systematic study of the robustness of state-of-the-art VLA models under linguistic perturbations. Specifically, we evaluate model performance under two types of instruction noise: (1) human-generated paraphrasing and (2) the addition of irrelevant context. We further categorize irrelevant contexts into two groups according to their length and their semantic and lexical proximity to robot commands. In this study, we observe consistent performance degradation as context size expands. We also demonstrate that the model can exhibit relative robustness to random context, with a performance drop within 10%, while semantically and lexically similar context of the same length can trigger a quality decline of around 50%. Human paraphrases of instructions lead to a drop of nearly 20%. Our results highlights a critical gap in the safety and efficiency of modern VLA models for real-world deployment.
The use of large language models (LLMs) for generating responses to multiple-choice style questionnaires that were originally intended to be answered by humans is often a helpful or even necessary task, for example in persona simulation or during LLM alignment. Although the input and output versatility of generative LLMs is beneficial when adapting such questionnaires to machine use, it can be detrimental when mapping the generated text back to a closed set of possible answer options for evaluation or scoring. In this paper, we investigate the performance of smaller models for the classification of LLM outputs into the available answer options of multiple-choice questionnaires. We consider fine-tuned encoder-transformers as well as a rule-based approach on three datasets with differing answer option complexity. Surprisingly, we find that the best-performing neural approach still underperforms in comparison to our rule-based baseline, indicating that simple pattern-matching of answer options against LLM outputs might still be the most competitive solution for cleaning LLM responses to multiple-choice questionnaires.
Research on bot detection in social media exhibits imbalance in several areas — across platforms, languages, and detection levels. Addressing these gaps, this study focuses on comment-level bot detection within Polish Reddit communities. We describe in detail the construction of a comprehensive dataset (~40,000 comments, 58% bot-comment prevalence), which provides labels for the subsequent model training. Polish Reddit is inherently multilingual, we therefore take advantage of the linguistic signals, treating language composition of a comment as a feature on its own. We develop novel platform-specific, language-specific, and culturally informed features, and train comment-level classifiers from multiple model families on the manually annotated dataset. The resulting models achieve strong performance and temporal generalization to 2025 data. We analyze the importance and direction of these novel features across models and report that our ’cross-level’ interaction features, ’Bottiquette’ compliance signals, formatting markers, language indicators, repetition and randomness measures — especially the entropy of non-alphabetic characters — rank among the most decisive features. Finally, we complement our quantitative findings with a qualitative characterization of the Polish Reddit bot ecosystem. Overall, this study provides an important baseline for an underexplored setting and contributes to an open discussion on how to approach detection where data is linguistically mixed.
Modern large language model (LLM) systems frequently route inputs to specialized experts to improve accuracy, efficiency, and robustness. Routers determine which expert to activate based on the input, typically represented as a single vector. The construction of this vector limits the distinctions the router can make. Prior work rarely isolates how this vector representation affects routing behavior. We isolate the role of the representation by holding the routing pipeline fixed and vary only how this representation is formed in multilingual settings. We find that representation choice systematically reshapes the available routing partitions. In multilingual routing settings, the routers single-vector input often only encodes shallow features (language/format), resulting in domains that are organized by these features rather than by topic. To mitigate this, we introduce Funnel pooling, a lightweight trainable in-model readout that constructs the routing vector directly from token-level hidden states and does not require a separate embedding encoder. Funnel pooling reduces language and source-dataset driven clustering and results in more topic-aligned domains. Despite this shift, downstream routing performance remains competitive with introducing only a minor inference overhead.
Temporal information is an essential part of communication, and understanding language requires processing it effectively. Despite recent advances, Large Language Models (LLMs) still struggle with temporal understanding.Existing benchmarks primarily focus on English and underexplore how linguistic structure contributes to temporal meaning.As a result, temporal understanding in languages other than English remains largely understudied.In this paper, we introduce TimeRes, a Turkish benchmark for evaluating temporal understanding of LLMs. TimeRes aims to investigate comprehension of Reichenbach’s temporal points and reported speech through date arithmetic.Our dataset includes 4,600 questions across 4 tasks at two levels of complexity, and presents a paired question formulation to distinguish temporal discourse understanding from temporal arithmetic capabilities.We evaluated six LLMs, and demonstrated that models struggle to resolve reported speech and fail to generalize across word order variations.
The recent rapid real-world adoption of vision-language models (VLMs) raises concerns about how social biases encoded in language may propagate into visual generation. In this work, we examine whether socioeconomic stereotypes, expressed through occupation and income-related linguistic cues in prompts, systematically influences skin-tone representations in text-to-image (T2I) generation, with a focus on colorism as a visual marker of social inequality. We first benchmark 3 small VLMs and 60 human annotators on the Monk Skin Tone (MST) scale using the MST-E dataset. We then conduct a large-scale T2I generation study in which we systematically vary the linguistic framing of income in prompts describing 210 occupations, producing over 2,500 portraits across 3 large VLMs. The skin-tone audit of the portraits by the best-performing annotator (GPT-5 mini) reveals strong color bias: high-income prompts consistently produce lighter-skinned faces, with prompt constraints only modestly attenuating this effect. Bias magnitude varies across generators, with GPT-5 Image-mini and Gemini-2.5 Flash-Image exhibiting more pronounced shifts in MST than Grok-2 Image. Our findings indicate that VLMs encode and amplify ethnoracialized socioeconomic stereotypes in language-conditioned image generation, underscoring the need for cross-modal fairness audits and human-centered evaluations.
Quantitative text analysis relies on high-quality corpora, but keyword-based collection often retrieves irrelevant material, undermining validity. We show that active learning with a transformer-based classifier can iteratively refine corpora by excluding irrelevant documents, prompting researchers to clarify inclusion criteria and address edge cases. Applied to German newspaper articles on depression and schizophrenia, this approach improves construct validity and reduces labeling effort. The document relevance classifiers reached an F1-score of 0.8 with just 100–150 labeled snippets, with further gains from tuning, outperforming both random sampling and a weakly supervised sampling baseline. Filtering non-medical articles further had little effect on downstream depression stigmatization measures but increased schizophrenia stigmatization. Active learning thus enables efficient corpus validation and clearer concept boundaries with minimal preprocessing. The source code is publicly available at https://github.com/jakobstgl/active-learning-corpus-refinement.
The ability of AI systems to not only answer complex natural language questions, but also transparently justify their reasoning, is crucial for building trust and enabling effective human-AI collaboration. In domains requiring multi-hop reasoning, answers must often be constructed by combining multiple relevant sentences from a knowledge base to build an inferential path from the question toward the answer. We tackle this challenge by exploring a neuro-symbolic approach to reasoning through the generation of entailment trees – structured, step-by-step proof trees – using Large Language Models (LLMs). These trees provide interpretable justifications for the inference process. Using the EntailmentBank (CITATION) data set, we evaluated a diverse set of prompting strategies across multiple models, along with a proposal of an inference-guided prompting approach that performs well. We also fine-tuned LLMs trained specifically for proof generation by applying several data augmentation, curriculum learning, and reinforcement-guided optimization strategies. Our results show that the fine-tuned model outperforms all prompting strategies, achieving superior performance across multiple structural and semantic metrics. We also provide a detailed evaluation of which training strategies are helpful towards proof generation. Our findings highlight the importance of proof tree generation as a benchmark for evaluating structured reasoning in LLMs.