Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Santosh T.Y.S.S., Juan Diego Rodriguez, Ona de Gibert (Editors)


Anthology ID:
2026.acl-srw
Month:
July
Year:
2026
Address:
San Diego, California, United States
Venue:
ACL
Event:
Annual Meeting of the Association for Computational Linguistics (2026)
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-srw/
DOI:
ISBN:
979-8-89176-393-7
Bib Export formats:
BibTeX

Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost-effective compute usage. However, computational efficiency is key for real-world applications with resource constraints. We provide a systematic analysis of the inference scaling strategies *self-consistency*, *self-refinement*, *multi-agent debate*, and *mixture-of-agents*, to study their computational performance tradeoffs. We evaluate methods on two reasoning benchmarks (MMLU-Pro, BBH) and include extensive parameter configurations (e.g., scaling the number of parallel predictions, agents, and debate rounds) across different model sizes. Across 34 configurations and over 100 evaluations, we compute the Pareto-optimal front to select methods that achieve the best accuracy with the lowest computational budget.Notably, inference scaling improves accuracy by up to +7.1% points over chain-of-thought at the highest evaluated budgets (20× the CoT compute budget) on MMLU-Pro. With an equal computing budget, debate and mixture-of-agents outperform self-consistency by 1.3% and 2.7% points, respectively. While self-consistency saturates earlier, multi-agent gains persist, particularly on more complicated tasks. We identify a simple multi-agent design guideline: mixture-of-agents is most efficient when the number of parallel generations exceeds the number of sequential aggregations.
The large and ever-increasing amount of data available on the Internet, coupled with the laborious task of manual claim and fact verification, has sparked interest in the development of automated claim verification systems. Several deep learning and transformer-based models have been proposed for this task over the years. With the introduction of Large Language Models (LLMs) and their superior performance in several NLP tasks, we have seen a surge of LLM-based approaches to claim verification along with the use of novel methods such as Retrieval Augmented Generation (RAG). In this survey, we present a comprehensive account of recent claim verification frameworks using LLMs. We describe the different components of the claim verification pipeline used in these frameworks in detail, including common approaches to retrieval, prompting, and fine-tuning. Finally, we describe publicly available English datasets for this task.
We investigate how multilingual representations emerge across depth in large language models.Using a unified probing framework, we analyze six multilingual LLMs across five languages (EN/ES/ZH/FR/DE), decomposing behavior into (i) early-layer dynamics, (ii) linear vs. MLP separability, and (iii) token–language alignment that tracks where vocabulary sharing peaks.Across models, we observe a consistent and substantial early jump: accuracy rises by +73.5 to +80.7 points from L0 to L1 on average, indicating that language-relevant signals become accessible immediately after the embedding layer.Moreover, representations are largely linearly separable: for 5/6 models, the mean gap between MLP and linear probes remains within ±0.5 points.Token–language alignment further reveals systematic structure, with peak vocabulary mass exceeding 48% in some models and substantial variation in the depth of peak sharing.These findings provide a compact, cross-model characterization of how multilingual information is organized across depth and introduce simple alignment metrics that complement accuracy-based evaluation.
Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task, a benchmark for studying coreference-like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves performance comparable to that of a multi-head model by composing information across layers primarily through query-key interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.
With the emergence of Large Language Models (LLMs), new methods in Information Retrieval are available in which relevance is estimated directly through language understanding and reasoning, instead of embedding similarity. We argue that similarity is a short-sighted interpretation of relevance, and that LLM-Based Relevance Judgment Systems (LLM-RJS) (with reasoning) have potential to outperform Neural Embedding Retrieval Systems (NERS) by overcoming this limitation. Using the TREC-DL 2019 passage retrieval dataset, we compare various LLM-RJS with NERS, but observe no noticeable improvement. Subsequently, we analyze the impact of reasoning by comparing LLM-RJS with and without reasoning. We find that human annotations also suffer from short-sightedness, and that false-positives in the reasoning LLM-RJS are primarily mistakes in annotations due to short-sightedness. We conclude that LLM-RJS do have the ability to address the short-sightedness limitation in NERS, but that this cannot be evaluated with standard annotated relevance datasets.
We propose a lightweight and single-pass uncertainty quantification method for detecting hallucinations in Large Language Models. The method uses attention matrices to estimate uncertainty without requiring repeated sampling or external models. Specifically, we measure the Kullback–Leibler divergence between each attention head’s distribution and a uniform reference distribution, and use these features in a logistic regression probe. Across multiple datasets, task types, and model families, attention divergence is strongly predictive of answer correctness and performs competitively with existing uncertainty estimation methods. We find that this signal is concentrated in middle layers and on factual tokens such as named entities and numbers, suggesting that attention dynamics provides an efficient and interpretable white-box signal of model uncertainty.
Automatic prompt optimization (APO) has emerged as a powerful paradigm for improving LLM performance without manual prompt engineering. Reflective APO methods such as GEPA iteratively refine prompts by diagnosing failure cases, but the optimization process remains black-box and label-free, leading to uninterpretable trajectories and systematic failure. We identify and empirically demonstrate four limitations: on GSM8K with a defective seed, GEPA degrades accuracy from 23.81% to 13.50%. We propose VISTA, a multi-agent APO framework that decouples hypothesis generation from prompt rewriting, enabling semantically labeled hypotheses, parallel minibatch verification, and interpretable optimization trace. A two-layer explore-exploit mechanism combining random restart and epsilon-greedy sampling further escapes local optima. VISTA recovers accuracy to 87.57% on the same defective seed and consistently outperforms baselines across all conditions on GSM8K and AIME2025.
Adapting Large Language Models to the medical domain remains an active area of research, with multiple strategies proposed to leverage annotated and unannotated data effectively. In this work, we propose a thesis outline to compare three common adaptation approaches—Instruction Tuning, Continual Pretraining, and Reasoning-oriented Training. We identify 5 dimensions to analyse: i) the interaction between the adaptation technique and the tasks; ii) the impact of the data size on the downstream performance; iii) the differences between datasets required by the three techniques; iv) the impact of the techniques given the model size; v) the impact of the techniques given the language.We construct an evaluation framework composed by 5 multilingual medical NLP tasks (named entity recognition, relation extraction, question answering, case report form filling, argument mining), spanning on 21 datasets in English, Italian, and Spanish, for a total of 61 combinations of language and sub-task.
Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers, yet little work has been done to optimize it for edge-side inference. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. After breaking down and analyzing the logic of the original cl100k pretokenizer, we introduced a new pretokenization algorithm with linear time complexity and constant, trivial memory usage, suited for edge scenarios. Test results show that it increases microbenchmarking throughput by up to 2.48× and delivers a 1.14× improvement in overall throughput across the entire Byte-level BPE encoding process, depending on the dataset, while providing identical results as the baseline Regex-based tokenizer.
Annotator disagreement on tasks like natural languageinference (NLI) reflects genuine linguistic ambiguity,yet most fine-tuning recipes treat every example as equallylearnable.We ask whether this external signal of ambiguity predicts*per-example* learning dynamics under LoRA, the most widelyused parameter-efficient fine-tuning method, and find that it does.Correlating annotation entropy (from ChaosNLI’s 100 labels perexample) with per-example area under the loss curve (AULC)on SNLI and MNLI, the correlation is positive in all 25conditions tested (Spearman 𝜌= 0.06-0.43), withdecoder-only models showing stronger correlations thanencoders at matched LoRA rank.More strikingly, under LoRA contested examples exhibit*un-learning*: their gold-label loss *increases*during training, a pattern that is largely absent underfull fine-tuning and IA3 in the encoder setting wherematched comparisons are available, and that we also observeunder LoRA on two decoder-only models.The effect survives partial-correlation controls andreplicates across seeds and datasets.A preliminary noise-injection experiment is consistentwith these findings.
Understanding idiomatic and figurative language in images remains a fundamental challenge for vision–language models, as it requires reasoning beyond literal image–text alignment. Although large pretrained models such as CLIP and BLIP-2 perform well on literal recognition, they consistently fail on multimodal figurative benchmarks, often favoring visually salient but semantically literal interpretations. We show that this failure arises from a systematic literal alignment bias rather than limited model capacity. Motivated by this observation, we reformulate multimodal figurative understanding as a contrastive semantic deviation problem, where figurative images must be distinguished from visually plausible literal alternatives. We introduce a parameter-efficient adaptation of CLIP using Low-Rank Adaptation (LoRA) with hard literal negative mining, achieving targeted reshaping of multimodal representations without full fine-tuning. Experiments on the IRFL benchmark across idioms, metaphors, and similes demonstrate substantial improvements over zero-shot CLIP, BLIP- 2, ensemble-based, and knowledge-augmented baselines. Finally, we introduce FIGMENT, a multilingual figurative grounding evaluation spanning five idiom-rich languages, and show that the adapted model generalizes across languages despite being trained exclusively on English supervision.
Chain-of-Thought (CoT) prompting is the dominant strategy for eliciting step-by-step reasoning in large language models, but its effect on code generation is poorly understood. We present a controlled 2×2 study of Qwen2.5-Coder-1.5B and DeepSeek-Coder-1.3B (each in base and instruction-tuned variants) on HumanEval, MBPP, and LiveCodeBench, plus scale-validation runs on Qwen2.5-Coder at 7B and 14B and a preliminary evaluation of CodeLlama-7B. We find that instruction tuning reverses CoT’s effect on small Qwen models: CoT improves the 1.5B base (+13.4pp, p<0.001) but significantly degrades the 1.5B instruct variant (-15.2pp, p<0.001). The reversal is sharply scale-bounded — it disappears at 7B (-0.6pp) and goes slightly positive at 14B (+2.4pp) — while CoT’s positive effect on base models grows monotonically with scale (+13.4 → +28.7 pp). DeepSeek-Coder-1.3B is insensitive regardless of regime. A direct token-count and truncation analysis shows the mechanism: at 1.5B, CoT inflates Qwen Instruct’s mean output length by 112 tokens and pushes 7.6× more generations into truncation, where Pass@1 is 0%; at 14B, the same prefix produces complete code well within budget. Layer-wise probing shows all four small models encode prompt type by Layer 1–4 (>90% accuracy) — universally, whether CoT helps or hurts — demonstrating that representation does not determine interpretation: the same internal signal drives divergent downstream behavior depending on training regime and capacity. Building on these mechanistic findings, we develop a probe-guided style router that, when trained per model on a labeled training split, selects among 12 prompt styles via a single 84 ms forward pass; it is statistically indistinguishable from the best fixed style in 7/8 settings and significantly outperforms CoT where CoT is most harmful (p=0.012, h=+0.40). Our results argue against applying CoT blindly to small instruct code models: its effect depends on architecture, training regime, and scale in ways that are mechanistically detectable from early-layer activations.
Diffusion language models generate text by iteratively denoising all tokens in parallel, but when and where their hidden states encode whether the output will be functionally correct remains unknown.We present the first probing study of DLM internals, training linear classifiers on hidden states to predict functional correctness.Across two models (LLaDA-8B, Dream-7B) and four tasks, we find that DLMs uniquely accumulate correctness signal across denoising steps (AUC gains of 0.08–0.11 on reasoning tasks), absent in single-pass AR decoding. However, step-0 signal reflects prompt difficulty rather than diffusion-specific computation. Signal emergence is task-dependent: structural tasks show flat profiles while reasoning tasks show gradual buildup. The two models exhibit distinct layer dynamics, with LLaDA concentrating signal in upper layers while Dream redistributes toward lower layers. We further show that probe confidence can identify likely failures, enabling selective generation that avoids 36–98% of wasted compute.
Retrieval-augmented generation (RAG) based on dense embeddings has become a dominant paradigm for text retrieval. However, many real-world applications require attribute-specific querying, where explicit values or properties must be extracted from text (e.g., symptoms in clinical notes or dosage values in medical reports). Dense retrieval handles paraphrastic variation well but often entangles multiple attributes within a single embedding, making value extraction difficult. Knowledge graphs (KGs), in contrast, support explicit attribute access but are brittle under linguistic and structural variation, leading to low recall.This thesis proposal aims to investigate the representational trade-off underlying these approaches. We study knowledge graph representations from an information-theoretic and optimal coding perspective, focusing on the tension between fine-grained factorization and compact canonicalization of concepts. Building on this perspective, we propose a query-driven framework for constructing and retrieving knowledge graphs from text, aiming to combine the robustness of dense retrieval with the explicit queryability of symbolic representations.
We introduce TokLens, an open-source toolkit for evaluating tokenizer quality across languages using six intrinsic metrics: fertility, characters per token, compression ratio, normalized sequence length, single-token retention rate, and cross-lingual parity. We evaluate 24 tokenizers from major LLM families across 15 typologically diverse languages and correlate these metrics with downstream performance. Our analysis reveals stark disparities: GPT-2 produces 56x more tokens per word in Japanese than in English, while newer tokenizers like Qwen2.5 and Gemma-2 reduce this gap to under 4x. No intrinsic metric predicts English benchmark performance after controlling for model size. However, on multilingual benchmarks (MMLU-ProX), linear mixed-effects models show that tokenizer metrics significantly predict per-language performance (STRR: 𝛽 = +5.7, z = 18.5, p < 0.001). A controlled experiment on the Qwen2.5 family further shows that languages with higher single-token retention rate exhibit steeper scaling slopes (𝜌 = 0.91, p < 0.001). These results indicate that tokenizer quality is significantly associated with multilingual LLM performance, though the evidence remains correlational and partially confounded with pretraining data composition.
One partner says "Fine" meaning resolution; the other hears surrender. The word is shared; the affective uptake is not. We formalize this as **affective meaning divergence** (AMD), the total-variation distance between interlocutors’ anchor-conditioned affect distributions. Building on speech-act theory, common-ground accumulation, and entropy-regularized game theory, we derive a logit best-response map whose dynamics undergo a *saddle-node bifurcation*: when 𝛽𝛼 > 4, a monotone increase in AMD-driven load produces an abrupt, hysteretic collapse of repair coordination. On Conversations Gone Awry (CGA-Wiki; N=652), derailing conversations exhibit critical-slowing-down (CSD) signatures across multiple levels: lexical divergence variance (p<0.001, d=0.36), AMD variance (p=0.001, d=0.26), and dialog-act repair variance (p=0.016, d=0.20), all significant after correction and stronger than toxicity and sentiment baselines. AMD provides a distinct temporal signature, with retrospectively measured variance peaking at the bifurcation point while toxicity variance peaks earlier, and is the only indicator grounded in the theoretical framework. Boundary-condition analysis on CGA-CMV (N=1,169) yields mixed but directionally consistent evidence.
Reasoning models frequently agree with incorrect user suggestions - a behavior known as sycophancy. However, it is unclear where in the reasoning trace this agreement originates and how strong the commitment is. We introduce sycophantic anchors - sentences identified via counterfactual analysis that commit models to user agreement. Across four reasoning models spanning three architecture families (Llama, Qwen, Falcon-hybrid) and 1.5B - 8B parameters, we analyze over 200,000 counterfactual rollouts and show that linear probes reliably detect sycophantic anchors (74 - 85% balanced accuracy), outperforming text-only baselines at high commitment levels -confirming they capture internal states beyond surface vocabulary. Regressors further predict commitment strength from activations (R2 up to 0.74). We observe a consistent asymmetry: sycophancy leaves a stronger mechanistic footprint than correct reasoning. We also find that sycophancy builds gradually during generation rather than being determined by the prompt. These findings enable sentence-level detection and quantification of model misalignment mid-inference.
Neural IR models achieve strong performance but remain difficult to interpret. We present NEAT-IR, a black-box analysis framework that explains ColBERT’s ranking behavior using 26 classical IR features (BM25, TF-IDF, IDF measures, positional signals). We analyze ColBERT through two complementary lenses: regression (predicting exact scores) and learning-to-rank (predicting relative order), evaluated on MS MARCO (48,250 query-passage pairs). Our key finding is a score-rank gap: classical features preserve ColBERT’s rankings nearly perfectly (NDCG@5 ≈ 0.99) yet explain only R2 ≈ 0.28 of score variance. Feature attribution reveals that regression and ranking models rely on distinct feature subsets: query-level IDF signals dominate score prediction, while document-matching features (BM25, cosine TF-IDF) drive ranking preservation. These findings suggest that ColBERT’s ordinal behavior on MS MARCO is largely recoverable from classical signals, while neural contributions primarily affect score magnitude. NEAT-IR enables practitioners to diagnose when neural rankers deviate from classical patterns, supporting interpretable model auditing and informed hybrid pipeline design.
Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BanglaSocialBench, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, comprising 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random; for example, inappropriate addressing choices concentrate heavily in downward-hierarchy (ElderYounger) and informal contexts. This reveals persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.
Hate speech detection is essential for maintaining healthy online communities. Large language models (LLMs) perform well on text classification, yet their decision strategies need to be better understood. While post-hoc rationales can justify individual decisions, they substantially increase inference cost and limit scalability in high-throughput settings. As another approach, we propose an extended rational inattention model that parameterizes linguistic noise and information processing cost, providing an interpretable behavioral framework for black-box LLM classifiers. Treating LLMs as rational decision-makers under information constraints allows us to estimate - from the observed classification behavior - the parameters that represent information processing cost and noise sensitivity. As a case study and using a hate-speech dataset spanning multiple noise environments, we evaluate four commercial LLMs and show that the introduced extended rational inattention model predictions closely match the observed performance across different noise levels. We further test the performance under various noise mechanisms and find that the inferred information cost parameters remain consistent while the noise parameters vary with the distortion mechanism. Overall, our introduced framework offers a cost-efficient and quantitative approach to derive interpretable indices of LLM moderation behavior and decisions, without additional rationale generation.
Adversarial perturbations in the context of large language models (LLMs) are subtle changes added to input data (i.e., images or text) that are designed to alter predictions or outputs of machine learning models. We introduce several novel visualizations using topological data analysis (TDA) (leveraging persistent homology) to characterize how adversarial perturbations act on text inputs, specifically, how sandbagging and code-injection attacksalter the geometric structure of attention heads in transformer models. By computing persistent homology metrics from attention maps across different model architectures (such as BERT, RoBERTa, ELECTRA, DistilGPT, etc.), we find that adversarial inputs alter higher-dimensional topological features (H1 loops and H2 voids) in ways that distinguish them from clean, non-adversarial inputs.
We investigate whether large language models (LLMs) can generate literal usage examples for Japanese multiword expressions (MWEs), whose literal readings are structurally low-frequency in available corpora.Prior work on MWEs has largely focused on detecting idiomatic usages in context, leaving literal usages underrepresented particularly for Japanese MWEs whose literal readings are rare and structurally diverse.Because literal readings are rarely attested in corpora, we design a lexicon-grounded setup that uses corpus non-literal usages as contrastive cues for controlled prompting. We evaluate the generated sentences using automatic literalness judgments and human literalness judgments, together with manual inspection.Our results show that providing contrastive non-literal information stabilizes literal generation and improves quality compared with prompts that include only literal information or no hints. In addition, we conduct an LLM-based understanding test that compares model predictions of literal and idiomatic plausibility with human judgments.The results indicate that the model aligns more closely with human judgments for idiomatic interpretations than for literal ones, highlighting the relative difficulty of modeling literal readings of MWEs.The study demonstrates that LLMs can complement existing resources by supplying frequency-independent literal examples and offers a controlled framework for examining contextual meaning understanding of Japanese MWEs.
We propose a novel approach to translating Japanese slides into English andto correcting their layout errors by utilizing multimodal LLMs with slide images and XML structures.Existing translation tools often suffer from layout errors after translationdue to text expansion during the translation process, causing text to overlap with figures or other items in slides and thereby reducing readability. To overcome this issue, our proposed framework introduces two steps consisting of (i) translating text fragments within the slide, and (ii) correcting layout errors by optimizing layout placement based on visual consistency. In step (ii), we empirically show that few-shot prompts are quite effective in layout error correction. Given that the optimal combination of steps (i) and (ii) varies depending on the slide layout, our method generates eight different layout candidates. Consequently, we introduce a third step that automatically selects the optimal output from these eight candidates.The experimental results showed that the proposed method outperformed baselines and achieved 4.1% layout error rate and over 80% model selection success rate.
Rap is a vocal style rooted in Hip-Hop culture, characterized by producing rhymes in synchrony with a rhythmic beat.This paper proposes a method for generating Japanese rap lyrics with a large language model (LLM) whose rhyming behavior is improved via reinforcement learning.We design a reward function that evaluates end rhymes between two generated bars and apply GRPO, a reinforcement-learning method, to encourage Japanese rhyming without using existing Japanese rap lyrics as training data.Experimental results show that, although output collapse is observed in some cases, GRPO increases the proportion of outputs that receive moderate or high human ratings on rhyme-related criteria.
To prepare for an uncertain future, organizations must continuously monitor emerging trends and early signals of change. The increasing availability of web-based textual data has boosted natural language processing (NLP) methods in strategic foresight, particularly in the scanning phase. While prior studies have extensively focused on the identification of signals in such data, considerably less attention has been paid to how these signals evolve over time and gain relevance as they become more visible. This study addresses this gap by examining whether tracking the temporal dynamics of signals can improve their assessment for strategic decision-making. Demonstrated on the use case of the European electric vehicle market, we find three dominant signal trajectories and show that burst dynamics tend to surface signal consolidation rather than the early detection of weak signals. The results indicate that foresight research should move beyond static, one-off analyses toward a dynamic temporal perspective capable of identifying signals at earlier stages of emergence.
Logical fallacy detection models frequentlyover-flag valid reasoning due to reliance onsurface-level spurious correlations. We in-troduce 703 LLM-generated CounterfactuallyAugmented Data (CAD) pairs—minimally dif-ferentiated valid and fallacious arguments—todebias models through targeted augmentation.Fine-tuning DeBERTa-v3-large on CoCoLoFaaugmented with these pairs yields marginalin-distribution improvement (+0.4% F1) butsubstantial out-of-distribution robustness: 58%relative reduction in false positive rate (64%→ 26.7%) on a 300-sample Reddit-sourcedevaluation set. While recent LLMs (Llama-3.1-8B, Llama-3.3-70B) achieve high perfor-mance under optimized prompts (F1 90–94%),they degrade severely under simple human-like prompts (F1 63–72%, FPR 54–74%).Our lightweight, prompt-invariant approachachieves competitive robustness (F1 85.9%,FPR 26.7%) across all prompting regimes with-out prompt engineering, making it stable forproduction deployment with unpredictable userinput. The dataset and model are publicly re-leased.
The Advanced Encryption Standard (AES) is currently considered the preferred standard of encryption for secure messaging by the National Institute of Standards and Technology. While its predecessor, the Data Encryption Standard (DES), can be analyzed in mere seconds with modern cryptanalysis algorithms, those same algorithms are unfeasible for cryptanalysis of AES. This is primarily because the key, or the list of values used during encryption, was increased from a length of 56 bits to 128 bits. The list of all possible combinations, also called the probability space, is tens of magnitudes larger in AES than DES. Current DES cryptanalysis methods use mathematical methods to find the most statistically likely approximate key in the entire probability space. The increased probability space in AES is too large to find an approximation using these formulas. However, these methods operate under the assumption that every key is possible. While this is mathematically true, when looked at from the lens of a linguist, many of these keys create messages that are impossible in language. Linguistic attestation is the documentation that a word, grapheme, sound, or other linguistic feature exists in a language. This thesis proposal presents an algorithm that eliminates AES keys using linguistic attestation. I apply the idea that any grammatical message encrypted by AES will not contain any "unattested" data in its input. The algorithm proposes trimming the probability space by removing keys that are unattested. Once a smaller probability space has been created, any method that searches the probability space for a solution may be applied to find the key from the new, smaller list of filtered keys.
Garden-path sentences offer a controlled probe of English incremental sentence processing because they require a reader to revise an initially plausible parse when a later region disambiguates the structure. We present an architecture-aware comparison of garden-path recovery in causal and masked language models using 100 English garden-path/control pairs (200 sentences) spanning three constructions: NP/Z, where a noun phrase is initially read as a direct object but must be reanalyzed as the subject of a zero-complement clause; NP/S, where a noun phrase must be reanalyzed as the subject of an embedded sentence; and MV/RR, where an apparent main verb must be reanalyzed as a reduced relative modifier. Causal models are evaluated with left-to-right word surprisal, whereas masked models are evaluated with pseudo-surprisal derived from masked language model scoring. Beyond the disambiguating word, we analyze cumulative excess surprisal, area-under-curve recovery summaries, and layer-wise hidden-state divergence between each garden-path sentence and its minimally different control. Across the audit-valid model set, causal models show larger within-model disambiguation effects than masked models overall, with the clearest family-level difference on NP/Z constructions. We interpret this difference cautiously because surprisal and pseudo-surprisal are not numerically commensurable across architectures or tokenizers. The results nevertheless show that architecture changes which recovery signals are observable: decoder-only models exhibit sharper online disruption at the point of syntactic revision, while bidirectional encoders appear comparatively buffered at the disambiguator due to right-context access. More broadly, the findings argue that garden-path evaluation should emphasize recovery dynamics, not merely end-state plausibility or task accuracy.
LLM judges are often used to score generated answers, but their decisions may be affected by surface style rather than semantic correctness. We introduce PolyJudge-Uncertain, a controlled benchmark for studying multilingual hedging effects in LLM-as-a-judge evaluation. The benchmark contains 5,120 short factual QA instances across English, Hindi, Hinglish, and Bengali, balancing assertive versus hedged style and correct versus incorrect answers. A small pilot suggested a large pointwise penalty against hedged answers. After repairing multilingual templates and adding quality-control checks, this pointwise effect largely disappears: final pointwise accuracy is 99.8%, with no meaningful assertive-hedged gap. The robust remaining effect is pairwise: when two answers are equally correct and differ only in style, the judge prefers the assertive answer in 1,276 of 1,280 cases. We interpret this as a protocol- and task-specific assertiveness preference, not as a universal bias against hedging. Our findings highlight benchmark auditing as a central requirement for multilingual judge-bias research.
Large language models excel at technical problem solving in English but struggle when questions are posed in Bangla. While translation offers a practical solution, existing Bangla-English systems frequently mistranslate specialized terminology, altering problem semantics and degrading downstream performance. We present BanglaSTEM, a dataset of 5,000 Bangla-English sentence pairs covering computer science, mathematics, physics, chemistry, and biology. Our pipeline extracts matching passages from official bilingual curriculum textbooks using OCR, then uses LLMs to align sentences and mark technical terms. These aligned examples serve as few-shot prompts for generating over 12,000 new translation pairs from LLMs, avoiding copyright issues. Human evaluators then select the best 5,000 pairs that correctly preserve technical terminology. We also test a term-weighted BLEU metric that gives higher weight to technical words, since standard metrics treat terminology errors and common word errors equally. We show that our weighted metric correlates better with downstream accuracy in code generation and math solving, while standard BLEU gives high scores even for wrong translations. The full implementation, dataset, and model will be made publicly available.
Chain-of-Thought (CoT) in large language models (LLMs) has been widely debated in terms of whether it faithfully reflects an internal reasoning process of models. Parametric faithfulness is a recently proposed metric that uses unlearning to assess whether a model encodes parametric beliefs corresponding to a reasoning chain. This paper refines this metric by accounting for the unintended artifacts of unlearning. We introduce control tasks that unlearn irrelevant knowledge and word-shuffled content and show that these control tasks yield substantial parametric faithfulness values, suggesting the non-negligible effect of unlearning. We also found that control tasks help explain the significant variations in parametric faithfulness observed across different model sizes and CoT lengths. We conclude that the effects of unlearning need to be considered when measuring parametric faithfulness.
Federated learning with heterogeneous client architectures cannot rely on parameter aggregation. Prototype-based methods address architectural heterogeneity by exchanging class-level representations, but naively averaging prototypes across non-IID clients leads to semantic drift and poor inter-class separation. We propose FedPAGR, a framework where heterogeneous clients project their features into a shared consensus space and exchange class prototypes with a central server. The server refines aggregated prototypes through a geometric regularization objective that enforces agreement with client submissions and inter-class angular separation. Clients anchor their classifiers to the refined prototypes and train with a composite objective combining classification, prototype alignment, and entropy regularization. We evaluate FedPAGR across multiple domains, including four image benchmarks and a clinical NLP task using heterogeneous ClinicalBERT variants, with five architectures per federation under severe label heterogeneity (𝛼=0.1). FedPAGR achieves the highest ensemble accuracy across all four image datasets and the highest local test accuracy on low-class and clinical tasks, including a 4.99-point improvement over the strongest baseline on MIMIC-IV, while remaining competitive on high-class benchmarks.
Sentiment analysis involves analyzing text to determine whether the sentiment expressed is positive, negative, or neutral. In the context of online reviews, such as those on Yelp, sentiment analysis helps businesses assess customer satisfaction and identify areas for improvement. Given the large volume of user-generated content, restaurants often struggle to extract actionable insights from feedback, making sentiment analysis an efficient tool for categorizing reviews and highlighting customer concerns. This study focuses on sentiment analysis of Yelp reviews. The main research question is: How can Natural Language Processing (NLP) combined with statistical machine learning methods be applied to classify sentiment in Yelp reviews and provide actionable insights for improving customer satisfaction, service quality, and business performance? The study used 21,000 Yelp reviews, utilizing NLP approaches - tokenization, stop-word removal, and vectorization. Comparative classification predictive modeling and analysis were done across traditional machine learning (Logistic Regression, Support Vector Machine (SVM), Naïve Bayes, Random Forest), deep learning methods (CNN, LSTM, BiLSTM, GRU, RNN), and an advanced transformer-based (RoBERTa) model. Results showed that RoBERTa outperformed the other candidate methods. These findings highlight the potential of advanced NLP techniques to offer businesses practical ways to address customer complaints, enhance service quality, and drive overall business performance.
Structured span extraction research is siloed by context length, annotation task, and domain, making it difficult to assess how well large language models (LLMs) generalize across realistic extraction settings. We introduce SSA (Structured Span Annotation), a unified evaluation framework bringing together five datasets across four domains: finance, biomedicine, affective analysis, and privacy, under a common JSONL format with character-level offsets. We conduct an exploratory study evaluating seven models (three closed, four open-weight) under three prompting configurations: zero-shot, definition-augmented, and few-shot, formulating extraction as inline XML generation where models reproduce the document with tagged spans. Our results reveal two distinct performance regimes: on tasks requiring complex ontology reasoning, zero-shot performance is near zero (e.g., 0.00% F1 on FiNER-139) but improves substantially with label definitions (e.g., Claude Opus 4.6 rises from 8.8% to 57.5% F1); on pattern-based tasks like PII detection, definitions consistently hurt performance across all models. These findings suggest that prompting strategy must be matched to task structure, and that unified evaluation frameworks spanning varied domains and input lengths are essential for understanding LLM extraction capabilities.
Current automated content moderation systems fail to protect children from harmful YouTube content, particularly in under-resourced, code-switched settings. These systems are often text-only, English-centric, and operate as ’black boxes,’ lacking the multimodal understanding and transparency needed for effective moderation. This thesis proposes a novel hybrid framework for the explainable multimodal detection of harmful content in videos with code-switching. The proposed framework integrates a fine-tuned classifier for accurate, scalable detection with an LLM-powered module that synthesizes the classifier’s internal evidential signals (e.g., text attention and visual heat maps) to generate faithful, human-readable rationales for each decision. As a primary case study, the framework will be developed and validated on an English–Filipino code-switched dataset. Expected contributions include a new dataset publicly available under controlled access (de-identified transcripts, blacked-out frames, extracted feature representations, and metadata via data-sharing agreement) and a blueprint for building more equitable, transparent, and trustworthy AI safety systems.
Retrieval-Augmented Generation (RAG) systems face challenges with complex, multi-hop questions, and iterative agentic frameworks such as Search-R1 (Jin et al., 2025) have been proposed to address these complexities. However, such approaches can introduce inefficiencies, including repetitive retrieval of previously processed information and challenges in contextualizing retrieved results effectively within the current generation prompt. Such issues can lead to unnecessary retrieval turns, suboptimal reasoning, inaccurate answers, and increased token consumption. In this paper, we investigate test-time modifications to Search-R1’s open-source Qwen2.5-7B pipeline to mitigate these identified shortcomings. Specifically, we explore the integration of two components and their combination: a contextualization module to better integrate relevant information from retrieved documents into reasoning, and a de-duplication module that replaces previously retrieved documents with the next most relevant ones. We evaluate our approaches using the HotpotQA (Yang et al., 2018) and the Natural Questions (Kwiatkowski et al., 2019) datasets, reporting the exact match (EM) score, an LLM-as-a-Judge assessment of answer correctness, and the average number of turns. Our best-performing variant (contextualization) achieves a 5.6% increase in EM score and reduces the average number of turns by 10.5% compared to the Search-R1 baseline. While contextualization itself introduces additional LLM calls, our results demonstrate improved answer accuracy and reduced retrieval load.
Eye movement offers valuable insights into human visual attention during assessment of machine-generated texts, yet existing research and resources in this area are limited. To bridge this gap, we introduce Gaze Responses for Evaluating AI Texts (GREAT), a comprehensive dataset capturing human eye-movement features during screen reading of passages generated by large language models (LLMs). The dataset includes raw eye-movement recordings, reading-time measurements, and post-reading evaluations for LLM-generated passage pairs, alongside rigorous validation metrics. The collected eye-movement features demonstrate strong explanatory power in predicting text quality. When integrated with negative log-likelihood (NLL), a commonly used metric for evaluating text quality, it substantially enhances model performance across all standard statistical criteria. These findings demonstrate that eye-movement can act as an effective source of information that complements probabilistic metrics, for the task of automatic text quality assessment. The full dataset and some processing code are publicly available at https://github.com/qwurd231/GREAT.
Multilingual language models now cover more languages than ever, yet script-sharing low-resource languages remain vulnerable to failures driven by script and dominant-language priors. This dissertation studies one such failure mode, semantic interference, in Square Bai Script, where many forms resemble Chinese characters but often differ in meaning. We argue that current adaptation pipelines underperform not only because Bai is low-resource, but because they treat visible overlap as safe transfer by default. Building on an expert-validated corpus of 28,382 Bai-Chinese sentence pairs, an out-of-domain epigraphic set and a reproducible encoding pipeline, the dissertation will (1) diagnose semantic interference, (2) compare adaptation strategies under realistic compute constraints, and (3) estimate when shared-script transfer helps or harms adaptation. The long-term goal is Bai-capable understanding and generation. The dissertation addresses the prerequisite problem of safe and effective adaptation in a script-sharing low-resource setting.
Federated learning is often framed as a practical trade-off in clinical NLP: safer data handling at the cost of lower predictive performance. We revisit this assumption in a benchmark-specific study of Polish medical text classification. A key issue is evaluation granularity: the test split contains 10,634 rows but only 670 unique normalized text hashes, with 18 inconsistent groups removed in strict grouped evaluation. We therefore compare centralized and federated training under both conventional instance-level scoring and a stricter hash-level protocol that controls duplicate inflation. In the strongest reported settings, federated training matches or slightly exceeds the centralized baseline, reaching instance-level Macro-F1 of 0.8826 ± 0.0177 versus 0.8689 ± 0.0124, and hash-level Macro-F1 of 0.8908 ± 0.0220 versus 0.8841 ± 0.0078. The claim is deliberately narrow: we do not argue that federated learning is generally superior to centralized training, nor do we claim formal privacy guarantees. Rather, we show that in this duplicate-heavy Polish medical text benchmark, conclusions about locality depend strongly on evaluation hygiene.
Fringe platforms like Gab harbor high volumes of hate speech due to minimal moderation and insular communities. Our study examines thefactors that determine how hate speech amplifies on these platforms. We prepared a novel dataset of 5K+ threads and 50K+ responses from four fringe platforms (Gab, 4chan, Stormfront, and Vanguard), including both structural features (e.g., timestamps, metadata) and con-tent features (e.g., original text, hate intensity of posts), where hate speech amplification was measured using platform-specific engagement metrics. We trained both Generalized Linear Models and Gradient Boosted Tree models to estimate how several features influence the amplification of hate speech on fringe platforms, and used Shapley value estimates to identify the relative importance of the features. Our analysis shows that research insights from social network analysis (SNA) of mainstream sites like X do not directly generalize to fringe platforms. For instance, our experiments show that using features like thread structure and disagreements in early response windows can give up to 74% lift in Root Mean Squared Error (RMSE) of predicting reply counts for hateful posts on fringe platforms, compared to a baseline model that has features like hate intensity and thread age (which would be considered predictive by regular SNA methods).
Machine Unlearning is a valuable ability of LLMs, enabling the removal of unsafe, outdated, or private information. Existing unlearning methods, however, are often evaluated under the assumption that all facts are equally challenging to forget. Controllable knowledge removal is essential for reliable NLP systems. In this paper, we investigate whether fact popularity influences the efficiency of LLM unlearning. To answer this question, we build **UNLamb** benchmark designed to systematically investigate this relationship. It consists of 11.6k question-answer pairs derived from real-world knowledge in Wikidata, explicitly partitioned into rare and popular facts. Using this benchmark, we perform a comprehensive evaluation of state-of-the-art unlearning algorithms on a set of models of different sizes. We conduct a comprehensive analysis of four unlearning methods across three validation sets and two LLMs. We show that larger models struggle more to forget popular entities, often damaging related knowledge in the process. In contrast, it is much easier to remove rare facts without side effects.
User-level ADHD-related text classification from social media is methodologically challenging because predictions must aggregate many short posts, performance can be inflated by direct diagnostic leakage, and screening-adjacent settings require calibrated probabilities rather than discrimination alone. We introduce a leakage-aware evaluation framework organized around two controlled axes: evidence budget, i.e., the number of tweets available per user, and leakage control. Within this setup, we compare document-level transformers, strong non-graph embedding-pooling baselines, and heterogeneous graph models combining semantic tweet embeddings, psycholinguistic features, and temporal structure. The main result is regime-dependent: graph aggregation is most useful when user evidence is scarce, whereas simple embedding pooling becomes highly competitive and often slightly stronger as more evidence becomes available. Overall, the main contribution is a controlled benchmarking framework and a clearer account of when graph-based aggregation is actually beneficial.
Active learning (AL) reduces labeled data requirements in NLP, yet most methods optimize label efficiency while ignoring annotation cost. Standard uncertainty sampling assumes uniform effort, leading to suboptimal resource allocation when documents vary in length. Supasan and Athuraliya (2026) introduced CAL-Log, a cost-aware AL variant using logarithmic cost modeling C(x)=α+β log(1+L(x)), where C(x) is the predicted annotation time for document x and L(x) is its token length, grounded in information foraging theory (Pirolli and Card, 1999) and psycholinguistic studies of human skimming (Rayner, 1998). This paper presents CAL-Log in full, extending that preliminary framework with two new contributions: temperature-scaled calibrated entropy and online per-annotator cost adaptation, which together resolve the cold-start calibration bottleneck identified in the prior work. Experiments on ten text classification benchmarks demonstrate a 3.3× speedup over BADGE (Batch Active learning by Diverse Gradient Embeddings; Ash et al., 2020) and 3.9× over Entropy sampling to reach F1=0.80, with large effect sizes (Cohen’s d>0.8). A live annotation deployment with preliminary user evaluation (N=7) confirms that the online cost model produces reading-speed classifications consistent with annotator self-reports, and that a transparency interface successfully communicates the scoring rationale to non-expert users.
As large language models (LLM) trained on massive corpora scraped from the web exhibit the capability to reproduce sensitive and copyright-protected data, the field of machine unlearning has emerged to address the arising ethical and legal concerns.While previous research has provided a unified evaluation of LLM unlearning methods, this unification remains constrained to English-only models and datasets.We aim to address the prevailing fragmentation in recent cross-lingual unlearning research by extending existing unified benchmarks with multilingual data.To that end, we plan to compile a dataset of parallel translations of question-answer pairs consisting of real-world facts and synthetic personally identifiable information.Moreover, we will focus on mitigating model degradation during unlearning by selectively editing only those layers that contain the given knowledge.
AI agents that interact with graphical user interfaces (GUIs) require effective observation representations for reliable grounding.The accessibility tree is a commonly used text-based format that encodes UI element attributes, but it suffers from redundancy and lacks structural information such as spatial relationships among elements.We propose A11y-Compressor, a framework that transforms linearized accessibility trees into compact and structured representations.Our implementation, Compressed-a11y, applies a lightweight and structured transformation pipeline with modal detection, redundancy reduction, and semantic structuring.Experiments on the OSWorld benchmark show that Compressed-a11y reduces input tokens to 22% of the original while improving task success rates by 5.1 percentage points on average.
Counterspeech offers a way to tackle harmful content online without restricting freedom of expression. This work explores counterspeech generation using small language models (SLMs) as lightweight and cost-effective alternatives to large language models. We evaluate SLMs ranging from 100 million to 3 billion parameters using simple prompting strategies as well as fine-tuning, combining automatic and robust human evaluations. Our findings demonstrate that small language models can generate relevant, coherent, and high-quality counterspeech, suggesting their potential suitability for efficient and responsible deployments.
Aspect-Term Sentiment Analysis (ATSA) aims to predict sentiment polarity for specific aspect terms, a task complicated by conflicting sentiments and limited context in short texts. Existing graph-based approaches rely on predefined pairwise structures to capture different linguistic views. However, this leads to two key limitations: (1) their pairwise formulation often requires multiple graphs to improve expressive capacity, and (2) their reliance on predefined parsers or heuristic graph construction limits adaptability to sentence-specific sentiment composition. We propose HyperATSA, a dynamic hypergraph framework that overcomes these limitations through a single instance-specific hypergraph constructed directly from contextual token representations. Hyperedges are dynamically induced via hierarchical agglomerative clustering over token embeddings, where an acceleration-based cutoff identifies sentence-specific semantic groupings and enables adaptive hypergraph construction. Experiments on Lap14, Rest14, and MAMS demonstrate consistent improvements over strong graph-based baselines, suggesting that hypergraph-based relational modeling generalizes effectively to short-text sentiment composition.
Large Language Models specialized for the medical domain achieve high performance on static benchmarks, but remain vulnerable to sycophantic confabulation, where the models generate medically spurious rationales to justify incorrect user hints. This robustness gap poses severe risks in clinical environments, as models may prioritize contextual faithfulness to a biased prompt over their internal parametric medical knowledge. This study introduces a mechanistic approach to identify and mitigate these failures in MedGemma-27B, isolating hint integration circuits using Sparse Autoencoders and geometric manifold analysis. Our findings reveal that sycophantic bias is a highly distributed and polymorphic concept, with biased reasoning routed through shifting dimensions across transformer layers. We identify the optimal layer for intervention and demonstrate that cluster-conditioned dynamic steering tailored to the geometric subspace of the prompt outperforms static global interventions, though it reveals a fundamental tension between bias resilience and the retention of internal parametric knowledge. This work proposes a principled framework toward clinical AI systems that are more robust and aligned with expert medical logic, demonstrating the potential of cluster-conditioned geometric interventions while characterizing the inherent trade-offs in clinical knowledge retention.
When rubric-based feedback tools explain a grade, students and instructors assume those explanations reflect how the score was actually determined. Yet it remains unclear whether explanation components such as rubric assignments and evidence spans reflect how scores are constructed or primarily serve as post-hoc justifications. This gap has direct implications for automated essay scoring and rubric-based feedback tools, where explanation reliability is often assumed but rarely evaluated.We introduce a knowledge graph framework that represents human tutor grading traces as structured objects, enabling controlled counterfactual testing of explanation components. Using 400 grading traces from 10 expert human tutors evaluating 100 narrative essays, we define a reconstruction-based diagnostic to measure how explanation components contribute to score interpretation, independent of prediction. Our results reveal a consistent asymmetry: removing rubric-level information leads to substantial changes in reconstructed scores, while removing evidence spans has minimal impact. This suggests that rubric structure is central to score interpretation, whereas cited evidence spans may function primarily as post-hoc justifications. We further observe tutor-specific variation in grading behavior. These findings highlight the need for explanation mechanisms that better align with scoring processes, ensuring that feedback provided to learners is both interpretable and functionally relevant.
Speaker diarization systems produce segmentation errors, such as false splits and boundary misplacements, that degrade transcript readability and downstream applications. We present CBAL (Context-Based Agentic Learning), a post-processing framework that refines segmentation boundaries in diarized scripts through targeted error correction. CBAL detects potential segmentation errors using acoustic and temporal heuristics and employs a lightweight LLM agent to reason about merge decisions, validating corrections through uncertainty-aware filtering with signal-based constraints. CBAL achieves 93.4% accuracy across 359 applied merges and reduces segment count by 6.1%. We demonstrate that our framework identifies and corrects high-confidence errors while maintaining 0% degradation in terms of concatenated minimum-permutation Word Error Rate (cpWER). An ablation study confirms that each component contributes non-redundantly, demonstrating the viability of interpretable refinement frameworks that use the strengths of acoustic models and language understanding without requiring end-to-end retraining.
Shortcut learning remains a major obstacle to robust NLP systems: models can achieve high in-distribution accuracy by relying on surface cues that fail under distribution shift. We study whether shortcut reliance can be diagnosed and mitigated in small instruction-tuned language models using a simple representation-level quantity. We fine-tune Gemma 3 1B Instruct and Llama 3.2 1B on two synthetic sentiment shortcuts in SST-2 and one natural shortcut in MNLI based on lexical overlap. During training, we fit linear probes for the task label and the shortcut attribute at every layer and define CDRE as the absolute cosine similarity between the two probe directions. Across settings, increasing shortcut prevalence produces a sharp rise in the robustness gap between shortcut-aligned and shortcut-free test sets, and higher deep-layer CDRE tracks this degradation. At a 99% shortcut ratio, Llama’s clean accuracy on capitalization-biased SST-2 drops from 93.2% at 0% bias to 49.0%, while Gemma drops from 91.8% to 60.2%. A CDRE-regularized objective substantially improves robustness for capitalization and lexical-overlap shortcuts, but offers little benefit for a speaker-prefix shortcut whose learned directions are already nearly orthogonal. These results show that probe-derived representation entanglement provides a reliable signal of harmful shortcut reliance and offers a practical criterion for determining when shortcut mitigation is likely to be effective.
Patient health literacy is critical to health outcomes, yet medical discharge summaries remain inaccessible to many patients due to jargon and complex language. Large language models (LLMs) offer a promising means of bridging this gap, but their deployment in resource-constrained hospital environments demands lightweight, privacy-preserving solutions. We evaluate a range of open- and closed-source LLMs on the MeDiSumQA dataset, comprising real patient discharge summaries paired with lay questions and clinician-verified answers, and demonstrate that larger open-source models achieve accuracy and semantic similarity performance comparable to GPT-5. We then introduce LAMP-MedQA, a lightweight multi-agent framework for patient-oriented medical question answering. The framework decomposes the task into two sequential stages: question-relevant evidence extraction and patient-facing answer simplification. Each stage is governed by an automated, metric-driven feedback loop that enables iterative self-correction without human-in-the-loop supervision. Using Qwen2.5-7B-Instruct for generation agents and Phi-3.5-Mini-Instruct for reviewer/verifier agents, it achieves significantly lower FKGL than zero-shot GPT-5, indicating better readability, and obtains the highest simplification quality (SARI) among all evaluated models, while remaining broadly competitive on accuracy and semantic similarity. This competitiveness is further improved by an offline medical glossary, which narrows the gap in n-gram overlap and contextual-similarity metrics. These results suggest that collaborative lightweight agents represent a viable approach to improving health literacy in clinical settings. Our code is available at: https://github.com/JackJ3636/LAMP-MedQA
Scientific NLP systems often require outputs that satisfy strict, machine-checkable constraints. In this work, we study structured-generation controllability along three axes: structural control, iterative correction, and decoding dynamics. Diffusion decoding is of particular interest because its iterative refinement may improve global structure and revision behavior, but may also introduce distinct failure modes such as termination instability and repetition. To quantify controllability, we evaluate compliance with five machine-checkable constraints: (i) required headings and (ii) correct ordering, which reflect global structural control; (iii) explicit end markers and (iv) per-section bullet constraints, which probe local constraint adherence; and (v) repetition avoidance, which captures generation stability under different decoding dynamics. We use these metrics to assess both single-pass generation and changes under iterative correction. Our goal is to isolate structural reliability under parser-facing requirements rather than to directly measure scientific correctness. Across our benchmark, diffusion models tend to better preserve global structure, while iterative improvement substantially improves explicit termination and other local control constraints. Hybrid systems show mixed behavior depending on decoding order. These results suggest that machine-checkable controllability can be usefully decomposed into global structure versus local control, and that the two may benefit from different inference-time strategies.
We study cross-lingual transfer by fine-tuning seven large language models (4B–671B parameters) on Arabic and evaluating zero-shot reading comprehension on Semitic languages and non-Semitic controls. Across dense and Mixture-of-Experts architectures, we find no evidence of Semitic-specific transfer: models with weak baselines improve dramatically across all languages, while strong-baseline models show only marginal gains regardless of language family. A chain-of-thought ablation reinforces this finding: the same models that benefit most from fine-tuning benefit equally from inference-time reasoning, suggesting both mechanisms address task-format alignment rather than cross-lingual knowledge transfer.
As large language models (LLMs) grow, their compute and memory demands become prohibitive for on-device deployment. Quantization is a crucial technique to shrink model footprint and accelerate inference, but aggressively low-bit weight-activation quantization schemes often sacrifice accuracy. Quantization Aware Training (QAT) is a commonly used paradigm to minimize quantization noise, but is extremely expensive to train and often unscalable to large models. We introduce PE-QAT, a parameter-efficient framework targeting per-channel 4-bit weight-activation quantization of LLMs, which aims to preserve model accuracy while significantly reducing resource requirements. The proposed method freezes the base model and trains lightweight LoRA adapters by fake quantizing the merged-weight model, enabling PE-QAT to scale efficiently unlike full QAT. We apply fake quantization with Straight-Through Estimators (STE) to the merged weights, allowing the adapters to explicitly compensate for quantization noise during training. One of the biggest challenges with quantizing activations alongside weights is addressing outliers that are orders of magnitude larger than other activations, which inflate quantization scales and suppress lower-magnitude values. To mitigate the impact of severe activation outliers, PE-QAT jointly learns per-channel smoothing factors and symmetric activation clipping thresholds. PE-QAT retains accuracy within 0.11 percentage points of the full-precision baseline on Llama-2-7B zero-shot tasks while training only 1.26% of total parameters.
Thinking Mode Fusion (TMF) enables large language models to support both concise responses and long-form reasoning by unifying a non-thinking mode and a thinking mode within a single model. However, its training dynamics, including the data ratio and training schedule between the two modes, remain underexplored. In this work, we present a systematic study of TMF by analyzing the effects of the training schedule and data ratio between thinking and non-thinking modes. Focusing on mathematical problem solving, we construct a benchmark with multiple thinking-to-non-thinking data ratios and three training schedules. Our results reveal an asymmetric interaction between the two modes: increasing the ratio of non-thinking supervision reduces the accuracy of the thinking mode. We further show that different training schedules modulate this trade-off and that the optimal schedule depends on the data ratio. Finally, we quantify a negative correlation between non-thinking and thinking mode supervision, highlighting an inherent tension between these two modes. These findings provide practical guidance for designing effective TMF training settings. All code and data are released to support further research at: Fusion Bench.
We ask whether topic sentiment has a causal effect on perceived political ideology, and whether the answer depends on who assigns the ideology label. Using articles from AllSides, paired with shared sentiment annotations from Llama-3.3-70b-versatile, we compare ideology labels from expert human annotators, GPT-4o-mini (baseline and finetuned), and Llama-3.3-70B. We apply Double Machine Learning (DML) and community-level mediation analysis across all four annotation paradigms. Human annotations yield no significant causal effects at the community level. Fine-tuned GPT-4o-mini achieves the highest classification accuracy (F1=72.48) and is the only annotator paradigm that produces significant community-level treatment effects and significant natural direct effects (NDEs) in mediation. We interpret this as evidence of shortcut learning: fine-tuning on ideology-labeled data causes the model to internalise a spurious sentiment–ideology coupling not operative in human judgment for this task. This coupling is structurally invisible to F1-based evaluation, with implications for the use of LLM annotations as silver labels and as proxies for human judgment in downstream causal analyses.
In conversational implicatures, speakers convey hidden intended meanings beyond the literal content of their utterances, and hearers are expected to infer what is implied. This study examines how Large Language Models (LLMs) interpret conversational implicatures, using human interpretation as a baseline and gold standard for comparison. The same experiments were conducted with two types of participants: humans and LLMs. Two metrics were adopted: a surprisal-based metric and a response-based metric. The results suggest that the response-based metric demonstrates higher accuracy, comparable to human responses, than the surprisal-based metric. In particular, humans and LLMs using the response-based metric performed better in the literal condition than in the implied condition. Additionally, they were more sensitive to capturing implied meanings for some-all trigger than for other triggers, whereas they showed lower performance on Manner implicatures. Overall, LLMs employing the response-based metric tend to exhibit human-like behavior, but still diverge from humans in their understanding of conversational implicatures.
Automatic Speech Recognition (ASR) has achieved strong performance for high-resource languages, but dense intra-sentential code-switched speech in African low-resource settings remains underexplored. Existing multilingual and pretrained ASR systems improve general recognition accuracy, yet they remain weak at switch regions, are sensitive to language imbalance during adaptation, and are typically evaluated with metrics that obscure switching-specific errors. This thesis proposes a self-adaptive and epistemic uncertainty-guided framework for African low-resource code-switched ASR, using Hausa–English (Engausa) and Hausa–Yorùbá as case studies. The work investigates three linked questions: (1) how to design a linguistically informed code-switched corpus with explicit switch-region annotation and labeled/unlabeled partitions for adaptive learning, (2) whether epistemic uncertainty is systematically elevated around switch regions and can guide pseudo-label selection in semi-supervised training, and (3) whether switch-aware adaptation with auxiliary language identification and boundary supervision can reduce recognition errors without increasing catastrophic forgetting. The long-term goal is to develop scalable and data-efficient ASR systems that model code-switching as a structured linguistic phenomenon rather than as noise in multilingual African speech.
Organizations must continuously monitor evolving regulations to maintain compliance. While current tools are limited to surface-level text comparison, existing models lack the finegrained classification schemes to determine whether small changes impact legal obligations or merely update formatting. To address this gap, we introduce a novel benchmark for change detection in EU regulations. It comprises 4,772 manually annotated pairs of structurally distinct provisions, defined as Atomic Legal Units (ALUs), mapped to a six-class taxonomy of legal change types. We formalize three core tasks: structural alignment, change classification, and a combined task requiring simultaneous alignment and classification. Evaluating lexical algorithms, dense encoders, and Large Language Models (LLMs) as baselines, we find LLMs excel at isolated change classification, whereas domain-specific dense encoders offer the most robust combined performance. By providing fine-grained labeled data, this benchmark enables the development of AI systems that can help organizations analyze regulatory shifts and support version-aware retrieval in the legal domain.
Masked language modeling for low-resource ancient languages remains challenging because pre-trained multilingual models lack exposure to these languages. We investigate rule-based linguistic constraints and hard negative mining for Sumerian, a language isolate not included in multilingual BERT’s training data. We build a hierarchical validator capturing subword, word, and part-of-speech patterns from 4,545 annotated sequences, using it to filter candidates and identify hard negatives for fine-tuning. Vanilla mBERT achieves 18.0% hit@10 accuracy. The validator alone improves this to 72.8%, while hard negative fine-tuning reaches 78.3%. Combining both yields 86.7%, a 68.7 percentage point improvement. Temporal generalization evaluation on tablets from 600 years earlier shows that both the hard negative mining and the validator alone improve performance, but the combined approach underperforms due to the validator’s period specific rules. These findings demonstrate that hard negative mining transfers across periods while explicit rule-based constraints provide strong in-domain improvements but limited cross-period generalization.
LLM routing directs queries to a cheaper model when it suffices and to an expensive model otherwise, reducing inference cost. Existing input-based routers optimize cost-performance trade-offs but provide no formal bound on how often the cheaper model fails among routed queries. We adapt a proactive conformal gate framework to LLM routing. A logistic regression gate trained on text embeddings predicts per-query safety, and Clopper-Pearson conformal calibration selects a routing threshold that guarantees the violation rate among routed queries stays below 𝛼 (the violation tolerance) with probability at least 1 - 𝛿 (the confidence level). On two benchmarks covering math reasoning (GSM8K) and multi-domain knowledge (MMLU), routing between Mixtral-8x7B and GPT-4 (a 24.5× cost difference), our method maintains the target 𝛼 within the 𝛿 tolerance across a sweep from 0.05 to 0.50, while a validation-tuned baseline crosses the violation boundary on GSM8K. A feasibility analysis across all 10 RouterBench models reveals that routability is jointly model- and task-dependent. To our knowledge, this is the first input-based LLM router with distribution-free safety guarantees.
Turkish idiomatic light verb constructions (LVCs) are challenging for multiword expression processing because they often share the same surface form as fully literal verb–object combinations while functioning as a single, partially idiomatic predicate. We frame Turkish LVC detection as a binary classification task (literal meaning vs. idiomatic meaning) and evaluate on a manually created controlled set (N=147) with matched negatives: out-of-domain random sentences and in-domain literal controls (NLVC), alongside LVC positives. We compare a supervised Turkish encoder baseline (BERTurk with a classifier head) to three instruction-tuned LLMs from different families under zero-shot, one-shot, and few-shot prompting, and analyze how demonstrations shift error profiles. In zero-shot, LLMs perform well on negatives but show very low LVC recall. One-shot prompting sharply improves LVC detection but can induce strong, model-specific biases (over- vs. under-predicting LVC). A richer few-shot prompt improves calibration and yields robust overall performance for GPT-OSS-20B and Qwen 2.5-14B. Overall, the results highlight substantial prompt sensitivity in Turkish metalinguistic classification: the supervised baseline remains competitive, while prompted LLMs can match or exceed it on LVCs with carefully constructed demonstrations. We release code, prompts, and evaluation materials to support reproducibility.
Large language models (LLMs) are increasingly deployed in high-stakes settings, yet reliably estimating when their outputs should be trusted remains an open challenge. Existing uncertainty estimation approaches—such as calibration, token-level probabilities, or semantic entropy—typically require access to model internals, additional supervision, or computationally intensive pipelines. We propose answer instability, defined as the variability of a model’s final answer across repeated stochastic generations of the same prompt, as a simple, label-free, and black-box uncertainty signal. Evaluated across three task types — reasoning, multiple-choice QA, and constraint-following — using four LLMs and 520 prompt-model pairs, our approach achieves performance competitive with semantic entropy while requiring no semantic similarity model. Our results show that instability strongly correlates with prediction errors and reliably discriminates correct from incorrect outputs. We further demonstrate its utility for selective prediction and targeted repair, improving reliability without access to internal probabilities or additional training.
Humans play a vital role at every stage of AI development, from data collection and curation to model development and evaluation. However, humans often disagree with each other and sometimes with themselves over time. It is essential to take disagreement into account when building human-centered AI systems, especially in domains where it is prevalent, such as AI safety, content moderation, or sentiment analysis. Disagreement often arises from subjective human opinion and can vary with one’s identity, beliefs, and social environment. Despite this, current LLM evaluation approaches frequently rely on aggregating labels (often via plurality voting) to represent consensus, thereby obscuring minority perspectives. By failing to account for human disagreement, these evaluation methods contribute to the reproducibility crisis in AI. Human feedback is also crucial for ensuring that AI systems align with human values. For these systems to be trustworthy, it is critical to ensure that they reflect diverse human values and perspectives. In this thesis proposal, we present a human-centered and perspective-aware framework for reproducible ML evaluation and AI alignment.
Large language models applied to clinical prediction exhibit case-level heterogeneity: simple cases yield consistent outputs, while complex cases produce divergent predictions under minor prompt changes. Existing single-agent strategies sample from one role-conditioned distribution, and multi-agent frameworks use fixed roles with flat majority voting, discarding the diagnostic signal in disagreement. We propose CAMP (Case-Adaptive Multi-agent Panel), where an attending-physician agent dynamically assembles a specialist panel tailored to each case’s diagnostic uncertainty. Each specialist evaluates candidates via three-valued voting (KEEP/REFUSE/NEUTRAL), enabling principled abstention outside one’s expertise. A hybrid router directs each diagnosis through strong consensus, fallback to the attending physician’s judgment, or evidence-based arbitration that weighs argument quality over vote counts. On diagnostic prediction and brief hospital course generation from MIMIC-IV across four LLM backbones, CAMP consistently outperforms strong baselines while consuming fewer tokens than most competing multi-agent methods, with voting records and arbitration traces offering transparent decision audits.
Deploying machine learning models in real-world domain-specific scenarios is challenged by the scarcity of expert annotations and by data drift, where the statistical properties of incoming data continuously evolve. Active Learning (AL) iteratively improves compact models with expert annotations but suffers from recurring cold-start degradation, while LLMs provide strong off-the-shelf performance yet cannot leverage newly accumulated labels, raising the question: how can we better leverage LLMs to assist the active learning process? Through an empirical study on five legal and biomedical datasets, we reveal a complementary temporal dynamic: LLMs excel during early and post-drift stages, while AL-assisted compact models eventually surpass them as annotations accumulate. Motivated by this finding, we propose an ensemble system that combines an LLM, an AL-assisted compact model, and an automatic switch module that routes predictions to the better-performing model in real time. Evaluated under simulated data drift on two mental health datasets, our system achieves 96–98% switch accuracy and consistently outperforms either model used alone.
Large language model (LLM) agents that invoke external tools must make sequences of interdependent decisions, yet existing uncertainty quantification (UQ) methods treat each step in isolation, ignoring how confidence evolves and compounds across a full task trajectory.We propose a framework for trajectory-level confidence analysis in the tool-use agent setting. The thesis pursues three aims: (1) estimating action-level confidence by adapting step-wise UQ to the heterogeneous think-act-observe cycles of tool-using agents; (2) aggregating the diverse action space into semantically coherent action types to enable meaningful trajectory-level analysis; and (3) discovering temporal patterns in the resulting confidence trajectories that reliably predict task success or failure.We ground the work in standard tool-use benchmarks and expect the framework to expose early warning signals for agent failure and offer interpretable diagnostic tools for understanding when and why LLM agents lose confidence, with improved calibration of multi-step agentic pipelines as a secondary benefit.
Crowdsourced annotators and Large Language Models (LLMs) offer complementary, cost-effective ways to obtain labeled data, yet ensuring high label quality remains challenging.We observe that task features influence the accuracy of humans and LLMs, while real-world constraints, such as per-annotator assignment limits, further complicate allocation.Prior work typically addresses either task features or constraints, but not both.We present an integrated framework that (i) estimates per-task accuracy from task features using a learning from crowds model and (ii) incorporates these estimations into a linear programming formulation that assigns tasks under practical constraints. Experimental results demonstrate that the proposed method achieves accuracy comparable to that of baseline methods while satisfying given constraints.
We propose Dynamic Meta-Metrics (DMM), a framework for machine translation evaluation that learns source-sentence conditioned combinations of existing metrics. Rather than relying on a single static ensemble or language-specific weighting, DMM adapts the metric combination based on properties of the source segment. We study hard conditioning, which fits an interpretable combiner per cluster, and an exploratory soft-conditioned extension whose weights vary continuously with source-cluster responsibilities. We evaluate DMM on the WMT Metrics Shared Task data across multiple language pairs using pairwise agreement measures at the system and segment levels. Across settings, MLP-based combinations outperform linear and Gaussian process-based ensembles, and introducing soft conditioning yields gains over linear models.
Speech-based screening for mild cognitive impairment offers a highly accessible way to detect early cognitive decline. While most existing work focuses on English, cross-linguistic research is emerging to examine how cognitive decline manifests across languages. Studies on the Interspeech 2024 TAUKADIAL dataset, comprising English and Chinese speech recordings, consistently report higher classification performance on Chinese, yet the cause of this cross-lingual discrepancy remains unexplored. We examine this gap using Gemini 2.5 Pro, a multimodal large language model, using zero-shot and in-context-learning (ICL) paradigms. We hypothesize that this disparity is rooted in language typology: in tonal languages like Chinese, pitch encodes lexical meaning in every syllable (tone), whereas in non-tonal languages like English, pitch carries no lexical function. To test this, we pitch-flattened audio from TAUKADIAL and compared how classification performance changed across both languages. We found that Chinese classification degraded significantly under both zero-shot and ICL conditions (-4.78 and -5.92 UAR, respectively), while English performance increased (+0.11 and +2.98 UAR), implicating tonal pitch as the cross-lingual advantage. These findings suggest language typology should inform the design of audio-based cognitive screening tools, with raw audio preferred for tonal languages and text for non-tonal languages, a distinction critical for developing equitable cross-linguistic screening.
Tool-using LLM agents are typically compared by accuracy alone, despite deployments being constrained by inference cost. We present a budgeted evaluation of common strategies for improving ReAct-style tool agents (multi-sample aggregation, iterative self-correction, and post-hoc answer revision) using Pareto analysis of cumulative accuracy versus token budget on three benchmarks (HotPotQA, FEVER, GSM8K) with Gemini 2.5 Flash. All experiments use three random seeds (N=500 per seed for HotPotQA/FEVER; N=1,015 for GSM8K); budgeted curves are computed post hoc from per-instance token logs. In our offline evaluation, Reflexion attains the highest accuracy on tool-heavy benchmarks (HotPotQA, FEVER), while CoT-SC leads on GSM8K. Reflexion’s reported token costs are optimistic lower bounds because retries are stopped using ground-truth feedback, and its accuracy is similarly optimistic: a deployment without access to ground-truth labels would not achieve the same accuracy because the gold-label stopping criterion would be unavailable; both costs and accuracy would differ in practice. Sampling-based approaches often spend 3-5x more tokens for comparatively small gains on tool-heavy tasks. GSM8K, a pure-math benchmark with minimal tool interaction, shows substantially larger gains for CoT-SC, TCAR, and Reflexion, larger than on tool-heavy benchmarks, though less sharply separated than headline accuracy alone would suggest, consistent with repeated tool trajectories being an important contributor to the observed efficiency gap in our tool-heavy settings. We provide a compute-aware evaluation protocol (frontier analysis and marginal-cost metrics) and practical guidance for choosing agent designs under different budget regimes.
Medical education resources are dense for common diseases but often sparse for under-covered conditions, atypical presentations, and fine-grained concept distinctions. This creates curriculum gaps that are difficult to repair manually, especially in long-tail domains where structured teaching materials are limited. We introduce Curriculum-Gap Completion (CGC), a new task for Large Language Model (LLM)-based medical education in which a model reconstructs missing educational units from a partially specified curriculum graph. Given topic nodes, pedagogical relations, and structured teaching slots, the model predicts omitted concepts, restores missing instructional links, and completes automatically verifiable teaching content. We instantiate this setting in a long-tail medical case study (hyperhidrosis) and evaluate five LLMs under three methods: direct prompting, retrieval-augmented prompting, and our proposed Structure-Aware Curriculum-Gap Completion (SACGC) framework. Across models, SACGC achieves the strongest overall performance, with the largest gains on structurally demanding masking settings. Ablation results show that explicit graph structure is the most important component, while schema constraints provide additional benefit. These findings suggest that LLMs are better suited for reconstructing an under-specified educational structure than for unrestricted medical tutoring, and they motivate CGC as a new natural language processing (NLP) problem for healthcare education.
Transformer language models reliably achieve high accuracy on many reasoning tasks; however, their internal mechanisms are not fully understood. Mechanistic interpretability seeks to remedy this gap by identifying task circuits within individual models, but it is unclear whether such circuits generalize across model families and scales. In this work, we study the universality of circuits through the lens of numerical comparisons, a simple and controlled task that enables clean and causal interventions. We conduct experiments on a set of transformer models spanning different families and sizes from 1.7b to 9b parameters. We find that models within the Qwen family exhibit a highly consistent circuit structure across architecture and scale, featuring localized attention heads that write a task relevant signal. In contrast, models from other families show qualitatively different implementations, where task relevant information emerges much earlier and is distributed across components as opposed to being concentrated within a small set of attention heads. These results serve as evidence that task behavior similarities do not imply mechanistic universality and highlight the necessity for cross model comparisons to claim generalization of internal circuits.
Modern LLMs have demonstrated advanced reasoning skills, including the ability to solve Olympiad-level mathematics problems. While solving more and more difficult problems is a hallmark of LLM progress, less attention has been placed on how "difficulty" is operationalized in the context of LLM problem solving tasks. This is particularly relevant in educational contexts where teachers or students may ask LLMs for "easy" or "hard" questions. In this paper, we explore various quantitative measurements from LLM-generated solutions and evaluate their inter-correlations, as well as their correlation to human-annotated difficulty scores. We find moderate correlations between metrics using log probabilities and output lengths, including some that are more strongly correlated to difficulty than LLM accuracy. We also train ModernBERT to predict difficulty scores, leading to reasonable accuracy within a given benchmark, but decreased performance when generalizing to other math benchmarks. Finally, to explore connections between difficulty scores and human performance, we collect problems, human solutions, and human performance data from the Putnam competition. We find poor alignment between LLM metrics and human-assigned difficulty scores, despite strong correlations between those scores and human performance on the problems.
Frequent revisions of complex regulatory documents in large organizations often introduce inconsistencies and contradictions that are difficult for lawyers and auditors to detect manually. Existing tools rely on character-level diffs and therefore miss paraphrases and semantic shifts. We introduce LegDiff, a novel benchmark for evaluating span-aware semantic comparison of legal texts, and use it to investigate the ability of large language models to detect semantic changes beyond token- and character-level matching. LegDiff comprises manually annotated pairs of legal paragraphs drawn from different documents. In addition, we present a pipeline to generate synthetic training data that aligns with the manual annotations and mirrors the structure and label distribution of the manually curated benchmark, and a visualization tool for clearly displaying detected differences and inconsistencies. The dataset, code, and a visualization tool are publicly available to facilitate reproducibility and further research (https://github.com/s-nlp/SLeDoC).
Modern generative AI produces fluent text,polished slides, and clean diagrams — yetstill fails when an artifact must serve a specificpurpose for a specific reader, used by aspecific presenter. The missing piece is notfluency but a model of why content is beingproduced, for whom (presenter and audiencealike), and how it should adapt as goalsshift. My completed and published work developsfive systems across the scientific communicationpipeline: ADAPTIVE IE for intentdrivenextraction; Persona-Aware Slide Generationfor audience reframing rather than blanketsimplification; GPA for reconciling divergentgroup preferences; SciDoc2Diagrammer-MAF,whose multi-aspect critics distinguish purposefulabstraction from genuine omission or hallucination;and SMART-Editor, which modelscascading edits across multimodal layouts. Togetherthey show that aligning with intent, audience,and structure is necessary—but cannotanswer whether the resulting artifacts actuallycommunicate. I therefore propose three directionsin priority order: (RQ1) a goal-drivenframework that measures the educational utilityof document-to-video generation throughIRT-calibrated diagnostic questions, validatedagainst measured learning outcomes and accompaniedby inter-annotator agreement studieson human effectiveness judgments; (RQ2)presenter-side personalization that treats thepresenter—not just the audience—as a firstclassuser; and (RQ3) a unified SuperPersonalizationbenchmark for transferable user preferences.RQ3 is scoped to be deferrable topost-dissertation work if RQ1 expands. Thethesis shifts the target from generative AI thatproduces content that looks correct to systemswhose outputs demonstrably communicate
We argue that LLM-based coding agents frequently fail to solve problems that lie within the model’s capacity and the bottleneck is often the conditioning context rather than the model itself. We formalize this for the full class of Turing-computable problems with verifiable specifications and introduce a framework that recasts coding as optimization overconditioning contexts that influence the generation of natural-languagesolution intentions. Guided by execution feedback, the method searches thiscontinuous context space to steer a coding agent toward correct solutions. The method operates as a plug-in layer that can wrap any coding agent without modifying its architecture or weights. On SWE-Bench Verified, our method raises the resolution rate of a weak, quantized 24B open-weight model to parity with frontier models +25× its size.
Self-consistency improves large language models by sampling multiple reasoning paths and selecting the most frequent answer, but majority vote often fails to recover correct answers that are already present among samples. In this work, we reformulate answer selection in self-consistency as a ranking problem. Instead of relying on a single uncertainty or confidence signal, we train a lightweight reranker to score candidate answers using five carefully designed features that capture answer-level frequency, semantic centrality, and reasoning-trace consistency. We instantiate this approach with a LambdaRank model and evaluate it on three datasets under a range of test-time budgets. Across datasets, the proposed method consistently achieves a better accuracy-efficiency trade-off than standard self-consistency and strong baselines, with particularly large gains on question answering benchmarks. Further analysis shows that the proposed features are individually useful and, more importantly, complementary, highlighting the value of learning to combine multiple informative signals for test-time answer selection.
Negotiation involves complex emotional and strategic dynamics that pose challenges for AI agents in negotiation dialogues. This paper proposes a zero-shot soft-labeling method using an large language model-based embedding model and verifies its performance on negotiation dialogues. Furthermore, it examines the performance of predictive model training on rule-based annotated hard and soft labels obtained by the proposed method for the task of predicting whether agreement will be reached from partial dialogues, namely, final disagreement anticipation in negotiation mid-dialogues (FDANMD). Soft labeling obtained by the proposed method showed a maximum HIT@3 score of 0.87 against rule-based annotated hard labels, whereas failure cases also demonstrated the limitations of rule-based annotation. Furthermore, using ROC AUC, evaluations of FDANMD across three datasets (CB, DN, and JI) with negotiation progress rates of 0.25, 0.5, and 1.0 revealed that soft labeling is particularly effective at low negotiation progress rates and also offers superior performance on individual datasets and unseen datasets for models trained on multiple datasets. These results motivate the use of soft labeling to incorporate the complexity of negotiation dialogues into intermediate representations and support the generalizability of zero-shot soft labeling and generalizable predictors across a wide range of negotiations beyond known domains.
We investigate the extent to which the language processing of LLMs resembles human cognitive processes, focusing on a human cognitive bias called the *neglect-zero effect*. This effect refers to the human tendency to ignore *zero-models*, which are configurations that render a proposition vacuously true by virtue of an empty set. We focus on two types of inferences driven by the neglect-zero effect, and examine how LLMs process these inferences by comparing their behavior with that in an inference that does not involve the neglect-zero effect. For this purpose, we employ a paradigm based on *structural priming*, where recent exposure to a preceding sentence (the *prime*) facilitates the processing of a subsequent sentence (the *target*) due to their structural similarity. We prepare primes to force LLMs to consider the zero-model, and analyze whether they also consider it in the target. The results suggest that the neglect-zero effect may not occur in the LLMs analyzed in this study. Our code is available at https://github.com/ynklab/neglect_zero.
Low-resource agglutinative languages, characterized by rich morphological inflection and severe vocabulary sparsity in corpora, have long posed numerous challenges in the field of representation learning. Word-level representations preserve semantic integrity but struggle to handle sparse surface forms, whereas morpheme-level representations, though easier to learn, often lack holistic semantic information. Existing multi-granularity methods are typically modeled at the word and phrase levels, with very limited application to low-resource agglutinative languages. Focusing on the morphemes of agglutinative languages, this paper proposes MAGNet, a morphology-aware gated multi-granularity pre-training framework. At the morpheme granularity, this framework leverages morphological knowledge and integrates morpheme segmentation with morphological tagging to construct fine-grained representations. It further introduces a morphology-aware masked language modeling objective to facilitate the model in learning functional morphological regularities. Meanwhile, at the word granularity, a word-level encoder is employed to capture contextual semantics and maintain its semantic coherence.Finally, a gated fusion mechanism dynamically fuses representations of different granularities according to the context. Experiments conducted on two low-resource agglutinative languages, Mongolian and Turkish, for the tasks of dependency parsing and named entity recognition (NER) demonstrate that our method achieves consistent performance improvements over strong baseline models. Ablation studies further validate the complementary roles of morphological tagging and whole-word modeling in efficient representation learning.
Most languages lack labeled evaluation benchmarks for large language models (LLMs). Creating such benchmarks requires native speakers, domain expertise, and answer annotation—resources unavailable for the vast majority of languages. We investigate whether a model’s internal processing signals—such as generation entropy and tokenizer statistics—correlate with its actual accuracy on a language, with the long-term goal of estimating language competence without labeled data. Our key observation is that for languages a model does not know, both tokenizer segmentation and generation entropy become highly variable across questions, whereas for known languages they remain consistent. We call this the *inconsistency hypothesis* and test it on 11 instruction-tuned LLMs (1B–70B parameters) across 14 language–script varieties (12 Turkic plus English and Russian controls). We extract over 25 processing features per model–language pair; individually, even the strongest correlate only moderately with accuracy (Pearson |r| up to 0.55). Yet combining just three complementary features—a tokenizer coverage ratio, entropy variability, and the model’s English/Russian benchmark score—explains 75% of accuracy variance in leave-one-language-out evaluation, nearly doubling the 44% explained by a model-mean baseline. The variability of processing signals (standard deviation) consistently outperforms mean values as a predictor across all five model families, but only for greedy-pass measures; sampling-based measures show no such pattern.
The text-to-table task aims to generate structured data in tabular formats from unstructured text. While the integration of large language models (LLMs) has significantly enhanced the comprehensiveness and flexibility of generation, challenges regarding inconsistent output quality persist, such as the inclusion of redundant information and numerical inaccuracies. We propose TableMBR, a robust table generation method that maintains structural consistency through minimum Bayes risk (MBR) decoding. Experimental results showed that TableMBR outperforms the baseline, achieving relative improvements of up to 15% in F1 score on Rotowire and 23% in accuracy on LiveSum.
Large language models (LLMs) are increasingly used as evaluative tools across languages, yet bias research remains overwhelmingly Anglocentric, with most studies conducted in English using Latin-script names. It remains unclear whether bias patterns generalize across linguistic contexts. We investigate this question and introduce the stereotype perceptual map, a framework for analyzing how ethnic groups are positioned along evaluative dimensions.Using 900,000 model responses over 45,000 name variations spanning 9 ethnicities, we evaluate model behavior across prompt languages (English, Chinese, Thai), writing scripts (Latin, Chinese, Thai), evaluative domains (competence, warmth), and models (GPT, DeepSeek). We find that ethnic bias hierarchies are jointly shaped by local linguistic context and model origin and differ substantially between Western-centric and Sinocentric models.DeepSeek exhibits highly stable rankings across conditions in math competence judgments, consistently placing Chinese at the top, followed by Russian, and White, Hispanic, and Black names at the bottom. GPT, by contrast, shows strong script-dependent reordering: Latin-script conditions form one stable cluster, while native-script conditions form another, with substantially lower cross-cluster correlations. We term this script-gated bias: transliterating the same names into a non-Latin script can activate a different evaluative frame and produce rankings that are sometimes inversely correlated with Latin-script results. Warmth evaluations are less stable than competence across both models.Our findings demonstrate that multilingual bias cannot be characterized through single-language, single-script audits. For multilingual users, code-switching between languages can toggle between different bias regimes. Fairness evaluations for multilingual LLMs must therefore account for deployment language, writing system, and model origin to capture the full range of potentially harmful bias these systems exhibit.
We propose a comprehensive research agenda to detect, measure, and mitigate racial bias in Natural Language Processing (NLP) systems deployed in criminal justice contexts. Our preliminary work demonstrates that racial descriptors systematically alter embedding similarity scores and retrieval rankings across six models, with bias being race-specific and models showing rank displacements of 1.82 to 7.44 positions, on average. This empirically indicates that even small shifts in similarity scores can displace relevant records outside top-10 results, leading to systematic under-retrieval of records involving certain demographic groups.Building on these findings, this thesis proposes four research questions: (1) developing and evaluating debiasing techniques including counterfactual data augmentation, adversarial training, and fairness-constrained fine-tuning; (2) validating synthetic findings on authentic law enforcement data through IRB-approved partnerships; (3) investigating intersectional bias patterns across race, gender, and age; and (4) we extend beyond embedding-level analysis to examine how bias propagates across modern multi-stage retrieval pipelines from embeddings to cross-encoders to LLMs. Expected contributions include empirical comparisons of debiasing methods, bias benchmarks for criminal justice NLP, deployment guidelines for fairness-aware retrieval systems, and the first comprehensive analysis of multi-stage bias propagation in retrieval pipelines.
Scenario-based text generation has broad applications across education and creative writing, but remains underexplored in controllable text generation. We introduce the Contextual Diversity Measure (CDM), a metric that quantifies semantic diversity for scenario generation under fixed abstract semantic constraints, and validate it through controlled experiments. Statistical analysis across four embedding models demonstrates that CDM successfully distinguishes between high-diversity and low-diversity text pairs, with all tests achieving statistical significance at p < 0.05 on both the manually curated and LLM-generated subsets of the dataset. Effect sizes range from small-to-medium (Cohen’s d: 0.292–0.508) on the former and medium-to-large (Cohen’s d: 0.677–1.195) on the latter. Baseline comparisons indicate that CDM achieves excellent discrimination accuracy (100% and 91.9%, respectively), with discriminative power up to 5.5× greater than the best baseline.
The development of fact-checking systems for verifying the factuality of text generated by large language models (LLMs) has been advancing.In the verdict prediction step of such systems, the system determines whether claims in the generated text are supported by retrieved evidence, formulated as a natural language inference (NLI) task.This study extends the label set for verdict prediction to capture claim-evidence relationships that humans would commonly interpret as supported or refuted, even in the absence of strict logical entailment or contradiction.It also constructs a Japanese dataset comprising 28,147 instances from two sources based on this extended label set.We analyze the causes of annotation disagreement and find that ambiguity in the boundary of acceptable inference, interpretive characteristics of negative cases, and incomplete information in the evidence affect annotation variability.Using this dataset, we evaluate the performance of prompt-based verdict prediction methods and show that prompts that explicitly elicit chain-of-thought reasoning improve F1 by 4 percentage points compared to baseline.
Machine unlearning aims to remove the influence of specific training data from a model while preserving reliable behavior on the remaining data, making reliable prediction and uncertainty estimation essential for evaluation. Calibration is commonly used as a proxy for reliability in language models, but low calibration error does not necessarily imply reliable decision rules, as models may rely on spurious correlations while remaining well calibrated. We investigate this gap in generative language models using the multiple-choice question-answering evaluation protocol on the TOFU benchmark, measuring probabilistic reliability with calibration metrics (ECE, MCE, Brier) and decision-rule reliability via attribution-based shortcut detection with Integrated Gradients and Local Mutual Information. We find that fine-tuned models achieve low calibration error (ECE ≈ 0.04) compared to pretrained models (ECE > 0.5), and models after unlearning retain similarly low calibration despite reduced accuracy on the forget split, while attribution analysis shows increased reliance on correlation-based tokens. These results demonstrate that good calibration can coexist with shortcut-based decision rules after unlearning, extending the reliability paradox to the machine unlearning setting.
We disentangle multilingual sentence embeddings into language-dependent and language-agnostic components, leveraging the latter to improve cross-lingual similarity estimation.Previous studies focused on encoder-based approaches that use only the input sentence; in contrast, this study examines the effectiveness of disentanglement methods across a broader range of sentence embeddings, including decoder-based approaches and those that utilize prompts.Experimental results demonstrate that embedding disentanglement is effective for a wide variety of sentence embeddings.
Evaluating the grammatical abilities of large language models (LLMs) is important for both NLP and linguistic theory. We investigate the ability of large language models (LLMs) to perform acceptability judgments in a forced-choice paradigm. We evaluate a subset of LLMs on 150 minimal sentence pairs sampled from Linguistic Inquiry and categorized using BLiMP linguistic phenomena. Our results show that while LLMs approximate human judgments, performance varies across models and phenomenon types, with stronger alignment on morphosyntactic phenomena than on linguistically and semantically demanding phenomena. Prompting strategies have minimal impact.
Lean proofs are built as sequences of tactic-induced state transitions, yet learned models often represent proof steps primarily through tactic strings or raw proof-state text. Building on Delta Tokens, which encode a proof step by the local edit it induces between successive proof states, we train an encoder-only Transformer to learn contextualized representations of Lean proof steps from state changes. We then use these step representations to study complete proofs as trajectories in a learned latent space.We first show that the Delta-based Transformer yields better held-out next-tactic retrieval than a matched surface-syntax control, supporting the representational choice used in the trajectory analysis. We then analyze proof trajectories using path length, endpoint span, directness, curvature, and torsion. Across the LeanWorkbook slice used here, longer proofs become increasingly indirect within a relatively bounded latent span: path length grows sharply with proof length while endpoint span changes little, mean step size decreases, curvature rises modestly, and torsion falls. Qualitative case studies show that these geometric patterns align with recognizable proof organizations, including immediate closure, aligned accumulation, scaffolded enrichment, bookkeeping-heavy restructuring, and repeated local contradiction work.The dataset is small and heavily skewed toward short proofs, so the claims are necessarily limited. Within those limits, the results suggest that learned state-change representations recover nontrivial structure in how proofs unfold and provide a promising basis for future trajectory-aware theorem proving.
One of the expected abilities of vision-language models (VLMs) is spatial reasoning ability based on a given text and image.To evaluate the spatial reasoning abilities of VLMs, we focus on the use of spatial deictic expressions, which are defined as spatial expressions whose referent is determined by their situational context, such as this and that.To handle spatial deictic expressions, VLMs must jointly reason over language and visual space, grounding context-dependent references in the image’s spatial structure.In addition, selecting appropriate spatial deictic expressions across languages requires VLMs to understand the language-specific spatial distinctions encoded by these expressions.In this paper, we develop a benchmark to evaluate the multilingual ability of VLMs to use spatial deictic expressions in four languages.Our experiments using this benchmark reveal that the tested models use demonstratives in a manner different from that of humans, particularly in selecting the appropriate demonstratives based on the distance from the object.
Large language models (LLMs) exhibit substantial cross-lingual variation in mathematical reasoning performance, but it remains unclear whether these differences reflect language-specific parameters or a shared mechanism that manifests differently by language. We present a cross-lingual mechanistic analysis of mathematical reasoning in LLMs, enabling us to localize and compare model parameters that support mathematical reasoning across languages. We find that the extracted math-associated parameters exhibit partial cross-lingual overlap, with the strongest overlap concentrated in intermediate model layers.We further observe that English consistently produces the largest set of math-relevant parameters, whereas lower-resource languages reveal smaller sets of relevant parameters. These results suggest that math-related behavior in multilingual LLMs is neither fully language-invariant nor fully language-specific, but instead exhibits partial cross-lingual parameter overlap with systematic language-dependent differences.
Large language models show strong capabilities in natural language generation (NLG) and have been applied to translate complex structured data into human-readable insights. While these models excel at surface-level fluency, they remain unreliable as they produce factually inaccurate outputs and struggle with consistent logical inference beyond surface-level patterns. Moreover, they often lack a clear sense of relevance and produce shallow or uninformative insights.This proposal argues that a key source of these limitations is task underspecification, which requires models to make implicit assumptions about missing context.We investigate how such underspecification leads to unintentional assumptions and how these affect faithfulness and evaluation.We examine how models can identify missing premises and surface multiple plausible interpretations to make evaluation more rigorous. We also explore how to improve reasoning to enable deeper inferences, focusing on code generation and qualitative reasoning. Finally, we will evaluate how the underlying assumptions and depth of inference influence the perceived interestingness of the insights. By shifting focus from surface-level generation to assumption-aware deeper inferences, this work aims to improve reliability, interpretability, and user controlability in NLG.
To accelerate scientific knowledge acquisition, LLMs are increasingly used to synthesize multiple papers into structured tables by inferring schemas and values. While value generation within a fixed schema can often be reduced to extractive question answering, the schema generation problem, determining which dimensions to compare a set of documents, lacks a formal mapping to standard NLP tasks. In this work, we formulate schema generation as a reinforcement learning problem and investigate whether these dimensions can be induced without access to gold-standard schemas. We design a multi-faceted reward framework capturing schema coverage, non-redundancy, relevance, and format, and train a small language model on a literature review dataset. Our approach yields consistent improvements over the untuned base model across intrinsic, reference-based, and LLM-judge metrics, and remains competitive with supervised fine-tuned models at 5× the parameter count on structural and diversity dimensions. All code, results and prompts are available in the GitHub repository: https://github.com/sinjoysaha/rl-schema-generation
Before a tax authority can issue a ruling, it must receive a complete description of the taxpayer’s situation—yet no benchmark measures whether language models can systematically elicit all relevant facts through dialogue.We introduce FSDBench (Factual State Discovery Benchmark), in which a discovery agent questions a simulated taxpayer grounded in a real tax document.The dataset comprises 500 narratives from official Polish tax interpretations, decomposed into 32 874 atomic facts with validated supported precision (97.6%), atomicity (93.8%), and sentence coverage (96.0%).Experiments with four models show that even the best system recovers only 77% of facts on easy samples and under 49% on hard samples after 50 turns.These findings establish conversational fact elicitation as a challenging open problem requiring retrieval-augmented and adaptive questioning strategies.
Uncertainty quantification (UQ) methods for large language models are predominantly designed by hand based on domain knowledge and heuristics, limiting their scalability and generality. We apply LLM-powered evolutionary search to automatically discover unsupervised UQ methods represented as Python programs. On the task of atomic claim verification, our evolved methods outperform strong manually-designed baselines, achieving up to 6.7% relative ROC-AUC improvement across 9 datasets while generalizing robustly out-of-distribution. Qualitative analysis reveals that different LLMs employ qualitatively distinct evolutionary strategies: Claude models consistently design high-feature-count linear estimators, while Gpt-oss-120B gravitates toward simpler and more interpretable positional weighting schemes. Surprisingly, only Sonnet 4.5 and Opus 4.5 reliably leverage increased method complexity to improve performance – Opus 4.6 shows an unexpected regression relative to its predecessor. Overall, our results hint that LLM-powered evolutionary search is a promising paradigm for automated, interpretable hallucination detector design.
Open Information Extraction (OIE) has largely focused on extracting relational tuples from text, yet in its current form remains unsuitable for downstream systems due to the absence of standardized, semantically sound representations. This thesis argues that the field has been addressing extraction as a surface-level prediction problem, leading to outputs that are semantically incomplete and logically ambiguous, particularly in the presence of modality, negation, conditionality, quantification, and attribution. We propose a normalization-first framework that reframes OIE as a structured semantic transformation pipeline, where raw text is first converted into a lossless, canonical form of declarative, active-voice, and irreducible sentence units, and extraction is constrained to atomic unary and binary relations augmented with explicit semantic annotations. Within a Probably Approximately Correct (PAC) learning perspective, we formalize soundness, completeness, and usefulness as approximate yet verifiable guarantees over extraction quality, acknowledging the inherent undecidability of full semantic interpretation. This thesis outlines a feasible research program to develop the theoretical foundations, models, and evaluation protocols required to produce system-ready OIE representations, thereby establishing a principled and executable path toward making OIE directly usable for downstream reasoning and machine interpretability.
Large Language Models (LLMs) are increasingly deployed in multilingual settings, yet most bias evaluation remains English-centric and overlooks how bias manifests within reasoning. We present a systematic study of social bias in both predictions and chain-of-thought reasoning across English, Dutch, Spanish, and Turkish using the MBBQ benchmark. We evaluate instruction-tuned, CoT-prompted, and reasoning-native models under supervised fine-tuning and preference optimization, using accuracy, F1, bias metrics, and a novel reasoning-level language drift measure. We find that (1) bias varies substantially across languages, with consistent degradation in non-English settings, (2) reasoning traces often introduce additional stereotype-driven signals beyond final outputs, and (3) English-trained debiasing methods fail to generalize reliably, with preference optimization introducing cross-lingual trade-offs. We further show that performance gains in multilingual settings are frequently driven by implicit reliance on English-centric reasoning, revealed through increased language drift. Together, our results demonstrate that multilingual fairness cannot be inferred from English performance and requires reasoning-aware, language-specific evaluation and alignment.
Large Language Models (LLMs) achieve strong performance on a wide range of reasoning benchmarks, yet it remains unclear whether they can reliably maintain and update internal representations of an evolving world described in natural language. In particular, existing evaluations inadequately probe state tracking under multiple interacting constraints and largely overlook the role of negated actions, despite their ubiquity in real-world language. We address this gap by introducing MCST, a diagnostic benchmark for multi-constraint state tracking that evaluates an LLM’s ability to maintain consistent world models across sequences of actions involving inventory changes, spatial movement, temporal ordering, and systematic negation. MCST comprises 100,847 questions spanning 12 real-world domains, with five calibrated difficulty levels, nine question types, and controlled integration of negated actions. The benchmark further incorporates culturally diverse entity names to enable analysis of cross-cultural robustness. We evaluate 14 SOTA LLMs across multiple model families using a unified evaluation protocol. Our results reveal substantial limitations: even the strongest models exhibit sharp performance degradation as difficulty increases, with accuracy dropping below 35% at the highest level. Most notably, we identify negation as a dominant failure mode, causing accuracy reductions of 23-32% across models. We release MCST and the full evaluation framework to support future research on state tracking and reasoning in language models and is available at GitHub.
Recursive models that progressively refine latent representations have demonstrated strong performance on a variety of reasoning tasks. However, these models only control whether and when to stop early, not how computation is distributed. In this work, we introduce shortcut reasoning, a framework for distilling recursive latent reasoning into a multiscale jump model that enables flexible test-time compute. We reinterpret recursive reasoning as a latent-time dynamical process and train a student model to predict the effect of multiple reasoning steps at once. To ensure robustness, we augment shortcut transitions with a repair mechanism, where a denoising variant of the base model projects latent states back onto a valid reasoning manifold. We further introduce stepwise improvement supervision, encouraging each shortcut step to increase the likelihood of the correct answer. Experiments on ARC-AGI show that our approach achieves competitive accuracy compared to recursive baselines while requiring fewer sequential updates.
Multi-agent debate systems are typically evaluated only on whether thefinal answer is correct, overlooking the quality of the intermediatereasoning that debate is designed to produce. This paper studies therelationship between three signals in multi-agent debate: token-levellog-probability distributions over reasoning tokens, LLM-as-judge rubricscores assigned to those tokens, and final task accuracy. We examinewhether internal confidence signals predict externally evaluated reasoningquality, and whether either signal aligns with task correctness, acrossthree domains: rubric-based scoring, mathematical reasoning, and factualquestion answering. Our framework pairs a two-agent debate architecture—a Constructor and an Auditor—with anLLM-as-judge that scores each agent’s reasoning along instructionfollowing, justification quality, and evidence grounding, together with acritical-failure flag. Experiments in the rubric-scoring domain reveal aconsistent four-phase confidence trajectory and a substantial roleasymmetry: confidence aligns with judged reasoning quality roughly twiceas strongly for the Constructor as for the Auditor, and confidence-based detection ofcritical reasoning failures is markedly more reliable for the Constructor(AUROC 0.804) than for the Auditor (0.634). These findings motivate thebroader cross-domain investigation proposed in this paper.
Recent work has shown that LLMs develop internally coherent utility functions that emerge with scale, yet whether these value systemsencode systematic demographic hierarchies remains unexplored. We elicit pairwise preferences across 15 intersectional demographic groups (defined by race, gender, and their combinations) and 8 policy domains on three 7–8B instruction-tuned LLMs, fitting Thurstonian utility models to the resulting preference matrices. All three models converge on a compensatory hierarchy that invertsreal-world structural advantage, consistently ranking marginalized groups, the highest and dominant groups are lowest. Intersectional utilities do not combine additively: single-axis audits that measure gender and race gaps independently overestimate the most extreme intersectional gap by 26- 40% in our experiments. Geometrically, we identify a linear direction in the representation space that predicts the full utility hierarchy from neutral sentences alone, and show that this direction is substantially aligned with gender encoding but not with race encoding. Orthogonalization reveals that gender separation in representations is not fully explained by utility encoding. The hierarchy is already present in base (pre-alignment) models and is amplified several-fold by instruction tuning, suggesting it originates in pre-training data rather than alignment procedures.
Prompt choice is crucial in adapting language models to text classification tasks, particularly under low-resource conditions. Manual prompt engineering is time-consuming, non-scalable, and brittle, while current auto-prompting techniques are still far from maturity. This paper presents a two-stage method for prompt learning of frozen language models, CRL-Prompt, based on soft prompt initialization followed by contrastive and reinforcement-based refinement. An experimental study demonstrates that our approach achieves consistent improvements in accuracy over baseline prompt tuning strategies, with gains of up to 2.2% while training fewer than 0.25% of model parameters.
Packing and shuffling tokens is a common practice in training auto-regressive language models to prevent overfitting and improve efficiency. Documents are typically concatenated to chunks of maximum sequence length (MSL) and shuffled in chunks of tokens (atom-size chunk), possibly breaking context within documents. An alternative approach is padding, which only includes one document per chunk. To optimize both packing strategies (concatenation vs padding), we explored the optimal atom size for shuffling and compared performance and efficiency. We found that in the most common setup (where average document length is greater than MSL), matching atom size to MSL yields the lowest perplexity, controlling for dataset. Also, padding yields lower final perplexity than concatenation at the cost of lower efficiency. This trade-off informs the choice of shuffling and packing methods in training LMs.
Validating evaluation metrics for NLG typically relies on expensive and time-consuming human annotations, which predominantly exists only for English datasets. We propose LLM as a Meta-Judge, a scalable framework that utilizes LLMs to generate synthetic evaluation datasets via controlled semantic degradation of real data, replacing human judgment. We validate our approach using meta-correlation, measuring the alignment between metric rankings derived from synthetic data and those from standard human benchmarks. Experiments across Machine Translation, Question Answering, and Summarization demonstrate that synthetic validation serves as a reliable proxy for human judgment, achieving meta-correlations exceeding 0.9 in multilingual QA and proves to be a viable alternative where human judgments are unavailable or too expensive to obtain. Our code and data are publicly available at https://github.com/eiglerl/meta-judge.
Starting from the observation that conditioning a poetry-writing prompt with a pancake recipe leads an LLM to produce a coherent poem incorporating pancake-related content and, more broadly, that such contexts arrange themselves into a structured semantic vector space, we argue that this renders the space explorable. By sampling it and using the resulting continuous representations to condition an LLM’s generation distribution, we can systematically expand the model’s reachable semantic range.We introduce a framework that requires no modification of LLM parameters and operationalizes this idea by constructing a conditioning distribution from a small set of diverse anchor generations. This distribution conditions LLM’s generation via an xRAG-style projector.Our experiments demonstrate that this manifold-based conditioning substantially increases generative diversity, with direct benefits for enhancing divergent thinking, a core facet of creativity, in language models.
In-context learning (ICL) enables Large Language Models (LLMs) to adapt to new tasks using few examples, with task vectors, defined as specific hidden state activations hypothesized to encode task information. Existing studies are limited by small-scale benchmarks, restricting comprehensive analysis. We introduce QᴜɪᴛᴇAFᴇᴡ, a novel dataset of 3,096 diverse few-shot tasks, each with 30 input-output pairs derived from the Alpaca dataset. Experiments with Llama-3-8B on QᴜɪᴛᴇAFᴇᴡ reveal: (1) task vector performance peaks at an intermediate layer (e.g., 15th), (2) effectiveness varies significantly by task type, and (3) complex tasks rely on multiple, subtask-specific vectors rather than a single vector, suggesting distributed task knowledge representation.