Joint Conference on Lexical and Computational Semantics (2026)


up

pdf (full)
bib (full)
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)

Long-context language models face efficiency challenges as context lengths expand. Traditional tokenization methods like BPE operate on frequency statistics, ignoring semantic structure and over-tokenizing redundant spans. We propose SemToken, a semantic-aware tokenization framework that adaptively compresses token sequences based on semantic density. SemToken uses lightweight encoders to identify and merge semantically equivalent spans, allocates variable granularity based on local semantic density, and dynamically adjusts token budgets during generation. Evaluations on WikiText-103, LongBench, and BookSum demonstrate 2.4× token reduction, 1.9× inference speedup, and 67% memory reduction while preserving or improving model quality. SemToken integrates seamlessly with existing models and achieves multiplicative benefits when combined with FlashAttention (up to 2.7× total speedup).
Large language models (LLMs) exhibit remarkable few-shot learning capabilities, yet the role of syntactic structure in demonstration examples remains unexplored. Drawing on psycholinguistic research on structural priming, we investigate whether syntactic patterns in few-shot prompts influence LLM outputs and task performance. We conduct systematic experiments across four model families (Llama, Mistral, Qwen, Gemma) using four syntactic constructions (passive voice, cleft sentences, dative alternation, particle placement). Our results reveal robust syntactic priming effects, with priming strength ranging from 1.3× to 6.4× depending on construction type, indicating that models are substantially more likely to produce constructions matching demonstration syntax. Critically, we find that priming strength shows a positive trend with model size (r = 0.85, p = 0.068), with effects intensifying from 7B to 14B parameter models. We demonstrate that priming is construction-specific rather than reflecting general stylistic preferences, and that priming effects persist across multiple intervening sentences. Analysis across three task types (sentence completion, paraphrase generation, story continuation) reveals that syntactic structure in demonstrations influences output style, and that models produce primed constructions even when the task calls for a different syntactic form. These findings have immediate implications for prompt engineering and reveal that LLMs encode syntactic abstractions beyond surface-level pattern matching. We release our benchmark, SyntaxPrime-ICL, containing controlled examples across multiple constructions for evaluating syntactic priming in few-shot contexts.
Hallucinations are a persistent challenge in natural language generation, including data-to-text. van Deemter (2024) introduced a framework based on the relation of logical consequence ("follows from"), which divides all data-to-text hallucinations into seven disjoint categories. We examine whether human annotators and large language models are able to apply the framework, in two data-to-text domains. Results suggest that the framework is applicable, although there are significant domain-dependent variations, as well as discrepancies between human and model judgments. We also uncover several issues that should inform future work on hallucination.
Metaphors are part of everyday language and shape the way in which we conceptualize the world. Moreover, they play a multifaceted role in communication, making their understanding and generation a challenging task for language models (LMs). While there has been extensive work in the literature linking metaphor to the fulfilment of individual intentions, no comprehensive taxonomy of such intentions, suitable for natural language processing (NLP) applications, is available to present day. In this paper, we propose a novel taxonomy of intentions commonly attributed to metaphor, which comprises 9 categories. We also release the first dataset annotated for intentions behind metaphor use. Finally, we use this dataset to test the capability of large language models (LLMs) in inferring the intentions behind metaphor use, in zero- and in-context few-shot settings. Our experiments show that this is still a challenge for LLMs.
The majority of contemporary computational methods for lexical semantic change (LSC) detection are based on neural embedding distributional representations. Although these models perform well on LSC benchmarks, their results are often difficult to interpret. We explore an alternative approach that relies solely on frame semantics. We show that this method is effective for detecting semantic change and can even outperform many distributional semantic models. Finally, we present a detailed quantitative and qualitative analysis of its predictions, demonstrating that they are both plausible and highly interpretable.
Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets—both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.
This paper studies the impact of retrieved ideologically framed texts on the outputs of large language models (LLMs). While interest in understanding ideological framing in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideologically framed texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify discourse dimensions within the corpus. LLMs are tasked to answer questions derived from three identified discourse dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideologically framed texts; and the second contains the question, ideologically framed texts, and LMDA descriptions. Alignment between reference ideologically framed texts and LLMs’ responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that retrieved ideologically framed texts influence LLM responses toward the discourse framing represented in the external knowledge, with enhanced prompts further amplifying this effect. Our findings highlight the importance of identifying ideological framings within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of intentional discourse steering of such models.
Compositionality is considered central to language abilities. As performant language systems, how do large language models (LLMs) do on compositional tasks? We evaluate adjective–noun compositionality in LLMs using two complementary setups: prompt-based functional assessment and a representational analysis of internal model states. Our results reveal a striking divergence between task performance and internal states. While LLMs reliably develop compositional representations, they fail to translate consistently into functional task success across model variants. Consequently, we highlight the importance of contrastive evaluation for obtaining a more complete understanding of model capabilities.
We evaluate GPT-4o’s color naming across nine languages using both synthetic and human-derived stimuli. Using hue wheels, fixed basic categories, low-chroma hue lines, and dense binned CIELAB grids, we separate lexical availability of color terms from distributional agreement with human color naming. GPT-4o reliably names vivid, high-chroma colors and reproduces several known language-specific distinctions under constrained settings. However, its performance degrades sharply for low-chroma colors and for stimuli near human category boundaries. In these regions, model-human divergence remains high. Overall, GPT-4o shows strong cross-linguistic lexical knowledge but does not reliably match human color-naming distributions, especially in low-chroma and boundary regions.
Lexical Semantic Change Detection (LSCD) is a complex, lemma-level task, which is usually operationalized based on two subsequently applied usage-level tasks: First, Word-in-Context (WiC) labels are derived for pairs of usages. Then, these labels are represented in a graph on which Word Sense Induction (WSI) is applied to derive sense clusters. Finally, LSCD labels are derived by comparing sense clusters over time. This modularity is reflected in most LSCD datasets and models. It also leads to a large heterogeneity in modeling options and task definitions, which is exacerbated by a variety of dataset versions, preprocessing options and evaluation metrics. This heterogeneity makes it difficult to evaluate models under comparable conditions, to choose optimal model combinations or to reproduce results. Hence, we provide a benchmark repository standardizing LSCD evaluation. Through transparent implementation results become easily reproducible and by standardization different components can be freely combined. The repository reflects the task’s modularity by allowing model evaluation for WiC, WSI and LSCD. This allows for careful evaluation of increasingly complex model components providing new ways of model optimization. We use the implemented benchmark to conduct a number of experiments with recent models and systematically improve the state-of-the-art.
Zero-shot Named Entity Recognition is critical for low-resource domains, yet existing approaches rely on opaque prompting of large language models or dense representations that suffer from polysemanticity. We propose an alternative approach that leverages monosemantic features of Sparse Autoencoders. We introduce SAE-NER, a training-free framework that maps monosemantic SAE feature activations to entity types through direct precision estimation, requiring no supervision or prompting. Experiments across general and biomedical domains show that SAE-NER consistently outperforms trained probing classifiers, with especially a large margin in the biomedical domain (up to +20 F1). Finally, we evaluate the utility of SAE-NER predictions as silver training data for downstream NER models. Using controlled perturbations of gold annotations to simulate realistic annotation noise, we show that false negatives are the primary bottleneck for silver-data quality, outweighing the impact of boundary imprecision and false positives.
Making texts clear and comprehensible has become an increasingly important topic in NLP. A possible strategy to enhance text comprehension is to make implicitly conveyed meaning explicit. To explore the role of explicit vs. implied meaning, we study cases of so-called explicitations, i.e. revisions of text in which implicitly conveyed content is made explicit. Using revision histories from wikiHow, we propose a rule-based approach to extract candidate explicitations and curate a human-annotated dataset in which explicitations are distinguished from insertions of new information. Our analyses show that while the extraction method is effective in retrieving relevant cases, distinguishing explicitations from new information is a challenging and often subjective task, reflecting differences in background knowledge and reasoning. Experimentally, we find off-the-shelf LLMs to achieve promising performance, with inconsistent gains from few-shot prompting and fine-tuning. In contrast, fine-tuned NLI models benefit consistently from supervised training and show stronger robustness under distribution shift. In sum, our findings show that the task is challenging, but also indicate that our annotated dataset contains informative signals that models can learn from, paving the way for further research on explicitations.
Large language models (LLMs) appear successful in emulating compositional language, yet it remains unclear what these results entail about their underlying compositional semantic representations. The probing classifier paradigm has emerged as a tool to remedy this. This paper proposes to critically review the findings of 24 probing studies targeting a wide range of linguistic and semantic phenomena. It proposes a taxonomy of probing tasks based on the linguistic primitives they presuppose, distinguishing four tiers: lexical semantics, the syntax–semantics interface, propositional semantics, and discourse and pragmatics. A gradient in representational evidence emerges: LLMs robustly encode lexical information, display less consistent sensitivity to structural relations within sentences, and obtain unsatisfactory results on tasks requiring propositional content, speech acts, or pragmatic inference. The review underscores the need for a clearer theoretical grounding of what probing tasks measure and reflects on how probing can illuminate the compositional pathways available within current language models.
Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over “Yes/No” answers without generating text. We introduce *-PLUIE, task-specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.
We present a systematic study on paraphrase detection in Tamil by constructing a unified dataset through translation and semantic verification of three English benchmarks QQP, PAWS, and MRPC. Unlike prior efforts that focus on individual sources or limited scales, our dataset combines multiple paraphrase detection paradigms and is evaluated using semantic similarity metrics, round-trip translation checks, and classifier agreement analysis. We fine-tune five multilingual transformer models (mBERT, XLM-R, IndicBERT, MuRIL, and DistilmBERT) and a Tamil-specific compact model, TLMR (Tamil Language Model - DeBERTa), pretrained on 525M Tamil tokens. Furthermore, we assess the representational quality of the sentence embeddings that are taken from these models using lightweight classifiers (SVM, XGBoost, and Logistic Regression). We formulate an efficiency-oriented metric that incorporates top-5 accuracy, vocabulary usage, and script fidelity in relation to perplexity in order to facilitate resource-aware evaluation. The experimental findings lay the groundwork for future Tamil semantic understanding tasks by highlighting differences in generalization and efficiency across models.
Social biases based on regional identity (or regional bias) are often observed in Indian contexts on major online social networks and require critical attention. However, due to large linguistic and cultural diversity, high annotation costs, and inherent human biases, very little annotated data exists on regional biases in the Indian context. Recently, Large Language Models (LLMs) have garnered attention for the automatic annotation of text. However, such annotation efforts are largely limited to English texts, and LLMs often perform poorly when applied to low-resource languages. Therefore, this paper focuses on understanding the capabilities and challenges of popular open-source LLMs in annotating Indian regional biases. We utilize the recently proposed IndRegBias dataset, which consists of Indian regionally biased social media comments in both English and code-mixed formats. First, we assess the annotation capabilities of LLMs in a zero-shot setting and critically analyze their performance across different writing styles, including code-mixing, transliteration, and English. We find that the majority of LLMs exhibit low agreement with human annotations (measured using Cohen’s kappa). Consequently, we extend our study by fine-tuning the models using 50% of the data and evaluating them on the remaining 50%. We observe a significant improvement in annotation agreement (kappa) for all the LLMs. To further assess the capabilities of the fine-tuned models, we evaluate them on 500 newly collected social media comments discussing regional issues in India. The results show that most fine-tuned LLMs outperform their zero-shot counterparts when annotating these new comments.
Text detoxification, the automated detection and mitigation of abusive and harmful content, is essential for ensuring the safety of online communities and protecting users. However, low resource languages such as Tatar have received little research attention. In this paper we present Tatoxa, a novel state-of-the-art system for text detoxification in the Tatar language. Comparative experiments show that the proposed approach outperforms existing open source and proprietary commercial LLMs on key quality metrics. We also introduce a new dataset for text detoxification in Tatar, designed for fine tuning and evaluation in low resource settings. Finally, cross lingual transfer experiments indicate that transfer from other languages, including the culturally close Russian, performs significantly worse than training on native Tatar data even when a large Russian corpus is available.
Updating bilingual dictionary entries is a tedious, time-consuming, and highly subjective task, especially when a new sense in the source language requires identifying an appropriate translation equivalent. To date, there have been no attempts to automatize the discovery of new bilingual sense entries. Related tasks such as Word-level Bilingual Dictionary Induction and cross-lingual embedding alignment do not account for polysemy and are not applied to lexicographic data. In contrast to their monolingual counterparts, bilingual dictionaries fall short in terms of completeness, detail with respect to examples and glosses, and diachronic information. We introduce a novel NLP task, Sense-Level Bilingual Dictionary Induction (SenseBDI), at the intersection of lexical semantics, cross-lingual, and diachronic NLP. We construct a dataset of time-stamped sense-level bilingual dictionary entries by aligning two bilingual dictionaries, two monolingual dictionaries, and the multilingual resource BabelNet, thereby enriching bilingual entries with monolingual source-language information. We propose a baseline based on nearest-neighbor search over cross-lingual embeddings of glosses and usages. We find that usages contribute more strongly than glosses, with substantial variation across language pairs and discuss task-specific challenges with regards to target language polysemy and future directions such as transfer to real-world scenarios.
Suicide ideation detection models are typically evaluated using aggregate performance metrics, yet little is known about how they internally represent psychologically meaningful risk factors. In high-stakes mental health applications, understanding these internal representations is essential for safety, transparency, and responsible deployment. In this work, we move beyond accuracy and analyze how suicide detection models trained on original and topic-augmented datasets encode psychological risk factors in their internal representation space. Using visualization and geometric analysis, we examine the coherence and separability of topic-related features. Our results show that topic-aware augmentation increases the clarity and distinctness of underrepresented psychosocial risk factors such as immigration, family issues, and financial crisis. These findings suggest that augmentation not only improves model performance but also leads to more structured and interpretable internal representations.
Stance detection seeks to determine whether a text expresses a position in favor of, against, or neutral toward a target. Despite advances in neural architectures, performance remains inconsistent across datasets. To better understand these disparities, we analyze over 75K samples from four benchmark datasets using six neural models, focusing on stylistic and pragmatic language features rather than architectures or external knowledge. We extract 43 features spanning lexical richness, syntactic complexity, affective tone, and hedging, and assess their impact through both Logistic Regression and SHAP analyses. Our findings reveal distinct stylistic profiles for each stance: favor is best detected when expressed concisely with minimal hedging; against when paired with strong negative emotions and greater lexical variety; and none when texts are lexically simple and emotionally neutral. Across classes, errors arise from excessive complexity, mixed emotional signals, and overuse of hedging. These results advance understanding of what drives success and failure in stance detection.
Stance detection identifies whether a text expresses support, opposition, or neutrality toward a target and is central to applications such as political analysis and misinformation monitoring. With the shift toward large language models (LLMs), stance classification increasingly relies on prompting and lightweight adaptation. Yet the generalization behavior of open-source LLMs across new targets and domains remains uneven. We conduct a large-scale diagnostic study of four open-source LLMs (3B–24B parameters), examining how model size, prompting strategies, and Low-Rank Adaptation (LoRA) interact across in-target, cross-target, and cross-domain settings. Across 912 experiments, three patterns emerge: (1) larger models improve prompting-based in-target performance, but this advantage diminishes after fine-tuning; (2) LoRA boosts in-target accuracy yet often harms cross-context transfer; (3) optimal prompting depends on model size. These results reveal a consistent tension between specialization and generalization, offering practical guidance for configuring LLM-based stance detection under transfer.
Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83.3% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://anonymous.4open.science/r/BanglaRiddleEval.
Do politically charged terms with similar referents, like "undocumented immigrants" (UI) "illegal aliens" (IA) differ only in who uses them, or also in what they mean? We investigate usage patterns by projecting contextual embeddings into interpretable psycholinguistic feature space, and extracting narrative scenes with LLMs. We find that in partisan news, the term IA appears in contexts emphasizing causation and fear. UI appears in contexts emphasizing consequences experienced and shared humanity. Scene abstraction reveals parallel patterns: IA is embedded in narratives of criminality and threat, UI in narratives of vulnerability and governance. Beyond indexing speaker identity, these terms impart different construals on migrants: as agents of harm versus patients of circumstance. This dual-track methodology adds new tools to the growing body of computational approaches for understanding the conceptual framing of politically charged topics.
A large body of research has examined the linguistic abilities of language models (LMs) across various languages. However, conclusive evidence regarding their semantic competence and world knowledge remains limited, especially for low-resource languages. In this study, we explore the semantic competence of Italian BabyLMs, focusing on their sensitivity to semantic violations. To this end, we adapt a minimal pair benchmark targeting semantic violations to evaluate the semantic abilities of BAMBI, a family of small-scale models trained on progressively larger and more complex datasets. We further compare their performance, assessed through accuracy, mean log-likelihood offset, and expected calibration error, with that of three larger Italian LMs. Our findings shed light on this aspect of semantic competence in small-scale models and how this is affected by data scale and training strategies.
News headlines and summaries shape how events are interpreted through selective emphasis and omission, a phenomenon commonly referred to as framing. Large language models are now routinely used to generate such content, yet existing evaluation frameworks largely overlook this dimension. We introduce Frame In, Frame Out (FIFO), the first large-scale benchmark for measuring framing presence in LLM-generated news summaries, grounded in the widely used XSum dataset. FIFO combines 15,499 jury-annotated examples with 320 expert-labeled instances (𝜅 = 0.61) to validate and calibrate model-based annotations. Using FIFO, we analyze measured framing rates across 27 summarization models. We find that LLM-generated summaries often exhibit higher calibrated framing rates than human-written references, with substantial variation across topics and training regimes, including elevated rates in scientific and public health summaries. Our results establish framing as an underexplored and consequential dimension of summarization quality.
We disentangle multilingual sentence embeddings into language-dependent and language-agnostic components, leveraging the latter to improve cross-lingual similarity estimation. Previous studies on this approach have trained disentanglers by combining intra-component constraints, which either align or disalign language-dependent embeddings or language-agnostic embeddings, with inter-component constraints across both embeddings. However, when and how these constraints are effective remains unclear. Our experiments on sentence similarity estimation and machine translation quality estimation revealed that while intra-component constraints and the combination of both constraints are effective for encoder-based multilingual sentence embeddings, inter-component constraints are effective for decoder-based ones. Furthermore, our detailed analysis revealed distinct roles: intra-component constraints improve uniformity within the embedding space, while inter-component constraints enhance cross-lingual alignment between parallel sentences.
Masked language modeling has become a standard pretraining objective for training encoder-based language models. In this approach, certain tokens in the input are masked, and the model learns to predict them using the surrounding context. This process enables the model to capture both syntactic and semantic properties of language. Conventionally, the tokens selected for masking are chosen at random, which may not always yield the most effective learning signals. In this work, we examine a token masking strategy based on entropy distribution. We use the model’s entropy over token predictions to identify which tokens should be masked. This method aims to target tokens that are more informative and uncertain to improve the training efficacy. We also propose a novel self-masking approach that enhances training efficiency without relying on an external reference model. Experimental results demonstrate that our method achieves an average performance improvement of 5% in GLUE scores compared to the baseline. Further, we experiment with combining knowledge distillation with entropy masking, resulting in the best overall results.
In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.
Creating effective dialogue systems for mental health support requires high-quality multi-turn counseling dialogue data, yet collecting real counselor-client conversations presents significant challenges, including privacy concerns, high costs, and limited scalability. We present Interactive Agents, a novel framework that simulates naturalistic counseling dialogues through controlled LLM-to-LLM interactions. The framework introduces two key innovations: (1) a personalized client agent that maintains consistent psychological characteristics throughout a session, and (2) a counselor agent that implements a theoretically grounded three-stage therapeutic model comprising the exploration, insight, and action phases. Through rigorous evaluation using both automatic metrics and professional-counselor assessments based on the Working Alliance Inventory, we demonstrate that our framework generates therapeutically valid dialogues that are comparable in quality to human-generated sessions. Models fine-tuned on our proposed synthetic dataset (SimPsyDial) achieve state-of-the-art performance in a standard pairwise chatbot-arena evaluation of LLM-based counselors. Our framework provides a scalable, privacy-preserving method for generating high-quality counseling dialogue data while maintaining professional therapeutic standards.
While zero-shot instructional prompts like "Let’s think step-by-step” have revolutionized Large Language Model performance, we lack systematic understanding of why: which specific words drive their effectiveness, and how do these patterns vary across tasks and models? We introduce the ZIP score (Zero-shot Importance of Perturbation), a metric that quantifies individual word importance through controlled, semantically meaningful perturbations. To enable rigorous evaluation, we also introduce the first ground-truth benchmark for prompt interpretability, a set of validation prompts with predetermined keywords where ZIP achieves 95.8% accuracy compared to 65.8% for LIME. Analyzing six flagship models across seven prompts and multiple task domains, we find that word importance is task-dependent ("step-by-step” dominates mathematical reasoning; "think” matters more for common-sense tasks), varies systematically across model families, and correlates inversely with model performance, suggesting prompts have greatest impact on tasks where models struggle. Our findings advance prompt science, providing both practical guidance for prompt engineering and theoretical understanding of how instructional language shapes model behavior.
Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal–novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal–frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.
Text embedding models are designed for sentence-level applications like retrieval and semantic similarity, and are primarily evaluated on sentence-level benchmarks. Their behavior on isolated words is less understood. We show that simply prepending semantic prompts to words before embedding substantially improves word similarity correlations. Testing 7 text embedding models, including text-embedding-3-large (OpenAI), embed-english-v3.0 (Cohere), voyage-3 (Voyage AI), all-mpnet-base-v2, and Qwen3-Embedding-8B, on 3 standard benchmarks (SimLex-999, WordSim-353, MEN-3000), we find that prompts like "meaning: word" or "Represent the semantic concept: word" improve Spearman correlations by up to +0.28 on SimLex-999. Some models fail completely on bare words (ρ ≈ 0) but recover with prompts (+0.73 improvement). Our best results achieve ρ=0.692 on SimLex-999 with embed-english-v3.0 (Cohere), ρ=0.811 on WordSim-353, and ρ=0.855 on MEN-3000 with text-embedding-3-large (OpenAI). These results outperform classic static embeddings like Word2Vec (ρ=0.40) and even the best static method LexVec (ρ=0.48) on SimLex-999, establishing a new state-of-the-art for pure embedding methods. This zero-shot technique requires no training and works with any text embedding model.
Temporal reasoning over historical events is vital for temporal NLP tasks such as event extraction, entity linking, question answering (QA), timeline summarization, event clustering, and natural language inference. However, benchmarks for evaluating large language models (LLMs) on temporal reasoning remain limited. Existing datasets are small, lack multilingual coverage, and focus on recent events. To address this, we introduce HistoryBank, a multilingual database of 10M+ historical events sourced from Wikipedia timelines and infoboxes. Our database provides unprecedented coverage in both historical depth and linguistic breadth with 10 languages. We also present a comprehensive benchmark covering 6 temporal QA tasks across all languages, evaluating models like LLaMA-3-8B, Mistral-7B, Gemma-2-9B, Qwen3-8B, and GPT4o. GPT-4o consistently performs best; Gemma-2 leads among smaller models. Our work offers a rich resource for advancing multilingual, temporally-aware language understanding of historical events. To support further research, we publicly release our code and datasets. Code available at https://github.com/mandalbiswadip/history-bank and data available at: https://drive.google.com/drive/folders/1vHudioDdI3EeYPbhYjKa0gimxaXvpxB2.
We present EvalMORAAL, a transparent chain-of-thought (CoT) framework that uses two scoring methods (log-probabilities and direct ratings) plus a model-as-judge peer review to evaluate moral alignment in 20 large language models. We assess models on the World Values Survey (55 countries, 19 topics) and the PEW Global Attitudes Survey (39 countries, 8 topics). With EvalMORAAL, top models align closely with survey responses (Pearson’s r ≈ 0.90 on WVS). Yet we find a clear regional difference: Western regions average r=0.82 while non-Western regions average r=0.61 (a 0.21 absolute gap), indicating a persistent regional alignment gap. Our framework adds three parts: (1) two scoring methods for all models to enable fair comparison, (2) a structured CoT protocol with self-consistency checks, and (3) a model-as-judge peer review that flags 348 conflicts using a data-driven threshold. Peer agreement relates to WVS survey alignment (r=0.74, p<.001; PEW r=0.39, n.s.), supporting automated quality checks. These results show real progress toward culture-aware AI while highlighting open challenges for use across regions.
How to defend (possibly) toxic large language models (LLMs) from generating toxic content is an important research area. Yet, most research focused on defending jailbreak or toxic prompts on safe models. However, they could fail on already-toxic models, either unintentionally made by those individual developers or the attackers have access to model weights.1 We thus propose a simple yet effective and novel algorithm, namely Toxic Subword Pruning (ToxPrune) to prune the subword contained by the toxic words from BPE in trained LLMs. In contrast to the previous work that demonstrates pruning BPE tokens as harmful to the task of machine translation, we surprisingly found its usefulness in preventing toxic content from being generated on LLMs. Our methods have unique advantages. First, our findings suggest that ToxPrune simultaneously improves the toxic language model NSFW-3B on dialogue response generation.2 Second, ToxPrune also improved the official Llama-3.1-6B on the metric of diversity. Extensive automatic results and human evaluation indicate that ToxPrune could be helpful for both remediating toxic LLMs and improving non-toxic LLMs on the task of dialogue response generation.
There are two shortages in the current Large Language Models (LLMs) era. The first is short of multilingual models, where most LLMs are English-centric and performance is limited on multilingual reasoning. The second is the place of external knowledge to be used, where most retrieved knowledge is prepended to the user queries (maybe sub-optimal). This paper presents a novel and simple yet effective method called Dictionary Insertion Prompting (DIP). When providing a non-English prompt, DIP looks up a word dictionary and inserts words’ English counterparts into the middle of the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results. We experiment with 10 to 200 languages from FLORES-200.1 Since there are no adequate datasets, we use the NLLB translator to create synthetic multilingual benchmarks from the existing 4 English reasoning benchmarks such as GSM8K and AQuA. The synthetic benchmarks are translated back into English for quality assurance with manual annotation. Interestingly, the place for injecting the dictionary plays an important factor in the performance gains, and we found that interleaving the dictionary with the original words gives a better performance compared to prepending/appending the dictionary, under the same dictionary constructed.