Workshop on Multilinguality in the Era of Large Language Models (2026)

Volumes

Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026) 32 papers

pdf (full)
bib (full) Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)

Proceedings of the 1st Workshop on Multilinguality in the Era of Large Language Models (MeLLM 2026)
Kaiyu Huang | Fengran Mo | Pinzhen Chen | Meng Jiang

pdf bib abs

Lost in Dialect: The Annotation Gap in Multilingual LLM Safety
Wajdi Zaghouani

Large Language Models are increasingly used as safety infrastructure for detecting harmful online content and moderating social media across multiple languages. Yet their effectiveness remains uneven across linguistic communities. This disparity reflects not only disparities in training data availability but also structural problems in annotation design. We argue that a central source of multilingual safety failure lies in the annotation gap underlying existing hate speech datasets. Most annotation guidelines and safety benchmarks are developed for English and standard language varieties, overlooking dialectal variation and culturally embedded forms of hostility. Using Arabic dialectal discourse as a case study, we show how harmful speech expressed through dialects, sarcasm, code-switching, and culturally specific expressions often remains undetected by current annotation schemes. We introduce the concept of the Multilingual Safety Annotation Gap (MSAG), identifying four sources of bias: language coverage gaps, dialect representation gaps, cultural semantic gaps, and annotation guideline gaps. We discuss implications for LLM safety alignment and outline directions for culturally grounded multilingual annotation. This paper is primarily a conceptual and methodological position paper; rather than introducing a new benchmark or empirical evaluation, we aim to formalize the MSAG as a framework for analyzing systematic weaknesses in multilingual safety annotation pipelines.

pdf bib abs

Evidence-Augmented Generation Reasoning for Extremely Low-Resource Language Decipherment
Xiaoyu Zhu | Long Yuan | Rui Qi | Jinan Xu

Inspired by linguistic Olympiads, extremely low-resource language reasoning presents a unique challenge that enables models to solve problems without prior knowledge. This task mirrors the Rosetta Stone decipherment process, where the goal is to induce and apply linguistic rules from minimal context. Existing methods mainly rely on naive in-context learning that fails to handle the complexity and diversity of language rules. To mitigate this issue, we propose a framework that combines dynamic knowledge construction with task-aware evidence augmentation. First, we use large language models (LLMs) to generate a diverse set of task-specific examples that instantiate potential linguistic rules for the target low-resource language. Second, we apply a semantic retrieval mechanism to select the most relevant examples as evidence for each test query, preventing context overload and ensuring focused, analogical reasoning. Our method shifts from learning language distributions to dynamically discovering and applying rules. Experimental results on the LINGOLY and Linguini benchmark show that our approach achieves competitive performance across various LLMs, outperforming existing baselines. More importantly, our framework advances extremely low-resource reasoning and provides a generalizable framework for rule induction under knowledge constraints.

pdf bib abs

With the increasing utilization of multilingual text information, Cross-Lingual Information Retrieval (CLIR) has become a crucial research area. However, the impact of training data composition on CLIR and Mono-Lingual Information Retrieval (Mono-IR) performance remains underexplored. To investigate this data-centric aspect, we construct linguistically parallel Korean-English datasets and train multilingual retrieval models with various language combinations. Our experiments reveal that the language composition of training data significantly influence IR performance, exhibiting important inter-lingual correlations: Using specific language pairs improves CLIR performance, while declines Mono-IR performance. Our work demonstrates that simple weight-averaged model merging can effectively mitigate this trade-off, achieving strong CLIR results while preserving Mono-IR capabilities. Our findings highlight the effects of linguistic configuration of training data on both CLIR and Mono-IR, and present model merging as a viable strategy to optimize performance across these tasks.

pdf bib abs

Multilingual embedding models often exhibit uneven representational quality, heavily favoring high-resource languages like English. However, conventional retrieval systems that rely exclusively on source-language queries fail to exploit the superior semantic expressiveness of these high-resource subspaces. To address this, we propose Query-Synergy, a training-free approach to improving retrieval performance using multilingual embeddings. Our method utilizes additional queries in English to complement source language queries and integrates similarity scores from both queries, effectively enhancing retrieval performance. We evaluate our approach across five languages (Arabic, Chinese, Greek, Thai, and Turkish) using four multilingual embedding models on two datasets. Our experiments show that this approach outperforms conventional source query retrieval methods, achieving superior nDCG scores across various configurations and translation settings. These results confirm that Query-Synergy is a simple yet effective method for retrieval across multiple languages.

pdf bib abs

Kyrgyz Text Normalization: A Comparative Study of Neural and Rule-Based Approaches
Zarina Uvalieva | Bektemir Kumarbai Uulu | Adilet Metinov | Tynchtykbek Tashbaltaev | Nurtilek Alibekov

Text normalization, the task of converting noisy, informal text into a standardized form - is a fundamental preprocessing step for many NLP applications. Despite the growing need for Kyrgyz language processing tools, to the best of our knowledge, no prior work has addressed automatic text normalization for Kyrgyz, a morphologically rich, low-resource Turkic language. In this paper, we present the first systematic study of Kyrgyz text normalization. We collect a dataset of 1.67 million noisy–clean text pairs sourced from YouTube comments, Instagram posts, and Telegram channels, where users frequently write without punctuation, capitalization, or standard spelling. Pairs were annotated with Gemini 3 Pro; the 1,000-example test set was fully verified by two native Kyrgyz speakers with adjudication, and a random subset of the training data was spot-checked, while the full 1.67M training set was not verified exhaustively. For continual pre-training, we additionally use a 538 MB Kyrgyz corpus compiled from news portals and books. We evaluate five systems: a rule-based baseline, zero-shot mT5, a fine-tuned mT5-small model, a continually pre-trained mT5-small followed by fine-tuning, and zero-shot Gemma 4. Our experiments show that fine-tuned mT5-small achieves a CER of 0.0796, outperforming the rule-based baseline (CER 0.2029), zero-shot mT5 (CER 0.9887), and zero-shot Gemma 4 (CER 0.1620), a roughly 32× larger model in a fine-tuned vs. zero-shot setting. Human evaluation by two native Kyrgyz speakers confirms these results, with fine-tuned mT5-small rated as correct in 99.8% of cases. We further analyze why continual pre-training with span corruption does not improve over direct fine-tuning, finding hallucination in 35/40 of the inspected failure cases (87.5%, 95% Wilson CI [74%, 95%]).

pdf bib abs

MultiHaluDet: Multilingual Hallucination Detection via LLM Hidden State Probing
Riasad Alvi | Nurul Labib Sayeedi | Md. Faiyaz Abdullah Sayeedi

Hallucinations in Large Language Models (LLMs) represent a critical barrier to their reliable deployment, a vulnerability heavily exacerbated in non-English and resource-constrained contexts. Existing detection approaches that rely on output confidence heuristics or single-layer internal representations frequently fail to capture deep, complex factual inconsistencies across diverse languages. To address this, we introduce MultiHaluDet, a novel three-stage stacking framework that detects multilingual hallucinations by probing the full hidden state trajectories of frozen LLMs without requiring language-specific fine-tuning. Our method extracts sequential features across multiple layers and processes them via a hybrid architecture using multi-scale attention and self-attention pooling. By generating out-of-fold embeddings that feed into a calibrated classical classifier ensemble, MultiHaluDet captures both fine-grained and coarse-grained patterns of factual inconsistency. Extensive experiments demonstrate that our framework achieves state-of-the-art detection performance, reaching up to 98.55% AUROC on the English HaluEval and TriviaQA benchmarks using Mistral-7B and LLaMA2-7B architectures. Crucially, we rigorously evaluate our framework’s cross-lingual generalization across high (French), medium (Bangla), and low-resource (Amharic) languages. MultiHaluDet demonstrates exceptional representational robustness, consistently outperforming baselines and successfully transferring hallucination detection capabilities across typologically diverse linguistic tiers.

pdf bib abs

Financial misinformation poses significant threats to financial market stability and individuals’ investment decisions. The multilingual environment and the inherent complexity of financial information present substantial challenges for Multilingual Financial Misinformation Detection (MFMD). Existing LLM-based approaches for financial misinformation detection primarily focus on English and a single financial misinformation detection task, which limits their ability to capture multilingual contexts and complex features. In this paper, we propose MFMDQwen, the first open-source LLM designed for MFMD tasks. Furthermore, we introduce MFMD4Instruction, the first instruction dataset supporting MFMD with LLMs, covering English, Chinese, Greek, and Bengali. We also construct MFMDBench, a benchmark dataset for evaluating the MFMD capabilities of LLMs. Experimental results on MFMDBench demonstrate that our model outperforms existing open-source LLMs.

pdf bib abs

Multilingual Chain-of-Thought Compression via Cross-Lingual Distillation
Jiarui Wan | Songming Zhang | Yufeng Chen

Chain-of-thought reasoning improves the performance of large language models on complex tasks but often produces overly verbose outputs, leading to increased inference cost. This issue is exacerbated in multilingual settings, where differences in tokenization and linguistic structure result in inconsistent compression performance across languages. Existing methods are largely English-centric and tend to suffer from accuracy degradation, especially in low-resource languages.We propose Multilingual Chain-of-thought Compression via Cross-lingual Distillation (MCD), a unified framework that addresses these challenges through both data construction and optimization. MCD builds a cross-lingually aligned dataset using a translation-with-verification pipeline and difficulty-aware sampling, and employs a reinforcement training strategy that combines supervised fine-tuning with direct preference optimization to encourage concise yet sufficient reasoning.Experiments on multilingual mathematical benchmarks show that MCD consistently reduces reasoning length while maintaining competitive accuracy, and significantly improves robustness in low-resource languages.

pdf bib abs

The problem of extractive multilingual QA with LLMs is characterized by complex interactions among retrieval mechanisms, knowledge source configurations, prompting techniques, and scripting biases. Despite high retrieval quality, multilingual RAG often degrades performance, revealing a gap between retrieved evidence and its effective utilization. To address this issue, this paper offers an extensive empirical study that examines these components by comparing retrieval-augmented generation (RAG) with a non-RAG baseline across 21 typologically diverse languages and 5 leading LLMs. Our analysis includes five prompting strategies and multiple retrieval configurations, which enable a unified evaluation across diverse linguistic settings. We have also observed an evidence utilization gap in RAG settings, where RAG underperforms despite high retrieval hit rates due to models’ inefficiency in leveraging the retrieved evidence. We also introduce lightweight inference-time metrics to better characterize retrieval usage and conflict patterns.We also highlight script fidelity as a crucial factor in our experiments, as non-Latin-script languages experience significant performance drops and increased hallucinations without proper grounding. Further, we analyzed generator language preferences, systematically examined conflicts, and identified mechanisms for the effective detection and resolution of conflicts. The study further details how prompting strategies affect language families and script types, offering a detailed analysis for optimizing future multilingual RAG settings.

pdf bib abs

DIMAS-OMOP: A Deliberative Intelligence-Based Multi-Agent System for Chinese Medical Text Standardization toward OMOP
Hanlin Lv | Xiao Wang | Kesong Wu | Lei Li | Lei Wang

Standardizing Chinese clinical imaging reports within the Observational Medical Outcomes Partnership (OMOP) framework is hindered by linguistic complexity and output inconsistency in existing methods. We propose DIMAS-OMOP, a Deliberative Intelligence-based Multi-Agent System designed for high-fidelity medical concept mapping toward OMOP standardization. Moving beyond single-model architectures, DIMAS-OMOP employs a hybrid three-stage workflow that integrates traditional natural language processing modules with selective Large Language Model reasoning and Retrieval-Augmented Generation. The core innovation lies in a hierarchical six-agent proposer-skeptic deliberation mechanism, complemented by a dynamic concept resolution approach and a four-dimensional quality control framework. Experimental results on 1,250 imaging reports demonstrate that DIMAS-OMOP achieves 95.2% mapping accuracy, significantly outperforming rule-based methods (+21.8 percentage points) and single-AI baselines (+8.1 percentage points). The system maintains a throughput of 1,200 reports/hour, with the multi-agent deliberation stage alone contributing an 8.9% relative accuracy gain. Furthermore, pilot deployment shows a 160.6% return on investment and a 31.5% increase in workflow efficiency. This study provides a novel, robust methodology for integrating unstructured non-English clinical data into the global Observational Health Data Sciences and Informatics (OHDSI) ecosystem through deliberative intelligence.

pdf bib abs

Beyond Accuracy: A Structured Error Analysis of Multilingual LLMs on Marathi Script Variation and Syntax
Tejas Patil | Barnali Chetia

Evaluation of multilingual large language models has grown rapidly in recent years, yet Marathi, spoken by over 83 million people across India, has received almost no systematic probing beyond surface-level benchmark tests. Most existing multilingual evaluations either omit Marathi entirely or rely on machine-translated test sets that fail to capture the morphological complexity that defines the language. We evaluate four models, namely Llama-3.1-8B, Llama-3.3-70B, Mistral-7B, and Qwen3-32B, on our manually curated Marathi dataset across three probing dimensions: Devanagari versus Romanized script, Marathi-English code-mixing, and syntactic structures including SOV word order, vibhakti case markers, verb gender agreement, and postpositions. Models are tested under English and Marathi instruction conditions across translation, similarity, grammaticality, and case marker tasks. Translation quality is evaluated using both token-level F1 and BERTScore to capture paraphrase equivalence beyond surface word overlap. All models drop between 7.9% and 20.5% on Romanized input. The negative subjunctive marker nasta is ignored by every model. Vibhakti case markers are consistently replaced with Hindi equivalents, revealing that multilingual training has not produced separate internal representations for Hindi and Marathi despite their distinct morphological systems. These findings reveal structural gaps in how current multilingual LLMs handle morphologically rich, low-resource Indic languages and point to specific areas where dedicated Marathi pretraining data would most benefit future work.

pdf bib abs

Cross-Lingual Sentiment Misalignment: Auditing Multilingual Language Models for Inversion Risk, Dialectal Representation, and Affective Stability
Nusrat Jahan Lia | Shubhashis Roy Dipta

Recent advances in multilingual representation learning aim to bridge the performance gap between high- and low-resource languages, yet their ability to preserve affective meaning across languages remains underexplored, particularly for underrepresented languages like Bengali. This research addresses cross-lingual sentiment misalignment between Bengali and English by introducing a controlled benchmarking framework evaluating four multilingual transformer models on parallel Bengali-English sentence pairs, stratified by dialect, to assess their representational stability. We demonstrate that a compressed model architecture exhibits a 28.7% "Sentiment Inversion Rate," fundamentally misinterpreting positive semantics as negative (or vice versa). Consequently, we identify a cross-lingual sentiment skew that we call "Asymmetric Empathy", where models systematically dampen or artificially amplify the affective weight of Bengali text relative to its exact English counterpart. Finally, we expose a key vulnerability regarding dialectal representation: a "Modern Bias" in the regional model, which exhibits a 57% increase in alignment error when processing the formal Bengali register compared to modern colloquial text. As foundational encoders continue to serve as safety classifiers and reward models for LLM pipelines, cross-lingual reliability becomes a critical concern. We therefore advocate for the integration of "Affective Stability" metrics into future cross-lingual benchmarks to detect and penalize polarity inversions, particularly in low-resource settings.

pdf bib abs

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation
Yunsu Kim | Kaden Uhlig | Joern Wuebker

Agent benchmarks remain largely English-centric, while their multilingual versions are often built with machine translation (MT) and limited post-editing. We argue that, for agentic tasks, this minimal workflow can easily break benchmark validity through query-answer misalignment or culturally off-target context. We propose a refined workflow for adapting English benchmarks into multiple languages with explicit functional alignment, cultural alignment, and difficulty calibration using both automated checks and human review. Using this workflow, we introduce GAIA-v2-LILT, a re-audited multilingual extension of GAIA covering five non-English languages. In experiments, our workflow improves agent success rates by up to 32.7% over minimally translated versions, bringing the closest audited setting to within 3.1% of English performance while substantial gaps remain in many other cases. This indicates that a substantial share of the multilingual performance gap is benchmark-induced measurement error, motivating task-level alignment when adapting English benchmarks across languages. The data is available as part of the MAPS package. We also release the code used in our experiments.

pdf bib abs

Do Thoughts Depth Affect Multilingual Reasoning?
Linjian Yang | Xinyan Wang | Kunpeng Liu

Chain-of-Thought (CoT) is commonly used to improve reasoning performance in large language models. We investigate its impact in multilingual contexts by systematically constraining reasoning steps across languages with varying resource levels. This study evaluates two models on two benchmarks with seven languages, comparing constrained CoT depth against zero-shot and free-CoT baselines. We demonstrate that increasing the number of reasoning steps does not consistently improve accuracy across various languages. While high-resource and mid-resource languages remain stable, low-resource languages often experience a decline in performance as the number of reasoning steps increases. We attribute this decline to error accumulation and reasoning noise, which are amplified under deeper reasoning in low-resource languages. These findings indicate that CoT is not inherently beneficial, but its effectiveness is significantly influenced by the interaction between reasoning steps and language resource availability.

pdf bib abs

On the Limits of Model Merging for Multilinguality in Pre-Training
Seth Aycock | Fedor Vitiugin | Aleksandr Umnov | Christof Monz | Khalil Sima’an

Endowing models with consistent multilingual performance can be achieved by _mixing_ pre-training data, or post-training approaches such as language-specific model _merging_. In this work, we test whether merging can be applied to monolingually pre-trained models. We conduct a controlled study on the efficacy of mixed, merged, and monolingual pre-training setups. We find that while monolingual pre-training results in strong in-language performance, merging any combination of monolingual models leads to performance collapse due to interference. Our analysis suggests representational similarity is a prerequisite for model merging. We therefore conclude that the flexibility of merging in fine-tuning does not extend trivially to language-specific pre-training.

pdf bib abs

mmPISA-bench: Do LLMs Reason Equally Well Across 43 Languages?
Yerzhan Sapenov | Jaromir Savelka

We introduce mmPISA-bench, a compact high-quality multilingual reasoning benchmark derived from the OECD Programme for International Student Assessment (PISA). The benchmark consists of 25 multiple-choice questions that require reasoning in order to be answered correctly. Each question is provided in official human translations to 43 languages and complemented with machine-translated counterparts (i.e., 2,150 data points in total). We evaluate two mainstream proprietary LLMs across languages, reasoning effort levels, and translation types in terms of their ability to answer the questions correctly. Our results show that modern LLMs can reason effectively across all evaluated languages, achieve accuracy comparable to human test-takers, with some performance variations across covered languages. We further find that machine-translated questions do not degrade accuracy relative to official human translations which suggests that high-quality machine translation (synthetic data) might often be adequate for large-scale multilingual reasoning evaluations where official translations are not available. Finally, we analyze token usage and related inference cost and find that LLMs usage in some languages is simultaneously more expensive and less accurate.

pdf bib abs

Cross-Lingual Bias in Large Language Models: A Comparative Analysis of English and Swahili
Ruolei Zhang | Teddy Njuguna | Yue Feng

Large language models are increasingly deployed in multilingual contexts, yet safety alignment and bias evaluation remain overwhelmingly English-centric. We investigate whether social biases generalise across languages by submitting 4,900 symmetric English–Swahili prompt pairs to GPT-5.2 and Gemini 2.5 Flash across nine demographic bias axes, yielding 19,600 completions evaluated for stereotype prevalence, sentiment, refusal behaviour, and cross-lingual semantic similarity. Our findings show that bias transforms rather than transfers: stereotype rates shifted by up to 12 percentage points on specific axes, Gemini’s neutral-sentiment rate doubled in Swahili, and GPT-5.2 refused 169 prompts in English and zero in Swahili, indicating safety mechanisms functionally anchored to English-language tokens. Over 55% of prompt pairs produced semantically dissimilar completions across both models. These reinforce the idea that English-only bias audits do not produce adequate coverage for multilingual deployment.

pdf bib abs

Polyglot-Lion: Efficient Multilingual ASR for Singapore via Balanced Fine-Tuning of Qwen3-ASR
Quy-Anh Dang | Chris Ngo

We present Polyglot-Lion, a family of compact multilingual automatic speech recognition (ASR) models tailored for the linguistic landscape of Singapore, covering English, Mandarin, Tamil, and Malay. Our models are obtained by fine-tuning Qwen3-ASR-0.6B and Qwen3-ASR-1.7B exclusively on publicly available speech corpora, using a balanced sampling strategy that equalizes the number of training utterances per language and deliberately omits language-tag conditioning so that the model learns to identify languages implicitly from audio. On 12 benchmarks spanning the four target languages, Polyglot-Lion-1.7B achieves an average error rate of 14.85, competitive with MERaLiON-2-10B-ASR (14.32) - a model 6x larger - while incurring a training cost of 81 on a single RTX PRO 6000 GPU. Inference throughput is approximately 20x faster than MERaLiON at 0.10 s/sample versus 2.02 s/sample. These results demonstrate that linguistically balanced fine-tuning of moderate-scale pretrained models can yield deployment-ready multilingual ASR at a fraction of the cost of larger specialist systems.

pdf bib abs

The Multilingual Curse at the Retrieval Layer: Evidence from Amharic
Yosef Worku Alemneh | Kidist Amde Mekonnen | Maarten de Rijke

Multilingual retrieval increasingly underpins cross-lingual question answering and retrieval-augmented generation. Strong zero-shot scores on multilingual benchmarks are often taken as evidence that current encoders transfer reliably across many languages. We argue that this assumption breaks down for underrepresented, morphologically rich languages, and use Amharic as a diagnostic case. Under a shared passage retrieval protocol covering dense, late-interaction, learned sparse, and cross-encoder paradigms, we compare zero-shot multilingual retrievers, Amharic-fine-tuned multilingual retrievers, and monolingual Amharic retrievers. The strongest zero-shot multilingual retriever underperforms the strongest monolingual Amharic first-stage retriever by 23% relative MRR@10. Fine-tuning two recent multilingual embedding models on the same Amharic supervision yields 32–60% relative MRR@10 gains over zero-shot, but the best Amharic-fine-tuned multilingual model remains below the strongest monolingual Amharic retriever. These findings indicate that zero-shot multilingual retrieval is not a sufficient proxy for equitable information access in the LLM era: for underrepresented languages, retrieval must be evaluated and adapted in language rather than inferred from aggregate multilingual benchmarks. To foster future research, we publicly release our trained models, dataset, and codebase at https://github.com/rasyosef/amharic-neural-ir.

pdf bib abs

Emotion detection is an important text classification task with applications in sentiment analysis, social media monitoring, human-computer interaction, and affective language understanding. However, Punjabi written in the Shahmukhi script remains severely under-resourced for emotion detection, with limited benchmark-style resources available for supervised evaluation. This paper introduces ShahiEmotion, a new Punjabi Shahmukhi emotion detection dataset containing 30379 sentence-level instances annotated with seven emotion categories: sadness, surprise, happiness, anger, neutral, fear, and disgust. The dataset is designed to support research in a low-resource setting characterized by script-specific challenges, lexical variation, and substantial class imbalance. We establish baseline results using several pretrained transformer-based models and formulate emotion detection as a sentence-level classification task. In particular, we fine-tune multilingual BERT, multilingual DistilBERT, XLM-RoBERTa, and Urdu RoBERTa under the same training and evaluation setting using standard cross-entropy loss. Experimental results show that XLM-RoBERTa provides the strongest overall performance among the compared models. The best model achieves 77.95% accuracy, 58.47% macro-F1, and 77.60% weighted-F1 on the test set. The dataset, evaluation protocol, and baseline results introduced in this work are intended to support future research on Punjabi Shahmukhi emotion analysis and low-resource NLP.

pdf bib abs

Evaluating Multilingual Tokenization under Worst-N Parity-Aware BPE
Vani Kanjirangat | David Kletz | Tanja Samardzic | Ljiljana Dolamic | Fabio Rinaldi

Improving the fairness of a language model is a goal that applies at every level of the model. In this paper, we evaluate a method targeting a foundational level: tokenization.We present a multilingual evaluation of parity-aware tokenization under worst-N optimization, extending PA-BPE to jointly optimize over the N worst-compressed languages.We evaluate this formulation for N > 1 across vocabulary sizes of 16K and 32K on the languages from the flores+ benchmark, using metrics that capture both efficiency and structural alignment.Our results reveal that the effects of increasing N are inconsistent across metrics and do not lead to major gains. Efficiency-oriented and boundary-level metrics show a modest tendency to improve at higher values of N, while structural alignment metrics (such as AST alignment and boundary crossing) exhibit no clear pattern, suggesting that compression fairness and linguistic structure are mainly orthogonal objectives. Script-level analysis further reveals uneven effects across writing systems, with several non-Latin scripts showing greater sensitivity to increasing N.

pdf bib abs

MLingualFC: Evaluating Jailbreak Vulnerabilities in Multilingual Vision-Language Models
Rishabh Makwana | Mamta Mamta | Deeksha Varshney | Oana Cocarascu

Vision-Language Models (VLMs) have demonstrated strong performance across multimodal tasks, yet their safety robustness remains an open challenge. While prior work has shown that structured visual prompts such as flowcharts can effectively jailbreak VLMs, existing studies are largely limited to English-centric settings. In this paper, we introduce MLingualFC, a multilingual multimodal benchmark designed to evaluate jailbreak vulnerabilities of VLMs across diverse languages using structured flowchart representations. MLingualFC encodes harmful instructions into flowchart images across five languages (Hindi, Punjabi, Spanish, Romanian, and German) We evaluate state-of-the-art multilingual VLMs, including Qwen2.5-VL, Gemma-4, and Pangea, under a black-box threat model. Our results reveal significant multilingual safety gaps. Flowchart-based attacks achieve high attack success rates (ASR) in case of Latin script languages, demonstrating that visual encoding of harmful content effectively bypasses safety alignment across languages. In contrast, non-Latin script languages such as Punjabi exhibit substantially lower ASR, suggesting potential limitations in visual text recognition rather than stronger safety alignment. These findings highlight that current VLM safety mechanisms fail to generalize across languages and modalities.

pdf bib abs

As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored.To address this gap, we introduce P3B3, an expert-curated variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability.Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.

pdf bib

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation
Priyaranjan Pattnayak

pdf bib abs

Causal Localization of the English Pivot in LLaVA: Mechanistic VLM Analysis and Training-Free Multilingual Steering
Abrar Zahin Raihan | Aurchi Chowdhury

Multilingual vision-language models (VLMs) consistently underperform on non-English visual queries, yet the internal mechanism behind this disparity remains unknown. As a focused case study on LLaVA-1.5-7B, we apply logit-lens analysis and causal activation patching to show that non-English visual queries are routed through an English-biased representational bottleneck in layers 5–17, extending the English-pivot phenomenon of Wendler et al. (2024) to the multimodal setting. Peak causal influence occurs at layer 8 ( ̅AIE = 0.49, averaged across languages), with all measurable pivot signal running through text-token positions. Without meaningful visual content (blank-image condition), language-specific representations do not emerge at any layer, showing that the pivot is image-content-dependent rather than triggered by any visual input. Building on these findings, we derive training-free language-steering vectors at the mechanistically identified pivot layers, improving Russian VQA by +6.5 pp and Portuguese by +4.0 pp on MMMB without any fine-tuning — the latter surpassing the English baseline. Within this case study, our results are consistent with the English pivot being a structural property of the LLM backbone that multimodal pre-training does not mitigate; extending this mechanistic methodology to other VLMs and language families remains an important direction for future work.

pdf bib abs

Multilingual Disparities in LLM-Based Safety Judgments: Evidence from Brand Safety Applications
Songjiang Liu | Riley Grossman | Mike Smith | Cristian Borcea | Yi Chen

Multilingual LLMs are increasingly used as context-aware judges in real-world information systems under the assumption that equivalent content receives equivalent judgments across languages. We examine this assumption through brand safety, a global application where automated ratings can affect advertisers’ reputations, publishers’ revenues, and users’ access to news. We construct a benchmark of LLM-generated safety ratings for 10,467 semantically aligned news articles across 13 languages. We find systematic cross-lingual disagreement appearing in more than 96% of cases where at least one language receives a non-zero risk rating. Suitability ratings differ significantly by language, controlling for run, category, and article. In the main model, English, German, and French content is generally rated more strictly, while Polish, Hungarian, Greek, Turkish, and Persian content is rated more leniently. Robustness checks with two additional LLMs show that significant language effects persist, though directional patterns vary by model. These findings show that multilingual LLM safety judgments can produce unequal outcomes for semantically equivalent content.

pdf bib abs

Benchmarking Byte-Pair Encoding Tokenizers on Different Languages with Bits per Byte
Soham Chowdhury | Warren Woolf

Tokenization significantly affects the cross-lingual performance of language models, yet recent tokenizer variants such as SuperBPE and MorphBPE have not been systematically evaluated across typologically diverse languages. We conduct the first extrinsic cross-language comparison of BPE, SuperBPE, and MorphBPE tokenizers on English, Mandarin, and Hungarian, using bits per byte (BPB) normalized perplexity as our metric, with vocabulary sizes of 8K, 16K, and 32K. We find that SuperBPE matches BPE for English but underperforms by 0.01–0.06 BPB for Hungarian and Mandarin, suggesting that cross-whitespace merging is counterproductive for non-English languages. MorphBPE performs worse than BPE across all settings, with gaps of 0.02–0.04 BPB at the 32K vocabulary size. These results suggest that linguistic theory alone does not guarantee practical improvements in tokenizer design, and that standard BPE remains a surprisingly effective baseline across typologically diverse languages.

pdf bib abs

Where Privacy Risk Lives in English-Source Multilingual RAG: A Stage-Decomposed Audit Across Five Query Languages
Yanhang Li | Zhichao Fan | Zexin Zhuang

A common assumption holds that switching to a non-English language makes a multilingual RAG system easier to attack for personal information. On an English-source synthetic-PII corpus with five query languages and a two-stage defence (LLM input judge + regex output filter), the output-stage point estimates do not support that assumption: English has the highest observed unstructured-PII leak rate, and only English-vs-Swahili separates cleanly under our document-level bootstrap intervals. Once the input judge is added, residual leaks remain on Arabic and Swahili in this Qwen-mediated pipeline, and back-translating the query does not close the gap. Translator, judge, and generator share one model family, so we treat this as pipeline-conditional rather than a causal language ranking. As an oracle diagnostic on a separate n=17 multilingual-prompted-judge residual corner, attaching the gold corpus document to the input judge blocks 15/17 residual cells — a follow-up direction, not a deployed mitigation, since all BLOCK/ALLOW rates are on adversarial queries only and we measure no benign-query FPR or utility. The anonymous supplement contains code, corpora, queries, and per-trial JSONLs.

pdf bib abs

The Broken Telephone Changes Tone: Examining Nuanced Linguistic Cues in LLM Chains-of-Translation
Quang Minh Nguyen | Maida Aizaz | Braahmi Padmakumar

As LLM-generated content proliferates online, texts are increasingly subject to repeated processing and translation by models, making it critical to understand how such iterative reprocessing reshapes language. Prior work has shown that this degrades factual content and reduces diversity, but the fine-grained linguistic shifts underlying these effects remain unexplored. We track changes in epistemic markers, grammatical voice, degree adverbs, and nominalisation density across 12 iterations of round-trip translation applied to 600 BBC News articles, varying intermediate language, translation model, and chain topology across 17 experimental configurations. We find a consistent epistemic shift: evidential and factive markers increase while hedges decline, potentially causing tentative claims to read as more certain. Concurrently, texts undergo register-level formalisation—informal degree adverbs give way to formal alternatives, active-voice density drops, by-phrase passives attrite disproportionately, and nominalisation density rises. We also record clear model-specific patterns for certain settings. These shifts erode the markers of source, register, and agency, offering a fine-grained account of the factual degradation reported in previous studies.

pdf bib abs

Group-Merger: A LoRA-based Framework for Multilingual Continual Learning
Weijian yi | Hongliang Li | Jinan Xu

Multilingual continual learning (MCL) is crucial for enabling language models to adapt across diverse linguistic environments while retaining knowledge over time. Existing parameter isolation methods allocate language-specific modules but fail to leverage cross-lingual transfer, leading to inefficient parameter growth and poor generalization. Model merging based approaches suffer from severe performance degradation as the number of language-specific tasks increases, due to interference between linguistic and task-specific knowledge. To address these challenges, we propose Group-Merger, a framework that employs group-wise merging to balance parameter efficiency and continual learning performance. Our framework mitigates catastrophic forgetting across languages while enabling knowledge transfer. Extensive experiments on multilingual evaluation benchmarks demonstrate superior performance compared to existing methods.

pdf bib abs

When English Isn’t the Best Teacher: Source Language Effects in Cross-Lingual In-Context Learning
Fred Philippy | Siwen Guo | Jacques Klein | Tegawendé F. Bissyandé

Cross-lingual transfer in multilingual NLP has been widely explored in supervised fine-tuning contexts, where factors like data availability and linguistic similarity largely determine transfer quality. As the field shifts toward few-shot In-Context Learning (ICL), it is often presumed that insights from fine-tuning carry over unchanged. Yet this assumption has not been rigorously evaluated, leaving open the question of how to choose source languages for cross-lingual ICL. We conduct a broad empirical study of cross-lingual transfer in ICL spanning seven tasks, six models, and a typologically diverse set of languages. We further analyze language confusion, a key obstacle for generative tasks in cross-lingual ICL. Our results show that conventional fine-tuning-based expectations do not consistently apply in the ICL regime and point to alternative heuristics for selecting source languages effectively..