Beyond English: Natural Language Processing for All Languages in an Era of Large Language Models (2025)


up

pdf (full)
bib (full)
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models

pdf bib
Proceedings of the Workshop on Beyond English: Natural Language Processing for all Languages in an Era of Large Language Models
Sudhansu Bala Das | Pruthwik Mishra | Alok Singh | Shamsuddeen Hassan Muhammad | Asif Ekbal | Uday Kumar Das

pdf bib
Towards the Creation of a Collao Quechua–Spanish Parallel Corpus Using Optical Character Recognition
Gian Carlo Orcotoma Mormontoy | Lida Leon Nuñez | Hugo Espetia Huamanga

The Quechua language stands as a fundamental element of Peru’s social and cultural identity, carries linguistic and cultural significance. However, it faces substantial challenges in terms of digital representation. One major limitation is the scarcity of resources such as a parallel corpus, which limits the development of technological resources for its analysis and practical application. This study addresses this gap through a methodology for building a parallel corpus using Optical Character Recognition (OCR). We digitized a collection of texts from a common origin to create a corpus that enables reliable access. The resulting corpus serves as a valuable asset for linguistic and Natural Language Processing (NLP) research, as well as for Quechua speakers. The source material derives from works produced by graduate students from the Academia Mayor de la Lengua Quechua, validated by academic staff, ensuring grammatical, syntactic and semantic integrity.

pdf bib
Prompt Balance Matters: Understanding How Imbalanced Few-Shot Learning Affects Multilingual Sense Disambiguation in LLMs
Deshan Koshala Sumanathilaka | Nicholas Micallef | Julian Hough

Recent advances in Large Language Models (LLMs) have significantly reshaped the landscape of Natural Language Processing (NLP). Among the various prompting techniques, few-shot prompting has gained considerable attention for its practicality and effectiveness. This study investigates how few-shot prompting strategies impact the Word Sense Disambiguation (WSD) task, particularly focusing on the biases introduced by imbalanced sample distributions. We use the GLOSSGPT prompting method, an advanced approach for English WSD, to test its effectiveness across five languages: English, German, Spanish, French, and Italian. Our results show that imbalanced few-shot examples can cause incorrect sense predictions in multilingual languages, but this issue does not appear in English. To assess model behavior, we evaluate both the GPT-4o and LLaMA-3.1-70B models and the results highlight the sensitivity of multilingual WSD to sample distribution in few-shot settings, emphasizing the need for balanced and representative prompting strategies.

pdf bib
Development of a Low-Cost Named Entity Recognition System for Odia Language using Deep Active Learning
Tusarkanta Dalai | Tapas Kumar Mishra | Pankaj Kumar Sa | Prithviraj Mohanty | Chittaranjan Swain | Ajit Kumar Nayak

pdf bib
Non-Contextual BERT or FastText? A Comparative Analysis
Abhay Shanbhag | Suramya Jadhav | Amogh Thakurdesai | Ridhima Bhaskar Sinare | Raviraj Joshi

Natural Language Processing (NLP) for low-resource languages, which lack large annotated datasets, faces significant challenges due to limited high-quality data and linguistic resources. The selection of embeddings plays a critical role in achieving strong performance in NLP tasks. While contextual BERT embeddings require a full forward pass, non-contextual BERT embeddings rely only on table lookup. Existing research has primarily focused on contextual BERT embeddings, leaving non-contextual embeddings largely unexplored. In this study, we analyze the effectiveness of non-contextual embeddings from BERT models (MuRIL and MahaBERT) and FastText models (IndicFT and MahaFT) for tasks such as news classification, sentiment analysis, and hate speech detection in one such low-resource language—Marathi. We compare these embeddings with their contextual and compressed variants. Our findings indicate that non-contextual BERT embeddings extracted from the model’s first embedding layer outperform FastText embeddings, presenting a promising alternative for low-resource NLP.

pdf bib
Kantika: A Knowledge-Radiant Framework for Dermatology QA using IR-CoT and RAPTOR-Augmented Retrieval
Deep Das | Vikram Mehrolia | Rahul Dixit | Rohit Kumar

This paper presents an improved Retrieval-Augmented Generation (RAG) approach for domain-specific question-answering in dermatology and cosmetic science. The proposed system integrates RAPTOR-style hierarchical indexing with Iterative Retrieval Chain-of-Thought (IR-CoT) reasoning and CRAG-style interleaved retrieval-generation to better handle complex, clinically grounded queries. It leverages multi-source dermatology data, including peer-reviewed research, product formulations, user reviews, and ingredient safety databases. By decomposing queries into rationale-driven substeps and applying subgoal-specific retrieval, the system improves answer depth, accuracy, and relevance—particularly for ingredient interactions and personalized dermatological guidance. Empirical results show notable gains over standard RAG baselines in both precision and clinical coherence, establishing the effectiveness of this approach in specialized medical QA tasks. With 100% user satisfaction and 99.07% overall accuracy across all document categories, the system sets a strong benchmark for domain-specific medical QA in dermatology.

pdf bib
GeistBERT: Breathing Life into German NLP
Raphael Scheible-Schmitt | Johann Frei

Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. We pre-trained GeistBERT using fairseq, following the RoBERTa base configuration with Whole Word Masking (WWM), and initialized from GottBERT weights. The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens. For evaluation, we fine-tuned the model on standard downstream tasks, including NER (CoNLL 2003, GermEval 2014), text classification (GermEval 2018 coarse/fine, 10kGNAD), and NLI (German XNLI), using F1 score and accuracy as evaluation metrics. GeistBERT achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification. It also outperformed several larger models, particularly in classification benchmarks. To support research in German NLP, we release GeistBERT under the MIT license.

pdf bib
Identifying Contextual Triggers in Hate Speech Texts Using Explainable Large Language Models
Dheeraj Kodati | Bhuvana Sree Lakkireddy

The pervasive spread of hate speech on online platforms poses a significant threat to social harmony, necessitating not only high-performing classifiers but also models capable of transparent, fine-grained interpretability. Existing methods often neglect the identification of influential contextual words that drive hate speech classification, limiting their reliability in high-stakes applications. To address this, we propose LLM-BiMACNet (Large Language Model-based Bidirectional Multi-Channel Attention Classification Network), an explainability-focused architecture that leverages pretrained language models and supervised attention to highlight key lexical indicators of hateful and offensive intent. Trained and evaluated on the HateXplain benchmark—comprising class labels, target community annotations, and human-labeled rationales—LLM-BiMACNet is optimized to simultaneously enhance both predictive performance and rationale alignment. Experimental results demonstrate that our model outperforms existing state-of-the-art approaches, achieving an accuracy of 87.3 %, AUROC of 0.881, token-level F1 of 0.553, IOU-F1 of 0.261, AUPRC of 0.874, and comprehensiveness of 0.524, thereby offering highly interpretable and accurate hate speech detection.

pdf bib
PortBERT: Navigating the Depths of Portuguese Language Models
Raphael Scheible-Schmitt | Henry He | Armando B. Mendes

Transformer models dominate modern NLP, but efficient, language-specific models remain scarce. In Portuguese, most focus on scale or accuracy, often neglecting training and deployment efficiency. In the present work, we introduce PortBERT, a family of RoBERTa-based language models for Portuguese, designed to balance performance and efficiency. Trained from scratch on over 450 GB of deduplicated and filtered mC4 and OSCAR23 from CulturaX using fairseq, PortBERT leverages byte-level BPE tokenization and stable pre-training routines across both GPU and TPU processors. We release two variants, PortBERT base and PortBERT large, and evaluate them on ExtraGLUE, a suite of translated GLUE and SuperGLUE tasks. Both models perform competitively, matching or surpassing existing monolingual and multilingual models. Beyond accuracy, we report training and inference times as well as fine-tuning throughput, providing practical insights into model efficiency. PortBERT thus complements prior work by addressing the underexplored dimension of compute-performance tradeoffs in Portuguese NLP. We release all models on Huggingface and provide fairseq checkpoints to support further research and applications.

pdf bib
Quality Matters Measuring the Effect of Human-Annotated Translation Quality on English-Slovak Machine Translation
Matúš Kleštinec | Daša Munková

This study investigates the influence of human-annotated translation quality on the performance of machine translation (MT) models for a low-resource language pair—English to Slovak. We collected and categorized 287 student translations from a national competition, annotated by expert translators into three quality levels. Using the mT5-large model, we trained six neural MT models: three on the full dataset without validation splitting, and three using training/validation splits. The models were evaluated using a suite of automatic metrics (BLEU, METEOR, chrF, COMET, BLEURT, and TER), with TER serving as the validity criterion. Statistical analyses revealed that data quality had no significant effect when training without validation, but did have a significant impact under fine-tuning conditions (p < 0.05). Our results suggest that fine-tuning with combination with validation splitting increases the model’s sensitivity to the quality of training data. While the overall effect size is modest, the findings underscore the importance of high-quality, annotated corpora and modern training strategies for improving MT in low-resource languages.

pdf bib
Spatio-Temporal Mechanism in Multilingual Sentiment Analysis
Adarsh Singh Jadon | Vivek Tiwari | Chittaranjan Swain | Deepak Kumar Dewangan

This study investigated the effectiveness of various models in deep learning in performing sentiment analysis on code-mixed Hinglish text, a hybrid language is widely used in digital telecommunication. Hinglish presents unique challenges due to its informal nature, frequent code-switching, and complex linguistic structure. This research leverages datasets from the HinGE, SemEval-2020 Task 9 & E-Commerce Reviews, datasets, competition, and employ models such as RNN (LSTM), BERT-LSTM, CNN, and a proposed BiLSTM model with Data Augmentation.

pdf bib
Automatic Animacy Classification for Latvian Nouns
Ralfs Brutāns | Jelke Bloem

We introduce the first automatic animacy classifier for the Latvian language. Animacy, a linguistic feature indicating whether a noun refers to a living entity, plays an important role in Latvian grammatical structures and syntactic agreement, but remains unexplored in Latvian NLP. We adapt and extend existing methods to develop type-based animacy classifiers that distinguish between human and non-human nouns. Due to the limited utility of Latvian WordNet, the classifier’s training data was derived from the WordNets of Lithuanian, English, and Japanese. These lists were intersected and mapped to Latvian nouns from the Tēzaurs dictionary through automatic translation. The resulting dataset was used to train classifiers with fastText and LVBERT embeddings. Results show good performance from a MLP classifier using the last four layers of LVBERT, with Lithuanian data contributing more than English. This demonstrates a viable method for animacy classification in languages lacking robust lexical resources and shows potential for broader application in morphologically rich, under-resourced languages.

pdf bib
Bootstrapping a Sentence-Level Corpus Quality Classifier for Web Text using Active Learning
Maximilian Bley | Thomas Eckart | Christopher Schröder

The quality of training data is an essential factor for training large language models (LLMs) as it directly impacts their performance. While high-quality data is crucial for training competitive LLMs, existing preprocessing pipelines still partly rely on rules, which are computationally cheap but also inherently limited to simpler patterns. Model-based filtering on the other hand, is more flexible and can detect finer-grained patterns and semantics, but often requires substantial amounts of labeled data. While there are existing models for common problems (such as toxicity classification), this is often only the case for resource-rich languages and well-studied problems—leaving gaps in coverage for other languages, problems, or combinations thereof. In this work, we investigate the feasibility of model-based preprocessing despite the absence of labeled data. We use active learning to bootstrap a sentence-level multi-label classifier that detects textual problems of traditional text cleaning approaches. With only 498 examples, the final classifier reaches macro- and micro-F1 scores of 0.80 and 0.84, making it suitable for practical use. Moreover, we find that it captured subtle errors compared to a rule-based baseline. We publish the training code, a labeled corpus quality classification dataset, and the resulting classifier.

pdf bib
Fine-Grained Arabic Offensive Language Classification with Taxonomy, Sentiment, and Emotions
Natalia Vanetik | Marina Litvak | Chaya Liebeskind

Offensive language detection in Arabic is a challenging task because of the unique linguistic and cultural characteristics of the Arabic language. This study introduces a high-quality annotated dataset for classifying offensive language in Arabic, based on a structured taxonomy, categorizing offensive content across seven levels, capturing both explicit and implicit expressions. Utilizing this taxonomy, we re-annotate the FARAD-500 dataset, creating reFarad-500, which provides fine-grained labels for offensive texts in Arabic. A thorough dataset analysis reveals key patterns in offensive language distribution, emphasizing the importance of target type, offense severity, and linguistic structures. Additionally, we assess text classification techniques to evaluate the dataset’s effectiveness, exploring the impact of sentiment analysis and emotion detection on classification performance. Our findings highlight the complexity of Arabic offensive language and underscore the necessity of extensive annotation frameworks for accurate detection. This paper advances Arabic natural language processing (NLP) in resource-constrained settings by enhancing the recognition of hate speech and fostering a deeper understanding of the linguistic and emotional dimensions of offensive language.

pdf bib
Measuring Prosodic Richness in LLM-Generated Responses for Conversational Recommendation
Darshna Parmar | Pramit Mazumdar

This paper presents a novel framework for stylistic evaluation in conversational recommendation systems (CRS), focusing on the prosodic and expressive qualities of generated responses. While prior work has predominantly emphasized semantic relevance and recommendation accuracy, the stylistic fidelity of model outputs remains underexplored. We introduce the prosodic richness score (PRS), a composite metric that quantifies expressive variation through structural pauses, emphatic lexical usage, and rhythmic variability. Using PRS, we conduct both sentence-level and turn-level analyses across six contemporary large language models (LLMs) on two benchmark CRS datasets: ReDial, representing goal-oriented dialogue, and INSPIRED, which incorporates stylized social interaction. Empirical results reveal statistically significant differences (p < 0.01) in PRS between human and model-generated responses, highlighting the limitations of current LLMs in reproducing natural prosodic variation. Our findings advocate for broader evaluation of stylistic attributes in dialogue generation, offering a scalable approach to enhance expressive language modeling in CRS.

pdf bib
Assessing the Accuracy of AI-Generated Idiom Translations
Marijana Gasparovic | Marija Brala Vukanovic | Marija Brkic Bakaric

Idioms pose unique challenges for machine translation due to their metaphorical nature and cultural nuances. Consequently, they often present a translation problem even for humans. This longitudinal study evaluates the performance of ChatGPT in translating idiomatic expressions between English and Croatian, comparing results across two time points. The test set comprises 72 idioms in each translation direction, divided into three categories based on equivalence: complete, partial, and zero, with each category representing one-third of the set. The evaluation considers three layers: translation of the isolated idiom, translation of an online excerpt containing the idiom, and translation of a self-constructed example sentence. As expected, accuracy generally declined with decreasing equivalence. However, a follow-up study conducted six months later highlighted the need for continuous monitoring of machine translation tools.

pdf bib
From Pixels to Prompts: Evaluating ChatGPT-4o in Face Recognition, Age Estimation, and Gender Classification
Jashn Jain | Praveen Kumar Chandaliya | Dhruti P. Sharma

This study investigates the biometric capabilities of ChatGPT-4o, evaluating its performance on age estimation, gender classification, and identity verification across two challenging datasets: the ITWCC (images of children aged 6–17) and a pediatric surgery dataset. By leveraging tailored prompts that bypass safety filters, ChatGPT-4o outperformed conventional CNN-based models such as DeepFace, achieving higher accuracy and offering interpretable, rationale-rich outputs. Specifically, it delivered a mean absolute error of 1.8 years in age estimation, 96–100% gender classification accuracy, and over 85% identity continuity recognition, even across surgical transformations. The findings demonstrate the potential of multimodal LLMs to complement or exceed traditional approaches in face analysis tasks, though the study notes the importance of expanding demographic diversity, refining prompt strategies, and ensuring fairness and robustness in real-world settings.

pdf bib
DRISHTI: Drug Recognition and Integrated System for Helping the visually Impaired with Tag-based Identification
Sajeeb Das | Srijit Paul | Ucchas Muhury | Akib Jayed Islam | Dhruba Jyoti Barua | Sultanus Salehin | Prasun Datta

DRISHTI is a novel RFID-vision integrated assistive medication-verification system that combines RFID contactless scanning, quantized AI-based vision processing, and adaptive audio feedback to provide comprehensive medication-safety assurance. The architecture integrates an MFRC522 RFID reader for rapid drug-container identification, a Raspberry Pi–mounted camera running a quantized Gemma3-4B vision model for prescription-document analysis, and a hierarchical validation engine employing confidence-weighted scoring across five critical safety dimensions. Operating entirely offline, the system processes compressed medication data through multi-criteria classification while preserving user privacy and eliminating cloud dependencies. In evaluations across 149 test scenarios, DRISHTI achieved 86.57% overall accuracy and 100% detection of safety-critical cases, including expired medications, dosage mismatches, and drug interactions. The system delivers sub-millisecond response times with real-time, urgency-differentiated audio feedback, offering a practical solution for enhancing independence and reducing healthcare risks for visually impaired individuals.

pdf bib
What Language(s) Does Aya-23 Think In? How Multilinguality Affects Internal Language Representations
Katharina A. T. T. Trinley | Toshiki Nakai | Tatiana Anikina | Tanja Baeumel

Large language models (LLMs) excel at multilingual tasks, yet their internal language processing remains poorly understood. We analyze how Aya-23-8B, a decoder-only LLM trained on balanced multilingual data, handles code-mixed, cloze, and translation tasks compared to predominantly monolingual models like Llama 3 and Chinese-LLaMA-2. Using logit lens and neuron specialization analyses, we find: (1) Aya-23 activates typologically related language representations during translation, unlike English-centric models that rely on a single pivot language; (2) code-mixed neuron activation patterns vary with mixing rates and are shaped more by the base language than the mixed-in one; and (3) Aya-23’s language-specific neurons for code-mixed inputs concentrate in final layers, diverging from prior findings on decoder-only models. Neuron overlap analysis further shows that script similarity and typological relations impact processing across model types. These findings reveal how multilingual training shapes LLM internals and inform future cross-lingual transfer research. The code and dataset are publicly available.

pdf bib
FedCliMask: Context-Aware Federated Learning with Ontology-Guided Semantic Masking for Clinical NLP
Srijit Paul | Sajeeb Das | Ucchas Muhury | Akib Jayed Islam | Dhruba Jyoti Barua | Sultanus Salehin | Prasun Datta

Clinical federated learning faces critical challenges from statistical heterogeneity across healthcare institutions and privacy requirements for sensitive medical data. This work implements the foundational components of FedCliMask and proposes a comprehensive framework for privacy-preserving federated learning in clinical settings that combines ontology-guided semantic masking with context-aware federated aggregation. Our framework addresses the dual challenges of privacy preservation and statistical heterogeneity through two key innovations: (1) ontology-guided semantic masking using UMLS hierarchies to provide graduated privacy protection while preserving clinical semantics, and (2) context-aware federated aggregation that considers hospital-specific features including medical specialties, data complexity, privacy levels, and data volume. The semantic masking component is implemented and evaluated on synthetic clinical data, demonstrating effective privacy-utility tradeoffs across four masking levels. The context-aware analysis component is also implemented successfully profiling 12,996 synthetic clinical notes across 6 diverse hospitals to demonstrate meaningful hospital differentiation. The complete framework is designed to enable privacy-preserving clinical trial recruitment through federated learning while adapting to institutional heterogeneity.

pdf bib
A study on the language independent stemmer in the Indian language IR
Siba Sankar Sahu | Sukomal Pal

We explore and evaluate the effect of different language-independent stemmers in the information retrieval (IR) tasks with Indian languages such as Hindi, Gujarati, and English. The issue was examined from two points of view. Does a language-independent stemmer improve retrieval effectiveness in Indian languages IR? Which language-independent stemmer is the most suitable for different Indian languages? It is observed that stemming enhances retrieval efficiency in different Indian languages compared to the no stemming approaches. Among the different stemmers experimented with, the co-occurrence-based stemmer (SNS) performs the best and improves a mean average precision (MAP) score by 2.98% in Hindi, and 20.78% in Gujarati languages, respectively, whereas the graph-based stemmer (GRAS) performs the best and improves a MAP score by 5.83% in English.

pdf bib
Checklist Engineering Empowers Multilingual LLM Judges
Mohammad Ghiasvand Mohammadkhani | Hamid Beigy

Automated text evaluation has long been a central issue in Natural Language Processing (NLP). Recently, the field has shifted toward using Large Language Models (LLMs) as evaluators—a trend known as the LLM-as-a-Judge paradigm. While promising and easily adaptable across tasks, this approach has seen limited exploration in multilingual contexts. Existing multilingual studies often rely on proprietary models or require extensive training data for fine-tuning, raising concerns about cost, time, and efficiency. In this paper, we propose Checklist Engineering based LLM-as-a-Judge (CE-Judge), a training-free framework that uses checklist intuition for multilingual evaluation with an open-source model. Experiments across multiple languages and three benchmark datasets, under both pointwise and pairwise settings, show that our method generally surpasses the baselines and performs on par with the GPT-4o model.

pdf bib
C A N C E R: Corpus for Accurate Non-English Cancer-related Educational Resources
Anika Harju | Asma Shakeel | Tiantian He | Tianqi Xu | Aaro Harju

Improving the quality of cancer terminology through Machine Translation (MT) in non-English languages remains an under-researched area despite its critical role in supporting self-management and advancing multilingual patient education. Existing computational tools encounter significant limitations in accurately translating cancer terminologies, particularly for low-resource languages, primarily due to data scarcity and morphological complexity. To address the gap, we introduce a dedicated terminology resource — Corpus for Accurate Non-English Cancer-related Educational Resources (C A N C E R), a manually annotated dataset in Finnish (FI), Chinese (ZH), and Urdu (UR), curated from publicly available existing English (EN) data. We also examine the impact of data quality versus quantity and compare the performance of the Opus-mt-en-fi, Opus-mt-en-zh, and Opus-mt-en-ur models with the SMaLL-100 multilingual MT model. We assess translation quality using automatic and human evaluation. Results demonstrated that high-quality parallel data, though sparse, combined with fine-tuning, substantially improved the translation of cancer terminology across both high and low-resource language pairs, positioning the C A N C E R corpus as a foundational resource for improving multilingual patient education.