Swakkhar Shatabda
2026
Can LLMs Solve My Grandma’s Riddle? Evaluating Multilingual Large Language Models on Reasoning Traditional Bangla Tricky Riddles
Nurul Labib Sayeedi | Md. Faiyaz Abdullah Sayeedi | Khushnur Binte Jahangir | Swakkhar Shatabda | Sarah Masud Preum
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Nurul Labib Sayeedi | Md. Faiyaz Abdullah Sayeedi | Khushnur Binte Jahangir | Swakkhar Shatabda | Sarah Masud Preum
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Large Language Models (LLMs) show impressive performance on many NLP benchmarks, yet their ability to reason in figurative, culturally grounded, and low-resource settings remains underexplored. We address this gap for Bangla by introducing BanglaRiddleEval, a benchmark of 1,244 traditional Bangla riddles instantiated across four tasks (4,976 riddle-task artifacts in total). Using an LLM-based pipeline, we generate Chain-of-Thought explanations, semantically coherent distractors, and fine-grained ambiguity annotations, and evaluate a diverse suite of open-source and closed-source models under different prompting strategies. Models achieve moderate semantic overlap on generative QA but low correctness, MCQ accuracy peaks at only about 56% versus an 83.3% human baseline, and ambiguity resolution ranges from roughly 26% to 68%, with high-quality explanations confined to the strongest models. These results show that current LLMs capture some cues needed for Bangla riddle reasoning but remain far from human-level performance, establishing BanglaRiddleEval as a challenging new benchmark for low-resource figurative reasoning. All data, code, and evaluation scripts are available on GitHub: https://anonymous.4open.science/r/BanglaRiddleEval.
CLRG at SemEval-2026 Task 3: One Size Does Not Fit All: A Resource Adaptive Framework for Dimensional Sentiment Regression
Wardat Iqbal | Ruwad Naswan | Swakkhar Shatabda
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Wardat Iqbal | Ruwad Naswan | Swakkhar Shatabda
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Predicting continuous Valence and Arousal scores across diverse languages poses significant challenges due to typological differences and the difficulty of modeling affective intensity. We introduce AdaptStance, a parameter-efficient framework designed for the SemEval-2026 Task 3 benchmark. To address cross-lingual disparities, AdaptStance routes inputs through resource-specific pipelines: direct regression with a hybrid concordance loss for high-resource languages, and an auxiliary multi-task mechanism to stabilize regression in low-resource and non-Western contexts. Architectural analysis reveals that decoupling task heads benefits morphologically related languages, whereas joint representations act as crucial regularizers for distant language families. Ultimately, this lightweight approach achieves competitive performance over generative baselines, demonstrating the efficacy of targeted architectural alignment while identifying Valence as the primary bottleneck in continuous affect prediction. Our code is available on GitHub.
MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Tasnim Mohiuddin | Md Mofijul Islam | Swakkhar Shatabda
Findings of the Association for Computational Linguistics: EACL 2026
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Tasnim Mohiuddin | Md Mofijul Islam | Swakkhar Shatabda
Findings of the Association for Computational Linguistics: EACL 2026
Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses 2,890 parallel Bangla-English gold standard artifacts, totaling ≈30K aligned question–answer pairs across thirteen languages, representing an extensive coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models under zero-shot, chain-of-thought (CoT), perturbated reasoning, and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs’ ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: https://github.com/mahbubhimel/MathMist
Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains
Md. Faiyaz Abdullah Sayeedi | Subhey Sadi Rahman | Md. Mahbub Alam | Md. Adnanul Islam | Jannatul Ferdous Deepti | Tasnim Mohiuddin | Md Mofijul Islam | Swakkhar Shatabda
Findings of the Association for Computational Linguistics: ACL 2026
Md. Faiyaz Abdullah Sayeedi | Subhey Sadi Rahman | Md. Mahbub Alam | Md. Adnanul Islam | Jannatul Ferdous Deepti | Tasnim Mohiuddin | Md Mofijul Islam | Swakkhar Shatabda
Findings of the Association for Computational Linguistics: ACL 2026
The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: https://github.com/faiyazabdullah/TranslationTangles
BnMMLU: Measuring Massive Multitask Language Understanding in Bengali
Saman Sarker Joy | Swakkhar Shatabda
Findings of the Association for Computational Linguistics: ACL 2026
Saman Sarker Joy | Swakkhar Shatabda
Findings of the Association for Computational Linguistics: ACL 2026
Large-scale multitask benchmarks have driven rapid progress in language modeling, yet most emphasize high-resource languages such as English, leaving Bengali underrepresented. We present BnMMLU, a comprehensive benchmark for measuring massive multitask language understanding in Bengali. BnMMLU spans 41 domains across STEM, humanities, social sciences, and general knowledge, and contains 134,375 multiple-choice question–option pairs-the most extensive Bengali evaluation suite to date. The dataset preserves mathematical content via MathML, and includes BnMMLU-HARD, a compact subset constructed from questions most frequently missed by top systems to stress difficult cases. We benchmark 24 model variants across 11 LLM families, spanning open-weights general/multilingual, Bengali-centric open-weights, and proprietary models, covering multiple parameter scales and instruction-tuned settings. We evaluate models under standardized protocols covering two prompting styles (Direct vs. Chain-of-Thought) and two context regimes (0-shot vs. 5-shot), reporting accuracy consistently across families. Our analysis highlights persistent gaps in reasoning and application skills and indicates sublinear returns to scale across model sizes. We release the dataset and evaluation templates to support rigorous, reproducible assessment of Bengali language understanding and to catalyze progress in multilingual NLP.
Do Multi-Agents Solve Better Than Single? Evaluating Agentic Frameworks for Diagram-Grounded Geometry Problem Solving and Reasoning
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Mohammad Nehad Alam | Proma Hossain Progga | Swakkhar Shatabda
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Mahbub E Sobhani | Md. Faiyaz Abdullah Sayeedi | Mohammad Nehad Alam | Proma Hossain Progga | Swakkhar Shatabda
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Diagram-grounded geometry problem solving is a critical benchmark for multimodal large language models (MLLMs), yet the benefits of multi-agent design over single-agent remain unclear. We systematically compare single-agent and multi-agent pipelines on four visual math benchmarks: Geometry3K, MathVerse, OlympiadBench, and We-Math. For open-source models, multi-agent consistently improves performance. For example, Qwen-2.5-VL (7B) gains +6.8 points and Qwen-2.5-VL (32B) gains +3.3 on Geometry3K, and both Qwen-2.5-VL variants see further gains on OlympiadBench and We-Math. In contrast, the closed-source Gemini-2.0-Flash generally performs better in single-agent mode on classic benchmarks, while multi-agent yields only modest improvements on the newer We-Math dataset. These findings show that multi-agent pipelines provide clear benefits for open-source models and can assist strong proprietary systems on newer, less familiar benchmarks, but agentic decomposition is not universally optimal. All code, data, and reasoning files are available at https://github.com/faiyazabdullah/Interpreter-Solver
2023
Advancing Bangla Punctuation Restoration by a Monolingual Transformer-Based Method and a Large-Scale Corpus
Mehedi Hasan Bijoy | Mir Fatema Afroz Faria | Mahbub E Sobhani | Tanzid Ferdoush | Swakkhar Shatabda
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
Mehedi Hasan Bijoy | Mir Fatema Afroz Faria | Mahbub E Sobhani | Tanzid Ferdoush | Swakkhar Shatabda
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)
Punctuation restoration is the endeavor of reinstating and rectifying missing or improper punctuation marks within a text, thereby eradicating ambiguity in written discourse. The Bangla punctuation restoration task has received little attention and exploration, despitethe rising popularity of textual communication in the language. The primary hindrances in the advancement of the task revolve aroundthe utilization of transformer-based methods and an openly accessible extensive corpus, challenges that we discovered remainedunresolved in earlier efforts. In this study, we propose a baseline by introducing a mono-lingual transformer-based method named Jatikarok, where the effectiveness of transfer learning has been meticulously scrutinized, and a large-scale corpus containing 1.48M source-target pairs to resolve the previous issues. The Jatikarok attains accuracy rates of 95.2%, 85.13%, and 91.36% on the BanglaPRCorpus, Prothom-Alo Balanced, and BanglaOPUS corpora, thereby establishing itself as the state-of-the-art method through its superior performance compared to BanglaT5 and T5-Small. Jatikarok and BanglaPRCorpus are publicly available at: https://github.com/mehedihasanbijoy/Jatikarok-and-BanglaPRCorpus