Minhajur Rahman Chowdhury Mahim
2025
BLUCK: A Benchmark Dataset for Bengali Linguistic Understanding and Cultural Knowledge
Daeen Kabir
|
Minhajur Rahman Chowdhury Mahim
|
Sheikh Shafayat
|
Adnan Sadik
|
Arian Ahmed
|
Eunsu Kim
|
Alice Oh
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
In this work, we introduce BLUCK, a new dataset designed to measure the performance of Large Language Models (LLMs) in Bengali linguistic understanding and cultural knowledge. Our dataset comprises 2366 multiple-choice questions (MCQs) carefully curated from compiled collections of several college and job level examinations and spans 23 categories covering knowledge on Bangladesh’s culture and history and Bengali linguistics. We benchmarked BLUCK using 6 proprietary and 3 open-source LLMs - including GPT-4o, Claude-3.5-Sonnet, Gemini-1.5-Pro, Llama-3.3-70B-Instruct, and DeepSeekV3. Our results show that while these models perform reasonably well overall, they, however, struggles in some areas of Bengali phonetics. Although current LLMs’ performance on Bengali cultural and linguistic contexts is still not comparable to that of mainstream languages like English, our results indicate Bengali’s status as a mid-resource language. Importantly, BLUCK is also the first MCQ-based evaluation benchmark that is centered around native Bengali culture, history, and linguistics.
2024
BEnQA: A Question Answering Benchmark for Bengali and English
Sheikh Shafayat
|
H M Quamran Hasan
|
Minhajur Rahman Chowdhury Mahim
|
Rifki Afina Putri
|
James Thorne
|
Alice Oh
Findings of the Association for Computational Linguistics: ACL 2024
In this study, we introduce BEnQA, a dataset comprising parallel Bengali and English exam questions for middle and high school levels in Bangladesh. Our dataset consists of approximately 5K questions covering several subjects in science with different types of questions, including factual, application, and reasoning-based questions. We benchmark several Large Language Models (LLMs) with our parallel dataset and observe a notable performance disparity between the models in Bengali and English. We also investigate some prompting methods, and find that Chain-of-Thought prompting is beneficial mostly on reasoning questions, but not so much on factual ones. We also find that appending English translation helps to answer questions in Bengali. Our findings point to promising future research directions for improving the performance of LLMs in Bengali and more generally in low-resource languages.
Search
Fix author
Co-authors
- Alice Oh 2
- Sheikh Shafayat 2
- Arian Ahmed 1
- H M Quamran Hasan 1
- Daeen Kabir 1
- show all...