Nazia Tasnim


2025

pdf bib
From Facts to Folklore: Evaluating Large Language Models on Bengali Cultural Knowledge
Nafis Chowdhury | Moinul Haque | Anika Ahmed | Nazia Tasnim | Md. Istiak Hossain Shihab | Sajjadur Rahman | Farig Sadeque
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Recent progress in NLP research has demonstrated remarkable capabilities of large language models (LLMs) across a wide range of tasks. While recent multilingual benchmarks have advanced cultural evaluation for LLMs, critical gaps remain in capturing the nuances of low-resource cultures. Our work addresses these limitations through a Bengali Language Cultural Knowledge (BLanCK) dataset including folk traditions, culinary arts, and regional dialects. Our investigation of several multilingual language models shows that while these models perform well in non-cultural categories, they struggle significantly with cultural knowledge and performance improves substantially across all models when context is provided, emphasizing context-aware architectures and culturally curated training data.

pdf bib
Are ASR foundation models generalized enough to capture features of regional dialects for low-resource languages?
Tawsif Tashwar Dipto | Azmol Hossain | Rubayet Sabbir Faruque | Md. Rezuwan Hassan | Kanij Fatema | Tanmoy Shome | Ruwad Naswan | Md.Foriduzzaman Zihad | Mohaymen Ul Anam | Nazia Tasnim | Hasan Mahmud | Md Kamrul Hasan | Md. Mehedi Hasan Shawon | Farig Sadeque | Tahsin Reasat
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Conventional research on speech recognition modeling relies on the canonical form for most low-resource languages while automatic speech recognition (ASR) for regional dialects is treated as a fine-tuning task. To investigate the effects of dialectal variations on ASR we develop a 78-hour annotated Bengali Speech-to-Text (STT) corpus named Ben-10. Investigation from linguistic and data-driven perspectives shows that speech foundation models struggle heavily in regional dialect ASR, both in zero-shot and fine-tuned settings. We observe that all deep learning methods struggle to model speech data under dialectal variations, but dialect specific model training alleviates the issue. Our dataset also serves as a out-of-distribution (OOD) resource for ASR modeling under constrained resources in ASR algorithms. The dataset and code developed for this project are publicly available.

2022

pdf bib
TEAM-Atreides at SemEval-2022 Task 11: On leveraging data augmentation and ensemble to recognize complex Named Entities in Bangla
Nazia Tasnim | Md. Istiak Shihab | Asif Shahriyar Sushmit | Steven Bethard | Farig Sadeque
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

Many areas, such as the biological and healthcare domain, artistic works, and organization names, have nested, overlapping, discontinuous entity mentions that may even be syntactically or semantically ambiguous in practice. Traditional sequence tagging algorithms are unable to recognize these complex mentions because they may violate the assumptions upon which sequence tagging schemes are founded. In this paper, we describe our contribution to SemEval 2022 Task 11 on identifying such complex Named Entities. We have leveraged the ensemble of multiple ELECTRA-based models that were exclusively pretrained on the Bangla language with the performance of ELECTRA-based models pretrained on English to achieve competitive performance on the Track-11. Besides providing a system description, we will also present the outcomes of our experiments on architectural decisions, dataset augmentations, and post-competition findings.