Akmmahbubur Rahman


2025

pdf bib
Generative Data Augmentation for Improving Semantic Classification
Shadman Rohan | Mahmud Elahi Akhter | Ibraheem Muhammad Moosa | Nabeel Mohammed | Amin Ahsan Ali | Akmmahbubur Rahman
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)

We study sentence-level generative data augmentation for Bangla semantic classification across four public datasets and three pretrained model families (BanglaBERT, XLM-Indic, mBERT). We evaluate two widely used, reproducible techniques—paraphrasing (mT5-based) and round-trip backtranslation (Bn–En–Bn)—and analyze their impact under realistic class imbalance. Overall, augmentation often helps, but gains are tightly coupled to label quality: paraphrasing typically outperforms backtranslation and yields the most consistent improvements for the monolingual model, whereas multilingual encoders benefit less and can be more sensitive to noisy minority-class expansions. A key empirical observation is that the neutral class appears to be a major source of annotation noise, which degrades decision boundaries and can cap the benefits of augmentation even when positive/negative classes are clean and polarized. We provide practical guidance for Bangla sentiment pipelines: (i) use simple sentence-level augmentation to rebalance classes when labels are reliable; (ii) allocate additional curation and higher inter-annotator agreement targets to the neutral class. Our results indicate when augmentation helps and suggest that data quality—not model choice alone—can become the limiting factor.

pdf bib
HateNet-BN at BLP-2025 Task 1: A Hierarchical Attention Approach for Bangla Hate Speech Detection
Mohaymen Ul Anam | Akm Moshiur Rahman Mazumder | Ashraful Islam | Akmmahbubur Rahman | M Ashraful Amin
Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)

The rise of social media in Bangladesh has increased abusive and hateful content, which is difficult to detect due to the informal nature of Bangla and limited resources. The BLP 2025 shared task addressed this challenge with Subtask 1A (multi-label abuse categories) and Subtask 1B (target identification). We propose a parameter-efficient model using a frozen BanglaBERT backbone with hierarchical attention to capture token level importance across hidden layers. Context vectors are aggregated for classification, combining syntactic and semantic features. On Subtask 1A, our frozen model achieved a micro-F1 of 0.7178, surpassing the baseline of 0.7100, while the unfrozen variant scored 0.7149. Our submissions ranked 15th (Subtask 1A) and 12th (Subtask 1B), showing that layer-wise attention with a frozen backbone can effectively detect abusive Bangla text.

pdf bib
BANMIME : Misogyny Detection with Metaphor Explanation on Bangla Memes
Md Ayon Mia | Akm Moshiur Rahman Mazumder | Khadiza Sultana Sayma | Md Fahim | Md Tahmid Hasan Fuad | Muhammad Ibrahim Khan | Akmmahbubur Rahman
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Detecting misogyny in multimodal content remains a notable challenge, particularly in culturally conservative and low-resource contexts like Bangladesh. While existing research has explored hate speech and general meme classification, the nuanced identification of misogyny in Bangla memes, rich in metaphor, humor, and visual-textual interplay, remains severely underexplored. To address this gap, we introduce BanMiMe, the first comprehensive Bangla misogynistic meme dataset comprising 2,000 culturally grounded samples where each meme includes misogyny labels, humor categories, metaphor localization, and detailed human-written explanations. We benchmark the various performance of open and closed-source vision-language models (VLMs) under zero-shot and prompt-based settings and evaluate their capacity for both classification and explanation generation. Furthermore, we systematically explore multiple fine-tuning strategies, including standard, data-augmented, and Chain-of-Thought (CoT) supervision. Our results demonstrate that CoT-based fine-tuning consistently enhances model performance, both in terms of accuracy and in generating meaningful explanations. We envision BanMiMe as a foundational resource for advancing explainable multimodal moderation systems in low-resource and culturally sensitive settings.

pdf bib
DM-Codec: Distilling Multimodal Representations for Speech Tokenization
Md Mubtasim Ahasan | Md Fahim | Tasnim Mohiuddin | Akmmahbubur Rahman | Aman Chadha | Tariq Iqbal | M Ashraful Amin | Md Mofijul Islam | Amin Ahsan Ali
Findings of the Association for Computational Linguistics: EMNLP 2025

Recent advancements in speech-language models have yielded significant improvements in speech tokenization and synthesis. However, effectively mapping the complex, multidimensional attributes of speech into discrete tokens remains challenging. This process demands acoustic, semantic, and contextual information for precise speech representations. Existing speech representations generally fall into two categories: acoustic tokens from audio codecs and semantic tokens from speech self-supervised learning models. Although recent efforts have unified acoustic and semantic tokens for improved performance, they overlook the crucial role of contextual representation in comprehensive speech modeling. Our empirical investigations reveal that the absence of contextual representations results in elevated Word Error Rate (WER) and Word Information Lost (WIL) scores in speech transcriptions. To address these limitations, we propose two novel distillation approaches: (1) a language model (LM)-guided distillation method that incorporates contextual information, and (2) a combined LM and self-supervised speech model (SM)-guided distillation technique that effectively distills multimodal representations (acoustic, semantic, and contextual) into a comprehensive speech tokenizer, termed DM-Codec. The DM-Codec architecture adopts a streamlined encoder-decoder framework with a Residual Vector Quantizer (RVQ) and incorporates the LM and SM during the training process. Experiments show DM-Codec significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset.

pdf bib
SOMAJGYAAN: A Dataset for Evaluating LLMs on Bangla Culture, Social Knowledge, and Low-Resource Language Adaptation
Fariha Anjum Shifa | Muhtasim Ibteda Shochcho | Abdullah Ibne Hanif Arean | Mohammad Ashfaq Ur Rahman | Akm Moshiur Rahman Mazumder | Ahaj Mahhin Faiak | Md Fahim | M Ashraful Amin | Amin Ahsan Ali | Akmmahbubur Rahman
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Despite significant progress in large language models (LLMs), their knowledge and evaluation continue to be centered around high-resource languages, leaving critical gaps in low-resource settings. This raises questions about how effectively LLMs handle subjects that require locally relevant knowledge. To address this challenge, we need a robust dataset that reflects the knowledge of underrepresented regions such as Bangladesh. In this paper, we present ***SOMAJGYAAN***, a Bangla multiple-choice dataset consisting of 4,234 questions, annotated across five levels of difficulty. The questions are drawn from Bangladesh’s National Curriculum and Global Studies textbooks, covering a wide range of domains including History, Geography, Economics, Social Studies, Politics and Law, and Miscellaneous topics. Difficulty levels were assigned by four expert annotators to minimize annotation bias. The experiments reveal that closed-source LLMs perform better than open-source LLMs. While fine-tuning open-source models on improves their performance, they still fall short of matching closed-source LLMs. Our findings highlight the importance of culturally grounded evaluation datasets and task-specific adaptation to improve LLM performance in low-resource language settings.

pdf bib
CMBan: Cartoon-Driven Meme Contextual Classification Dataset for Bangla
Newaz Ben Alam | Akm Moshiur Rahman Mazumder | Mir Sazzat Hossain | Mysha Samiha | Md Alvi Noor Hossain | Md Fahim | Amin Ahsan Ali | Ashraful Islam | M Ashraful Amin | Akmmahbubur Rahman
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Social networks extensively feature memes, particularly cartoon images, as a prevalent form of communication often conveying complex sentiments or harmful content. Detecting such content, particularly when it involves Bengali and English text, remains a multimodal challenge. This paper introduces ***CMBan***, a novel and culturally relevant dataset of 2,641 annotated cartoon memes. It addresses meme classification based on their sentiment across five key categories: Humor, Sarcasm, Offensiveness, Motivational Content, and Overall Sentiment, incorporating both image and text features. Our curated dataset specifically aids in detecting nuanced offensive content and navigating complexities of pure Bengali, English, or code-mixed Bengali-English languages. Through rigorous experimentation involving over 12 multimodal models, including monolingual, multilingual, and proprietary architectures, and utilizing prompting methods like Chain-Of-Thought (CoT), findings suggest this cartoon-based, code-mixed meme content poses substantial understanding challenges. Experimental results demonstrate that closed models excel over open models. While the LoRA fine-tuning strategy equalizes performance across model architectures and improves classification of challenging aspects in multilingual meme contexts, this work advances meme classification by providing effective solution for detecting harmful content in multilingual meme contexts.

2023

pdf bib
Contextual Bangla Neural Stemmer: Finding Contextualized Root Word Representations for Bangla Words
Md Fahim | Amin Ahsan Ali | M Ashraful Amin | Akmmahbubur Rahman
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

Stemmers are commonly used in NLP to reduce words to their root form. However, this process may discard important information and yield incorrect root forms, affecting the accuracy of NLP tasks. To address these limitations, we propose a Contextual Bangla Neural Stemmer for Bangla language to enhance word representations. Our method involves splitting words into characters within the Neural Stemming Block, obtaining vector representations for both stem words and unknown vocabulary words. A loss function aligns these representations with Word2Vec representations, followed by contextual word representations from a Universal Transformer encoder. Mean Pooling generates sentence-level representations that are aligned with BanglaBERT’s representations using a MLP layer. The proposed model also tries to build good representations for out-of-vocabulary (OOV) words. Experiments with our model on five Bangla datasets shows around 5% average improvement over the vanilla approach. Notably, our method avoids BERT retraining, focusing on root word detection and addressing OOV and sub-word issues. By incorporating our approach into a large corpus-based Language Model, we expect further improvements in aspects like explainability.

pdf bib
Investigating the Effectiveness of Graph-based Algorithm for Bangla Text Classification
Farhan Dehan | Md Fahim | Amin Ahsan Ali | M Ashraful Amin | Akmmahbubur Rahman
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In this study, we examine and analyze the behavior of several graph-based models for Bangla text classification tasks. Graph-based algorithms create heterogeneous graphs from text data. Each node represents either a word or a document, and each edge indicates relationship between any two words or word and document. We applied the BERT model and different graph-based models including TextGCN, GAT, BertGAT, and BertGCN on five different datasets including SentNoB, Sarcasm detection, BanFakeNews, Hate speech detection, and Emotion detection datasets for Bangla text. BERT’s model bested the TextGCN and the GAT models by a large difference in terms of accuracy, Macro F1 score, and weighted F1 score. BertGCN and BertGAT are shown to outperform standalone graph models and BERT model. BertGAT excelled in the Emotion detection dataset and achieved a 1%-2% performance boost in Sarcasm detection, Hate speech detection, and BanFakeNews datasets from BERT’s performance. Whereas, BertGCN outperformed BertGAT by 1% for SetNoB, and BanFakeNews datasets while beating BertGAT by 2% for Sarcasm detection, Hate Speech, and Emotion detection datasets. We also examined different variations in graph structure and analyzed their effects.

pdf bib
BaTEClaCor: A Novel Dataset for Bangla Text Error Classification and Correction
Nabilah Oshin | Syed Hoque | Md Fahim | Amin Ahsan Ali | M Ashraful Amin | Akmmahbubur Rahman
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

In the context of the dynamic realm of Bangla communication, online users are often prone to bending the language or making errors due to various factors. We attempt to detect, categorize, and correct those errors by employing several machine learning and deep learning models. To contribute to the preservation and authenticity of the Bangla language, we introduce a meticulously categorized organic dataset encompassing 10,000 authentic Bangla comments from a commonly used social media platform. Through rigorous comparative analysis of distinct models, our study highlights BanglaBERT’s superiority in error-category classification and underscores the effectiveness of BanglaT5 for text correction. BanglaBERT achieves accuracy of 79.1% and 74.1% for binary and multiclass error-category classification while the BanglaBERT is fine-tuned and tested with our proposed dataset. Moreover, BanglaT5 achieves the best Rouge-L score (0.8459) when BanglaT5 is fine-tuned and tested with our corrected ground truths. Beyond algorithmic exploration, this endeavor represents a significant stride in enhancing the quality of digital discourse in the Bangla-speaking community, fostering linguistic precision and coherence in online interactions. The dataset and code is available at https://github.com/SyedT1/BaTEClaCor.