BiCap: Bangla Image Captioning Using Attention-based Encoder-Decoder Architecture

Md Aminul Kader Bulbul

BiCap: Bangla Image Captioning Using Attention-based Encoder-Decoder Architecture

Abstract

Automatic image captioning has gained significant attention at the intersection of computer vision and natural language processing, yet research in low-resource languages such as Bangla remains limited. This work introduces BiCap, an attention-based encoder–decoder framework designed for Bangla image captioning. The model leverages a pretrained ResNet-50 as the encoder to extract rich visual features and a Long Short-Term Memory (LSTM) network as the decoder to sequentially generate Bangla captions. To overcome the fixed-length bottleneck of traditional encoder–decoder architectures, we integrate Bahdanau attention, enabling the decoder to dynamically focus on salient image regions while producing each word. The model is trained and evaluated on the Chitron dataset, with extensive preprocessing including vocabulary construction, tokenization, and word embedding. Experimental results demonstrate that BiCap achieves superior performance over the existing works (Masud et al., 2025; Hossain et al., 2024; Das et al., 2023; Humaira et al., 2021), yielding higher BLEU, METEOR, ROUGE, CIDEr scores. Improved fluency in human evaluation further confirms that the model generates more contextually accurate and semantically coherent captions, although occasional challenges remain with complex scenes. Recent advances in Vision–Language Models (VLMs), such as CLIP, BLIP, Flamingo, LLaVA, and MiniGPT-4, have redefined state-of-the-art captioning performance in high-resource settings. However, these models require large multimodal corpora and extensive pretraining that are currently unavailable for Bangla. BiCap therefore offers a resource-efficient, interpretable, and practically deployable solution tailored to low-resource multimodal learning.

Anthology ID:: 2025.banglalp-1.6
Volume:: Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Naeemul Hassan, Enamul Hoque Prince, Mohiuddin Tasnim, Md Rashad Al Hasan Rony, Md Tahmid Rahman Rahman
Venues:: BanglaLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 80–90
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.6/
DOI:
Bibkey:
Cite (ACL):: Md Aminul Kader Bulbul. 2025. BiCap: Bangla Image Captioning Using Attention-based Encoder-Decoder Architecture. In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pages 80–90, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: BiCap: Bangla Image Captioning Using Attention-based Encoder-Decoder Architecture (Bulbul, BanglaLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.6.pdf

PDF Cite Search Fix data