Mst. Sanjida Jamal Priya
2025
AniSan@NLU of Devanagari Script Languages 2025: Optimizing Language Identification with Ensemble Learning
Anik Mahmud Shanto
|
Mst. Sanjida Jamal Priya
|
Mohammad Shamsul Arefin
Proceedings of the First Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2025)
Identifying languages written in Devanagari script, including Hindi, Marathi, Nepali, Bhojpuri, and Sanskrit, is essential in multilingual contexts but challenging due to the high overlap between these languages. To address this, a shared task on “Devanagari Script Language Identification” has been organized, with a dataset available for subtask A to test language identification models. This paper introduces an ensemble-based approach that combines mBERT, XLM-R, and IndicBERT models through majority voting to improve language identification accuracy across these languages. Our ensemble model has achieved an impressive accuracy of 99.68%, outperforming individual models by capturing a broader range of language features and reducing model biases that often arise from closely related linguistic patterns. Additionally, we have fine-tuned other transformer models as part of a comparative analysis, providing further validation of the ensemble’s effectiveness. The results highlight the ensemble model’s ability in distinguishing similar languages within the Devanagari script, offering a promising approach for accurate language identification in complex multilingual contexts.
MDC3: A Novel Multimodal Dataset for Commercial Content Classification in Bengali
Anik Mahmud Shanto
|
Mst. Sanjida Jamal Priya
|
Fahim Shakil Tamim
|
Mohammed Moshiul Hoque
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)
Identifying commercial posts in resource-constrained languages among diverse and unstructured content remains a significant challenge for automatic text classification tasks. To address this, this work introduces a novel dataset named MDC3 (Multimodal Dataset for Commercial Content Classification), comprising 5,007 annotated Bengali social media posts classified as commercial and noncommercial. A comprehensive annotation guideline accompanying the dataset is included to aid future dataset creation in resource-constrained languages. Furthermore, we performed extensive experiments on MDC3 considering both unimodal and multimodal domains. Specifically, the late fusion of textual (mBERT) and visual (ViT) models (i.e., ViT+mBERT) achieves the highest F1 score of 90.91, significantly surpassing other baselines.