Deeparghya Dutta Barua
2025
BanTH: A Multi-label Hate Speech Detection Dataset for Transliterated Bangla
Fabiha Haider
|
Fariha Tanjim Shifat
|
Md Farhan Ishmam
|
Md Sakib Ul Rahman Sourove
|
Deeparghya Dutta Barua
|
Md Fahim
|
Md Farhad Alam Bhuiyan
Findings of the Association for Computational Linguistics: NAACL 2025
The proliferation of transliterated texts in digital spaces has emphasized the need for detecting and classifying hate speech in languages beyond English, particularly in low-resource languages. As online discourse can perpetuate discrimination based on target groups, e.g. gender, religion, and origin, multi-label classification of hateful content can help in understanding hate motivation and enhance content moderation. While previous efforts have focused on monolingual or binary hate classification tasks, no work has yet addressed the challenge of multi-label hate speech classification in transliterated Bangla. We introduce BanTH, the first multi-label transliterated Bangla hate speech dataset. The samples are sourced from YouTube comments, where each instance is labeled with one or more target groups, reflecting the regional demographic. We propose a novel translation-based LLM prompting strategy that translates or transliterates under-resourced text to higher-resourced text before classifying the hate group(s). Experiments reveal further pre-trained encoders achieving state-of-the-art performance on the BanTH dataset while translation-based prompting outperforms other strategies in the zero-shot setting. We address a critical gap in Bangla hate speech and set the stage for further exploration into code-mixed and multi-label classification in underrepresented languages.
2024
BanglaTLit: A Benchmark Dataset for Back-Transliteration of Romanized Bangla
Md Fahim
|
Fariha Tanjim Shifat
|
Fabiha Haider
|
Deeparghya Dutta Barua
|
MD Sakib Ul Rahman Sourove
|
Md Farhan Ishmam
|
Md Farhad Alam Bhuiyan
Findings of the Association for Computational Linguistics: EMNLP 2024
Low-resource languages like Bangla are severely limited by the lack of datasets. Romanized Bangla texts are ubiquitous on the internet, offering a rich source of data for Bangla NLP tasks and extending the available data sources. However, due to the informal nature of romanized text, they often lack the structure and consistency needed to provide insights. We address these challenges by proposing: (1) BanglaTLit, the large-scale Bangla transliteration dataset consisting of 42.7k samples, (2) BanglaTLit-PT, a pre-training corpus on romanized Bangla with 245.7k samples, (3) encoders further-pretrained on BanglaTLit-PT achieving state-of-the-art performance in several romanized Bangla classification tasks, and (4) multiple back-transliteration baseline methods, including a novel encoder-decoder architecture using further pre-trained encoders. Our results show the potential of automated Bangla back-transliteration in utilizing the untapped sources of romanized Bangla to enrich this language. The code and datasets are publicly available: https://github.com/farhanishmam/BanglaTLit.
Search
Fix data
Co-authors
- Md Farhad Alam Bhuiyan 2
- Md Fahim 2
- Fabiha Haider 2
- Md Farhan Ishmam 2
- Fariha Tanjim Shifat 2
- show all...