2025
pdf
bib
abs
Whispering in Ol Chiki: Cross-Lingual Transfer Learning for Santali Speech Recognition
Atanu Mandal
|
Madhusudan Ghosh
|
Pratick Maiti
|
Sudip Kumar Naskar
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
India, a country with a large population, possesses two official and twenty-two scheduled languages, making it the most linguistically diverse nation. Despite being one of the scheduled languages, Santali remains a low-resource language. Although Ol Chiki is recognized as the official script for Santali, many continue to use Bengali, Devanagari, Odia, and Roman scripts. In tribute to the upcoming centennial anniversary of the Ol Chiki script, we present an Automatic Speech Recognition for Santali in the Ol Chiki script. Our approach involves cross-lingual transfer learning by utilizing the Whisper framework pre-trained in Bengali and Hindi on the Santali language, using Ol Chiki script transcriptions. With the adoption of the Bengali pre-trained framework, we achieved a Word Error Rate (WER) score of 28.47%, whereas the adaptation of the Hindi pre-trained framework resulted in a score of 34.50% WER. These outcomes were obtained using the Whisper Small framework.
2023
pdf
bib
abs
MLlab4CS at SemEval-2023 Task 2: Named Entity Recognition in Low-resource Language Bangla Using Multilingual Language Models
Shrimon Mukherjee
|
Madhusudan Ghosh
|
Girish
|
Partha Basuchowdhuri
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
Extracting of NERs from low-resource languages and recognizing their types is one of the important tasks in the entity extraction domain. Recently many studies have been conducted in this area of research. In our study, we introduce a system for identifying complex entities and recognizing their types from low-resource language Bangla, which was published in SemEval Task 2 MulitCoNER II 2023. For this sequence labeling task, we use a pre-trained language model built on a natural language processing framework. Our team name in this competition is MLlab4CS. Our model Muril produces a macro average F-score of 76.27%, which is a comparable result for this competition.
2022
pdf
bib
abs
Astro-mT5: Entity Extraction from Astrophysics Literature using mT5 Language Model
Madhusudan Ghosh
|
Payel Santra
|
Sk Asif Iqbal
|
Partha Basuchowdhuri
Proceedings of the First Workshop on Information Extraction from Scientific Publications
Scientific research requires reading and extracting relevant information from existing scientific literature in an effective way. To gain insights over a collection of such scientific documents, extraction of entities and recognizing their types is considered to be one of the important tasks. Numerous studies have been conducted in this area of research. In our study, we introduce a framework for entity recognition and identification of NASA astrophysics dataset, which was published as a part of the DEAL SharedTask. We use a pre-trained multilingual model, based on a natural language processing framework for the given sequence labeling tasks. Experiments show that our model, Astro-mT5, out-performs the existing baseline in astrophysics related information extraction.