2023
pdf
Subwords to Word Back Composition for Morphologically Rich Languages in Neural Machine Translation
Telem Joyson Singh
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
pdf
Assamese Back Transliteration - An Empirical Study Over Canonical and Non-canonical Datasets
Hemanta Baruah
|
Sanasam Ranbir Singh
|
Priyankoo Sarmah
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation
2022
pdf
abs
TEAM: A multitask learning based Taxonomy Expansion approach for Attach and Merge
Bornali Phukon
|
Anasua Mitra
|
Ranbir Sanasam
|
Priyankoo Sarmah
Findings of the Association for Computational Linguistics: NAACL 2022
Taxonomy expansion is a crucial task. Most of Automatic expansion of taxonomy are of two types, attach and merge. In a taxonomy like WordNet, both merge and attach are integral parts of the expansion operations but majority of study consider them separately. This paper proposes a novel mult-task learning-based deep learning method known as Taxonomy Expansion with Attach and Merge (TEAM) that performs both the merge and attach operations. To the best of our knowledge this is the first study which integrates both merge and attach operations in a single model. The proposed models have been evaluated on three separate WordNet taxonomies, viz., Assamese, Bangla, and Hindi. From the various experimental setups, it is shown that TEAM outperforms its state-of-the-art counterparts for attach operation, and also provides highly encouraging performance for the merge operation.
pdf
abs
AsNER - Annotated Dataset and Baseline for Assamese Named Entity recognition
Dhrubajyoti Pathak
|
Sukumar Nandi
|
Priyankoo Sarmah
Proceedings of the Thirteenth Language Resources and Evaluation Conference
We present the AsNER, a named entity annotation dataset for low resource Assamese language with a baseline Assamese NER model. The dataset contains about 99k tokens comprised of text from the speech of the Prime Minister of India and Assamese play. It also contains person names, location names and addresses. The proposed NER dataset is likely to be a significant resource for deep neural based Assamese language processing. We benchmark the dataset by training NER models and evaluating using state-of-the-art architectures for supervised named entity recognition (NER) such as Fasttext, BERT, XLM-R, FLAIR, MuRIL etc. We implement several baseline approaches with state-of-the-art sequence tagging Bi-LSTM-CRF architecture. The highest F1-score among all baselines achieves an accuracy of 80.69% when using MuRIL as a word embedding method. The annotated dataset and the top performing model are made publicly available.
2020
pdf
abs
Lexical Tone Recognition in Mizo using Acoustic-Prosodic Features
Parismita Gogoi
|
Abhishek Dey
|
Wendy Lalhminghlui
|
Priyankoo Sarmah
|
S R Mahadeva Prasanna
Proceedings of the Twelfth Language Resources and Evaluation Conference
Mizo is an under-studied Tibeto-Burman tonal language of the North-East India. Preliminary research findings have confirmed that four distinct tones of Mizo (High, Low, Rising and Falling) appear in the language. In this work, an attempt is made to automatically recognize four phonological tones in Mizo distinctively using acoustic-prosodic parameters as features. Six features computed from Fundamental Frequency (F0) contours are considered and two classifier models based on Support Vector Machine (SVM) & Deep Neural Network (DNN) are implemented for automatic tonerecognition task respectively. The Mizo database consists of 31950 iterations of the four Mizo tones, collected from 19 speakers using trisyllabic phrases. A four-way classification of tones is attempted with a balanced (equal number of iterations per tone category) dataset for each tone of Mizo. it is observed that the DNN based classifier shows comparable performance in correctly recognizing four phonological Mizo tones as of the SVM based classifier.