2024
pdf
abs
NLP Progress in Indigenous Latin American Languages
Atnafu Tonja
|
Fazlourrahman Balouchzahi
|
Sabur Butt
|
Olga Kolesnikova
|
Hector Ceballos
|
Alexander Gelbukh
|
Thamar Solorio
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The paper focuses on the marginalization of indigenous language communities in the face of rapid technological advancements. We highlight the cultural richness of these languages and the risk they face of being overlooked in the realm of Natural Language Processing (NLP). We aim to bridge the gap between these communities and researchers, emphasizing the need for inclusive technological advancements that respect indigenous community perspectives. We show the NLP progress of indigenous Latin American languages and the survey that covers the status of indigenous languages in Latin America, their representation in NLP, and the challenges and innovations required for their preservation and development. The paper contributes to the current literature in understanding the need and progress of NLP for indigenous communities of Latin America, specifically low-resource and indigenous communities in general.
pdf
abs
EthioLLM: Multilingual Large Language Models for Ethiopian Languages with Task Evaluation
Atnafu Lambebo Tonja
|
Israel Abebe Azime
|
Tadesse Destaw Belay
|
Mesay Gemeda Yigezu
|
Moges Ahmed Ah Mehamed
|
Abinew Ali Ayele
|
Ebrahim Chekol Jibril
|
Michael Melese Woldeyohannis
|
Olga Kolesnikova
|
Philipp Slusallek
|
Dietrich Klakow
|
Seid Muhie Yimam
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Large language models (LLMs) have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks. However, low-resource languages are still lagging behind current state-of-the-art (SOTA) developments in the field of NLP due to insufficient resources to train LLMs. Ethiopian languages exhibit remarkable linguistic diversity, encompassing a wide array of scripts, and are imbued with profound religious and cultural significance. This paper introduces EthioLLM – multilingual large language models for five Ethiopian languages (Amharic, Ge’ez, Afan Oromo, Somali, and Tigrinya) and English, and Ethiobenchmark – a new benchmark dataset for various downstream NLP tasks. We evaluate the performance of these models across five downstream NLP tasks. We open-source our multilingual language models, new benchmark datasets for various downstream tasks, and task-specific fine-tuned language models and discuss the performance of the models. Our dataset and models are available at the https://huggingface.co/EthioNLP repository.
pdf
abs
EthioMT: Parallel Corpus for Low-resource Ethiopian Languages
Atnafu Lambebo Tonja
|
Olga Kolesnikova
|
Alexander Gelbukh
|
Jugal Kalita
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024
Recent research in natural language processing (NLP) has achieved impressive performance in tasks such as machine translation (MT), news classification, and question-answering in high-resource languages. However, the performance of MT leaves much to be desired for low-resource languages. This is due to the smaller size of available parallel corpora in these languages, if such corpora are available at all. NLP in Ethiopian languages suffers from the same issues due to the unavailability of publicly accessible datasets for NLP tasks, including MT. To help the research community and foster research for Ethiopian languages, we introduce EthioMT – a new parallel corpus for 15 languages. We also create a new benchmark by collecting a dataset for better-researched languages in Ethiopia. We evaluate the newly collected corpus and the benchmark dataset for 23 Ethiopian languages using transformer and fine-tuning approaches.
pdf
abs
Social Media Fake News Classification Using Machine Learning Algorithm
Girma Bade
|
Olga Kolesnikova
|
Grigori Sidorov
|
José Oropeza
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
The rise of social media has facilitated easier communication, information sharing, and current affairs updates. However, the prevalence of misleading and deceptive content, commonly referred to as fake news, poses a significant challenge. This paper focuses on the classification of fake news in Malayalam, a Dravidian language, utilizing natural language processing (NLP) techniques. To develop a model, we employed a random forest machine learning method on a dataset provided by a shared task(DravidianLangTech@EACL 2024)1. When evaluated by the separate test dataset, our developed model achieved a 0.71 macro F1 measure.
pdf
abs
Habesha@DravidianLangTech 2024: Detecting Fake News Detection in Dravidian Languages using Deep Learning
Mesay Yigezu
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
This research tackles the issue of fake news by utilizing the RNN-LSTM deep learning method with optimized hyperparameters identified through grid search. The model’s performance in multi-label classification is hindered by unbalanced data, despite its success in binary classification. We achieved a score of 0.82 in the binary classification task, whereas in the multi-class task, the score was 0.32. We suggest incorporating data balancing techniques for researchers who aim to further this task, aiming to improve results in managing a variety of information.
pdf
abs
Social Media Hate and Offensive Speech Detection Using Machine Learning method
Girma Bade
|
Olga Kolesnikova
|
Grigori Sidorov
|
José Oropeza
Proceedings of the Fourth Workshop on Speech, Vision, and Language Technologies for Dravidian Languages
Even though the improper use of social media is increasing nowadays, there is also technology that brings solutions. Here, improperness is posting hate and offensive speech that might harm an individual or group. Hate speech refers to an insult toward an individual or group based on their identities. Spreading it on social media platforms is a serious problem for society. The solution, on the other hand, is the availability of natural language processing(NLP) technology that is capable to detect and handle such problems. This paper presents the detection of social media’s hate and offensive speech in the code-mixed Telugu language. For this, the task and golden standard dataset were provided for us by the shared task organizer (DravidianLangTech@ EACL 2024)1. To this end, we have employed the TF-IDF technique for numeric feature extraction and used a random forest algorithm for modeling hate speech detection. Finally, the developed model was evaluated on the test dataset and achieved 0.492 macro-F1.
2023
pdf
abs
Natural Language Processing in Ethiopian Languages: Current State, Challenges, and Opportunities
Atnafu Lambebo Tonja
|
Tadesse Destaw Belay
|
Israel Abebe Azime
|
Abinew Ali Ayele
|
Moges Ahmed Mehamed
|
Olga Kolesnikova
|
Seid Muhie Yimam
Proceedings of the Fourth workshop on Resources for African Indigenous Languages (RAIL 2023)
This survey delves into the current state of natural language processing (NLP) for four Ethiopian languages: Amharic, Afaan Oromo, Tigrinya, and Wolaytta. Through this paper, we identify key challenges and opportunities for NLP research in Ethiopia.Furthermore, we provide a centralized repository on GitHub that contains publicly available resources for various NLP tasks in these languages. This repository can be updated periodically with contributions from other researchers. Our objective is to disseminate information to NLP researchers interested in Ethiopian languages and encourage future research in this domain.
pdf
abs
Parallel Corpus for Indigenous Language Translation: Spanish-Mazatec and Spanish-Mixtec
Atnafu Lambebo Tonja
|
Christian Maldonado-sifuentes
|
David Alejandro Mendoza Castillo
|
Olga Kolesnikova
|
Noé Castro-Sánchez
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
In this paper, we present a parallel Spanish- Mazatec and Spanish-Mixtec corpus for machine translation (MT) tasks, where Mazatec and Mixtec are two indigenous Mexican languages. We evaluated the usability of the collected corpus using three different approaches: transformer, transfer learning, and fine-tuning pre-trained multilingual MT models. Fine-tuning the Facebook m2m100-48 model outperformed the other approaches, with BLEU scores of 12.09 and 22.25 for Mazatec-Spanish and Spanish-Mazatec translations, respectively, and 16.75 and 22.15 for Mixtec-Spanish and Spanish-Mixtec translations, respectively. The results indicate that translation performance is influenced by the dataset size (9,799 sentences in Mazatec and 13,235 sentences in Mixtec) and is more effective when indigenous languages are used as target languages. The findings emphasize the importance of creating parallel corpora for indigenous languages and fine-tuning models for low-resource translation tasks. Future research will investigate zero-shot and few-shot learning approaches to further improve translation performance in low-resource settings.
pdf
abs
Enhancing Translation for Indigenous Languages: Experiments with Multilingual Models
Atnafu Lambebo Tonja
|
Hellina Hailu Nigatu
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
|
Jugal Kalita
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
This paper describes CIC NLP’s submission to the AmericasNLP 2023 Shared Task on machine translation systems for indigenous languages of the Americas. We present the system descriptions for three methods. We used two multilingual models, namely M2M-100 and mBART50, and one bilingual (one-to-one) — Helsinki NLP Spanish-English translation model, and experimented with different transfer learning setups. We experimented with 11 languages from America and report the setups we used as well as the results we achieved. Overall, the mBART setup was able to improve upon the baseline for three out of the eleven languages.
pdf
abs
LIDOMA@DravidianLangTech: Convolutional Neural Networks for Studying Correlation Between Lexical Features and Sentiment Polarity in Tamil and Tulu Languages
Moein Tash
|
Jesus Armenta-Segura
|
Zahra Ahani
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
With the prevalence of code-mixing among speakers of Dravidian languages, DravidianLangTech proposed the shared task on Sentiment Analysis in Tamil and Tulu at RANLP 2023. This paper presents the submission of LIDOMA, which proposes a methodology that combines lexical features and Convolutional Neural Networks (CNNs) to address the challenge. A fine-tuned 6-layered CNN model is employed, achieving macro F1 scores of 0.542 and 0.199 for Tulu and Tamil, respectively
pdf
abs
Habesha@DravidianLangTech: Utilizing Deep and Transfer Learning Approaches for Sentiment Analysis.
Mesay Gemeda Yigezu
|
Tadesse Kebede
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
This research paper focuses on sentiment analysis of Tamil and Tulu texts using a BERT model and an RNN model. The BERT model, which was pretrained, achieved satisfactory performance for the Tulu language, with a Macro F1 score of 0.352. On the other hand, the RNN model showed good performance for Tamil language sentiment analysis, obtaining a Macro F1 score of 0.208. As future work, the researchers aim to fine-tune the models to further improve their results after the training process.
pdf
abs
Habesha@DravidianLangTech: Abusive Comment Detection using Deep Learning Approach
Mesay Gemeda Yigezu
|
Selam Kanta
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the Third Workshop on Speech and Language Technologies for Dravidian Languages
This research focuses on identifying abusive language in comments. The study utilizes deep learning models, including Long Short-Term Memory (LSTM) and Recurrent Neural Networks (RNNs), to analyze linguistic patterns. Specifically, the LSTM model, a type of RNN, is used to understand the context by capturing long-term dependencies and intricate patterns in the input sequences. The LSTM model achieves better accuracy and is enhanced through the addition of a dropout layer and early stopping. For detecting abusive language in Telugu and Tamil-English, an LSTM model is employed, while in Tamil abusive language detection, a word-level RNN is developed to identify abusive words. These models process text sequentially, considering overall content and capturing contextual dependencies.
pdf
abs
First Attempt at Building Parallel Corpora for Machine Translation of Northeast India’s Very Low-Resource Languages
Atnafu Lambebo Tonja
|
Melkamu Mersha
|
Ananya Kalita
|
Olga Kolesnikova
|
Jugal Kalita
Proceedings of the 20th International Conference on Natural Language Processing (ICON)
This paper presents the creation of initial bilingual corpora for thirteen very low-resource languages of India, all from Northeast India. It also presents the results of initial translation efforts in these languages. It creates the first-ever parallel corpora for these languages and provides initial benchmark neural machine translation results for these languages. We intend to extend these corpora to include a large number of low-resource Indian languages and integrate the effort with our prior work with African and American-Indian languages to create corpora covering a large number of languages from across the world.
2022
pdf
abs
CIC NLP at SMM4H 2022: a BERT-based approach for classification of social media forum posts
Atnafu Lambebo Tonja
|
Olumide Ebenezer Ojo
|
Mohammed Arif Khan
|
Abdul Gafar Manuel Meque
|
Olga Kolesnikova
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task
This paper describes our submissions for the Social Media Mining for Health (SMM4H) 2022 shared tasks. We participated in 2 tasks: a) Task 4: Classification of Tweets self-reporting exact age and b) Task 9: Classification of Reddit posts self-reporting exact age. We evaluated the two( BERT and RoBERTa) transformer based models for both tasks. For Task 4 RoBERTa-Large achieved an F1 score of 0.846 on the test set and BERT-Large achieved an F1 score of 0.865 on the test set for Task 9.
pdf
abs
Transformer-based Model for Word Level Language Identification in Code-mixed Kannada-English Texts
Atnafu Lambebo Tonja
|
Mesay Gemeda Yigezu
|
Olga Kolesnikova
|
Moein Shahiki Tash
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
Language Identification at the Word Level in Kannada-English Texts. This paper describes the system paper of CoLI-Kanglish 2022 shared task. The goal of this task is to identify the different languages used in CoLI-Kanglish 2022. This dataset is distributed into different categories including Kannada, English, Mixed-Language, Location, Name, and Others. This Code-Mix was compiled by CoLI-Kanglish 2022 organizers from posts on social media. We use two classification techniques, KNN and SVM, and achieve an F1-score of 0.58 and place third out of nine competitors.
pdf
abs
Word Level Language Identification in Code-mixed Kannada-English Texts using Deep Learning Approach
Mesay Gemeda Yigezu
|
Atnafu Lambebo Tonja
|
Olga Kolesnikova
|
Moein Shahiki Tash
|
Grigori Sidorov
|
Alexander Gelbukh
Proceedings of the 19th International Conference on Natural Language Processing (ICON): Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts
The goal of code-mixed language identification (LID) is to determine which language is spoken or written in a given segment of a speech, word, sentence, or document. Our task is to identify English, Kannada, and mixed language from the provided data. To train a model we used the CoLI-Kenglish dataset, which contains English, Kannada, and mixed-language words. In our work, we conducted several experiments in order to obtain the best performing model. Then, we implemented the best model by using Bidirectional Long Short Term Memory (Bi-LSTM), which outperformed the other trained models with an F1-score of 0.61%.