2023
pdf
Challenges and Issue of Gender Bias in Under-Represented Languages: An Empirical Study on Inuktitut-English NMT
Ngoc Tan Le
|
Oussama Hansal
|
Fatiha Sadat
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages
pdf
abs
Towards the First Named Entity Recognition of Inuktitut for an Improved Machine Translation
Ngoc Tan Le
|
Soumia Kasdi
|
Fatiha Sadat
Proceedings of the Workshop on Natural Language Processing for Indigenous Languages of the Americas (AmericasNLP)
Named Entity Recognition is a crucial step to ensure good quality performance of several Natural Language Processing applications and tools, including machine translation and information retrieval. Moreover, it is considered as a fundamental module of many Natural Language Understanding tasks such as question-answering systems. This paper presents a first study on NER for an under-represented Indigenous Inuit language of Canada, Inuktitut, which lacks linguistic resources and large labeled data. Our proposed NER model for Inuktitut is built by transferring linguistic characteristics from English to Inuktitut, based on either rules or bilingual word embeddings. We provide an empirical study based on a comparison with the state of the art models and as well as intrinsic and extrinsic evaluations. In terms of Recall, Precision and F-score, the obtained results show the effectiveness of the proposed NER methods. Furthermore, it improved the performance of Inuktitut-English Neural Machine Translation.
2022
pdf
abs
Indigenous Language Revitalization and the Dilemma of Gender Bias
Oussama Hansal
|
Ngoc Tan Le
|
Fatiha Sadat
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
Natural Language Processing (NLP), through its several applications, has been considered as one of the most valuable field in interdisciplinary researches, as well as in computer science. However, it is not without its flaws. One of the most common flaws is bias. This paper examines the main linguistic challenges of Inuktitut, an indigenous language of Canada, and focuses on gender bias identification and mitigation. We explore the unique characteristics of this language to help us understand the right techniques that can be used to identify and mitigate implicit biases. We use some methods to quantify the gender bias existing in Inuktitut word embeddings; then we proceed to mitigate the bias and evaluate the performance of the debiased embeddings. Next, we explain how approaches for detecting and reducing bias in English embeddings may be transferred to Inuktitut embeddings by properly taking into account the language’s particular characteristics. Next, we compare the effect of the debiasing techniques on Inuktitut and English. Finally, we highlight some future research directions which will further help to push the boundaries.
pdf
abs
Deep Learning-Based Morphological Segmentation for Indigenous Languages: A Study Case on Innu-Aimun
Ngoc Tan Le
|
Antoine Cadotte
|
Mathieu Boivin
|
Fatiha Sadat
|
Jimena Terraza
Proceedings of the Third Workshop on Deep Learning for Low-Resource Natural Language Processing
Recent advances in the field of deep learning have led to a growing interest in the development of NLP approaches for low-resource and endangered languages. Nevertheless, relatively little research, related to NLP, has been conducted on indigenous languages. These languages are considered to be filled with complexities and challenges that make their study incredibly difficult in the NLP and AI fields. This paper focuses on the morphological segmentation of indigenous languages, an extremely challenging task because of polysynthesis, dialectal variations with rich morpho-phonemics, misspellings and resource-limited scenario issues. The proposed approach, towards a morphological segmentation of Innu-Aimun, an extremely low-resource indigenous language of Canada, is based on deep learning. Experiments and evaluations have shown promising results, compared to state-of-the-art rule-based and unsupervised approaches.
2021
pdf
abs
Towards a First Automatic Unsupervised Morphological Segmentation for Inuinnaqtun
Ngoc Tan Le
|
Fatiha Sadat
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Low-resource polysynthetic languages pose many challenges in NLP tasks, such as morphological analysis and Machine Translation, due to available resources and tools, and the morphologically complex languages. This research focuses on the morphological segmentation while adapting an unsupervised approach based on Adaptor Grammars in low-resource setting. Experiments and evaluations on Inuinnaqtun, one of Inuit language family in Northern Canada, considered a language that will be extinct in less than two generations, have shown promising results.
pdf
Towards a Low-Resource Neural Machine Translation for Indigenous Languages in Canada
Ngoc Tan Le
|
Fatiha Sadat
Traitement Automatique des Langues, Volume 62, Numéro 3 : Diversité Linguistique [Linguistic Diversity in Natural Language Processing]
2018
pdf
Improving the neural network-based machine transliteration for low-resourced language pair
Ngoc Tan Le
|
Fatiha Sadat
Proceedings of the 32nd Pacific Asia Conference on Language, Information and Computation
pdf
abs
Low-Resource Machine Transliteration Using Recurrent Neural Networks of Asian Languages
Ngoc Tan Le
|
Fatiha Sadat
Proceedings of the Seventh Named Entities Workshop
Grapheme-to-phoneme models are key components in automatic speech recognition and text-to-speech systems. With low-resource language pairs that do not have available and well-developed pronunciation lexicons, grapheme-to-phoneme models are particularly useful. These models are based on initial alignments between grapheme source and phoneme target sequences. Inspired by sequence-to-sequence recurrent neural network-based translation methods, the current research presents an approach that applies an alignment representation for input sequences and pre-trained source and target embeddings to overcome the transliteration problem for a low-resource languages pair. We participated in the NEWS 2018 shared task for the English-Vietnamese transliteration task.
2017
pdf
abs
Translittération automatique pour une paire de langues peu dotée ()
Ngoc Tan Le
|
Fatiha Sadat
|
Lucie Ménard
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 3 - Démonstrations
La translittération convertit phonétiquement les mots dans une langue source (i.e. français) en mots équivalents dans une langue cible (i.e. vietnamien). Cette conversion nécessite un nombre considérable de règles définies par les experts linguistes pour déterminer comment les phonèmes sont alignés ainsi que prendre en compte le système de phonologie de la langue cible. La problématique pour les paires de langues peu dotées lie à la pénurie des ressources linguistiques. Dans ce travail de recherche, nous présentons une démonstration de conversion de graphème en phonème pour pallier au problème de translittération pour une paire de langues peu dotée, avec une application sur français-vietnamien. Notre système nécessite un petit corpus d’apprentissage phonétique bilingue. Nous avons obtenu des résultats prometteurs, avec un gain de +4,40% de score BLEU, par rapport au système de base utilisant l’approche de traduction automatique statistique.
2016
pdf
abs
UQAM-NTL: Named entity recognition in Twitter messages
Ngoc Tan Le
|
Fatma Mallek
|
Fatiha Sadat
Proceedings of the 2nd Workshop on Noisy User-generated Text (WNUT)
This paper describes our system used in the 2nd Workshop on Noisy User-generated Text (WNUT) shared task for Named Entity Recognition (NER) in Twitter, in conjunction with Coling 2016. Our system is based on supervised machine learning by applying Conditional Random Fields (CRF) to train two classifiers for two evaluations. The first evaluation aims at predicting the 10 fine-grained types of named entities; while the second evaluation aims at predicting no type of named entities. The experimental results show that our method has significantly improved Twitter NER performance.
2015
pdf
abs
Building a Bilingual Vietnamese-French Named Entity Annotated Corpus through Cross-Linguistic Projection
Ngoc Tan Le
|
Fatiha Sadat
Actes de la 22e conférence sur le Traitement Automatique des Langues Naturelles. Démonstrations
The creation of high-quality named entity annotated resources is time-consuming and an expensive process. Most of the gold standard corpora are available for English but not for less-resourced languages such as Vietnamese. In Asian languages, this task is remained problematic. This paper focuses on an automatic construction of named entity annotated corpora for Vietnamese-French, a less-resourced pair of languages. We incrementally apply different cross-projection methods using parallel corpora, such as perfect string matching and edit distance similarity. Evaluations on Vietnamese –French pair of languages show a good accuracy (F-score of 94.90%) when identifying named entities pairs and building a named entity annotated parallel corpus.