Cheikh M. Bamba Dione

Also published as: Cheikh Bamba Dione


2022

pdf
Low-resource Neural Machine Translation: Benchmarking State-of-the-art Transformer for Wolof<->French
Cheikh M. Bamba Dione | Alla Lo | Elhadji Mamadou Nguer | Sileye Ba
Proceedings of the Thirteenth Language Resources and Evaluation Conference

In this paper, we propose two neural machine translation (NMT) systems (French-to-Wolof and Wolof-to-French) based on sequence-to-sequence with attention and Transformer architectures. We trained our models on the parallel French-Wolof corpus (Nguer et al., 2020) of about 83k sentence pairs. Because of the low-resource setting, we experimented with advanced methods for handling data sparsity, including subword segmentation, backtranslation and the copied corpus method. We evaluate the models using BLEU score and find that the transformer outperforms the classic sequence-to-sequence model in all settings, in addition to being less sensitive to noise. In general, the best scores are achieved when training the models on subword-level based units. For such models, using backtranslation proves to be slightly beneficial in low-resource Wolof to high-resource French language translation for the transformer-based models. A slight improvement can also be observed when injecting copied monolingual text in the target language. Moreover, combining the copied method data with backtranslation leads to a slight improvement of the translation quality.

2021

pdf
Multilingual Dependency Parsing for Low-Resource African Languages: Case Studies on Bambara, Wolof, and Yoruba
Cheikh M. Bamba Dione
Proceedings of the 17th International Conference on Parsing Technologies and the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies (IWPT 2021)

This paper describes a methodology for syntactic knowledge transfer between high-resource languages to extremely low-resource languages. The methodology consists in leveraging multilingual BERT self-attention model pretrained on large datasets to develop a multilingual multi-task model that can predict Universal Dependencies annotations for three African low-resource languages. The UD annotations include universal part-of-speech, morphological features, lemmas, and dependency trees. In our experiments, we used multilingual word embeddings and a total of 11 Universal Dependencies treebanks drawn from three high-resource languages (English, French, Norwegian) and three low-resource languages (Bambara, Wolof and Yoruba). We developed various models to test specific language combinations involving contemporary contact languages or genetically related languages. The results of the experiments show that multilingual models that involve high-resource languages and low-resource languages with contemporary contact between each other can provide better results than combinations that only include unrelated languages. As far genetic relationships are concerned, we could not draw any conclusion regarding the impact of language combinations involving the selected low-resource languages, namely Wolof and Yoruba.

2020

pdf
SENCORPUS: A French-Wolof Parallel Corpus
Elhadji Mamadou Nguer | Alla Lo | Cheikh M. Bamba Dione | Sileye O. Ba | Moussa Lo
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we report efforts towards the acquisition and construction of a bilingual parallel corpus between French and Wolof, a Niger-Congo language belonging to the Northern branch of the Atlantic group. The corpus is constructed as part of the SYSNET3LOc project. It currently contains about 70,000 French-Wolof parallel sentences drawn on various sources from different domains. The paper discusses the data collection procedure, conversion, and alignment of the corpus as well as it’s application as training data for neural machine translation. In fact, using this corpus, we were able to create word embedding models for Wolof with relatively good results. Currently, the corpus is being used to develop a neural machine translation model to translate French sentences into Wolof.

pdf
Implementation and Evaluation of an LFG-based Parser for Wolof
Cheikh M. Bamba Dione
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper reports on a parsing system for Wolof based on the LFG formalism. The parser covers core constructions of Wolof, including noun classes, cleft, copula, causative and applicative sentences. It also deals with several types of coordination, including same constituent coordination, asymmetric and asyndetic coordination. The system uses a cascade of finite-state transducers for word tokenization and morphological analysis as well as various lexicons. In addition, robust parsing techniques, including fragmenting and skimming, are used to optimize grammar coverage. Parsing coverage is evaluated by running test-suites of naturally occurring Wolof sentences through the parser. The evaluation of parsing coverage reveals that 72.72% of the test sentences receive full parses; 27.27% receive partial parses. To measure accuracy, the parsed sentences are disambiguated manually using an incremental parsebanking approach based on discriminants. The evaluation of parsing quality reveals that the parser achieves 67.2% recall, 92.8% precision and an f-score of 77.9%.

pdf
From LFG To UD: A Combined Approach
Cheikh M. Bamba Dione
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

This paper reports on a systematic approach for deriving Universal Dependencies from LFG structures. The conversion starts with a step-wise transformation of the c-structure, combining part-of-speech (POS) information and the embedding path to determine the true head of dependency structures. The paper discusses several issues faced by existing algorithms when applied on Wolof and presents the strategies used to account for these issues. An experimental evaluation indicated that our approach was able to generate the correct output in more than 90% of the cases, leading to a substantial improvement in conversion accuracy compared to the previous models.

2019

pdf
Developing Universal Dependencies for Wolof
Cheikh Bamba Dione
Proceedings of the Third Workshop on Universal Dependencies (UDW, SyntaxFest 2019)

2014

pdf
Pruning the Search Space of the Wolof LFG Grammar Using a Probabilistic and a Constraint Grammar Parser
Cheikh M. Bamba Dione
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper presents a method for greatly reducing parse times in LFG by integrating a Constraint Grammar parser into a probabilistic context-free grammar. The CG parser is used in the pre-processing phase to reduce morphological and lexical ambiguity. Similarly, the c-structure pruning mechanism of XLE is used in the parsing phase to discard low-probability c-structures, before f-annotations are solved. The experiment results show a considerable increase in parsing efficiency and robustness in the annotation of Wolof running text. The Wolof CG parser indicated an f-score of 90% for morphological disambiguation and a speedup of ca. 40%, while the c-structure pruning method increased the speed of the Wolof grammar by over 36%. On a small amount of data, CG disambiguation and c-structure pruning allowed for a speedup of 58%, however with a substantial drop in parse accuracy of 3.62.

2013

pdf
ParGramBank: The ParGram Parallel Treebank
Sebastian Sulger | Miriam Butt | Tracy Holloway King | Paul Meurer | Tibor Laczkó | György Rákosi | Cheikh Bamba Dione | Helge Dyvik | Victoria Rosén | Koenraad De Smedt | Agnieszka Patejuk | Özlem Çetinoğlu | I Wayan Arka | Meladel Mistica
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2012

pdf
A Morphological Analyzer For Wolof Using Finite-State Techniques
Cheikh M. Bamba Dione
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

This paper reports on the design and implementation of a morphological analyzer for Wolof. The main motivation for this work is to obtain a linguistically motivated tool using finite-state techniques. The finite-state technology is especially attractive in dealing with human language morphologies. Finite-state transducers (FST) are fast, efficient and can be fully reversible, enabling users to perform analysis as well as generation. Hence, I use this approach to construct a new FST tool for Wolof, as a first step towards a computational grammar for the language in the Lexical Functional Grammar framework. This article focuses on the methods used to model complex morphological issues and on developing strategies to limit ambiguities. It discusses experimental evaluations conducted to assess the performance of the analyzer with respect to various statistical criteria. In particular, I also wanted to create morphosyntactically annotated resources for Wolof, obtained by automatically analyzing text corpora with a computational morphology.

2010

pdf
Design and Development of Part-of-Speech-Tagging Resources for Wolof (Niger-Congo, spoken in Senegal)
Cheikh M. Bamba Dione | Jonas Kuhn | Sina Zarrieß
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we report on the design of a part-of-speech-tagset for Wolof and on the creation of a semi-automatically annotated gold standard. In order to achieve high-quality annotation relatively fast, we first generated an accurate lexicon that draws on existing word and name lists and takes into account inflectional and derivational morphology. The main motivation for the tagged corpus is to obtain data for training automatic taggers with machine learning approaches. Hence, we took machine learning considerations into account during tagset design and we present training experiments as part of this paper. The best automatic tagger achieves an accuracy of 95.2% in cross-validation experiments. We also wanted to create a basis for experimenting with annotation projection techniques, which exploit parallel corpora. For this reason, it was useful to use a part of the Bible as the gold standard corpus, for which sentence-aligned parallel versions in many languages are easy to obtain. We also report on preliminary experiments exploiting a statistical word alignment of the parallel text.