Serge Léger

Also published as: Serge Leger

2021

pdf bib abs
N-gram and Neural Models for Uralic Language Identification: NRC at VarDial 2021
Gabriel Bernier-Colborne | Serge Leger | Cyril Goutte
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

We describe the systems developed by the National Research Council Canada for the Uralic language identification shared task at the 2021 VarDial evaluation campaign. We evaluated two different approaches to this task: a probabilistic classifier exploiting only character 5-grams as features, and a character-based neural network pre-trained through self-supervision, then fine-tuned on the language identification task. The former method turned out to perform better, which casts doubt on the usefulness of deep learning methods for language identification, where they have yet to convincingly and consistently outperform simpler and less costly classification algorithms exploiting n-gram features.

2019

pdf bib abs
Improving Cuneiform Language Identification with BERT
Gabriel Bernier-Colborne | Cyril Goutte | Serge Léger
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

We describe the systems developed by the National Research Council Canada for the Cuneiform Language Identification (CLI) shared task at the 2019 VarDial evaluation campaign. We compare a state-of-the-art baseline relying on character n-grams and a traditional statistical classifier, a voting ensemble of classifiers, and a deep learning approach using a Transformer network. We describe how these systems were trained, and analyze the impact of some preprocessing and model estimation decisions. The deep neural network achieved 77% accuracy on the test data, which turned out to be the best performance at the CLI evaluation, establishing a new state-of-the-art for cuneiform language identification.

2017

pdf bib abs
Exploring Optimal Voting in Native Language Identification
Cyril Goutte | Serge Léger
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

We describe the submissions entered by the National Research Council Canada in the NLI-2017 evaluation. We mainly explored the use of voting, and various ways to optimize the choice and number of voting systems. We also explored the use of features that rely on no linguistic preprocessing. Long ngrams of characters obtained from raw text turned out to yield the best performance on all textual input (written essays and speech transcripts). Voting ensembles turned out to produce small performance gains, with little difference between the various optimization strategies we tried. Our top systems achieved accuracies of 87% on the essay track, 84% on the speech track, and close to 92% by combining essays, speech and i-vectors in the fusion track.

2016

pdf bib abs
Advances in Ngram-based Discrimination of Similar Languages
Cyril Goutte | Serge Léger
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

We describe the systems entered by the National Research Council in the 2016 shared task on discriminating similar languages. Like previous years, we relied on character ngram features, and a mixture of discriminative and generative statistical classifiers. We mostly investigated the influence of the amount of data on the performance, in the open task, and compared the two-stage approach (predicting language/group, then variant) to a flat approach. Results suggest that ngrams are still state-of-the-art for language and variant identification, and that additional data has a small but decisive impact.

pdf bib abs
Discriminating Similar Languages: Evaluations and Explorations
Cyril Goutte | Serge Léger | Shervin Malmasi | Marcos Zampieri
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We present an analysis of the performance of machine learning classifiers on discriminating between similar languages and language varieties. We carried out a number of experiments using the results of the two editions of the Discriminating between Similar Languages (DSL) shared task. We investigate the progress made between the two tasks, estimate an upper bound on possible performance using ensemble and oracle combination, and provide learning curves to help us understand which languages are more challenging. A number of difficult sentences are identified and investigated further with human annotation

Serge Léger

2021

2019

2017

2016

2015

2014

2013

Co-authors

Venues