Desislava Aleksandrova

2023

pdf abs
CEFR-based Contextual Lexical Complexity Classifier in English and French
Desislava Aleksandrova | Vincent Pouliot
Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023)

This paper describes a CEFR-based classifier of single-word and multi-word lexical complexity in context from a second language learner perspective in English and in French, developed as an analytical tool for the pedagogical team of the language learning application Mauril. We provide an overview of the required corpora and the way we transformed it into rich contextual representations that allow the disambiguation and accurate labelling in context of polysemous occurrences of a given lexical item. We report evaluation results for all models, including two multi-lingual lexical classifiers evaluated on novel French datasets created for this experiment. Finally, we share the perspective of Mauril’s pedagogical team on the limitations of such systems.

2022

pdf abs
RCML at TSAR-2022 Shared Task: Lexical Simplification With Modular Substitution Candidate Ranking
Desislava Aleksandrova | Olivier Brochu Dufour
Proceedings of the Workshop on Text Simplification, Accessibility, and Readability (TSAR-2022)

This paper describes the lexical simplification system RCML submitted to the English language track of the TSAR-2022 Shared Task. The system leverages a pre-trained language model to generate contextually plausible substitution candidates which are then ranked according to their simplicity as well as their grammatical and semantic similarity to the target complex word. Our submissions secure 6th and 7th places out of 33, improving over the SOTA baseline for 27 out of the 51 metrics.

pdf abs
A French Corpus of Québec’s Parliamentary Debates
Pierre André Ménard | Desislava Aleksandrova
Proceedings of the Workshop ParlaCLARIN III within the 13th Language Resources and Evaluation Conference

Parliamentary debates offer a window on political stances as well as a repository of linguistic and semantic knowledge. They provide insights and reasons for laws and regulations that impact electors in their everyday life. One such resource is the transcribed debates available online from the Assemblée Nationale du Québec (ANQ). This paper describes the effort to convert the online ANQ debates from various HTML formats into a standardized ParlaMint TEI annotated corpus and to enrich it with annotations extracted from related unstructured members and political parties list. The resulting resource includes 88 years of debates over a span of 114 years with more than 33.3 billion words. The addition of linguistic annotations is detailed as well as a quantitative analysis of part-of-speech tags and distribution of utterances across the corpus.

2019

pdf abs
Multilingual sentence-level bias detection in Wikipedia
Desislava Aleksandrova | François Lareau | Pierre André Ménard
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We propose a multilingual method for the extraction of biased sentences from Wikipedia, and use it to create corpora in Bulgarian, French and English. Sifting through the revision history of the articles that at some point had been considered biased and later corrected, we retrieve the last tagged and the first untagged revisions as the before/after snapshots of what was deemed a violation of Wikipedia’s neutral point of view policy. We extract the sentences that were removed or rewritten in that edit. The approach yields sufficient data even in the case of relatively small Wikipedias, such as the Bulgarian one, where 62k articles produced 5k biased sentences. We evaluate our method by manually annotating 520 sentences for Bulgarian and French, and 744 for English. We assess the level of noise and analyze its sources. Finally, we exploit the data with well-known classification methods to detect biased sentences. Code and datasets are hosted at https://github.com/crim-ca/wiki-bias.