Franck Sajous


2020

This paper introduces ENGLAWI, a large, versatile, XML-encoded machine-readable dictionary extracted from Wiktionary. ENGLAWI contains 752,769 articles encoding the full body of information included in Wiktionary: simple words, compounds and multiword expressions, lemmas and inflectional paradigms, etymologies, phonemic transcriptions in IPA, definition glosses and usage examples, translations, semantic and morphological relations, spelling variants, etc. It is fully documented, released under a free license and supplied with G-PeTo, a series of scripts allowing easy information extraction from ENGLAWI. Additional resources extracted from ENGLAWI, such as an inflectional lexicon, a lexicon of diatopic variants and the inclusion dates of headwords in Wiktionary’s nomenclature are also provided. The paper describes the content of the resource and illustrates how it can be - and has been - used in previous studies. We finally introduce an ongoing work that computes lexicographic word embeddings from ENGLAWI’s definitions.
Glawinette is a derivational lexicon of French that will be used to feed the Démonette database. It has been created from the GLAWI machine readable dictionary. We collected couples of words from the definitions and the morphological sections of the dictionary and then selected the ones that form regular formal analogies and that instantiate frequent enough formal patterns. The graph structure of the morphological families has then been used to identify for each couple of lexemes derivational patterns that are close to the intuition of the morphologists.

2016

GLAWI is a free, large-scale and versatile Machine-Readable Dictionary (MRD) that has been extracted from the French language edition of Wiktionary, called Wiktionnaire. In (Sajous and Hathout, 2015), we introduced GLAWI, gave the rationale behind the creation of this lexicographic resource and described the extraction process, focusing on the conversion and standardization of the heterogeneous data provided by this collaborative dictionary. In the current article, we describe the content of GLAWI and illustrate how it is structured. We also suggest various applications, ranging from linguistic studies, NLP applications to psycholinguistic experimentation. They all can take advantage of the diversity of the lexical knowledge available in GLAWI. Besides this diversity and extensive lexical coverage, GLAWI is also remarkable because it is the only free lexical resource of contemporary French that contains definitions. This unique material opens way to the renewal of MRD-based methods, notably the automated extraction and acquisition of semantic relations.

2015

2014

This paper introduces GLAFF, a large-scale versatile French lexicon extracted from Wiktionary, the collaborative online dictionary. GLAFF contains, for each entry, inflectional features and phonemic transcriptions. It distinguishes itself from the other available French lexicons by its size, its potential for constant updating and its copylefted license. We explain how we have built GLAFF and compare it to other known resources in terms of coverage and quality of the phonemic transcriptions. We show that its size and quality are strong assets that could allow GLAFF to become a reference lexicon for French NLP and linguistics. Moreover, other derived lexicons can easily be based on GLAFF to satisfy specific needs of various fields such as psycholinguistics.

2013

2011

2009