Sacha Beniamine


2021

pdf
Multiple alignments of inflectional paradigms
Sacha Beniamine | Matías Guzmán Naranjo
Proceedings of the Society for Computation in Linguistics 2021

2020

pdf
Automated Parsing of Interlinear Glossed Text from Page Images of Grammatical Descriptions
Erich Round | Mark Ellison | Jayden Macklin-Cordes | Sacha Beniamine
Proceedings of the Twelfth Language Resources and Evaluation Conference

Linguists seek insight from all human languages, however accessing information from most of the full store of extant global linguistic descriptions is not easy. One of the most common kinds of information that linguists have documented is vernacular sentences, as recorded in descriptive grammars. Typically these sentences are formatted as interlinear glossed text (IGT). Most descriptive grammars, however, exist only as hardcopy or scanned pdf documents. Consequently, parsing IGTs in scanned grammars is a priority, in order to significantly increase the volume of documented linguistic information that is readily accessible. Here we demonstrate fundamental viability for a technology that can assist in making a large number of linguistic data sources machine readable: the automated identification and parsing of interlinear glossed text from scanned page images. For example, we attain high median precision and recall (>0.95) in the identification of examples sentences in IGT format. Our results will be of interest to those who are keen to see more of the existing documentation of human language, especially for less-resourced and endangered languages, become more readily accessible.

pdf
Opening the Romance Verbal Inflection Dataset 2.0: A CLDF lexicon
Sacha Beniamine | Martin Maiden | Erich Round
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce the Romance Verbal Inflection Dataset 2.0, a multilingual lexicon of Romance inflection covering 74 varieties. The lexicon provides verbal paradigm forms in broad IPA phonemic notation. Both lexemes and paradigm cells are organized to reflect cognacy. Such multi-lingual inflected lexicons annotated for two dimensions of cognacy are necessary to study the evolution of inflectional paradigms, and test linguistic hypotheses systematically. However, these resources seldom exist, and when they do, they are not usually encoded in computationally usable ways. The Oxford Online Database of Romance Verb Morphology provides this kind of information, however, it is not maintained anymore and is only available as a web service without interfaces for machine-readability. We collect its data and clean and correct it for consistency using both heuristics and expert annotator judgements. Most resources used to study language evolution computationally rely strictly on multilingual contemporary information, and lack information about prior stages of the languages. To provide such information, we augment the database with Latin paradigms from the LatInFlexi lexicon. Finally, to make it widely avalable, the resource is released under a GPLv3 license in CLDF format.

2017

pdf
Une approche universelle pour l’abstraction automatique d’alternances morphophonologiques (A universal algorithm for the automatical abstraction of morphophonological alternations)
Sacha Beniamine
Actes des 24ème Conférence sur le Traitement Automatique des Langues Naturelles. Volume 2 - Articles courts

Cet article présente un algorithme implémenté pour l’inférence de patrons d’alternances morphophonologiques entre mots-formes. Il est universel au sens où il permet d’obtenir des classifications comparables d’une langue à l’autre sans préjuger des types d’alternances. Les patrons constituent une première étape pour les travaux quantitatifs dans l’approche Mot et Paradigme de la morphologie.