Eddie Antonio Santos

Also published as: Eddie Antonio Santos


Gi2Pi Rule-based, index-preserving grapheme-to-phoneme transformations
Aidan Pine | Patrick William Littell | Eric Joanis | David Huggins-Daines | Christopher Cox | Fineen Davis | Eddie Antonio Santos | Shankhalika Srikanth | Delasie Torkornoo | Sabrina Yu
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages

This paper describes the motivation and implementation details for a rule-based, index-preserving grapheme-to-phoneme engine ‘Gi2Pi' implemented in pure Python and released under the open source MIT license. The engine and interface have been designed to prioritize the developer experience of potential contributors without requiring a high level of programming knowledge. ‘Gi2Pi' already provides mappings for 30 (mostly Indigenous) languages, and the package is accompanied by a web-based interactive development environment, a RESTful API, and extensive documentation to encourage the addition of more mappings in the future. We also present three downstream applications of ‘Gi2Pi' and show results of a preliminary evaluation.


On the Computational Modelling of Michif Verbal Morphology
Fineen Davis | Eddie Antonio Santos | Heather Souter
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

This paper presents a finite-state computational model of the verbal morphology of Michif. Michif, the official language of the Métis peoples, is a uniquely mixed language with Algonquian and French origins. It is spoken across the Métis homelands in what is now called Canada and the United States, but it is highly endangered with less than 100 speakers. The verbal morphology is remarkably complex, as the already polysynthetic Algonquian patterns are combined with French elements and unique morpho-phonological interactions.The model presented in this paper, LI VERB KAA-OOSHITAHK DI MICHIF handles this complexity by using a series of composed finite-state transducers to model the concatenative morphology and phonological rule alternations that are unique to Michif. Such a rule-based approach is necessary as there is insufficient language data for an approach that uses machine learning. A language model such as LI VERB KAA-OOSHITAHK DI MICHIF furthers the goals of Indigenous computational linguistics in Canada while also supporting the creation of tools for documentation, education, and revitalization that are desired by the Métis community.


OCR evaluation tools for the 21st century
Eddie Antonio Santos
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)


Training & Quality Assessment of an Optical Character Recognition Model for Northern Haida
Isabell Hubert | Antti Arppe | Jordan Lachler | Eddie Antonio Santos
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We are presenting our work on the creation of the first optical character recognition (OCR) model for Northern Haida, also known as Masset or Xaad Kil, a nearly extinct First Nations language spoken in the Haida Gwaii archipelago in British Columbia, Canada. We are addressing the challenges of training an OCR model for a language with an extensive, non-standard Latin character set as follows: (1) We have compared various training approaches and present the results of practical analyses to maximize recognition accuracy and minimize manual labor. An approach using just one or two pages of Source Images directly performed better than the Image Generation approach, and better than models based on three or more pages. Analyses also suggest that a character’s frequency is directly correlated with its recognition accuracy. (2) We present an overview of current OCR accuracy analysis tools available. (3) We have ported the once de-facto standardized OCR accuracy tools to be able to cope with Unicode input. Our work adds to a growing body of research on OCR for particularly challenging character sets, and contributes to creating the largest electronic corpus for this severely endangered language.