Alejandrina Cristia


2024

pdf
Long-Form Recordings to Study Children’s Language Input and Output in Under-Resourced Contexts
Joseph R. Coffey | Alejandrina Cristia
Proceedings of the Fifth Workshop on Resources for African Indigenous Languages @ LREC-COLING 2024

A growing body of research suggests that young children’s early speech and language exposure is associated with later language development (including delays and diagnoses), school readiness, and academic performance. The last decade has seen increasing use of child-worn devices to collect long-form audio recordings by educators, economists, and developmental psychologists. The most commonly used system for analyzing this data is LENA, which was trained on North American English child-centered data and generates estimates of children’s speech-like vocalization counts, adult word counts, and child-adult turn counts. Recently, cheaper and open-source non-LENA alternatives with multilingual training have been proposed. Both kinds of systems have been employed in under-resourced, sometimes multilingual contexts, including Africa where access to printed or digital linguistic resources may be limited. In this paper, we describe each kind of system (LENA, non-LENA), provide information on audio data collected with them that is available for reuse, review evidence of the accuracy of extant automated analyses, and note potential strengths and shortcomings of their use in African communities.

2020

pdf
Seshat: a Tool for Managing and Verifying Annotation Campaigns of Audio Data
Hadrien Titeux | Rachid Riad | Xuan-Nga Cao | Nicolas Hamilakis | Kris Madden | Alejandrina Cristia | Anne-Catherine Bachoud-Lévi | Emmanuel Dupoux
Proceedings of the Twelfth Language Resources and Evaluation Conference

We introduce Seshat, a new, simple and open-source software to efficiently manage annotations of speech corpora. The Seshat software allows users to easily customise and manage annotations of large audio corpora while ensuring compliance with the formatting and naming conventions of the annotated output files. In addition, it includes procedures for checking the content of annotations following specific rules that can be implemented in personalised parsers. Finally, we propose a double-annotation mode, for which Seshat computes automatically an associated inter-annotator agreement with the gamma measure taking into account the categorisation and segmentation discrepancies.

2019

pdf
Is Word Segmentation Child’s Play in All Languages?
Georgia R. Loukatou | Steven Moran | Damian Blasi | Sabine Stoll | Alejandrina Cristia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

When learning language, infants need to break down the flow of input speech into minimal word-like units, a process best described as unsupervised bottom-up segmentation. Proposed strategies include several segmentation algorithms, but only cross-linguistically robust algorithms could be plausible candidates for human word learning, since infants have no initial knowledge of the ambient language. We report on the stability in performance of 11 conceptually diverse algorithms on a selection of 8 typologically distinct languages. The results consist evidence that some segmentation algorithms are cross-linguistically valid, thus could be considered as potential strategies employed by all infants.

2018

pdf
Modeling infant segmentation of two morphologically diverse languages
Georgia-Rengina Loukatou | Sabine Stoll | Damian Blasi | Alejandrina Cristia
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

A rich literature explores unsupervised segmentation algorithms infants could use to parse their input, mainly focusing on English, an analytic language where word, morpheme, and syllable boundaries often coincide. Synthetic languages, where words are multi-morphemic, may present unique difficulties for segmentation. Our study tests corpora of two languages selected to differ in the extent of complexity of their morphological structure, Chintang and Japanese. We use three conceptually diverse word segmentation algorithms and we evaluate them on both word- and morpheme-level representations. As predicted, results for the simpler Japanese are better than those for the more complex Chintang. However, the difference is small compared to the effect of the algorithm (with the lexical algorithm outperforming sub-lexical ones) and the level (scores were lower when evaluating on words versus morphemes). There are also important interactions between language, model, and evaluation level, which ought to be considered in future work.

2017

pdf
The Role of Prosody and Speech Register in Word Segmentation: A Computational Modelling Perspective
Bogdan Ludusan | Reiko Mazuka | Mathieu Bernard | Alejandrina Cristia | Emmanuel Dupoux
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This study explores the role of speech register and prosody for the task of word segmentation. Since these two factors are thought to play an important role in early language acquisition, we aim to quantify their contribution for this task. We study a Japanese corpus containing both infant- and adult-directed speech and we apply four different word segmentation models, with and without knowledge of prosodic boundaries. The results showed that the difference between registers is smaller than previously reported and that prosodic boundary information helps more adult- than infant-directed speech.

2016

pdf
La reconnaissance des mots dans la parole accentuée : Une étude en laboratoire et à l’extérieur. (Mispronunciations slow down word recognition: A study using touchscreens in the lab and the real world)
Delphine Deï | Page Piccinini | Isabelle Dautriche | Marieke Van Heugten | Alejandrina Cristia
Actes de la conférence conjointe JEP-TALN-RECITAL 2016. volume 1 : JEP

Des travaux récents suggèrent que les enfants et les adultes sont initialement ralentis dans leur compréhension des mots qui n’ont pas été prononcés de façon standard. Néanmoins, quand ils font face à un interlocuteur qui à un discours accentué, ils développent rapidement des stratégies spécifiques qui leur permettent de comprendre même des prononciations atypiques. Cependant, ces résultats sont typiquement issus de recherches en laboratoire, où l’attention des participants se concentre sur une tâche unique qui leur demande peu de ressources. Afin de dépasser ces limitations, nous avons mené une expérience de reconnaissance de mots sur tablette tactile, en évaluant des enfants et des adultes, en laboratoire et dans l’environnement naturel de chaque groupe. Nous avons constaté que des déviations de prononciation dans la parole accentuée ralentissent la reconnaissance des mots, chez des enfants et adultes, tant dans le laboratoire que dans des environnements naturels.

pdf
Word comprehension and multilingualism among toddlers: A study using touch screens in daycares
Laia Fibla | Charlotte Maniel | Alejandrina Cristia
Proceedings of the joint workshop on NLP for Computer Assisted Language Learning and NLP for Language Acquisition