Valeria Pagliai
2026
The Spanish Learner and Heritage Speaker Dependency Treebank
Valeria Pagliai | Sergio José Salazar Rodó | Emiliana Pulido | Andres Gutierrez-Quintero | Zoey Liu
Proceedings of the Society for Computation in Linguistics 2026
Valeria Pagliai | Sergio José Salazar Rodó | Emiliana Pulido | Andres Gutierrez-Quintero | Zoey Liu
Proceedings of the Society for Computation in Linguistics 2026
We present a manually curated L2-Heritage Speaker Spanish dataset (N = 49,247) following the Universal Dependencies framework, including lemmatizations, part-of-speech tags, syntactic dependencies, and instances of pro-drop and ungrammatical structures. In addition to this, for dependency parsing we examined different data partitioning strategies and data representations, as well as different training configurations using our data and the AnCora treebank. Overall, the results yield reasonable LAS scores and comparable performance between AnCora and our dataset.
A Dataset for Oral Reading in Young English Readers
Madison Rose | Michael Bennie | Valeria Pagliai | Hatice Kubra Karakis | Qian Shen | Xinyi Tai | Walter L. Leite | Zoey Liu
Proceedings of the 30th Conference on Computational Natural Language Learning
Madison Rose | Michael Bennie | Valeria Pagliai | Hatice Kubra Karakis | Qian Shen | Xinyi Tai | Walter L. Leite | Zoey Liu
Proceedings of the 30th Conference on Computational Natural Language Learning
Among English child speech corpora, very few focus on oral reading. Existing resources such as the CMU Kids Corpus (Ellis Weismer et al., 2013) face limitations in the lack of grade-appropriate, curriculum-aligned reading texts, the annotation scope and quality, and most crucially, comprehensive annotation scheme for characterization of children’s reading errors. This study presents a multi-layered, fully manually annotated corpus of oral reading from 63 1st-3rd grade students residing in the U.S. who grow up hearing and speaking English. Additionally, we contribute methodologically rigorous annotation guidelines that categorize 10 reading error categories and 26 sublevel error labels. Using a digital reading platform supported by GPT-4o-mini (OpenAI, 2024), children read stories on topics of their own interest, while the system records their speech and logs their interactions with embedded digital supports. Each recording is paired with detailed demographic and educational metadata and subjected to linguistic annotations, including: (1) sentence- and word-level time alignment; (2) phonemic transcription; (3) reading errors.
2025
Predictability Effects of Spanish-English Code-Switching: A Directionality and Part of Speech Analysis
Josh Higdon | Valeria Pagliai | Zoey Liu
Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)
Josh Higdon | Valeria Pagliai | Zoey Liu
Proceedings of the Third Workshop on Quantitative Syntax (QUASY, SyntaxFest 2025)
Research on code-switching (CS), the spontaneous alternation between two or more languages within a discourse, remains relatively new and often limited by the use of elicited production tasks, with some exceptions leveraging naturalistic corpora. This study analyses the effects of language directionality and part-of-speech (POS) tags on Spanish-English CS production between corpus modalities and speech communities. We use data from two spoken corpora: Miami Bangor Corpus (MBC; N = 261,711) and Spanish in Texas Corpus (STC; N = 416,784), as well as the written LinCE Corpus (N=278,093). Bootstrap analyses indicate that Spanish serves as the matrix language (i.e., the most used) for MBC and LinCE, while English is for STC. Logistic regression analyses show that the particle-coordinating conjunction combination was the strongest POS predictor of a CS. The results suggest that corpus modality and the speech community affect matrix language proportions and that both previously attested and unseen POS combinations modulate the production of Spanish-English CS.