Sabine Stoll


The ACQDIV Corpus Database and Aggregation Pipeline
Anna Jancso | Steven Moran | Sabine Stoll
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present the ACQDIV corpus database and aggregation pipeline, a tool developed as part of the European Research Council (ERC) funded project ACQDIV, which aims to identify the universal cognitive processes that allow children to acquire any language. The corpus database represents 15 corpora from 14 typologically maximally diverse languages. Here we give an overview of the project, database, and our extensible software package for adding more corpora to the current language sample. Lastly, we discuss how we use the corpus database to mine for universal patterns in child language acquisition corpora and we describe avenues for future research.


Is Word Segmentation Child’s Play in All Languages?
Georgia R. Loukatou | Steven Moran | Damian Blasi | Sabine Stoll | Alejandrina Cristia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

When learning language, infants need to break down the flow of input speech into minimal word-like units, a process best described as unsupervised bottom-up segmentation. Proposed strategies include several segmentation algorithms, but only cross-linguistically robust algorithms could be plausible candidates for human word learning, since infants have no initial knowledge of the ambient language. We report on the stability in performance of 11 conceptually diverse algorithms on a selection of 8 typologically distinct languages. The results consist evidence that some segmentation algorithms are cross-linguistically valid, thus could be considered as potential strategies employed by all infants.

On the Distribution of Deep Clausal Embeddings: A Large Cross-linguistic Study
Damian Blasi | Ryan Cotterell | Lawrence Wolf-Sonkin | Sabine Stoll | Balthasar Bickel | Marco Baroni
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Embedding a clause inside another (“the girl [who likes cars [that run fast]] has arrived”) is a fundamental resource that has been argued to be a key driver of linguistic expressiveness. As such, it plays a central role in fundamental debates on what makes human language unique, and how they might have evolved. Empirical evidence on the prevalence and the limits of embeddings has however been based on either laboratory setups or corpus data of relatively limited size. We introduce here a collection of large, dependency-parsed written corpora in 17 languages, that allow us, for the first time, to capture clausal embedding through dependency graphs and assess their distribution. Our results indicate that there is no evidence for hard constraints on embedding depth: the tail of depth distributions is heavy. Moreover, although deeply embedded clauses tend to be shorter, suggesting processing load issues, complex sentences with many embeddings do not display a bias towards less deep embeddings. Taken together, the results suggest that deep embeddings are not disfavoured in written language. More generally, our study illustrates how resources and methods from latest-generation big-data NLP can provide new perspectives on fundamental questions in theoretical linguistics.


Cross-linguistically Small World Networks are Ubiquitous in Child-directed Speech
Steven Moran | Danica Pajović | Sabine Stoll
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

Modeling infant segmentation of two morphologically diverse languages
Georgia-Rengina Loukatou | Sabine Stoll | Damian Blasi | Alejandrina Cristia
Actes de la Conférence TALN. Volume 1 - Articles longs, articles courts de TALN

A rich literature explores unsupervised segmentation algorithms infants could use to parse their input, mainly focusing on English, an analytic language where word, morpheme, and syllable boundaries often coincide. Synthetic languages, where words are multi-morphemic, may present unique difficulties for segmentation. Our study tests corpora of two languages selected to differ in the extent of complexity of their morphological structure, Chintang and Japanese. We use three conceptually diverse word segmentation algorithms and we evaluate them on both word- and morpheme-level representations. As predicted, results for the simpler Japanese are better than those for the more complex Chintang. However, the difference is small compared to the effect of the algorithm (with the lexical algorithm outperforming sub-lexical ones) and the level (scores were lower when evaluating on words versus morphemes). There are also important interactions between language, model, and evaluation level, which ought to be considered in future work.


Automatic interlinear glossing as two-level sequence classification
Tanja Samardžić | Robert Schikowski | Sabine Stoll
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH)