Micha Elsner

2021

pdf bib
Formalizing Inflectional Paradigm Shape with Information Theory
Grace LeFevre | Micha Elsner | Andrea D. Sims
Proceedings of the Society for Computation in Linguistics 2021

pdf bib abs
What transfers in morphological inflection? Experiments with analogical models
Micha Elsner
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

We investigate how abstract processes like suffixation can be learned from morphological inflection task data using an analogical memory-based framework. In this framework, the inflection target form is specified by providing an example inflection of another word in the language. We show that this model is capable of near-baseline performance on the SigMorphon 2020 inflection challenge. Such a model can make predictions for unseen languages, allowing us to perform one-shot inflection on natural languages and investigate morphological transfer with synthetic probes. Accuracy for one-shot transfer can be unexpectedly high for some target languages (88% in Shona) and language families (53% across Romance). Probe experiments show that the model learns partially generalizable representations of prefixation, suffixation and reduplication, aiding its ability to transfer. We argue that the degree of generality of these process representations also helps to explain transfer results from previous research.

2020

pdf bib
Stop the Morphological Cycle, I Want to Get Off: Modeling the Development of Fusion
Micha Elsner | Martha Johnson | Stephanie Antetomaso | Andrea Sims
Proceedings of the Society for Computation in Linguistics 2020

pdf bib
Interpreting Sequence-to-Sequence Models for Russian Inflectional Morphology
David King | Andrea Sims | Micha Elsner
Proceedings of the Society for Computation in Linguistics 2020

pdf bib abs
Acquiring language from speech by learning to remember and predict
Cory Shain | Micha Elsner
Proceedings of the 24th Conference on Computational Natural Language Learning

Classical accounts of child language learning invoke memory limits as a pressure to discover sparse, language-like representations of speech, while more recent proposals stress the importance of prediction for language learning. In this study, we propose a broad-coverage unsupervised neural network model to test memory and prediction as sources of signal by which children might acquire language directly from the perceptual stream. Our model embodies several likely properties of real-time human cognition: it is strictly incremental, it encodes speech into hierarchically organized labeled segments, it allows interactive top-down and bottom-up information flow, it attempts to model its own sequence of latent representations, and its objective function only recruits local signals that are plausibly supported by human working memory capacity. We show that much phonemic structure is learnable from unlabeled speech on the basis of these local signals. We further show that remembering the past and predicting the future both contribute to the linguistic content of acquired representations, and that these contributions are at least partially complementary.

pdf bib abs
The Paradigm Discovery Problem
Alexander Erdmann | Micha Elsner | Shijie Wu | Ryan Cotterell | Nizar Habash
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work.

2019

pdf bib abs
Measuring the perceptual availability of phonological features during language acquisition using unsupervised binary stochastic autoencoders
Cory Shain | Micha Elsner
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this paper, we deploy binary stochastic neural autoencoder networks as models of infant language learning in two typologically unrelated languages (Xitsonga and English). We show that the drive to model auditory percepts leads to latent clusters that partially align with theory-driven phonemic categories. We further evaluate the degree to which theory-driven phonological features are encoded in the latent bit patterns, finding that some (e.g. [+-approximant]), are well represented by the network in both languages, while others (e.g. [+-spread glottis]) are less so. Together, these findings suggest that many reliable cues to phonemic structure are immediately available to infants from bottom-up perceptual characteristics alone, but that these cues must eventually be supplemented by top-down lexical and phonotactic information to achieve adult-like phone discrimination. Our results also suggest differences in degree of perceptual availability between features, yielding testable predictions as to which features might depend more or less heavily on top-down cues during child language acquisition.

pdf bib abs
Practical, Efficient, and Customizable Active Learning for Named Entity Recognition in the Digital Humanities
Alexander Erdmann | David Joseph Wrisley | Benjamin Allen | Christopher Brown | Sophie Cohen-Bodénès | Micha Elsner | Yukun Feng | Brian Joseph | Béatrice Joyeux-Prunel | Marie-Catherine de Marneffe
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Scholars in inter-disciplinary fields like the Digital Humanities are increasingly interested in semantic annotation of specialized corpora. Yet, under-resourced languages, imperfect or noisily structured data, and user-specific classification tasks make it difficult to meet their needs using off-the-shelf models. Manual annotation of large corpora from scratch, meanwhile, can be prohibitively expensive. Thus, we propose an active learning solution for named entity recognition, attempting to maximize a custom model’s improvement per additional unit of manual annotation. Our system robustly handles any domain or user-defined label set and requires no external resources, enabling quality named entity recognition for Humanities corpora where such resources are not available. Evaluating on typologically disparate languages and datasets, we reduce required annotation by 20-60% and greatly outperform a competitive active learning baseline.

2018

pdf bib abs
Lexical Networks in !Xung
Syed-Amad Hussain | Micha Elsner | Amanda Miller
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

We investigate the lexical network properties of the large phoneme inventory Southern African language Mangetti Dune !Xung as it compares to English and other commonly-studied languages. Lexical networks are graphs in which nodes (words) are linked to their minimal pairs; global properties of these networks are believed to mediate lexical access in the minds of speakers. We show that the network properties of !Xung are within the range found in previously-studied languages. By simulating data (”pseudolexicons”) with varying levels of phonotactic structure, we find that the lexical network properties of !Xung diverge from previously-studied languages when fewer phonotactic constraints are retained. We conclude that lexical network properties are representative of an underlying cognitive structure which is necessary for efficient word retrieval and that the phonotactics of !Xung may be shaped by a selective pressure which preserves network properties within this cognitively useful range.

2017

pdf bib abs
Speech segmentation with a neural encoder model of working memory
Micha Elsner | Cory Shain
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present the first unsupervised LSTM speech segmenter as a cognitive model of the acquisition of words from unsegmented input. Cognitive biases toward phonological and syntactic predictability in speech are rooted in the limitations of human memory (Baddeley et al., 1998); compressed representations are easier to acquire and retain in memory. To model the biases introduced by these memory limitations, our system uses an LSTM-based encoder-decoder with a small number of hidden units, then searches for a segmentation that minimizes autoencoding loss. Linguistically meaningful segments (e.g. words) should share regular patterns of features that facilitate decoder performance in comparison to random segmentations, and we show that our learner discovers these patterns when trained on either phoneme sequences or raw acoustics. To our knowledge, ours is the first fully unsupervised system to be able to segment both symbolic and acoustic representations of speech.

pdf bib
Click reduction in fluent speech: a semi-automated analysis of Mangetti Dune !Xung
Amanda Miller | Micha Elsner
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf bib abs
Breaking NLP: Using Morphosyntax, Semantics, Pragmatics and World Knowledge to Fool Sentiment Analysis Systems
Taylor Mahler | Willy Cheung | Micha Elsner | David King | Marie-Catherine de Marneffe | Cory Shain | Symon Stevens-Guille | Michael White
Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems

This paper describes our “breaker” submission to the 2017 EMNLP “Build It Break It” shared task on sentiment analysis. In order to cause the “builder” systems to make incorrect predictions, we edited items in the blind test data according to linguistically interpretable strategies that allow us to assess the ease with which the builder systems learn various components of linguistic structure. On the whole, our submitted pairs break all systems at a high rate (72.6%), indicating that sentiment analysis as an NLP task may still have a lot of ground to cover. Of the breaker strategies that we consider, we find our semantic and pragmatic manipulations to pose the most substantial difficulties for the builder systems.

2016

pdf bib
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Micha Elsner | Sandra Kuebler
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Automatic discovery of Latin syntactic changes
Micha Elsner | Emily Lane
Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

Although spanning thousands of years and genres as diverse as liturgy, historiography, lyric and other forms of prose and poetry, the body of Latin texts is still relatively sparse compared to English. Data sparsity in Latin presents a number of challenges for traditional Named Entity Recognition techniques. Solving such challenges and enabling reliable Named Entity Recognition in Latin texts can facilitate many down-stream applications, from machine translation to digital historiography, enabling Classicists, historians, and archaeologists for instance, to track the relationships of historical persons, places, and groups on a large scale. This paper presents the first annotated corpus for evaluating Named Entity Recognition in Latin, as well as a fully supervised model that achieves over 90% F-score on a held-out test set, significantly outperforming a competitive baseline. We also present a novel active learning strategy that predicts how many and which sentences need to be annotated for named entities in order to attain a specified degree of accuracy when recognizing named entities automatically in a given text. This maximizes the productivity of annotators while simultaneously controlling quality.

pdf bib
Joint Word Segmentation and Phonetic Category Induction
Micha Elsner | Stephanie Antetomaso | Naomi Feldman
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

2015

pdf bib abs
Abstract Representations of Plot Structure
Micha Elsner
Linguistic Issues in Language Technology, Volume 12, 2015 - Literature Lifts up Computational Linguistics

Since the 18th century, the novel has been one of the defining forms of English writing, a mainstay of popular entertainment and academic criticism. Despite its importance, however, there are few computational studies of the large-scale structure of novels—and many popular representations for discourse modeling do not work very well for novelistic texts. This paper describes a high-level representation of plot structure which tracks the frequency of mentions of different characters, topics and emotional words over time. The representation can distinguish with high accuracy between real novels and artificially permuted surrogates; characters are important for eliminating random permutations, while topics are effective at distinguishing beginnings from ends.