Sarah Moeller

Also published as: Sarah R. Moeller

2023

pdf bib
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Atticus Harrigan | Aditi Chaudhary | Shruti Rijhwani | Sarah Moeller | Antti Arppe | Alexis Palmer | Ryan Henke | Daisy Rosenblum
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

pdf
Morphological Data Generation from FLEx
Shengyu Liao | Sarah Moeller | Beth Bryson
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

2022

pdf abs
Disambiguation of morpho-syntactic features of African American English – the case of habitual be
Harrison Santiago | Joshua Martin | Sarah Moeller | Kevin Tang
Proceedings of the Second Workshop on Language Technology for Equality, Diversity and Inclusion

Recent research has highlighted that natural language processing (NLP) systems exhibit a bias againstAfrican American speakers. These errors are often caused by poor representation of linguistic features unique to African American English (AAE), which is due to the relatively low probability of occurrence for many such features. We present a workflow to overcome this issue in the case of habitual “be”. Habitual “be” is isomorphic, and therefore ambiguous, with other forms of uninflected “be” found in both AAE and General American English (GAE). This creates a clear challenge for bias in NLP technologies. To overcome the scarcity, we employ a combination of rule-based filters and data augmentation that generate a corpus balanced between habitual and non-habitual instances. This balanced corpus trains unbiased machine learning classifiers, as demonstrated on a corpus of AAE transcribed texts, achieving .65 F₁ score at classifying habitual “be”.

2021

pdf abs
To POS Tag or Not to POS Tag: The Impact of POS Tags on Morphological Learning in Low-Resource Settings
Sarah Moeller | Ling Liu | Mans Hulden
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Part-of-Speech (POS) tags are routinely included as features in many NLP tasks. However, the importance and usefulness of POS tags needs to be examined as NLP expands to low-resource languages because linguists who provide many annotated resources do not place priority on early identification and tagging of POS. This paper describes an empirical study about the effect that POS tags have on two computational morphological tasks with the Transformer architecture. Each task is tested twice on identical data except for the presence/absence of POS tags, using published data in ten high- to low-resource languages or unpublished linguistic field data in five low-resource languages. We find that the presence or absence of POS tags does not have a significant bearing on performance. In joint segmentation and glossing, the largest average difference is an .09 improvement in F1-scores by removing POS tags. In reinflection, the greatest average difference is 1.2% in accuracy for published data and 5% for unpublished and noisy field data.

pdf bib
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)
Antti Arppe | Jeff Good | Atticus Harrigan | Mans Hulden | Jordan Lachler | Sarah Moeller | Alexis Palmer | Miikka Silfverberg | Lane Schwartz
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

pdf
Integrating Automated Segmentation and Glossing into Documentary and Descriptive Linguistics
Sarah Moeller | Mans Hulden
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2020

Despite recent advances in natural language processing and other language technology, the application of such technology to language documentation and conservation has been limited. In August 2019, a workshop was held at Carnegie Mellon University in Pittsburgh, PA, USA to attempt to bring together language community members, documentary linguists, and technologists to discuss how to bridge this gap and create prototypes of novel and practical language revitalization technologies. The workshop focused on developing technologies to aid language documentation and revitalization in four areas: 1) spoken language (speech transcription, phone to orthography decoding, text-to-speech and text-speech forced alignment), 2) dictionary extraction and management, 3) search tools for corpora, and 4) social media (language learning bots and social media analysis). This paper reports the results of this workshop, including issues discussed, and various conceived and implemented technologies for nine languages: Arapaho, Cayuga, Inuktitut, Irish Gaelic, Kidaw’ida, Kwak’wala, Ojibwe, San Juan Quiahije Chatino, and Seneca.

pdf abs
The Russian PropBank
Sarah Moeller | Irina Wagner | Martha Palmer | Kathryn Conger | Skatje Myers
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper presents a proposition bank for Russian (RuPB), a resource for semantic role labeling (SRL). The motivating goal for this resource is to automatically project semantic role labels from English to Russian. This paper describes frame creation strategies, coverage, and the process of sense disambiguation. It discusses language-specific issues that complicated the process of building the PropBank and how these challenges were exploited as language-internal guidance for consistency and coherence.

pdf abs
IGT2P: From Interlinear Glossed Texts to Paradigms
Sarah Moeller | Ling Liu | Changbing Yang | Katharina Kann | Mans Hulden
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

An intermediate step in the linguistic analysis of an under-documented language is to find and organize inflected forms that are attested in natural speech. From this data, linguists generate unseen inflected word forms in order to test hypotheses about the language’s inflectional patterns and to complete inflectional paradigm tables. To get the data linguists spend many hours manually creating interlinear glossed texts (IGTs). We introduce a new task that speeds this process and automatically generates new morphological resources for natural language processing systems: IGT-to-paradigms (IGT2P). IGT2P generates entire morphological paradigms from IGT input. We show that existing morphological reinflection models can solve the task with 21% to 64% accuracy, depending on the language. We further find that (i) having a language expert spend only a few hours cleaning the noisy IGT data improves performance by as much as 21 percentage points, and (ii) POS tags, which are generally considered a necessary part of NLP morphological reinflection input, have no effect on the accuracy of the models considered here.

2019

pdf abs
Linguistic Analysis Improves Neural Metaphor Detection
Kevin Stowe | Sarah Moeller | Laura Michaelis | Martha Palmer
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

In the field of metaphor detection, deep learning systems are the ubiquitous and achieve strong performance on many tasks. However, due to the complicated procedures for manually identifying metaphors, the datasets available are relatively small and fraught with complications. We show that using syntactic features and lexical resources can automatically provide additional high-quality training data for metaphoric language, and this data can cover gaps and inconsistencies in metaphor annotation, improving state-of-the-art word-level metaphor identification. This novel application of automatically improving training data improves classification across numerous tasks, and reconfirms the necessity of high-quality data for deep learning frameworks.

pdf
Improving Low-Resource Morphological Learning with Intermediate Forms from Finite State Transducers
Sarah Moeller | Ghazaleh Kazeminejad | Andrew Cowell | Mans Hulden
Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Volume 1 (Papers)

2018

pdf
Morphological Reinflection in Context: CU Boulder’s Submission to CoNLL–SIGMORPHON 2018 Shared Task
Ling Liu | Ilamvazhuthy Subbiah | Adam Wiemerslage | Jonathan Lilley | Sarah Moeller
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf bib abs
A Neural Morphological Analyzer for Arapaho Verbs Learned from a Finite State Transducer
Sarah Moeller | Ghazaleh Kazeminejad | Andrew Cowell | Mans Hulden
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages

We experiment with training an encoder-decoder neural model for mimicking the behavior of an existing hand-written finite-state morphological grammar for Arapaho verbs, a polysynthetic language with a highly complex verbal inflection system. After adjusting for ambiguous parses, we find that the system is able to generalize to unseen forms with accuracies of 98.68% (unambiguous verbs) and 92.90% (all verbs).

pdf abs
Automatic Glossing in a Low-Resource Setting for Language Documentation
Sarah Moeller | Mans Hulden
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages

Morphological analysis of morphologically rich and low-resource languages is important to both descriptive linguistics and natural language processing. Field documentary efforts usually procure analyzed data in cooperation with native speakers who are capable of providing some level of linguistic information. Manually annotating such data is very expensive and the traditional process is arguably too slow in the face of language endangerment and loss. We report on a case study of learning to automatically gloss a Nakh-Daghestanian language, Lezgi, from a very small amount of seed data. We compare a conditional random field based sequence labeler and a neural encoder-decoder model and show that a nearly 0.9 F1-score on labeled accuracy of morphemes can be achieved with 3,000 words of transcribed oral text. Errors are mostly limited to morphemes with high allomorphy. These results are potentially useful for developing rapid annotation and fieldwork tools to support documentation of morphologically rich, endangered languages.

Semantic relations are often signaled with prepositional or possessive marking—but extreme polysemy bedevils their analysis and automatic interpretation. We introduce a new annotation scheme, corpus, and task for the disambiguation of prepositions and possessives in English. Unlike previous approaches, our annotations are comprehensive with respect to types and tokens of these markers; use broadly applicable supersense classes rather than fine-grained dictionary definitions; unite prepositions and possessives under the same class inventory; and distinguish between a marker’s lexical contribution and the role it marks in the context of a predicate or scene. Strong interannotator agreement rates, as well as encouraging disambiguation results with established supervised methods, speak to the viability of the scheme and task.