Miriam Butt

2024

pdf abs
GRIT: A Dataset of Group Reference Recognition in Italian
Sergio E. Zanotto | Qi Yu | Miriam Butt | Diego Frassinelli
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

For the analysis of political discourse a reliable identification of group references, i.e., linguistic components that refer to individuals or groups of people, is useful. However, the task of automatically recognizing group references has not yet gained much attention within NLP. To address this gap, we introduce GRIT (Group Reference for Italian), a large-scale, multi-domain manually annotated dataset for group reference recognition in Italian. GRIT represents a new resource for automatic and generalizable recognition of group references. With this dataset, we aim to establish group reference recognition as a valid classification task, which extends the domain of Named Entity Recognition by expanding its focus to literal and figurative mentions of social groups. We verify the potential of achieving automated group reference recognition for Italian through an experiment employing a fine-tuned BERT model. Our experimental results substantiate the validity of the task, implying a huge potential for applying automated systems to multiple fields of analysis, such as political text or social media analysis.

In this paper we focus on a subclass of multi-word expressions, namely compound formation in German. The automatic detection of compounds is a known problem and we argue that its resolution should be given more urgency in light of a new role we uncovered with respect to ad hoc compound formation: the systematic expression of attitudinal meaning and its potential importance for the down-stream NLP task of stance detection. We demonstrate that ad hoc compounds in German indeed systematically express attitudinal meaning by adducing corpus linguistic and psycholinguistic experimental data. However, an investigation of state-of-the-art dependency parsers and Universal Dependency treebanks shows that German compounds are parsed and annotated very unevenly, so that currently one cannot reliably identify or access ad hoc compounds with attitudinal meaning in texts. Moreover, we report initial experiments with large language models underlining the challenges in capturing attitudinal meanings conveyed by ad hoc compounds. We consequently suggest a systematized way of annotating (and thereby also parsing) ad hoc compounds that is based on positive experiences from within the multilingual ParGram grammar development effort.

2022

pdf abs
Automatized Detection and Annotation for Calls to Action in Latin-American Social Media Postings
Wassiliki Siskou | Clara Giralt Mirón | Sarah Molina-Raith | Miriam Butt
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Voter mobilization via social media has shown to be an effective tool. While previous research has primarily looked at how calls-to-action (CTAs) were used in Twitter messages from non-profit organizations and protest mobilization, we are interested in identifying the linguistic cues used in CTAs found on Facebook and Twitter for an automatic identification of CTAs. The work is part of an on-going collaboration with researchers from political science, who are investigating CTAs in the period leading up to recent elections in three different Latin American countries. We developed a new NLP pipeline for Spanish to facilitate their work. Our pipeline annotates social media posts with a range of linguistic information and then conducts targeted searches for linguistic cues that allow for an automatic annotation and identification of relevant CTAs. By using carefully crafted and linguistically informed heuristics, our system so far achieves an F1-score of 0.72.

2021

pdf abs
Is that really a question? Going beyond factoid questions in NLP
Aikaterini-Lida Kalouli | Rebecca Kehlbeck | Rita Sevastjanova | Oliver Deussen | Daniel Keim | Miriam Butt
Proceedings of the 14th International Conference on Computational Semantics (IWCS)

Research in NLP has mainly focused on factoid questions, with the goal of finding quick and reliable ways of matching a query to an answer. However, human discourse involves more than that: it contains non-canonical questions deployed to achieve specific communicative goals. In this paper, we investigate this under-studied aspect of NLP by introducing a targeted task, creating an appropriate corpus for the task and providing baseline models of diverse nature. With this, we are also able to generate useful insights on the task and open the way for future research in this direction.

2020

pdf abs
Dependency Parsing for Urdu: Resources, Conversions and Learning
Toqeer Ehsan | Miriam Butt
Proceedings of the Twelfth Language Resources and Evaluation Conference

This paper adds to the available resources for the under-resourced language Urdu by converting different types of existing treebanks for Urdu into a common format that is based on Universal Dependencies. We present comparative results for training two dependency parsers, the MaltParser and a transition-based BiLSTM parser on this new resource. The BiLSTM parser incorporates word embeddings which improve the parsing results significantly. The BiLSTM parser outperforms the MaltParser with a UAS of 89.6 and an LAS of 84.2 with respect to our standardized treebank resource.

pdf abs
Representation Problems in Linguistic Annotations: Ambiguity, Variation, Uncertainty, Error and Bias
Christin Beck | Hannah Booth | Mennatallah El-Assady | Miriam Butt
Proceedings of the 14th Linguistic Annotation Workshop

The development of linguistic corpora is fraught with various problems of annotation and representation. These constitute a very real challenge for the development and use of annotated corpora, but as yet not much literature exists on how to address the underlying problems. In this paper, we identify and discuss five sources of representation problems, which are independent though interrelated: ambiguity, variation, uncertainty, error and bias. We outline and characterize these sources, discussing how their improper treatment can have stark consequences for research outcomes. Finally, we discuss how an adequate treatment can inform corpus-related linguistic research, both computational and theoretical, improving the reliability of research results and NLP models, as well as informing the more general reproducibility issue.

2019

pdf abs
Using Meta-Morph Rules to develop Morphological Analysers: A case study concerning Tamil
Kengatharaiyer Sarveswaran | Gihan Dias | Miriam Butt
Proceedings of the 14th International Conference on Finite-State Methods and Natural Language Processing

This paper describes a new and larger coverage Finite-State Morphological Analyser (FSM) and Generator for the Dravidian language Tamil. The FSM has been developed in the context of computational grammar engineering, adhering to the standards of the ParGram effort. Tamil is a morphologically rich language and the interaction between linguistic analysis and formal implementation is complex, resulting in a challenging task. In order to allow the development of the FSM to focus more on the linguistic analysis and less on the formal details, we have developed a system of meta-morph(ology) rules along with a script which translates these rules into FSM processable representations. The introduction of meta-morph rules makes it possible for computationally naive linguists to interact with the system and to expand it in future work. We found that the meta-morph rules help to express linguistic generalisations and reduce the manual effort of writing lexical classes for morphological analysis. Our Tamil FSM currently handles mainly the inflectional morphology of 3,300 verb roots and their 260 forms. Further, it also has a lexicon of approximately 100,000 nouns along with a guesser to handle out-of-vocabulary items. Although the Tamil FSM was primarily developed to be part of a computational grammar, it can also be used as a web or stand-alone application for other NLP tasks, as per general ParGram practice.

pdf abs
ParHistVis: Visualization of Parallel Multilingual Historical Data
Aikaterini-Lida Kalouli | Rebecca Kehlbeck | Rita Sevastjanova | Katharina Kaiser | Georg A. Kaiser | Miriam Butt
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

The study of language change through parallel corpora can be advantageous for the analysis of complex interactions between time, text domain and language. Often, those advantages cannot be fully exploited due to the sparse but high-dimensional nature of such historical data. To tackle this challenge, we introduce ParHistVis: a novel, free, easy-to-use, interactive visualization tool for parallel, multilingual, diachronic and synchronic linguistic data. We illustrate the suitability of the components of the tool based on a use case of word order change in Romance wh-interrogatives.

pdf abs
Visualizing Linguistic Change as Dimension Interactions
Christin Schätzle | Frederik L. Dennig | Michael Blumenschein | Daniel A. Keim | Miriam Butt
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Historical change typically is the result of complex interactions between several linguistic factors. Identifying the relevant factors and understanding how they interact across the temporal dimension is the core remit of historical linguistics. With respect to corpus work, this entails a separate annotation, extraction and painstaking pair-wise comparison of the relevant bits of information. This paper presents a significant extension of HistoBankVis, a multilayer visualization system which allows a fast and interactive exploration of complex linguistic data. Linguistic factors can be understood as data dimensions which show complex interrelationships. We model these relationships with the Parallel Sets technique. We demonstrate the powerful potential of this technique by applying the system to understanding the interaction of case, grammatical relations and word order in the history of Icelandic.

pdf abs
lingvis.io - A Linguistic Visual Analytics Framework
Mennatallah El-Assady | Wolfgang Jentner | Fabian Sperrle | Rita Sevastjanova | Annette Hautli-Janisz | Miriam Butt | Daniel Keim
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

We present a modular framework for the rapid-prototyping of linguistic, web-based, visual analytics applications. Our framework gives developers access to a rich set of machine learning and natural language processing steps, through encapsulating them into micro-services and combining them into a computational pipeline. This processing pipeline is auto-configured based on the requirements of the visualization front-end, making the linguistic processing and visualization design, detached independent development tasks. This paper describes the constellation and modality of our framework, which continues to support the efficient development of various human-in-the-loop, linguistic visual analytics research techniques and applications.

pdf abs
Complex Predicates and Multidimensionality in Grammar
Miriam Butt
Linguistic Issues in Language Technology, Volume 17, 2019

This paper contributes to the on-going discussion of how best to analyze and handle complex predicate formations, commenting in particular on the properties of Hindi N-V complex predicates as set out by Vaidya et al. (2019). I highlight features of existing LFG analyses and focus in particular on the modular architecture of LFG, its attendant multidimensional lexicon and the analytic consequences which follow from this. I point out where the previously existing LFG proposals have been misunderstood as viewed from the lens of theories such as LTAG and HPSG, which assume a very different architectural set-up and provide a comparative discussion of the issues.

2018

pdf
A Multilingual Approach to Question Classification
Aikaterini-Lida Kalouli | Katharina Kaiser | Annette Hautli-Janisz | Georg A. Kaiser | Miriam Butt
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf
HistoBankVis: Detecting Language Change via Data Visualization
Christin Schätzle | Michael Hund | Frederik Dennig | Miriam Butt | Daniel Keim
Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language

2015

2014

pdf
Automatic Detection of Causal Relations in German Multilogs
Tina Bögel | Annette Hautli-Janisz | Sebastian Sulger | Miriam Butt
Proceedings of the EACL 2014 Workshop on Computational Approaches to Causality in Language (CAtoCL)

The paper presents a design schema and details of a new Urdu POS tagset. This tagset is designed due to challenges encountered in working with existing tagsets for Urdu. It uses tags that judiciously incorporate information about special morpho-syntactic categories found in Urdu. With respect to the overall naming schema and the basic divisions, the tagset draws on the Penn Treebank and a Common Tagset for Indian Languages. The resulting CLE Urdu POS Tagset consists of 12 major categories with subdivisions, resulting in 32 tags. The tagset has been used to tag 100k words of the CLE Urdu Digest Corpus, giving a tagging accuracy of 96.8%.

2013

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Miriam Butt | Sarmad Hussain
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations

2012

pdf bib
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH
Miriam Butt | Sheelagh Carpendale | Gerald Penn | Jelena Prokić | Michael Cysouw
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

pdf bib
Introduction
Miriam Butt | Jelena Prokić | Thomas Mayer | Michael Cysouw
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

pdf bib
Lexical Semantics and Distribution of Suffixes - A Visual Analysis
Christian Rohrdantz | Andreas Niekler | Annette Hautli | Miriam Butt | Daniel A. Keim
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

pdf
Identifying Urdu Complex Predication via Bigram Extraction
Miriam Butt | Tina Bögel | Annette Hautli | Sebastian Sulger | Tafseer Ahmed
Proceedings of COLING 2012

pdf abs
A Reference Dependency Bank for Analyzing Complex Predicates
Tafseer Ahmed | Miriam Butt | Annette Hautli | Sebastian Sulger
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

When dealing with languages of South Asia from an NLP perspective, a problem that repeatedly crops up is the treatment of complex predicates. This paper presents a first approach to the analysis of complex predicates (CPs) in the context of dependency bank development. The efforts originate in theoretical work on CPs done within Lexical-Functional Grammar (LFG), but are intended to provide a guideline for analyzing different types of CPs in an independent framework. Despite the fact that we focus on CPs in Hindi and Urdu, the design of the dependencies is kept general enough to account for CP constructions across languages.

2011

pdf
Discovering Semantic Classes for Urdu N-V Complex Predicates
Tafseer Ahmed | Miriam Butt
Proceedings of the Ninth International Conference on Computational Semantics (IWCS 2011)

pdf
Towards a Computational Semantic Analyzer for Urdu
Annette Hautli | Miriam Butt
Proceedings of the 9th Workshop on Asian Language Resources

pdf
Towards Tracking Semantic Change by Visual Analytics
Christian Rohrdantz | Annette Hautli | Thomas Mayer | Miriam Butt | Daniel A. Keim | Frans Plank
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

2010

In this paper, we present a system for transliterating the Arabic-based script of Urdu to a Roman transliteration scheme. The system is integrated into a larger system consisting of a morphology module, implemented via finite state technologies, and a computational LFG grammar of Urdu that was developed with the grammar development platform XLE (Crouch et al. 2008). Our long-term goal is to handle Hindi alongside Urdu; the two languages are very similar with respect to syntax and lexicon and hence, one grammar can be used to cover both languages. However, they are not similar concerning the script -- Hindi is written in Devanagari, while Urdu uses an Arabic-based script. By abstracting away to a common Roman transliteration scheme in the respective transliterators, our system can be enabled to handle both languages in parallel. In this paper, we discuss the pipeline architecture of the Urdu-Roman transliterator, mention several linguistic and orthographic issues and present the integration of the transliterator into the LFG parsing system.

pdf
Consonant Co-Occurrence in Stems across Languages: Automatic Analysis and Visualization of a Phonotactic Constraint
Thomas Mayer | Christian Rohrdantz | Frans Plank | Peter Bak | Miriam Butt | Daniel A. Keim
Proceedings of the 2010 Workshop on NLP and Linguistics: Finding the Common Ground