Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)

Daniel Dakota, Kilian Evang, Sandra Kübler, Lori Levin (Editors)

Anthology ID:
Washington, D.C.
TLT | SyntaxFest
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)
Daniel Dakota | Kilian Evang | Sandra Kübler | Lori Levin

pdf bib
Corpus-Based Multilingual Event-type Ontology: Annotation Tools and Principles
Eva Fučíková | Jan Hajič | Zdeňka Urešová

In the course of building a multilingual Event-type Ontology resource called SynSemClass, it was necessary to provide the maintainers and the annotators with a set of tools to facilitate their job, achieve data format consistency, and in general obtain high-quality data. We have adapted a previously existing tool (Urešová et al., 2018b), developed to assist the work in capturing bilingual synonymy. This tool needed to be both substantially expanded with some new features and fundamentally changed in the context of developing the resource for more languages, which necessarily is to be done in parallel. We are thus presenting here the tool, the new data structure design which had to change at the same time, and the associated workflow.

pdf bib
Spanish Verbal Synonyms in the SynSemClass Ontology
Cristina Fernández-Alcaina | Eva Fučíková | Jan Hajič | Zdeňka Urešová

This paper presents ongoing work in the expansion of the multilingual semantic event-type ontology SynSemClass (Czech-English-German) to include Spanish. As in previous versions of the lexicon, Spanish verbal synonyms have been collected from a sentence-aligned parallel corpus and classified into classes based on their syntactic-semantic properties. Each class member is linked to a number of syntactic and/or semantic resources specific to each language, thus enriching the annotation and enabling interoperability. This paper describes the procedure for the data extraction and annotation of Spanish verbal synonyms in the lexicon.

Hedging in diachrony: the case of Vedic Sanskrit iva
Erica Biagetti | Oliver Hellwig | Sven Sellmer

The rhetoric strategy of hedging serves to attenuate speech acts and their semantic content, as in English ‘kind of’ or ‘somehow’. While hedging has recently met with increasing interest in linguistic research, most studies deal with modern languages, preferably English, and take a synchronic approach. This paper complements this research by tracing the diachronic syntactic flexibilization of the Vedic Sanskrit particle iva from a marker of comparison (‘like’) to a full-fledged adaptor. We discuss the outcomes of a diachronic Bayesian framework applied to iva constructions in a Universal Dependencies treebank, and supplement these results with a qualitative discussion of relevant text passages.

Is Japanese CCGBank empirically correct? A case study of passive and causative constructions
Daisuke Bekki | Hitomi Yanaka

The Japanese CCGBank serves as training and evaluation data for developing Japanese CCG parsers. However, since it is automatically generated from the Kyoto Corpus, a dependency treebank, its linguistic validity still needs to be sufficiently verified. In this paper, we focus on the analysis of passive/causative constructions in the Japanese CCGBank and show that, together with the compositional semantics of ccg2lambda, a semantic parsing system, it yields empirically wrong predictions for the nested construction of passives and causatives.

ICON: Building a Large-Scale Benchmark Constituency Treebank for the Indonesian Language
Ee Suan Lim | Wei Qi Leong | Ngan Thanh Nguyen | Dea Adhista | Wei Ming Kng | William Chandra Tjh | Ayu Purwarianti

Constituency parsing is an important task of informing how words are combined to form sentences. While constituency parsing in English has seen significant progress in the last few years, tools for constituency parsing in Indonesian remain few and far between. In this work, we publish ICON (Indonesian CONstituency treebank), the hitherto largest publicly-available manually-annotated benchmark constituency treebank for the Indonesian language with a size of 10,000 sentences and approximately 124,000 constituents and 182,000 tokens, which can support the training of state-of-the-art transformer-based models. We establish strong baselines on the ICON dataset using the Berkeley Neural Parser with transformer-based pre-trained embeddings, with the best performance of 88.85% F1 score coming from our own version of SpanBERT (IndoSpanBERT). We further analyze the predictions made by our best-performing model to reveal certain idiosyncrasies in the Indonesian language that pose challenges for constituency parsing.

Parsing Early New High German: Benefits and limitations of cross-dialectal training
Christopher Saap | Daniel Dakota | Elliot Evans

Historical treebanking within the generative framework has gained in popularity. However, there are still many languages and historical periods yet to be represented. For German, a constituency treebank exists for historical Low German, but not Early New High German. We begin to fill this gap by presenting our initial work on the Parsed Corpus of Early New High German (PCENHG). We present the methodological considerations and workflow for the treebank’s annotations and development. Given the limited amount of currently available PCENHG treebank data, we treat it as a low-resource language and leverage a larger, closely related variety—Middle Low German—to build a parser to help facilitate faster post-annotation correction. We present an analysis on annotation speeds and conclude with a small pilot use-case, highlighting potential for future linguistic analyses. In doing so we highlight the value of the treebank’s development for historical linguistic analysis and demonstrate the benefits and challenges of developing a parser using two closely related historical Germanic varieties.

Semgrex and Ssurgeon, Searching and Manipulating Dependency Graphs
John Bauer | Chloé Kiddon | Eric Yeh | Alex Shan | Christopher D. Manning

Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.

Mapping AMR to UMR: Resources for Adapting Existing Corpora for Cross-Lingual Compatibility
Julia Bonn | Skatje Myers | Jens E. L. Van Gysel | Lukas Denk | Meagan Vigus | Jin Zhao | Andrew Cowell | William Croft | Jan Hajič | James H. Martin | Alexis Palmer | Martha Palmer | James Pustejovsky | Zdenka Urešová | Rosa Vallejos | Nianwen Xue

This paper presents detailed mappings between the structures used in Abstract Meaning Representation (AMR) and those used in Uniform Meaning Representation (UMR). These structures include general semantic roles, rolesets, and concepts that are largely shared between AMR and UMR, but with crucial differences. While UMR annotation of new low-resource languages is ongoing, AMR-annotated corpora already exist for many languages, and these AMR corpora are ripe for conversion to UMR format. Rather than focusing on semantic coverage that is new to UMR (which will likely need to be dealt with manually), this paper serves as a resource (with illustrated mappings) for users looking to understand the fine-grained adjustments that have been made to the representation techniques for semantic categoriespresent in both AMR and UMR.