Other Workshops and Events (2023)

Recent work has formulated the task for computational construction grammar as producing a constructicon given a corpus of usage. Previous work has evaluated these unsupervised grammars using both internal metrics (for example, Minimum Description Length) and external metrics (for example, performance on a dialectology task). This paper instead takes a linguistic approach to evaluation, first learning a constructicon and then analyzing its contents from a linguistic perspective. This analysis shows that a learned constructicon can be divided into nine major types of constructions, of which Verbal and Nominal are the most common. The paper also shows that both the token and type frequency of constructions can be used to model variation across registers and dialects.

pdf bib abs
Constructions, Collocations, and Patterns: Alternative Ways of Construction Identification in a Usage-based, Corpus-driven Theoretical Framework
Gábor Simon

There is a serious theoretical and methodological dilemma in usage-based construction grammar: how to identify constructions based on corpus pattern analysis. The present paper provides an overview of this dilemma, focusing on argument structure constructions (ASCs) in general. It seeks to answer the question of how a data-driven construction grammatical description can be built on the collocation data extracted from corpora. The study is of meta-scientific interest: it compares theoretical proposals in construction grammar regarding how they handle co-occurrences emerging from a corpus. Discussing alternative bottom-up approaches to the notion of construction, the paper concludes that there is no one-to-one correspondence between corpus patterns and constructions. Therefore, a careful analysis of the former can empirically ground both the identification and the description of constructions.

pdf abs
CALaMo: a Constructionist Assessment of Language Models
Ludovica Pannitto | Aurélie Herbelot

This paper presents a novel framework for evaluating Neural Language Models’ linguistic abilities using a constructionist approach. Not only is the usage-based model in line with the un- derlying stochastic philosophy of neural architectures, but it also allows the linguist to keep meaning as a determinant factor in the analysis. We outline the framework and present two possible scenarios for its application.

pdf abs
High-dimensional vector spaces can accommodate constructional features quite conveniently
Jussi Karlgren

Current language processing tools presuppose input in the form of a sequence of high-dimensional vectors with continuous values. Lexical items can be converted to such vectors with standard methodology and subsequent processing is assumed to handle structural features of the string. Constructional features do typically not fit in that processing pipeline: they are not as clearly sequential, they overlap with other items, and the fact that they are combinations of lexical items obscures their ontological status as observable linguistic items in their own right. Constructional grammar frameworks allow for a more general view on how to understand lexical items and their configurations in a common framework. This paper introduces an approach to accommodate that understanding in a vector symbolic architecture, a processing framework which allows for combinations of continuous vectors and discrete items, convenient for various downstream processing using e.g. neural processing or other tools which expect input in vector form.

pdf abs
Constructivist Tokenization for English
Allison Fan | Weiwei Sun

This paper revisits tokenization from a theoretical perspective, and argues for the necessity of a constructivist approach to tokenization for semantic parsing and modeling language acquisition. We consider two problems: (1) (semi-) automatically converting existing lexicalist annotations, e.g. those of the Penn TreeBank, into constructivist annotations, and (2) automatic tokenization of raw texts. We demonstrate that (1) a heuristic rule-based constructivist tokenizer is able to yield relatively satisfactory accuracy when gold standard Penn TreeBank part-of-speech tags are available, but that some manual annotations are still necessary to obtain gold standard results, and (2) a neural tokenizer is able to provide accurate automatic constructivist tokenization results from raw character sequences. Our research output also includes a set of high-quality morpheme-tokenized corpora, which enable the training of computational models that more closely align with language comprehension and acquisition.

pdf abs
Fluid Construction Grammar: State of the Art and Future Outlook
Katrien Beuls | Paul Van Eecke

Fluid Construction Grammar (FCG) is a computational framework that provides a formalism for representing construction grammars and a processing engine that supports construction-based language comprehension and production. FCG is conceived as a computational operationalisation of the basic tenets of construction grammar. It thereby aims to establish more solid foundations for constructionist theories of language, while expanding their application potential in the fields of artificial intelligence and natural language understanding. This paper aims to provide a brief introduction to the FCG research programme, reflecting on what has been achieved so far and identifying key challenges for the future.

pdf abs
An Argument Structure Construction Treebank
Kristopher Kyle | Hakyung Sung

In this paper we introduce a freely available treebank that includes argument structure construction (ASC) annotation. We then use the treebank to train probabilistic annotation models that rely on verb lemmas and/ or syntactic frames. We also use the treebank data to train a highly accurate transformer-based annotation model (F1 = 91.8%). Future directions for the development of the treebank and annotation models are discussed.

pdf abs
Investigating Stylistic Profiles for the Task of Empathy Classification in Medical Narrative Essays
Priyanka Dey | Roxana Girju

One important aspect of language is how speakers generate utterances and texts to convey their intended meanings. In this paper, we bring various aspects of the Construction Grammar (CxG) and the Systemic Functional Grammar (SFG) theories in a deep learning computational framework to model empathic language. Our corpus consists of 440 essays written by premed students as narrated simulated patient–doctor interactions. We start with baseline classifiers (state-of-the-art recurrent neural networks and transformer models). Then, we enrich these models with a set of linguistic constructions proving the importance of this novel approach to the task of empathy classification for this dataset. Our results indicate the potential of such constructions to contribute to the overall empathy profile of first-person narrative essays.

pdf abs
UMR annotation of Chinese Verb compounds and related constructions
Haibo Sun | Yifan Zhu | Jin Zhao | Nianwen Xue

This paper discusses the challenges of annotating the predicate-argument structure of Chinese verb compounds in Uniform Meaning Representation (UMR), a recent meaning representation framework that extends Abstract Meaning Representation (AMR) to cross-linguistic settings. The key issue is to decide whether to annotate the argument structure of a verb compound as a whole, or to annotate the argument structure of their component verbs as well as the relations between them. We examine different types of Chinese verb compounds, and propose how to annotate them based on the principle of compositionality, level of grammaticalization, and productivity of component verbs. We propose a solution to the practical problem of having to define the semantic roles for Chinese verb compounds that are quite open-ended by separating compositional verb compounds from verb compounds that are non-compositional or have grammaticalized verb components. For compositional verb compounds, instead of annotating the argument structure of the verb compound as a whole, we annotate the argument structure of the component verbs as well as the semantic relations between them as creating an exhaustive list of such verb compounds is infeasible. Verb compounds with grammaticalized verb components also tend to be productive and we represent grammaticalized verb compounds as either attributes of the primary verb or as relations.

Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pretrained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mind, as well as probing methodology that was designed for specific constructions. We analyse selected previous work in detail, and provide our view of the most important challenges and research questions that this promising new field faces.

pdf abs
Modeling Construction Grammar’s Way into NLP: Insights from negative results in automatically identifying schematic clausal constructions in Brazilian Portuguese
Arthur Lorenzi | Vânia Gomes de Almeida | Ely Edison Matos | Tiago Timponi Torrent

This paper reports on negative results in a task of automatic identification of schematic clausal constructions and their elements in Brazilian Portuguese. The experiment was set up so as to test whether form and meaning properties of constructions, modeled in terms of Universal Dependencies and FrameNet Frames in a Constructicon, would improve the performance of transformer models in the task. Qualitative analysis of the results indicate that alternatives to the linearization of those properties, dataset size and a post-processing module should be explored in the future as a means to make use of information in Constructicons for NLP tasks.

pdf (full)
bib (full) Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023)

pdf bib
Proceedings of the Seventh International Conference on Dependency Linguistics (Depling, GURT/SyntaxFest 2023)
Owen Rambow | François Lareau

pdf bib abs
The development of dependency length minimization in early child language: A case study of the dative alternation
Zoey Liu | Stefanie Wulff

How does the preference for dependency length minimization (DLM) develop in early child language? This study takes up this question with the dative alternation in English as the test case. We built a large-scale dataset of dative constructions using transcripts of naturalistic child-parent interactions. Across different developmental stages of children, there appears to be a strong tendency for DLM. The tendency emerges between the age range of 12-18 months, slightly decreases until 30-36 months, then becomes more pronounced afterwards and approaches parents’ production preferences after 48 months. We further show the extent of DLM depends on how a given dative construction is realized: the tendency for shorter dependencies is much more pronounced in double object structures, whereas the prepositional object structures are associated with longer dependencies.

pdf bib abs
Which Sentence Representation is More Informative: An Analysis on Text Classification
Necva Bölücü | Burcu Can

Text classification is a popular and well-studied problem in Natural Language Processing. Most previous work on text classification has focused on deep neural networks such as LSTMs and CNNs. However, text classification studies using syntactic and semantic information are very limited in the literature. In this study, we propose a model using Graph Attention Network (GAT) that incorporates semantic and syntactic information as input for the text classification task. The semantic representations of UCCA and AMR are used as semantic information and the dependency tree is used as syntactic information. Extensive experimental results and in-depth analysis show that UCCA-GAT model, which is a semantic-aware model outperforms the AMR-GAT and DEP-GAT, which are semantic and syntax-aware models respectively. We also provide a comprehensive analysis of the proposed model to understand the limitations of the representations for the problem.

pdf abs
Formal Semantics for Dependency Grammar
Dag T. T. Haug | Jamie Y. Findlay

In this paper, we provide an explicit interface to formal semantics for Dependency Grammar, based on Glue Semantics. Glue Semantics has mostly been developed in the context of Lexical Functional Grammar, which shares two crucial assumptions with Dependency Grammar: lexical integrity and allowance of nonbinary-branching syntactic structure. We show how Glue can be adapted to the Dependency Grammar setting and provide sample semantic analyses of quantifier scope, control infinitives and relative clauses.

pdf abs
Predicates and entities in Abstract Meaning Representation
Antoine Venant | François Lareau

Nodes in Abstract Meaning Representation (AMR) are generally thought of as neo-Davidsonian entities. We review existing translation into neo-Davidsonian representations and show that these translations inconsistently handle copula sentences. We link the problem to an asymmetry arising from a problematic handling of words with no associated PropBank frames for the underlying predicate. We introduce a method to automatically and uniformly decompose AMR nodes into an entity-part and a predicative part, which offers a consistent treatment of copula sentences and quasi- predicates such as brother or client.

pdf abs
Character-level Dependency Annotation of Chinese
Li Yixuan

In this paper, we propose a new model for annotating dependency relations at the Mandarin character level with the aim of building treebanks to cope with the unsatisfactory performance of existing word segmentation and syntactic analysis models in specific scientific domains, such as Chinese patent texts. The result is a treebank of 100 sentences annotated according to our scheme, which also serves as a training corpus that facilitates the subsequent development of a joint word segmenter and dependency analyzer that enables downstream tasks in Chinese to be separated from the non-standardized pre-processing step of word segmentation.

pdf abs
What quantifying word order freedom can tell us about dependency corpora
Maja Buljan

Building upon existing work on word order freedom and syntactic annotation, this paper investigates whether we can differentiate between findings that reveal inherent properties of natural languages and their syntax, and features dependent on annotations used in computing the measures. An existing quantifiable and linguistically interpretable measure of word order freedom in language is applied to take a closer look at the robustness of the basic measure (word order entropy) to variations in dependency corpora used in the analysis. Measures are compared at three levels of generality, applied to corpora annotated according to the Universal Dependencies v1 and v2 annotation guidelines, selecting 31 languages for analysis. Preliminary results show that certain measures, such as subject-object relation order freedom, are sensitive to slight changes in annotation guidelines, while simpler measures are more robust, highlighting aspects of these metrics that should be taken into consideration when using dependency corpora for linguistic analysis and generalisation.

pdf abs
Word order flexibility: a typometric study
Sylvain Kahane | Ziqian Peng | Kim Gerdes

This paper introduces a typometric measure of flexibility, which quantifies the variability of head-dependent word order on the whole set of treebanks of a language or on specific constructions. The measure is based on the notion of head-initiality and we show that it can be computed for all of languages of the Universal Dependency treebank set, that it does not require ad-hoc thresholds to categorize languages or constructions, and that it can be applied with any granularity of constructions and languages. We compare our results with Bakker’s (1998) categorical flexibility index. Typometric flexibility is shown to be a good measure for characterizing the language distribution with respect to word order for a given construction, and for estimating whether a construction predicts the global word order behavior of a language.

pdf abs
Measure words are measurably different from sortal classifiers
Yamei Wang | Géraldine Walther

Nominal classifiers categorize nouns based on salient semantic properties. Past studies have long debated whether sortal classifiers (related to intrinsic semantic noun features) and mensural classifiers (related to quantity) should be considered as the same grammatical category. Suggested diagnostic tests rely on functional and distributional criteria, typically evaluated in terms of isolated example sentences obtained through elicitation. This paper offers a systematic re-evaluation of this long-standing question: using 981,076 nominal phrases from a 489 MB dependency-parsed word corpus, corresponding extracted contextual word embeddings from a Chinese BERT model, and information-theoretic measures of mutual information, we show that mensural classifiers can be distributionally and functionally distinguished from sortal classifiers justifying the existence of distinct syntactic categories for mensural and sortal classifiers. Our study also entails broader implications for the typological study of classifier systems.

pdf abs
A Pipeline for Extracting Abstract Dependency Templates for Data-to-Text Natural Language Generation
Simon Mille | Josep Ricci | Alexander Shvets | Anya Belz

We present work in progress that aims to address the coverage issue faced by rule-based text generators. We propose a pipeline for extracting abstract dependency template (predicate-argument structures) from Wikipedia text to be used as input for generating text from structured data with the FORGe system. The pipeline comprises three main components: (i) candidate sentence retrieval, (ii) clause extraction, ranking and selection, and (iii) conversion to predicate-argument form. We present an approach and preliminary evaluation for the ranking and selection module.

pdf (full)
bib (full) Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)

pdf bib
Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT/SyntaxFest 2023)
Daniel Dakota | Kilian Evang | Sandra Kübler | Lori Levin

pdf bib abs
Corpus-Based Multilingual Event-type Ontology: Annotation Tools and Principles
Eva Fučíková | Jan Hajič | Zdeňka Urešová

In the course of building a multilingual Event-type Ontology resource called SynSemClass, it was necessary to provide the maintainers and the annotators with a set of tools to facilitate their job, achieve data format consistency, and in general obtain high-quality data. We have adapted a previously existing tool (Urešová et al., 2018b), developed to assist the work in capturing bilingual synonymy. This tool needed to be both substantially expanded with some new features and fundamentally changed in the context of developing the resource for more languages, which necessarily is to be done in parallel. We are thus presenting here the tool, the new data structure design which had to change at the same time, and the associated workflow.

pdf bib abs
Spanish Verbal Synonyms in the SynSemClass Ontology
Cristina Fernández-Alcaina | Eva Fučíková | Jan Hajič | Zdeňka Urešová

This paper presents ongoing work in the expansion of the multilingual semantic event-type ontology SynSemClass (Czech-English-German) to include Spanish. As in previous versions of the lexicon, Spanish verbal synonyms have been collected from a sentence-aligned parallel corpus and classified into classes based on their syntactic-semantic properties. Each class member is linked to a number of syntactic and/or semantic resources specific to each language, thus enriching the annotation and enabling interoperability. This paper describes the procedure for the data extraction and annotation of Spanish verbal synonyms in the lexicon.

pdf abs
Hedging in diachrony: the case of Vedic Sanskrit iva
Erica Biagetti | Oliver Hellwig | Sven Sellmer

The rhetoric strategy of hedging serves to attenuate speech acts and their semantic content, as in English ‘kind of’ or ‘somehow’. While hedging has recently met with increasing interest in linguistic research, most studies deal with modern languages, preferably English, and take a synchronic approach. This paper complements this research by tracing the diachronic syntactic flexibilization of the Vedic Sanskrit particle iva from a marker of comparison (‘like’) to a full-fledged adaptor. We discuss the outcomes of a diachronic Bayesian framework applied to iva constructions in a Universal Dependencies treebank, and supplement these results with a qualitative discussion of relevant text passages.

pdf abs
Is Japanese CCGBank empirically correct? A case study of passive and causative constructions
Daisuke Bekki | Hitomi Yanaka

The Japanese CCGBank serves as training and evaluation data for developing Japanese CCG parsers. However, since it is automatically generated from the Kyoto Corpus, a dependency treebank, its linguistic validity still needs to be sufficiently verified. In this paper, we focus on the analysis of passive/causative constructions in the Japanese CCGBank and show that, together with the compositional semantics of ccg2lambda, a semantic parsing system, it yields empirically wrong predictions for the nested construction of passives and causatives.

Constituency parsing is an important task of informing how words are combined to form sentences. While constituency parsing in English has seen significant progress in the last few years, tools for constituency parsing in Indonesian remain few and far between. In this work, we publish ICON (Indonesian CONstituency treebank), the hitherto largest publicly-available manually-annotated benchmark constituency treebank for the Indonesian language with a size of 10,000 sentences and approximately 124,000 constituents and 182,000 tokens, which can support the training of state-of-the-art transformer-based models. We establish strong baselines on the ICON dataset using the Berkeley Neural Parser with transformer-based pre-trained embeddings, with the best performance of 88.85% F1 score coming from our own version of SpanBERT (IndoSpanBERT). We further analyze the predictions made by our best-performing model to reveal certain idiosyncrasies in the Indonesian language that pose challenges for constituency parsing.

pdf abs
Parsing Early New High German: Benefits and limitations of cross-dialectal training
Christopher Sapp | Daniel Dakota | Elliott Evans

Historical treebanking within the generative framework has gained in popularity. However, there are still many languages and historical periods yet to be represented. For German, a constituency treebank exists for historical Low German, but not Early New High German. We begin to fill this gap by presenting our initial work on the Parsed Corpus of Early New High German (PCENHG). We present the methodological considerations and workflow for the treebank’s annotations and development. Given the limited amount of currently available PCENHG treebank data, we treat it as a low-resource language and leverage a larger, closely related variety—Middle Low German—to build a parser to help facilitate faster post-annotation correction. We present an analysis on annotation speeds and conclude with a small pilot use-case, highlighting potential for future linguistic analyses. In doing so we highlight the value of the treebank’s development for historical linguistic analysis and demonstrate the benefits and challenges of developing a parser using two closely related historical Germanic varieties.

pdf abs
Semgrex and Ssurgeon, Searching and Manipulating Dependency Graphs
John Bauer | Chloé Kiddon | Eric Yeh | Alex Shan | Christopher D. Manning

Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.

This paper presents detailed mappings between the structures used in Abstract Meaning Representation (AMR) and those used in Uniform Meaning Representation (UMR). These structures include general semantic roles, rolesets, and concepts that are largely shared between AMR and UMR, but with crucial differences. While UMR annotation of new low-resource languages is ongoing, AMR-annotated corpora already exist for many languages, and these AMR corpora are ripe for conversion to UMR format. Rather than focusing on semantic coverage that is new to UMR (which will likely need to be dealt with manually), this paper serves as a resource (with illustrated mappings) for users looking to understand the fine-grained adjustments that have been made to the representation techniques for semantic categoriespresent in both AMR and UMR.

pdf (full)
bib (full) Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)

pdf bib
Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023)
Loïc Grobol | Francis Tyers

pdf bib abs
Building a Universal Dependencies Treebank for a Polysynthetic Language: the Case of Abaza
Alexey Koshevoy | Anastasia Panova | Ilya Makarchuk

In this paper, we discuss the challenges that we faced during the construction of a Universal Dependencies treebank for Abaza, a polysynthetic Northwest Caucasian language. We propose an alternative to the morpheme-level annotation of polysynthetic languages introduced in Park et al. (2021). Our approach aims at reducing the number of morphological features, yet providing all the necessary information for the comprehensive representation of all the syntactic relations. Besides, we suggest to add one language-specific relation needed for annotating repetitions in spoken texts and present several solutions that aim at increasing cross-linguistic comparability of our data.

pdf bib abs
Universalising Latin Universal Dependencies: a harmonisation of Latin treebanks in UD
Federica Gamba | Daniel Zeman

This paper presents the harmonisation process carried out on the five treebanks available for Latin in Universal Dependencies, with the aim of eliminating the discrepancies in their annotation styles. Indeed, this is the first issue to be addressed when parsing Latin, as significant drops in parsing accuracy on different Latin treebanks have been repeatedly observed. Latin syntactic variability surely accounts for this, but parsing results are as well affected by divergent annotation choices. By analysing where annotations differ, we propose a Python-based alignment of the five UD treebanks. Consequently, the impact of annotation choices on accuracy scores is assessed by performing parsing experiments with UDPipe and Stanza.

pdf abs
Sinhala Dependency Treebank (STB)
Chamila Liyanage | Kengatharaiyer Sarveswaran | Thilini Nadungodage | Randil Pushpananda

This paper reports the development of the first dependency treebank for the Sinhala language (STB). Sinhala, which is morphologically rich, is a low-resource language with few linguistic and computational resources available publicly. This treebank consists of 100 sentences taken from a large contemporary written text corpus. These sentences were annotated manually according to the Universal Dependencies framework. In this paper, apart from elaborating on the approach that has been followed to create the treebank, we have also discussed some interesting syntactic constructions found in the corpus and how we have handled them using the current Universal Dependencies specification.

pdf abs
Methodological issues regarding the semi-automatic UD treebank creation of under-resourced languages: the case of Pomak
Stella Markantonatou | Nicolaos Th. Constantinides | Vivian Stamou | Vasileios Arampatzakis | Panagiotis G. Krimpas | George Pavlidis

Pomak is an endangered oral Slavic language of Thrace/Greece. We present a short description of its interesting morphological and syntactic features in the UD framework. Because the morphological annotation of the treebank takes advantage of existing resources, it requires a different methodological approach from the one adopted for syntactic annotation that has started from scratch. It also requires the option of obtaining morphological predictions/evaluation separately from the syntactic ones with state-of-the-art NLP tools. Active annotation is applied in various settings in order to identify the best model that would facilitate the ongoing syntactic annotation.

pdf abs
Analysis of Corpus-based Word-Order Typological Methods
Diego Alves | Božo Bekavac | Daniel Zeman | Marko Tadić

This article presents a comparative analysis of four different syntactic typological approaches applied to 20 different languages. We compared three specific quantitative methods, using parallel CoNLL-U corpora, to the classification obtained via syntactic features provided by a typological database (lang2vec). First, we analyzed the Marsagram linear approach which consists of extracting the frequency word-order patterns regarding the position of components inside syntactic nodes. The second approach considers the relative position of heads and dependents, and the third is based simply on the relative position of verbs and objects. From the results, it was possible to observe that each method provides different language clusters which can be compared to the classic genealogical classification (the lang2vec and the head and dependent methods being the closest). As different word-order phenomena are considered in these specific typological strategies, each one provides a different angle of analysis to be applied according to the precise needs of the researchers.

pdf abs
Rule-based semantic interpretation for Universal Dependencies
Jamie Y. Findlay | Saeedeh Salimifar | Ahmet Yıldırım | Dag T. T. Haug

In this paper, we present a system for generating semantic representations from Universal Dependencies syntactic parses. The foundation of our pipeline is a rule-based interpretation system, designed to be as universal as possible, which produces the correct semantic structure; the content of this structure can then be filled in by additional (sometimes language-specific) post-processing. The rules which generate semantic resources rely as far as possible on the UD parse alone, so that they can apply to any language for which such a parse can be given (a much larger number than the number of languages for which detailed semantically annotated corpora are available). We discuss our general approach, and highlight areas where the UD annotation scheme makes semantic interpretation less straightforward. We compare our results with the Parallel Meaning Bank, and show that when it comes to modelling semantic structure, our approach shows potential, but also discuss some areas for expansion.

pdf abs
Are UD Treebanks Getting More Consistent? A Report Card for English UD
Amir Zeldes | Nathan Schneider

Recent efforts to consolidate guidelines and treebanks in the Universal Dependencies project raise the expectation that joint training and dataset comparison is increasingly possible for high-resource languages such as English, which have multiple corpora. Focusing on the two largest UD English treebanks, we examine progress in data consolidation and answer several questions: Are UD English treebanks becoming more internally consistent? Are they becoming more like each other and to what extent? Is joint training a good idea, and if so, since which UD version? Our results indicate that while consolidation has made progress, joint models may still suffer from inconsistencies, which hamper their ability to leverage a larger pool of training data.

pdf abs
Introducing Morphology in Universal Dependencies Japanese
Chihiro Taguchi | David Chiang

This paper discusses the need for including morphological features in Japanese Universal Dependencies (UD). In the current version (v2.11) of the Japanese UD treebanks, sentences are tokenized at the morpheme level, and almost no morphological feature annotation is used. However, Japanese is not an isolating language that lacks morphological inflection but is an agglutinative language. Given this situation, we introduce a tentative scheme for retokenization and morphological feature annotation for Japanese UD. Then, we measure and compare the morphological complexity of Japanese with other languages to demonstrate that the proposed tokenizations show similarities to synthetic languages reflecting the linguistic typology.

pdf (full)
bib (full) Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

pdf bib abs
Analyzing Zero-Shot transfer Scenarios across Spanish variants for Hate Speech Detection
Galo Castillo-lópez | Arij Riabi | Djamé Seddah

Hate speech detection in online platforms has been widely studied inthe past. Most of these works were conducted in English and afew rich-resource languages. Recent approaches tailored forlow-resource languages have explored the interests of zero-shot cross-lingual transfer learning models in resource-scarce scenarios.However, languages variations between geolects such as AmericanEnglish and British English, Latin-American Spanish, and EuropeanSpanish is still a problem for NLP models that often relies on(latent) lexical information for their classification tasks. Moreimportantly, the cultural aspect, crucial for hate speech detection,is often overlooked.In this work, we present the results of a thorough analysis of hatespeech detection models performance on different variants of Spanish,including a new hate speech toward immigrants Twitter data set we built to cover these variants. Using mBERT and Beto, a monolingual Spanish Bert-based language model, as the basis of our transfer learning architecture, our results indicate that hate speech detection models for a given Spanish variant are affected when different variations of such language are not considered. Hate speech expressions could vary from region to region where the same language is spoken.Our new dataset, models and guidelines are freely available.

pdf bib abs
Optimizing the Size of Subword Vocabularies in Dialect Classification
Vani Kanjirangat | Tanja Samardžić | Ljiljana Dolamic | Fabio Rinaldi

Pre-trained models usually come with a pre-defined tokenization and little flexibility as to what subword tokens can be used in downstream tasks. This problem concerns especially multilingual NLP and low-resource languages, which are typically processed using cross-lingual transfer. In this paper, we aim to find out if the right granularity of tokenization is helpful for a text classification task, namely dialect classification. Aiming at generalizations beyond the studied cases, we look for the optimal granularity in four dialect datasets, two with relatively consistent writing (one Arabic and one Indo-Aryan set) and two with considerably inconsistent writing (one Arabic and one Swiss German set). To gain more control over subword tokenization and ensure direct comparability in the experimental settings, we train a CNN classifier from scratch comparing two subword tokenization methods (Unigram model and BPE). For reference, we compare the results obtained in our analysis to the state of the art achieved by fine-tuning pre-trained models. We show that models trained from scratch with an optimal tokenization level perform better than fine-tuned classifiers in the case of highly inconsistent writing. In the case of relatively consistent writing, fine-tuned models remain better regardless of the tokenization level.

pdf abs
Murreviikko - A Dialectologically Annotated and Normalized Dataset of Finnish Tweets
Olli Kuparinen

This paper presents Murreviikko, a dataset of dialectal Finnish tweets which have been dialectologically annotated and manually normalized to a standard form. The dataset can be used as a test set for dialect identification and dialect-to-standard normalization, for instance. We evaluate the dataset on the normalization task, comparing an existing normalization model built on a spoken dialect corpus and three newly trained models with different architectures. We find that there are significant differences in normalization difficulty between the dialects, and that a character-level statistical machine translation model performs best on the Murreviikko tweet dataset.

pdf abs
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages
Verena Blaschke | Hinrich Schütze | Barbara Plank

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the isplit word ratio difference/i) is the strongest predictor for model performance on target data.

pdf abs
Temporal Domain Adaptation for Historical Irish
Oksana Dereza | Theodorus Fransen | John P. Mccrae

The digitisation of historical texts has provided new horizons for NLP research, but such data also presents a set of challenges, including scarcity and inconsistency. The lack of editorial standard during digitisation exacerbates these difficulties.This study explores the potential for temporal domain adaptation in Early Modern Irish and pre-reform Modern Irish data. We describe two experiments carried out on the book subcorpus of the Historical Irish Corpus, which includes Early Modern Irish and pre-reform Modern Irish texts from 1581 to 1926. We also propose a simple orthographic normalisation method for historical Irish that reduces the type-token ratio by 21.43% on average in our data.The results demonstrate that the use of out-of-domain data significantly improves a language model’s performance. Providing a model with additional input from another historical stage of the language improves its quality by 12.49% on average on non-normalised texts and by 27.02% on average on normalised (demutated) texts. Most notably, using only out-of-domain data for both pre-training and training stages allowed for up to 86.81% of the baseline model quality on non-normalised texts and up to 95.68% on normalised texts without any target domain data. Additionally, we investigate the effect of temporal distance between the training and test data. The hypothesis that there is a positive correlation between performance and temporal proximity of training and test data has been validated, which manifests best in normalised data. Expanding this approach even further back, to Middle and Old Irish, and testing it on other languages is a further research direction.

pdf abs
Variation and Instability in Dialect-Based Embedding Spaces
Jonathan Dunn

This paper measures variation in embedding spaces which have been trained on different regional varieties of English while controlling for instability in the embeddings. While previous work has shown that it is possible to distinguish between similar varieties of a language, this paper experiments with two follow-up questions: First, does the variety represented in the training data systematically influence the resulting embedding space after training? This paper shows that differences in embeddings across varieties are significantly higher than baseline instability. Second, is such dialect-based variation spread equally throughout the lexicon? This paper shows that specific parts of the lexicon are particularly subject to variation. Taken together, these experiments confirm that embedding spaces are significantly influenced by the dialect represented in the training data. This finding implies that there is semantic variation across dialects, in addition to previously-studied lexical and syntactic variation.

pdf abs
PALI: A Language Identification Benchmark for Perso-Arabic Scripts
Sina Ahmadi | Milind Agarwal | Antonios Anastasopoulos

The Perso-Arabic scripts are a family of scripts that are widely adopted and used by various linguistic communities around the globe. Identifying various languages using such scripts is crucial to language technologies and challenging in low-resource setups. As such, this paper sheds light on the challenges of detecting languages using Perso-Arabic scripts, especially in bilingual communities where “unconventional” writing is practiced. To address this, we use a set of supervised techniques to classify sentences into their languages. Building on these, we also propose a hierarchical model that targets clusters of languages that are more often confused by the classifiers. Our experiment results indicate the effectiveness of our solutions.

pdf abs
Get to Know Your Parallel Data: Performing English Variety and Genre Classification over MaCoCu Corpora
Taja Kuzman | Peter Rupnik | Nikola Ljubešić

Collecting texts from the web enables a rapid creation of monolingual and parallel corpora of unprecedented size. However, unlike manually-collected corpora, authors and end users do not know which texts make up the web collections. In this work, we analyse the content of seven European parallel web corpora, collected from national top-level domains, by analysing the English variety and genre distribution in them. We develop and provide a lexicon-based British-American variety classifier, which we use to identify the English variety. In addition, we apply a Transformer-based genre classifier to corpora to analyse genre distribution and the interplay between genres and English varieties. The results reveal significant differences among the seven corpora in terms of different genre distribution and different preference for English varieties.

pdf abs
Reconstructing Language History by Using a Phonological Ontology. An Analysis of German Surnames
Hanna Fischer | Robert Engsterhold

This paper applies the ontology-baseddialectometric technique of Engsterhold(2020) to surnames. The method wasoriginally developed for phonetic analyses.However, as will be shown, it is also suitedfor the study of graphemic representations.Based on data from the German SurnameAtlas (DFA), the method is optimized forgraphemic analysis and illustrated with anexample case.

pdf abs
BENCHić-lang: A Benchmark for Discriminating between Bosnian, Croatian, Montenegrin and Serbian
Peter Rupnik | Taja Kuzman | Nikola Ljubešić

Automatic discrimination between Bosnian, Croatian, Montenegrin and Serbian is a hard task due to the mutual intelligibility of these South-Slavic languages. In this paper, we introduce the BENCHić-lang benchmark for discriminating between these four languages. The benchmark consists of two datasets from different domains - a Twitter and a news dataset - selected with the aim of fostering cross-dataset evaluation of different modelling approaches. We experiment with the baseline SVM models, based on character n-grams, which perform nicely in-dataset, but do not generalize well in cross-dataset experiments. Thus, we introduce another approach, exploiting only web-crawled data and the weak supervision signal coming from the respective country/language top-level domains. The resulting simple Naive Bayes model, based on less than a thousand word features extracted from web data, outperforms the baseline models in the cross-dataset scenario and achieves good levels of generalization across datasets.

pdf abs
Comparing and Predicting Eye-tracking Data of Mandarin and Cantonese
Junlin Li | Bo Peng | Yu-yin Hsu | Emmanuele Chersoni

Eye-tracking data in Chinese languages present unique challenges due to the non-alphabetic and unspaced nature of the Chinese writing systems. This paper introduces the first deeply-annotated joint Mandarin-Cantonese eye-tracking dataset, from which we achieve a unified eye-tracking prediction system for both language varieties. In addition to the commonly studied first fixation duration and the total fixation duration, this dataset also includes the second fixation duration, expressing fixation patterns that are more relevant to higher-level, structural processing. A basic comparison of the features and measurements in our dataset revealed variation between Mandarin and Cantonese on fixation patterns related to word class and word position. The test of feature usefulness suggested that traditional features are less powerful in predicting the second-pass fixation, to which the linear distance to root makes a leading contribution in Mandarin. In contrast, Cantonese eye-movement behavior relies more on word position and part of speech.

pdf abs
A Measure for Linguistic Coherence in Spatial Language Variation
Alfred Lameli | Andreas Schönberg

Based on historical dialect data we introduce a local measure of linguistic coherence in spatial language variation aiming at the identification of regions which are particularly sensitive to language variation and change. Besides, we use a measure of global coherence for the automated detection of linguistic items (e.g., sounds or morphemes) with higher or lesser language variation. The paper describes both the data and the method and provides analyses examples.

pdf abs
Dialect and Variant Identification as a Multi-Label Classification Task: A Proposal Based on Near-Duplicate Analysis
Gabriel Bernier-colborne | Cyril Goutte | Serge Leger

We argue that dialect identification should be treated as a multi-label classification problem rather than the single-class setting prevalent in existing collections and evaluations.In order to avoid extensive human re-labelling of the data, we propose an analysis of ambiguous near-duplicates in an existing collection covering four variants of French.We show how this analysis helps us provide multiple labels for a significant subset of the original data, therefore enriching the annotation with minimal human intervention.The resulting data can then be used to train dialect identifiers in a multi-label setting.Experimental results show that on the enriched dataset, the multi-label classifier produces similar accuracy to the single-label classifier on test cases that are unambiguous (single label), but it increases the macro-averaged F1-score by 0.225 absolute (71% relative gain) on ambiguous texts with multiple labels. On the original data, gains on the ambiguous test cases are smaller but still considerable (+0.077 absolute, 20% relative gain), and accuracy on non-ambiguous test cases is again similar in this case.This supports our thesis that modelling dialect identification as a multi-label problem potentially has a positive impact.

pdf abs
Fine-Tuning BERT with Character-Level Noise for Zero-Shot Transfer to Dialects and Closely-Related Languages
Aarohi Srivastava | David Chiang

In this work, we induce character-level noise in various forms when fine-tuning BERT to enable zero-shot cross-lingual transfer to unseen dialects and languages. We fine-tune BERT on three sentence-level classification tasks and evaluate our approach on an assortment of unseen dialects and languages. We find that character-level noise can be an extremely effective agent of cross-lingual transfer under certain conditions, while it is not as helpful in others. Specifically, we explore these differences in terms of the nature of the task and the relationships between source and target languages, finding that introduction of character-level noise during fine-tuning is particularly helpful when a task draws on surface level cues and the source-target cross-lingual pair has a relatively high lexical overlap with shorter (i.e., less meaningful) unseen tokens on average.

pdf abs
Lemmatization Experiments on Two Low-Resourced Languages: Low Saxon and Occitan
Aleksandra Miletić | Janine Siewert

We present lemmatization experiments on the unstandardized low-resourced languages Low Saxon and Occitan using two machine-learning-based approaches represented by MaChAmp and Stanza. We show different ways to increase training data by leveraging historical corpora, small amounts of gold data and dictionary information, and discuss the usefulness of this additional data. In the results, we find some differences in the performance of the models depending on the language. This variation is likely to be partly due to differences in the corpora we used, such as the amount of internal variation. However, we also observe common tendencies, for instance that sequential models trained only on gold-annotated data often yield the best overall performance and generalize better to unknown tokens.

pdf abs
The Use of Khislavichi Lect Morphological Tagging to Determine its Position in the East Slavic Group
Ilia Afanasev

The study of low-resourced East Slavic lects is becoming increasingly relevant as they face the prospect of extinction under the pressure of standard Russian while being treated by academia as an inferior part of this lect. The Khislavichi lect, spoken in a settlement on the border of Russia and Belarus, is a perfect example of such an attitude.We take an alternative approach and study East Slavic lects (such as Khislavichi) as separate systems. The proposed method includes the development of a tagged corpus through morphological tagging with the models trained on the bigger lects. Morphological tagging results may be used to place these lects among the bigger ones, such as standard Belarusian or standard Russian. The implemented morphological taggers of standard Russian and standard Belarusian demonstrate an accuracy higher than the accuracy of multilingual models by 3 to 15{%. The study suggests possible ways to adapt these taggers to the Khislavichi dataset, such as tagset unification and transcription closer to the actual sound rather than the standard lect pronunciation. Automatic classification supports the hypothesis that Khislavichi is a border East Slavic lect that historically was Belarusian but got russified: the algorithm places it either slightly closer to Russian or to Belarusian.

pdf abs
DiatopIt: A Corpus of Social Media Posts for the Study of Diatopic Language Variation in Italy
Alan Ramponi | Camilla Casula

We introduce DiatopIt, the first corpus specifically focused on diatopic language variation in Italy for language varieties other than Standard Italian. DiatopIt comprises over 15K geolocated social media posts from Twitter over a period of two years, including regional Italian usage and content fully written in local language varieties or exhibiting code-switching with Standard Italian. We detail how we tackled key challenges in creating such a resource, including the absence of orthography standards for most local language varieties and the lack of reliable language identification tools. We assess the representativeness of DiatopIt across time and space, and show that the density of non-Standard Italian content across areas correlates with actual language use. We finally conduct computational experiments and find that modeling diatopic variation on highly multilingual areas such as Italy is a complex task even for recent language models.

pdf abs
Dialect Representation Learning with Neural Dialect-to-Standard Normalization
Olli Kuparinen | Yves Scherrer

Language label tokens are often used in multilingual neural language modeling and sequence-to-sequence learning to enhance the performance of such models. An additional product of the technique is that the models learn representations of the language tokens, which in turn reflect the relationships between the languages. In this paper, we study the learned representations of dialects produced by neural dialect-to-standard normalization models. We use two large datasets of typologically different languages, namely Finnish and Norwegian, and evaluate the learned representations against traditional dialect divisions of both languages. We find that the inferred dialect embeddings correlate well with the traditional dialects. The methodology could be further used in noisier settings to find new insights into language variation.

pdf abs
VarDial in the Wild: Industrial Applications of LID Systems for Closely-Related Language Varieties
Fritz Hohl | Soh-eun Shim

This report describes first an industrial use case for identifying closely related languages, e.g.dialects, namely the detection of languages of movie subtitle documents. We then presenta 2-stage architecture that is able to detect macrolanguages in the first stage and languagevariants in the second. Using our architecture, we participated in the DSL-TL Shared Task of the VarDial 2023 workshop. We describe the results of our experiments. In the first experiment we report an accuracy of 97.8% on a set of 460 subtitle files. In our second experimentwe used DSL-TL data and achieve a macroaverage F1 of 76% for the binary task, and 54% for the three-way task in the dev set. In the open track, we augment the data with named entities retrieved from Wikidata and achieve minor increases of about 1% for both tracks.

pdf abs
Two-stage Pipeline for Multilingual Dialect Detection
Ankit Vaidya | Aditya Kane

Dialect Identification is a crucial task for localizing various Large Language Models. This paper outlines our approach to the VarDial 2023 shared task. Here we have to identify three or two dialects from three languages each which results in a 9-way classification for Track-1 and 6-way classification for Track-2 respectively. Our proposed approach consists of a two-stage system and outperforms other participants’ systems and previous works in this domain. We achieve a score of 58.54% for Track-1 and 85.61% for Track-2. Our codebase is available publicly (https://github.com/ankit-vaidya19/EACL_VarDial2023).

pdf abs
Using Ensemble Learning in Language Variety Identification
Mihaela Gaman

The present work describes the solutions pro- posed by the UnibucNLP team to address the closed format of the DSL-TL task featured in the tenth VarDial Evaluation Campaign. The DSL-TL organizers provided approximately 11 thousand sentences written in three different languages and manually tagged with one of 9 classes. Out of these, 3 tags are considered common label and the remaining 6 tags are variety-specific. The DSL-TL task features 2 subtasks: Track 1 - a three-way and Track 2 - a two-way classification per language. In Track 2 only the variety-specific labels are used for scoring, whereas in Track 1 the common label is considered as well. Our team participated in both tracks, with three ensemble-based sub- missions for each. The meta-learner used for Track 1 is XGBoost and for Track 2, Logis- tic Regression. With each submission, we are gradually increasing the complexity of the en- semble, starting with two shallow, string-kernel based methods. To the first ensemble, we add a convolutional neural network for our second submission. The third ensemble submitted adds a fine-tuned BERT model to the second one. In Track 1, ensemble three is our highest ranked, with an F1 − score of 53.18%; 5.36% less than the leader. Surprisingly, in Track 2 the en- semble of shallow methods surpasses the other two, more complex ensembles, achieving an F 1 − score of 69.35%.

pdf abs
SIDLR: Slot and Intent Detection Models for Low-Resource Language Varieties
Sang Yun Kwon | Gagan Bhatia | Elmoatez Billah Nagoudi | Alcides Alcoba Inciarte | Muhammad Abdul-mageed

Intent detection and slot filling are two critical tasks in spoken and natural language understandingfor task-oriented dialog systems. In this work, we describe our participation in slot and intent detection for low-resource language varieties (SID4LR) (Aepli et al., 2023). We investigate the slot and intent detection (SID) tasks using a wide range of models and settings. Given the recent success of multitask promptedfinetuning of the large language models, we also test the generalization capability of the recent encoder-decoder model mT0 (Muennighoff et al., 2022) on new tasks (i.e., SID) in languages they have never intentionally seen. We show that our best model outperforms the baseline by a large margin (up to +30 F1 points) in both SID tasks.

This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages – True Labels (DSL-TL), and Discriminating Between Similar Languages – Speech (DSL-S). All three tasks were organized for the first time this year.

bib (full) Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference

pdf bib
Proceedings of the Computational Sanskrit & Digital Humanities: Selected papers presented at the 18th World Sanskrit Conference
Amba Kulkarni | Oliver Hellwig

pdf bib
Evaluating Neural Word Embeddings for Sanskrit
Jivnesh Sandhan | Om Adideva Paranjay | Komal Digumarthi | Laxmidhar Behra | Pawan Goyal

pdf
Validation and Normalization of DCS corpus and Development of the Sanskrit Heritage Engine’s Segmenter
Krishnan Sriram | Amba Kulkarni | Gérard Huet

pdf
Pre-annotation Based Approach for Development of a Sanskrit Named Entity Recognition Dataset
Sarkar Sujoy | Amrith Krishna | Pawan Goyal

pdf
Disambiguation of Instrumental, Dative and Ablative Case suffixes in Sanskrit
Malay Maity | Sanjeev Panchal | Amba Kulkarni

pdf
Creation of a Digital Rig Vedic Index (Anukramani) for Computational Linguistic Tasks
A V S D S Mahesh | Arnab Bhattacharya

pdf
Skrutable: Another Step Toward Effective Sanskrit Meter Identification
Tyler Neill

pdf
Chandojnanam: A Sanskrit Meter Identification and Utilization System
Hrishikesh Terdalkar | Arnab Bhattacharya

pdf
Development of a TEI standard for digital Sanskrit texts containing commentaries: A pilot study of Bhaṭṭti’s Rāvaṇavadha with Mallinātha’s commentary on the first canto
Tanuja P Ajotikar | Peter M Scharf

pdf
Rāmopākhyāna: A Web-based reader and index
Peter M Scharf | Dhruv Chauhan

pdf
Semantic Annotation and Querying Framework based on Semi-structured Ayurvedic Text
Hrishikesh Terdalkar | Arnab Bhattacharya | Madhulika Dubey | S Ramamurthy | Bhavna Naneria Singh

pdf
The Vedic corpus as a graph. An updated version of Bloomfields Vedic Concordance
Oliver Hellwig | Sven Sellmer | Kyoko Amano

pdf
The transmission of the Buddha’s teachings in the digital age
Sumachaya Harnsukworapanich | Phatchareporn Supphipat

pdf
Distinguishing Commentary from Canon: Experiments in Pāli Computational Linguistics
Dan Zigmond