Adam Ek

2023

pdf abs
Vector Norms as an Approximation of Syntactic Complexity
Adam Ek | Nikolai Ilinykh
Proceedings of the Second Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2023)

Internal representations in transformer models can encode useful linguistic knowledge about syntax. Such knowledge could help optimise the data annotation process. However, identifying and extracting such representations from big language models is challenging. In this paper we evaluate two multilingual transformers for the presence of knowledge about the syntactic complexity of sentences and examine different vector norms. We provide a fine-grained evaluation of different norms in different layers and for different languages. Our results suggest that no single part in the models would be the primary source for the knowledge of syntactic complexity. But some norms show a higher degree of sensitivity to syntactic complexity, depending on the language and model used.

2022

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation, and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements on several fronts that were made in the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 66 new languages, including 24 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g., missing gender and macrons information. We have amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

In this paper we examine different meaning representations that are commonly used in different natural language applications today and discuss their limits, both in terms of the aspects of the natural language meaning they are modelling and in terms of the aspects of the application for which they are used.

In this paper, we present a number of fine-grained resources for Natural Language Inference (NLI). In particular, we present a number of resources and validation methods for Greek NLI and a resource for precise NLI. First, we extend the Greek version of the FraCaS test suite to include examples where the inference is directly linked to the syntactic/morphological properties of Greek. The new resource contains an additional 428 examples, making it in total a dataset of 774 examples. Expert annotators have been used in order to create the additional resource, while extensive validation of the original Greek version of the FraCaS by non-expert and expert subjects is performed. Next, we continue the work initiated by (CITATION), according to which a subset of the RTE problems have been labeled for missing hypotheses and we present a dataset an order of magnitude larger, annotating the whole SuperGlUE/RTE dataset with missing hypotheses. Lastly, we provide a de-dropped version of the Greek XNLI dataset, where the pronouns that are missing due to the pro-drop nature of the language are inserted. We then run some models to see the effect of that insertion and report the results.

2021

pdf
Can the Transformer Learn Nested Recursion with Symbol Masking?
Jean-Philippe Bernardy | Adam Ek | Vladislav Maraev
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf abs
Can predicate-argument relationships be extracted from UD trees?
Adam Ek | Jean-Philippe Bernardy | Stergios Chatzikyriakidis
Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop

In this paper we investigate the possibility of extracting predicate-argument relations from UD trees (and enhanced UD graphs). Con- cretely, we apply UD parsers on an En- glish question answering/semantic-role label- ing data set (FitzGerald et al., 2018) and check if the annotations reflect the relations in the resulting parse trees, using a small number of rules to extract this information. We find that 79.1% of the argument-predicate pairs can be found in this way, on the basis of Ud- ify (Kondratyuk and Straka, 2019). Error anal- ysis reveals that half of the error cases are at- tributable to shortcomings in the dataset. The remaining errors are mostly due to predicate- argument relations not being extractible algo- rithmically from the UD trees (requiring se- mantic reasoning to be resolved). The parser itself is only responsible for a small portion of errors. Our analysis suggests a number of improvements to the UD annotation schema: we propose to enhance the schema in four ways, in order to capture argument-predicate relations. Additionally, we propose improve- ments regarding data collection for question answering/semantic-role labeling data.

This year’s iteration of the SIGMORPHON Shared Task on morphological reinflection focuses on typological diversity and cross-lingual variation of morphosyntactic features. In terms of the task, we enrich UniMorph with new data for 32 languages from 13 language families, with most of them being under-resourced: Kunwinjku, Classical Syriac, Arabic (Modern Standard, Egyptian, Gulf), Hebrew, Amharic, Aymara, Magahi, Braj, Kurdish (Central, Northern, Southern), Polish, Karelian, Livvi, Ludic, Veps, Võro, Evenki, Xibe, Tuvan, Sakha, Turkish, Indonesian, Kodi, Seneca, Asháninka, Yanesha, Chukchi, Itelmen, Eibela. We evaluate six systems on the new data and conduct an extensive error analysis of the systems’ predictions. Transformer-based models generally demonstrate superior performance on the majority of languages, achieving >90% accuracy on 65% of them. The languages on which systems yielded low accuracy are mainly under-resourced, with a limited amount of data. Most errors made by the systems are due to allomorphy, honorificity, and form variation. In addition, we observe that systems especially struggle to inflect multiword lemmas. The systems also produce misspelled forms or end up in repetitive loops (e.g., RNN-based models). Finally, we report a large drop in systems’ performance on previously unseen lemmas.

pdf abs
Training Strategies for Neural Multilingual Morphological Inflection
Adam Ek | Jean-Philippe Bernardy
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper presents the submission of team GUCLASP to SIGMORPHON 2021 Shared Task on Generalization in Morphological Inflection Generation. We develop a multilingual model for Morphological Inflection and primarily focus on improving the model by using various training strategies to improve accuracy and generalization across languages.

2020

pdf abs
How Much of Enhanced UD Is Contained in UD?
Adam Ek | Jean-Philippe Bernardy
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

In this paper, we present the submission of team CLASP to the IWPT 2020 Shared Task on parsing enhanced universal dependencies. We develop a tree-to-graph transformation algorithm based on dependency patterns. This algorithm can transform gold UD trees to EUD graphs with an ELAS score of 81.55 and a EULAS score of 96.70. These results show that much of the information needed to construct EUD graphs from UD trees are present in the UD trees. Coupled with a standard UD parser, the method applies to the official test data and yields and ELAS score of 67.85 and a EULAS score is 80.18.

pdf abs
Composing Byte-Pair Encodings for Morphological Sequence Classification
Adam Ek | Jean-Philippe Bernardy
Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020)

Byte-pair encodings is a method for splitting a word into sub-word tokens, a language model then assigns contextual representations separately to each of these tokens. In this paper, we evaluate four different methods of composing such sub-word representations into word representations. We evaluate the methods on morphological sequence classification, the task of predicting grammatical features of a word. Our experiments reveal that using an RNN to compute word representations is consistently more effective than the other methods tested across a sample of eight languages with different typology and varying numbers of byte-pair tokens per word.

pdf bib
Proceedings of the Probability and Meaning Conference (PaM 2020)
Christine Howes | Stergios Chatzikyriakidis | Adam Ek | Vidya Somashekarappa
Proceedings of the Probability and Meaning Conference (PaM 2020)

pdf abs
How does Punctuation Affect Neural Models in Natural Language Inference
Adam Ek | Jean-Philippe Bernardy | Stergios Chatzikyriakidis
Proceedings of the Probability and Meaning Conference (PaM 2020)

Natural Language Inference models have reached almost human-level performance but their generalisation capabilities have not been yet fully characterized. In particular, sensitivity to small changes in the data is a current area of investigation. In this paper, we focus on the effect of punctuation on such models. Our findings can be broadly summarized as follows: (1) irrelevant changes in punctuation are correctly ignored by the recent transformer models (BERT) while older RNN-based models were sensitive to them. (2) All models, both transformers and RNN-based models, are incapable of taking into account small relevant changes in the punctuation.

2019

pdf abs
Language Modeling with Syntactic and Semantic Representation for Sentence Acceptability Predictions
Adam Ek | Jean-Philippe Bernardy | Shalom Lappin
Proceedings of the 22nd Nordic Conference on Computational Linguistics

In this paper, we investigate the effect of enhancing lexical embeddings in LSTM language models (LM) with syntactic and semantic representations. We evaluate the language models using perplexity, and we evaluate the performance of the models on the task of predicting human sentence acceptability judgments. We train LSTM language models on sentences automatically annotated with universal syntactic dependency roles (Nivre, 2016), dependency depth and universal semantic tags (Abzianidze et al., 2017) to predict sentence acceptability judgments. Our experiments indicate that syntactic tags lower perplexity, while semantic tags increase it. Our experiments also show that neither syntactic nor semantic tags improve the performance of LSTM language models on the task of predicting sentence acceptability judgments.

pdf abs
Synthetic Propaganda Embeddings To Train A Linear Projection
Adam Ek | Mehdi Ghanimifard
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

This paper presents a method of detecting fine-grained categories of propaganda in text. Given a sentence, our method aims to identify a span of words and predict the type of propaganda used. To detect propaganda, we explore a method for extracting features of propaganda from contextualized embeddings without fine-tuning the large parameters of the base model. We show that by generating synthetic embeddings we can train a linear function with ReLU activation to extract useful labeled embeddings from an embedding space generated by a general-purpose language model. We also introduce an inference technique to detect continuous spans in sequences of propaganda tokens in sentences. A result of the ensemble model is submitted to the first shared task in fine-grained propaganda detection at NLP4IF as Team Stalin. In this paper, we provide additional analysis regarding our method of detecting spans of propaganda with synthetically generated representations.

Adam Ek

2023

2022

2021

2020

2019

2018

2017

Co-authors

Venues