Grzegorz Kondrak

2023

pdf abs
Correcting Sense Annotations Using Wordnets and Translations
Arnob Mallik | Grzegorz Kondrak
Proceedings of the 12th Global Wordnet Conference

Acquiring large amounts of high-quality annotated data is an open issue in word sense disambiguation. This problem has become more critical recently with the advent of supervised models based on neural networks, which require large amounts of annotated data. We propose two algorithms for making selective corrections on a sense-annotated parallel corpus, based on cross-lingual synset mappings. We show that, when applied to bilingual parallel corpora, these algorithms can rectify noisy sense annotations, and thereby produce multilingual sense-annotated data of improved quality.

pdf abs
Bridging the Gap Between BabelNet and HowNet: Unsupervised Sense Alignment and Sememe Prediction
Xiang Zhang | Ning Shi | Bradley Hauer | Grzegorz Kondrak
Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

As the minimum semantic units of natural languages, sememes can provide precise representations of concepts. Despite the widespread utilization of lexical resources for semantic tasks, use of sememes is limited by a lack of available sememe knowledge bases. Recent efforts have been made to connect BabelNet with HowNet by automating sememe prediction. However, these methods depend on large manually annotated datasets. We propose to use sense alignment via a novel unsupervised and explainable method. Our method consists of four stages, each relaxing predefined constraints until a complete alignment of BabelNet synsets to HowNet senses is achieved. Experimental results demonstrate the superiority of our unsupervised method over previous supervised ones by an improvement of 12% overall F1 score, setting a new state of the art. Our work is grounded in an interpretable propagation of sememe information between lexical resources, and may benefit downstream applications which can incorporate sememe information.

pdf abs
Don’t Trust ChatGPT when your Question is not in English: A Study of Multilingual Abilities and Types of LLMs
Xiang Zhang | Senyu Li | Bradley Hauer | Ning Shi | Grzegorz Kondrak
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have demonstrated exceptional natural language understanding abilities, and have excelled in a variety of natural language processing (NLP) tasks. Despite the fact that most LLMs are trained predominantly on English, multiple studies have demonstrated their capabilities in a variety of languages. However, fundamental questions persist regarding how LLMs acquire their multilingual abilities and how performance varies across different languages. These inquiries are crucial for the study of LLMs since users and researchers often come from diverse language backgrounds, potentially influencing how they use LLMs and interpret their output. In this work, we propose a systematic way of qualitatively and quantitatively evaluating the multilingual capabilities of LLMs. We investigate the phenomenon of cross-language generalization in LLMs, wherein limited multilingual training data leads to advanced multilingual capabilities. To accomplish this, we employ a novel prompt back-translation method. The results demonstrate that LLMs, such as GPT, can effectively transfer learned knowledge across different languages, yielding relatively consistent results in translation-equivariant tasks, in which the correct output does not depend on the language of the input. However, LLMs struggle to provide accurate results in translation-variant tasks, which lack this property, requiring careful user judgment to evaluate the answers.

pdf
One Sense per Translation
Bradley Hauer | Grzegorz Kondrak
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf abs
UAlberta at SemEval-2023 Task 1: Context Augmentation and Translation for Multilingual Visual Word Sense Disambiguation
Michael Ogezi | Bradley Hauer | Talgat Omarov | Ning Shi | Grzegorz Kondrak
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

We describe the systems of the University of Alberta team for the SemEval-2023 Visual Word Sense Disambiguation (V-WSD) Task. We present a novel algorithm that leverages glosses retrieved from BabelNet, in combination with text and image encoders. Furthermore, we compare language-specific encoders against the application of English encoders to translated texts. As the contexts given in the task datasets are extremely short, we also experiment with augmenting these contexts with descriptions generated by a language model. This yields substantial improvements in accuracy. We describe and evaluate additional V-WSD methods which use image generation and text-conditioned image segmentation. Some of our experimental results exceed those of our official submissions on the test set. Our code is publicly available at https://github.com/UAlberta-NLP/v-wsd.

pdf abs
Grounding the Lexical Substitution Task in Entailment
Talgat Omarov | Grzegorz Kondrak
Findings of the Association for Computational Linguistics: ACL 2023

Existing definitions of lexical substitutes are often vague or inconsistent with the gold annotations. We propose a new definition which is grounded in the relation of entailment; namely, that the sentence that results from the substitution should be in the relation of mutual entailment with the original sentence. We argue that the new definition is well-founded and supported by previous work on lexical entailment. We empirically validate our definition by verifying that it covers the majority of gold substitutes in existing datasets. Based on this definition, we create a new dataset from existing semantic resources. Finally, we propose a novel context augmentation method motivated by the definition, which relates the substitutes to the sense of the target word by incorporating glosses and synonyms directly into the context. Experimental results demonstrate that our augmentation approach improves the performance of lexical substitution systems on the existing benchmarks.

pdf abs
Taxonomy of Problems in Lexical Semantics
Bradley Hauer | Grzegorz Kondrak
Findings of the Association for Computational Linguistics: ACL 2023

Semantic tasks are rarely formally defined, and the exact relationship between them is an open question. We introduce a taxonomy that elucidates the connection between several problems in lexical semantics, including monolingual and cross-lingual variants. Our theoretical framework is based on the hypothesis of the equivalence of concept and meaning distinctions. Using algorithmic problem reductions, we demonstrate that all problems in the taxonomy can be reduced to word sense disambiguation (WSD), and that WSD itself can be reduced to some problems, making them theoretically equivalent. In addition, we carry out experiments that strongly support the soundness of the concept-meaning hypothesis, and the correctness of our reductions.

2022

pdf abs
Lexical Resource Mapping via Translations
Hongchang Bao | Bradley Hauer | Grzegorz Kondrak
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Aligning lexical resources that associate words with concepts in multiple languages increases the total amount of semantic information that can be leveraged for various NLP tasks. We present a translation-based approach to mapping concepts across diverse resources. Our methods depend only on multilingual lexicalization information. When applied to align WordNet/BabelNet to CLICS and OmegaWiki, our methods achieve state-of-the-art accuracy, without any dependence on other sources of semantic knowledge. Since each word-concept pair corresponds to a unique sense of the word, we also demonstrate that the mapping task can be framed as word sense disambiguation. To facilitate future work, we release a set of high-precision WordNet-CLICS alignments, produced by combining three different mapping methods.

pdf abs
UAlberta at LSCDiscovery: Lexical Semantic Change Detection via Word Sense Disambiguation
Daniela Teodorescu | Spencer von der Ohe | Grzegorz Kondrak
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

We describe our two systems for the shared task on Lexical Semantic Change Discovery in Spanish. For binary change detection, we frame the task as a word sense disambiguation (WSD) problem. We derive sense frequency distributions for target words in both old and modern corpora. We assume that the word semantics have changed if a sense is observed in only one of the two corpora, or the relative change for any sense exceeds a tuned threshold. For graded change discovery, we follow the design of CIRCE (Pömsl and Lyapin, 2020) by combining both static and contextual embeddings. For contextual embeddings, we use XLM-RoBERTa instead of BERT, and train the model to predict a masked token instead of the time period. Our language-independent methods achieve results that are close to the best-performing systems in the shared task.

pdf abs
UAlberta at SemEval 2022 Task 2: Leveraging Glosses and Translations for Multilingual Idiomaticity Detection
Bradley Hauer | Seeratpal Jaura | Talgat Omarov | Grzegorz Kondrak
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We describe the University of Alberta systems for the SemEval-2022 Task 2 on multilingual idiomaticity detection. Working under the assumption that idiomatic expressions are noncompositional, our first method integrates information on the meanings of the individual words of an expression into a binary classifier. Further hypothesizing that literal and idiomatic expressions translate differently, our second method translates an expression in context, and uses a lexical knowledge base to determine if the translation is literal. Our approaches are grounded in linguistic phenomena, and leverage existing sources of lexical knowledge. Our results offer support for both approaches, particularly the former.

pdf abs
Improving HowNet-Based Chinese Word Sense Disambiguation with Translations
Xiang Zhang | Bradley Hauer | Grzegorz Kondrak
Findings of the Association for Computational Linguistics: EMNLP 2022

Word sense disambiguation (WSD) is the task of identifying the intended sense of a word in context. While prior work on unsupervised WSD has leveraged lexical knowledge bases, such as WordNet and BabelNet, these resources have proven to be less effective for Chinese. Instead, the most widely used lexical knowledge base for Chinese is HowNet. Previous HowNet-based WSD methods have not exploited contextual translation information. In this paper, we present the first HowNet-based WSD system which combines monolingual contextual information from a pretrained neural language model with bilingual information obtained via machine translation and sense translation information from HowNet. The results of our evaluation experiment on a test set from prior work demonstrate that our new method achieves a new state of the art for unsupervised Chinese WSD.

pdf abs
WiC = TSV = WSD: On the Equivalence of Three Semantic Tasks
Bradley Hauer | Grzegorz Kondrak
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The Word-in-Context (WiC) task has attracted considerable attention in the NLP community, as demonstrated by the popularity of the recent MCL-WiC SemEval shared task. Systems and lexical resources from word sense disambiguation (WSD) are often used for the WiC task and WiC dataset construction. In this paper, we establish the exact relationship between WiC and WSD, as well as the related task of target sense verification (TSV). Building upon a novel hypothesis on the equivalence of sense and meaning distinctions, we demonstrate through the application of tools from theoretical computer science that these three semantic classification problems can be pairwise reduced to each other, and therefore are equivalent. The results of experiments that involve systems and datasets for both WiC and WSD provide strong empirical evidence that our problem reductions work in practice.

2021

pdf bib abs
On Universal Colexifications
Hongchang Bao | Bradley Hauer | Grzegorz Kondrak
Proceedings of the 11th Global Wordnet Conference

Colexification occurs when two distinct concepts are lexified by the same word. The term covers both polysemy and homonymy. We posit and investigate the hypothesis that no pair of concepts are colexified in every language. We test our hypothesis by analyzing colexification data from BabelNet, Open Multilingual WordNet, and CLICS. The results show that our hypothesis is supported by over 99.9% of colexified concept pairs in these three lexical resources.

pdf abs
Homonymy and Polysemy Detection with Multilingual Information
Amir Ahmad Habibi | Bradley Hauer | Grzegorz Kondrak
Proceedings of the 11th Global Wordnet Conference

Deciding whether a semantically ambiguous word is homonymous or polysemous is equivalent to establishing whether it has any pair of senses that are semantically unrelated. We present novel methods for this task that leverage information from multilingual lexical resources. We formally prove the theoretical properties that provide the foundation for our methods. In particular, we show how the One Homonym Per Translation hypothesis of Hauer and Kondrak (2020a) follows from the synset properties formulated by Hauer and Kondrak (2020b). Experimental evaluation shows that our approach sets a new state of the art for homonymy detection.

pdf abs
UAlberta at SemEval-2021 Task 2: Determining Sense Synonymy via Translations
Bradley Hauer | Hongchang Bao | Arnob Mallik | Grzegorz Kondrak
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

We describe the University of Alberta systems for the SemEval-2021 Word-in-Context (WiC) disambiguation task. We explore the use of translation information for deciding whether two different tokens of the same word correspond to the same sense of the word. Our focus is on developing principled theoretical approaches which are grounded in linguistic phenomena, leading to more explainable models. We show that translations from multiple languages can be leveraged to improve the accuracy on the WiC task.

pdf abs
Semi-Supervised and Unsupervised Sense Annotation via Translations
Bradley Hauer | Grzegorz Kondrak | Yixing Luan | Arnob Mallik | Lili Mou
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

Acquisition of multilingual training data continues to be a challenge in word sense disambiguation (WSD). To address this problem, unsupervised approaches have been proposed to automatically generate sense annotations for training supervised WSD systems. We present three new methods for creating sense-annotated corpora which leverage translations, parallel bitexts, lexical resources, as well as contextual and synset embeddings. Our semi-supervised method applies machine translation to transfer existing sense annotations to other languages. Our two unsupervised methods refine sense annotations produced by a knowledge-based WSD system via lexical translations in a parallel corpus. We obtain state-of-the-art results on standard WSD benchmarks.

pdf abs
Dorabella Cipher as Musical Inspiration
Bradley Hauer | Colin Choi | Abram Hindle | Scott Smallwood | Grzegorz Kondrak
Proceedings of the Workshop on Speech and Music Processing 2021

The Dorabella cipher is an encrypted note of English composer Edward Elgar, which has defied decipherment attempts for more than a century. While most proposed solutions are English texts, we investigate the hypothe- sis that Dorabella represents enciphered music. We weigh the evidence in favor of and against the hypothesis, devise a simplified music nota- tion, and attempt to reconstruct a melody from the cipher. Our tools are n-gram models of mu- sic which we validate on existing music cor- pora enciphered using monoalphabetic substi- tution. By applying our methods to Dorabella, we produce a decipherment with musical qual- ities, which is then transformed via artful com- position into a listenable melody. Far from ar- guing that the end result represents the only true solution, we instead frame the process of decipherment as part of the composition pro- cess.

2020

pdf abs
Improving Word Sense Disambiguation with Translations
Yixing Luan | Bradley Hauer | Lili Mou | Grzegorz Kondrak
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

It has been conjectured that multilingual information can help monolingual word sense disambiguation (WSD). However, existing WSD systems rarely consider multilingual information, and no effective method has been proposed for improving WSD by generating translations. In this paper, we present a novel approach that improves the performance of a base WSD system using machine translation. Since our approach is language independent, we perform WSD experiments on several languages. The results demonstrate that our methods can consistently improve the performance of WSD systems, and obtain state-ofthe-art results in both English and multilingual WSD. To facilitate the use of lexical translation information, we also propose BABALIGN, an precise bitext alignment algorithm which is guided by multilingual lexical correspondences from BabelNet.

pdf abs
Low-Resource G2P and P2G Conversion with Synthetic Training Data
Bradley Hauer | Amir Ahmad Habibi | Yixing Luan | Arnob Mallik | Grzegorz Kondrak
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

This paper presents the University of Alberta systems and results in the SIGMORPHON 2020 Task 1: Multilingual Grapheme-to-Phoneme Conversion. Following previous SIGMORPHON shared tasks, we define a low-resource setting with 100 training instances. We experiment with three transduction approaches in both standard and low-resource settings, as well as on the related task of phoneme-to-grapheme conversion. We propose a method for synthesizing training data using a combination of diverse models.

pdf abs
UAlberta at SemEval-2020 Task 2: Using Translations to Predict Cross-Lingual Entailment
Bradley Hauer | Amir Ahmad Habibi | Yixing Luan | Arnob Mallik | Grzegorz Kondrak
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We investigate the hypothesis that translations can be used to identify cross-lingual lexical entailment. We propose novel methods that leverage parallel corpora, word embeddings, and multilingual lexical resources. Our results demonstrate that the implementation of these ideas leads to improvements in predicting entailment.

2019

pdf abs
Joint Approach to Deromanization of Code-mixed Texts
Rashed Rubby Riyadh | Grzegorz Kondrak
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

The conversion of romanized texts back to the native scripts is a challenging task because of the inconsistent romanization conventions and non-standard language use. This problem is compounded by code-mixing, i.e., using words from more than one language within the same discourse. In this paper, we propose a novel approach for handling these two problems together in a single system. Our approach combines three components: language identification, back-transliteration, and sequence prediction. The results of our experiments on Bengali and Hindi datasets establish the state of the art for the task of deromanization of code-mixed texts.

pdf bib abs
Cognate Projection for Low-Resource Inflection Generation
Bradley Hauer | Amir Ahmad Habibi | Yixing Luan | Rashed Rubby Riyadh | Grzegorz Kondrak
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

We propose cognate projection as a method of crosslingual transfer for inflection generation in the context of the SIGMORPHON 2019 Shared Task. The results on four language pairs show the method is effective when no low-resource training data is available.

2018

pdf abs
Comparison of Assorted Models for Transliteration
Saeed Najafi | Bradley Hauer | Rashed Rubby Riyadh | Leyuan Yu | Grzegorz Kondrak
Proceedings of the Seventh Named Entities Workshop

We report the results of our experiments in the context of the NEWS 2018 Shared Task on Transliteration. We focus on the comparison of several diverse systems, including three neural MT models. A combination of discriminative, generative, and neural models obtains the best results on the development sets. We also put forward ideas for improving the shared task.

pdf abs
String Transduction with Target Language Models and Insertion Handling
Garrett Nicolai | Saeed Najafi | Grzegorz Kondrak
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

Many character-level tasks can be framed as sequence-to-sequence transduction, where the target is a word from a natural language. We show that leveraging target language models derived from unannotated target corpora, combined with a precise alignment of the training data, yields state-of-the art results on cognate projection, inflection generation, and phoneme-to-grapheme conversion.

pdf
Combining Neural and Non-Neural Methods for Low-Resource Morphological Reinflection
Saeed Najafi | Bradley Hauer | Rashed Rubby Riyadh | Leyuan Yu | Grzegorz Kondrak
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

2017

pdf abs
Identifying Cognate Sets Across Dictionaries of Related Languages
Adam St Arnaud | David Beck | Grzegorz Kondrak
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present a system for identifying cognate sets across dictionaries of related languages. The likelihood of a cognate relationship is calculated on the basis of a rich set of features that capture both phonetic and semantic similarity, as well as the presence of regular sound correspondences. The similarity scores are used to cluster words from different languages that may originate from a common proto-word. When tested on the Algonquian language family, our system detects 63% of cognate sets while maintaining cluster purity of 70%.

pdf abs
Morphological Analysis without Expert Annotation
Garrett Nicolai | Grzegorz Kondrak
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The task of morphological analysis is to produce a complete list of lemma+tag analyses for a given word-form. We propose a discriminative string transduction approach which exploits plain inflection tables and raw text corpora, thus obviating the need for expert annotation. Experiments on four languages demonstrate that our system has much higher coverage than a hand-engineered FST analyzer, and is more accurate than a state-of-the-art morphological tagger.

pdf abs
Bootstrapping Unsupervised Bilingual Lexicon Induction
Bradley Hauer | Garrett Nicolai | Grzegorz Kondrak
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The task of unsupervised lexicon induction is to find translation pairs across monolingual corpora. We develop a novel method that creates seed lexicons by identifying cognates in the vocabularies of related languages on the basis of their frequency and lexical similarity. We apply bidirectional bootstrapping to a method which learns a linear mapping between context-based vector spaces. Experimental results on three language pairs show consistent improvement over prior work.

pdf
If you can’t beat them, join them: the University of Alberta system description
Garrett Nicolai | Bradley Hauer | Mohammad Motallebi | Saeed Najafi | Grzegorz Kondrak
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

2016

pdf
Morphological Reinflection via Discriminative String Transduction
Garrett Nicolai | Bradley Hauer | Adam St Arnaud | Grzegorz Kondrak
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf
Morphological Segmentation Can Improve Syllabification
Garrett Nicolai | Lei Yao | Grzegorz Kondrak
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf abs
Decoding Anagrammed Texts Written in an Unknown Language and Script
Bradley Hauer | Grzegorz Kondrak
Transactions of the Association for Computational Linguistics, Volume 4

Algorithmic decipherment is a prime example of a truly unsupervised problem. The first step in the decipherment process is the identification of the encrypted language. We propose three methods for determining the source language of a document enciphered with a monoalphabetic substitution cipher. The best method achieves 97% accuracy on 380 languages. We then present an approach to decoding anagrammed substitution ciphers, in which the letters within words have been arbitrarily transposed. It obtains the average decryption word accuracy of 93% on a set of 50 ciphertexts in 5 languages. Finally, we report the results on the Voynich manuscript, an unsolved fifteenth century cipher, which suggest Hebrew as the language of the document.

pdf
Integrating Morphological Desegmentation into Phrase-based Decoding
Mohammad Salameh | Colin Cherry | Grzegorz Kondrak
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Leveraging Inflection Tables for Stemming and Lemmatization.
Garrett Nicolai | Grzegorz Kondrak
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We evaluate several orthographic word similarity measures in the context of bitext word alignment. We investigate the relationship between the length of the words and the length of their longest common subsequence. We present an alternative to the longest common subsequence ratio (LCSR), a widely-used orthographic word similarity measure. Experiments involving identification of cognates in bitexts suggest that the alternative method outperforms LCSR. Our results also indicate that alignment links can be used as a substitute for cognates for the purpose of evaluating word similarity measures.

pdf
Computing Word Similarity and Identifying Cognates with Pair Hidden Markov Models
Wesley Mackay | Grzegorz Kondrak
Proceedings of the Ninth Conference on Computational Natural Language Learning (CoNLL-2005)