Hinrich Schütze

Also published as: Hinrich Schuetze, Hinrich Schutze

2024

pdf abs
mPLM-Sim: Better Cross-Lingual Similarity and Transfer in Multilingual Pretrained Language Models
Peiqin Lin | Chengzhi Hu | Zheyu Zhang | Andre Martins | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EACL 2024

Recent multilingual pretrained language models (mPLMs) have been shown to encode strong language-specific signals, which are not explicitly provided during pretraining. It remains an open question whether it is feasible to employ mPLMs to measure language similarity, and subsequently use the similarity results to select source languages for boosting cross-lingual transfer. To investigate this, we propose mPLM-Sim, a language similarity measure that induces the similarities across languages from mPLMs using multi-parallel corpora. Our study shows that mPLM-Sim exhibits moderately high correlations with linguistic similarity measures, such as lexicostatistics, genealogical language family, and geographical sprachbund. We also conduct a case study on languages with low correlation and observe that mPLM-Sim yields more accurate similarity results. Additionally, we find that similarity results vary across different mPLMs and different layers within an mPLM. We further investigate whether mPLM-Sim is effective for zero-shot cross-lingual transfer by conducting experiments on both low-level syntactic tasks and high-level semantic tasks. The experimental results demonstrate that mPLM-Sim is capable of selecting better source languages than linguistic measures, resulting in a 1%-2% improvement in zero-shot cross-lingual transfer performance.

pdf abs
OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining
Yihong Liu | Peiqin Lin | Mingyang Wang | Hinrich Schuetze
Findings of the Association for Computational Linguistics: NAACL 2024

Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining. However, this method usually randomly initializes the embeddings of new subwords and introduces substantially more embedding parameters to the model, thus weakening the efficiency. To address these issues, we propose a novel framework: One For All (OFA), which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively. OFA takes advantage of external well-aligned multilingual static word vectors and injects the alignment knowledge into the subword embeddings. In addition, OFA applies matrix factorization and replaces the cumbersome embeddings with two lower-dimensional matrices, which largely reduces the number of parameters. We show OFA accelerates the convergence of continued pretraining, which is environmentally friendly as much fewer carbon footprints are generated. Through extensive experiments, we demonstrate OFA can achieve competitive or better performance than default continued pretraining baselines on a wide range of crosslingual downstream tasks. We make our code and models publicly available.

pdf abs
Geographic Adaptation of Pretrained Language Models
Valentin Hofmann | Goran Glavaš | Nikola Ljubešić | Janet B. Pierrehumbert | Hinrich Schütze
Transactions of the Association for Computational Linguistics, Volume 12

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.

pdf abs
Kardeş-NLU: Transfer to Low-Resource Languages with the Help of a High-Resource Cousin – A Benchmark and Evaluation for Turkic Languages
Lütfi Kerem Senel | Benedikt Ebing | Konul Baghirova | Hinrich Schuetze | Goran Glavaš
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Cross-lingual transfer (XLT) driven by massively multilingual language models (mmLMs) has been shown largely ineffective for low-resource (LR) target languages with little (or no) representation in mmLM’s pretraining, especially if they are linguistically distant from the high-resource (HR) source language. Much of the recent focus in XLT research has been dedicated to LR language families, i.e., families without any HR languages (e.g., families of African languages or indigenous languages of the Americas). In this work, in contrast, we investigate a configuration that is arguably of practical relevance for more of the world’s languages: XLT to LR languages that do have a close HR relative. To explore the extent to which a HR language can facilitate transfer to its LR relatives, we (1) introduce Kardeş-NLU, an evaluation benchmark with language understanding datasets in five LR Turkic languages: Azerbaijani, Kazakh, Kyrgyz, Uzbek, and Uyghur; and (2) investigate (a) intermediate training and (b) fine-tuning strategies that leverage Turkish in XLT to these target languages. Our experimental results show that both - integrating Turkish in intermediate training and in downstream fine-tuning - yield substantial improvements in XLT to LR Turkic languages. Finally, we benchmark cutting-edge instruction-tuned large language models on Kardeş-NLU, showing that their performance is highly task- and language-dependent.

pdf abs
ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling Tasks
Bolei Ma | Ercong Nie | Shuzhou Yuan | Helmut Schmid | Michael Färber | Frauke Kreuter | Hinrich Schuetze
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)

Prompt-based methods have been successfully applied to multilingual pretrained language models for zero-shot cross-lingual understanding. However, most previous studies primarily focused on sentence-level classification tasks, and only a few considered token-level labeling tasks such as Named Entity Recognition (NER) and Part-of-Speech (POS) tagging. In this paper, we propose Token-Level Prompt Decomposition (ToPro), which facilitates the prompt-based method for token-level sequence labeling tasks. The ToPro method decomposes an input sentence into single tokens and applies one prompt template to each token. Our experiments on multilingual NER and POS tagging datasets demonstrate that ToPro-based fine-tuning outperforms Vanilla fine-tuning and Prompt-Tuning in zero-shot cross-lingual transfer, especially for languages that are typologically different from the source language English. Our method also attains state-of-the-art performance when employed with the mT5 model. Besides, our exploratory study in multilingual large language models shows that ToPro performs much better than the current in-context learning method. Overall, the performance improvements show that ToPro could potentially serve as a novel and simple benchmarking method for sequence labeling tasks.

pdf abs
Constructions Are So Difficult That Even Large Language Models Get Them Right for the Wrong Reasons
Shijia Zhou | Leonie Weissweiler | Taiqi He | Hinrich Schütze | David R. Mortensen | Lori Levin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

In this paper, we make a contribution that can be understood from two perspectives: from an NLP perspective, we introduce a small challenge dataset for NLI with large lexical overlap, which minimises the possibility of models discerning entailment solely based on token distinctions, and show that GPT-4 and Llama 2 fail it with strong bias. We then create further challenging sub-tasks in an effort to explain this failure. From a Computational Linguistics perspective, we identify a group of constructions with three classes of adjectives which cannot be distinguished by surface features. This enables us to probe for LLM’s understanding of these constructions in various ways, and we find that they fail in a variety of ways to distinguish between them, suggesting that they don’t adequately represent their meaning or capture the lexical properties of phrasal heads.

pdf abs
GlotScript: A Resource and Tool for Low Resource Writing System Identification
Amir Hossein Kargaran | François Yvon | Hinrich Schütze
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource languages in the NLP community. GlotScript-R and GlotScript-T are available at https://github.com/cisnlp/GlotScript.

pdf abs
MaiBaam: A Multi-Dialectal Bavarian Universal Dependency Treebank
Verena Blaschke | Barbara Kovačić | Siyao Peng | Hinrich Schütze | Barbara Plank
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Despite the success of the Universal Dependencies (UD) project exemplified by its impressive language breadth, there is still a lack in ‘within-language breadth’: most treebanks focus on standard languages. Even for German, the language with the most annotations in UD, so far no treebank exists for one of its language varieties spoken by over 10M people: Bavarian. To contribute to closing this gap, we present the first multi-dialect Bavarian treebank (MaiBaam) manually annotated with part-of-speech and syntactic dependency information in UD, covering multiple text genres (wiki, fiction, grammar examples, social, non-fiction). We highlight the morphosyntactic differences between the closely-related Bavarian and German and showcase the rich variability of speakers’ orthographies. Our corpus includes 15k tokens, covering dialects from all Bavarian-speaking areas spanning three countries. We provide baseline parsing and POS tagging results, which are lower than results obtained on German and vary substantially between different graph-based parsers. To support further research on Bavarian syntax, we make our dataset, language-specific guidelines and code publicly available.

pdf abs
SilverAlign: MT-Based Silver Data Algorithm for Evaluating Word Alignment
Abdullatif Koksal | Silvia Severini | Hinrich Schütze
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Word alignments are essential for a variety of NLP tasks. Therefore, choosing the best approaches for their creation is crucial. However, the scarce availability of gold evaluation data makes the choice difficult. We propose SilverAlign, a new method to automatically create silver data for the evaluation of word aligners by exploiting machine translation and minimal pairs. We show that performance on our silver data correlates well with gold benchmarks for 9 language pairs, making our approach a valid resource for evaluation of different languages and domains when gold data is not available. This addresses the important scenario of missing gold data alignments for low-resource languages.

The Universal Dependencies (UD) project has created an invaluable collection of treebanks with contributions in over 140 languages. However, the UD annotations do not tell the full story. Grammatical constructions that convey meaning through a particular combination of several morphosyntactic elements—for example, interrogative sentences with special markers and/or word orders—are not labeled holistically. We argue for (i) augmenting UD annotations with a ‘UCxn’ annotation layer for such meaning-bearing grammatical constructions, and (ii) approaching this in a typologically informed way so that morphosyntactic strategies can be compared across languages. As a case study, we consider five construction families in ten languages, identifying instances of each construction in UD treebanks through the use of morphosyntactic patterns. In addition to findings regarding these particular constructions, our study yields important insights on methodology for describing and identifying constructions in language-general and language-particular ways, and lays the foundation for future constructional enrichment of UD treebanks.

pdf abs
Verbing Weirds Language (Models): Evaluation of English Zero-Derivation in Five LLMs
David R. Mortensen | Valentina Izrailevitch | Yunze Xiao | Hinrich Schütze | Leonie Weissweiler
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Lexical-syntactic flexibility, in the form of conversion (or zero-derivation) is a hallmark of English morphology. In conversion, a word with one part of speech is placed in a non-prototypical context, where it is coerced to behave as if it had a different part of speech. However, while this process affects a large part of the English lexicon, little work has been done to establish the degree to which language models capture this type of generalization. This paper reports the first study on the behavior of large language models with reference to conversion. We design a task for testing lexical-syntactic flexibility—the degree to which models can generalize over words in a construction with a non-prototypical part of speech. This task is situated within a natural language inference paradigm. We test the abilities of five language models—two proprietary models (GPT-3.5 and GPT-4), three open source model (Mistral 7B, Falcon 40B, and Llama 2 70B). We find that GPT-4 performs best on the task, followed by GPT-3.5, but that the open source language models are also able to perform it and that the 7-billion parameter Mistral displays as little difference between its baseline performance on the natural language inference task and the non-prototypical syntactic category task, as the massive GPT-4.

pdf bib abs
Sociocultural knowledge is needed for selection of shots in hate speech detection tasks
Antonis Maronikolakis | Abdullatif Köksal | Hinrich Schuetze
Proceedings of the Fourth Workshop on Language Technology for Equality, Diversity, Inclusion

We introduce HATELEXICON, a lexicon of slurs and targets of hate speech for Brazil, Germany, India and Kenya, to aid model development and interpretability. First, we demonstrate how HATELEXICON can be used to interpret model predictions, showing that models developed to classify extreme speech rely heavily on target group names. Further, we propose a culturally-informed method to aid shot selection for training in low-resource settings. In few-shot learning, shot selection is of paramount importance to model performance and we need to ensure we make the most of available data. We work with HASOC German and Hindi data for training and the Multilingual HateCheck (MHC) benchmark for evaluation. We show that selecting shots based on our lexicon leads to models performing better than models trained on shots sampled randomly. Thus, when given only a few training examples, using HATELEXICON to select shots containing more sociocultural information leads to better few-shot performance. With these two use-cases we show how our HATELEXICON can be used for more effective hate speech detection.

pdf abs
Rehearsal-Free Modular and Compositional Continual Learning for Language Models
Mingyang Wang | Heike Adel | Lukas Lange | Jannik Strötgen | Hinrich Schuetze
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)

Continual learning aims at incrementally acquiring new knowledge while not forgetting existing knowledge. To overcome catastrophic forgetting, methods are either rehearsal-based, i.e., store data examples from previous tasks for data replay, or isolate parameters dedicated to each task. However, rehearsal-based methods raise privacy and memory issues, and parameter-isolation continual learning does not consider interaction between tasks, thus hindering knowledge transfer. In this work, we propose MoCL, a rehearsal-free **Mo**dular and **C**ompositional Continual **L**earning framework which continually adds new modules to language models and composes them with existing modules. Experiments on various benchmarks show that MoCL outperforms state of the art and effectively facilitates knowledge transfer.

2023

The NLP community has mainly focused on scaling Large Language Models (LLMs) vertically, i.e., making them better for about 100 languages. We instead scale LLMs horizontally: we create, through continued pretraining, Glot500-m, an LLM that covers 511 predominantly low-resource languages. An important part of this effort is to collect and clean Glot500-c, a corpus that covers these 511 languages and allows us to train Glot500-m. We evaluate Glot500-m on five diverse tasks across these languages. We observe large improvements for both high-resource and low-resource languages compared to an XLM-R baseline. Our analysis shows that no single factor explains the quality of multilingual LLM representations. Rather, a combination of factors determines quality including corpus size, script, “help” from related languages and the total capacity of the model. Our work addresses an important goal of NLP research: we should notlimit NLP to a small fraction of the world’s languages and instead strive to support as many languages as possible to bring the benefits of NLP technology to all languages and cultures. Code, data and models are available at https://github.com/cisnlp/Glot500.

pdf abs
PVGRU: Generating Diverse and Relevant Dialogue Responses via Pseudo-Variational Mechanism
Yongkang Liu | Shi Feng | Daling Wang | Yifei Zhang | Hinrich Schütze
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We investigate response generation for multi-turn dialogue in generative chatbots. Existing generative modelsbased on RNNs (Recurrent Neural Networks) usually employ the last hidden state to summarize the history, which makesmodels unable to capture the subtle variability observed in different dialogues and cannot distinguish the differencesbetween dialogues that are similar in composition. In this paper, we propose Pseudo-Variational Gated Recurrent Unit (PVGRU). The key novelty of PVGRU is a recurrent summarizing variable thataggregates the accumulated distribution variations of subsequences. We train PVGRU without relying on posterior knowledge, thus avoiding the training-inference inconsistency problem. PVGRU can perceive subtle semantic variability through summarizing variables that are optimized by two objectives we employ for training: distribution consistency and reconstruction. In addition, we build a Pseudo-Variational Hierarchical Dialogue(PVHD) model based on PVGRU. Experimental results demonstrate that PVGRU can broadly improve the diversity andrelevance of responses on two benchmark datasets.

Languages differ in how they divide up the world into concepts and words; e.g., in contrast to English, Swahili has a single concept for ‘belly’ and ‘womb’. We investigate these differences in conceptualization across 1,335 languages by aligning concepts in a parallel corpus. To this end, we propose Conceptualizer, a method that creates a bipartite directed alignment graph between source language concepts and sets of target language strings. In a detailed linguistic analysis across all languages for one concept (‘bird’) and an evaluation on gold standard data for 32 Swadesh concepts, we show that Conceptualizer has good alignment accuracy. We demonstrate the potential of research on conceptualization in NLP with two experiments. (1) We define crosslingual stability of a concept as the degree to which it has 1-1 correspondences across languages, and show that concreteness predicts stability. (2) We represent each language by its conceptualization pattern for 83 concepts, and define a similarity measure on these representations. The resulting measure for the conceptual similarity between two languages is complementary to standard genealogical, typological, and surface similarity measures. For four out of six language families, we can assign languages to their correct family based on conceptual similarity with accuracies between 54% and 87%

pdf abs
How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives
Xinpeng Wang | Leonie Weissweiler | Hinrich Schütze | Barbara Plank
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Recently, various intermediate layer distillation (ILD) objectives have been shown to improve compression of BERT models via Knowledge Distillation (KD). However, a comprehensive evaluation of the objectives in both task-specific and task-agnostic settings is lacking. To the best of our knowledge, this is the first work comprehensively evaluating distillation objectives in both settings. We show that attention transfer gives the best performance overall. We also study the impact of layer choice when initializing the student from the teacher layers, finding a significant impact on the performance in task-specific distillation. For vanilla KD and hidden states transfer, initialisation with lower layers of the teacher gives a considerable improvement over higher layers, especially on the task of QNLI (up to an absolute percentage change of 17.8 in accuracy). Attention transfer behaves consistently under different initialisation settings. We release our code as an efficient transformer-based model distillation framework for further studies.

pdf abs
On the Copying Problem of Unsupervised NMT: A Training Schedule with a Language Discriminator Loss
Yihong Liu | Alexandra Chronopoulou | Hinrich Schütze | Alexander Fraser
Proceedings of the 20th International Conference on Spoken Language Translation (IWSLT 2023)

Although unsupervised neural machine translation (UNMT) has achieved success in many language pairs, the copying problem, i.e., directly copying some parts of the input sentence as the translation, is common among distant language pairs, especially when low-resource languages are involved. We find this issue is closely related to an unexpected copying behavior during online back-translation (BT). In this work, we propose a simple but effective training schedule that incorporates a language discriminator loss. The loss imposes constraints on the intermediate translation so that the translation is in the desired language. By conducting extensive experiments on different language pairs, including similar and distant, high and low-resource languages, we find that our method alleviates the copying problem, thus improving the translation performance on low-resource languages.

pdf abs
Does Manipulating Tokenization Aid Cross-Lingual Transfer? A Study on POS Tagging for Non-Standardized Languages
Verena Blaschke | Hinrich Schütze | Barbara Plank
Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023)

One of the challenges with finetuning pretrained language models (PLMs) is that their tokenizer is optimized for the language(s) it was pretrained on, but brittle when it comes to previously unseen variations in the data. This can for instance be observed when finetuning PLMs on one language and evaluating them on data in a closely related language variety with no standardized orthography. Despite the high linguistic similarity, tokenization no longer corresponds to meaningful representations of the target data, leading to low performance in, e.g., part-of-speech tagging. In this work, we finetune PLMs on seven languages from three different families and analyze their zero-shot performance on closely related, non-standardized varieties. We consider different measures for the divergence in the tokenization of the source and target data, and the way they can be adjusted by manipulating the tokenization during the finetuning step. Overall, we find that the similarity between the percentage of words that get split into subwords in the source and target data (the split word ratio difference) is the strongest predictor for model performance on target data.

pdf abs
GradSim: Gradient-Based Language Grouping for Effective Multilingual Training
Mingyang Wang | Heike Adel | Lukas Lange | Jannik Strötgen | Hinrich Schuetze
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Most languages of the world pose low-resource challenges to natural language processing models. With multilingual training, knowledge can be shared among languages. However, not all languages positively influence each other and it is an open research question how to select the most suitable set of languages for multilingual training and avoid negative interference among languages whose characteristics or data distributions are not compatible. In this paper, we propose GradSim, a language grouping method based on gradient similarity. Our experiments on three diverse multilingual benchmark datasets show that it leads to the largest performance gains compared to other similarity measures and it is better correlated with cross-lingual model performance. As a result, we set the new state of the art on AfriSenti, a benchmark dataset for sentiment analysis on low-resource African languages. In our extensive analysis, we further reveal that besides linguistic features, the topics of the datasets play an important role for language grouping and that lower layers of transformer models encode language-specific features while higher layers capture task-specific information.

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko’s (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results—through the lens of morphology—cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.

While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent “beliefs”. This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that answers are supported by interpretable chains of reasoning drawn from a consistent network of beliefs. Our approach, which we call REFLEX, is to add a **rational, self-reflecting layer** on top of the LLM. First, given a question, we construct a **belief graph** using a backward-chaining process to materialize relevant model beliefs (including beliefs about answer candidates) and their inferential relationships. Second, we identify and minimize contradictions in that graph using a formal constraint reasoner. We find that REFLEX significantly improves consistency (by 8%-11% absolute) without harming overall answer accuracy, resulting in answers supported by faithful chains of reasoning drawn from a more consistent belief system. This suggests a new style of system architecture in which an LLM extended with a rational layer can provide an interpretable window into system beliefs, add a systematic reasoning capability, and repair latent inconsistencies present in the LLM.

pdf abs
LLM-driven Instruction Following: Progresses and Concerns
Wenpeng Yin | Qinyuan Ye | Pengfei Liu | Xiang Ren | Hinrich Schütze
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

The progress of natural language processing (NLP) is primarily driven by machine learning that optimizes a system on a large-scale set of task-specific labeled examples. This learning paradigm limits the ability of machines to have the same capabilities as humans in handling new tasks since humans can often solve unseen tasks with a couple of examples accompanied by task instructions. In addition, we may not have a chance to prepare task-specific examples of large-volume for new tasks because we cannot foresee what task needs to be addressed next and how complex to annotate for it. Therefore, task instructions act as a novel and promising resource for supervision. This tutorial targets researchers and practitioners who are interested in AI and ML technologies for NLP generalization in a low-shot scenario. In particular, we will present a diverse thread of instruction-driven NLP studies that try to answer the following questions: (i) What is task instruction? (ii) How is the process of creating datasets and evaluating systems conducted? (iii) How to encode task instructions? (iv) When and why do some instructions work better? (v) What concerns remain in LLM-driven instruction following? We will discuss several lines of frontier research that tackle those challenges and will conclude the tutorial by outlining directions for further investigation.

Since conventional knowledge embedding models cannot take full advantage of the abundant textual information, there have been extensive research efforts in enhancing knowledge embedding using texts. However, existing enhancement approaches cannot apply to temporal knowledge graphs (tKGs), which contain time-dependent event knowledge with complex temporal dynamics. Specifically, existing enhancement approaches often assume knowledge embedding is time-independent. In contrast, the entity embedding in tKG models usually evolves, which poses the challenge of aligning temporally relevant texts with entities. To this end, we propose to study enhancing temporal knowledge embedding with textual data in this paper. As an approach to this task, we propose Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations (ECOLA), which takes the temporal aspect into account and injects textual information into temporal knowledge embedding. To evaluate ECOLA, we introduce three new datasets for training and evaluating ECOLA. Extensive experiments show that ECOLA significantly enhances temporal KG embedding models with up to 287% relative improvements regarding Hits@1 on the link prediction task. The code and models are publicly available on https://github.com/mayhugotong/ECOLA.

pdf abs
Cross-Lingual Retrieval Augmented Prompt for Low-Resource Languages
Ercong Nie | Sheng Liang | Helmut Schmid | Hinrich Schütze
Findings of the Association for Computational Linguistics: ACL 2023

Multilingual Pretrained Language Models (MPLMs) perform strongly in cross-lingual transfer. We propose Prompts Augmented by Retrieval Crosslingually (PARC) to improve zero-shot performance on low-resource languages (LRLs) by augmenting the context with prompts consisting of semantically similar sentences retrieved from a high-resource language (HRL). PARC improves zero-shot performance on three downstream tasks (sentiment classification, topic categorization, natural language inference) with multilingual parallel test sets across 10 LRLs covering 6 language families in unlabeled (+5.1%) and labeled settings (+16.3%). PARC also outperforms finetuning by 3.7%. We find a significant positive correlation between cross-lingual transfer performance on one side, and the similarity between high- and low-resource languages as well as the amount of low-resource pretraining data on the other side. A robustness analysis suggests that PARC has the potential to achieve even stronger performance with more powerful MPLMs.

pdf abs
MEAL: Stable and Active Learning for Few-Shot Prompting
Abdullatif Köksal | Timo Schick | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EMNLP 2023

Few-shot classification has made great strides due to foundation models that, through priming and prompting, are highly effective few-shot learners. However, this approach has high variance both across different sets of few shots (*data selection*) and across different finetuning runs (*run variability*). This is problematic not only because it impedes the fair comparison of different approaches, but especially because it makes few-shot learning too unreliable for many real-world applications. To alleviate these issues, we make two contributions for more stable and effective few-shot learning: First, we propose novel ensembling methods and show that they substantially reduce *run variability*. Second, we introduce a new active learning (AL) criterion for *data selection* and present the first AL-based approach specifically tailored towards prompt-based learning. In our experiments, we show that our combined method, MEAL (**M**ultiprompt finetuning and prediction **E**nsembling with **A**ctive **L**earning), improves overall performance of prompt-based finetuning by 2.3 points on five diverse tasks. We publicly share our code and data splits in https://github.com/akoksal/MEAL.

pdf abs
GlotLID: Language Identification for Low-Resource Languages
Amir Kargaran | Ayyoob Imani | François Yvon | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EMNLP 2023

Several recent papers have published good solutions for language identification (LID) for about 300 high-resource and medium-resource languages. However, there is no LID available that (i) covers a wide range of low-resource languages, (ii) is rigorously evaluated and reliable and (iii) efficient and easy to use. Here, we publish GlotLID-M, an LID model that satisfies the desiderata of wide coverage, reliability and efficiency. It identifies 1665 languages, a large increase in coverage compared to prior work. In our experiments, GlotLID-M outperforms four baselines (CLD3, FT176, OpenLID and NLLB) when balancing F1 and false positive rate (FPR). We analyze the unique challenges that low-resource LID poses: incorrect corpus metadata, leakage from high-resource languages, difficulty separating closely related languages, handling of macrolanguage vs varieties and in general noisy data. We hope that integrating GlotLID-M into dataset creation pipelines will improve quality and enhance accessibility of NLP technology for low-resource languages and cultures. GlotLID-M model, code, and list of data sources are available: https://github.com/cisnlp/GlotLID.

pdf abs
Crosslingual Transfer Learning for Low-Resource Languages Based on Multilingual Colexification Graphs
Yihong Liu | Haotian Ye | Leonie Weissweiler | Renhao Pei | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EMNLP 2023

In comparative linguistics, colexification refers to the phenomenon of a lexical form conveying two or more distinct meanings. Existing work on colexification patterns relies on annotated word lists, limiting scalability and usefulness in NLP. In contrast, we identify colexification patterns of more than 2,000 concepts across 1,335 languages directly from an unannotated parallel corpus. We then propose simple and effective methods to build multilingual graphs from the colexification patterns: ColexNet and ColexNet+. ColexNet’s nodes are concepts and its edges are colexifications. In ColexNet+, concept nodes are additionally linked through intermediate nodes, each representing an ngram in one of 1,334 languages. We use ColexNet+ to train \overrightarrow{\mbox{ColexNet+}}, high-quality multilingual embeddings that are well-suited for transfer learning. In our experiments, we first show that ColexNet achieves high recall on CLICS, a dataset of crosslingual colexifications. We then evaluate \overrightarrow{\mbox{ColexNet+}} on roundtrip translation, sentence retrieval and sentence classification and show that our embeddings surpass several transfer learning baselines. This demonstrates the benefits of using colexification as a source of information in multilingual NLP.

Pretrained language models (PLMs) are key components in NLP, but they contain strong social biases. Quantifying these biases is challenging because current methods focusing on fill-the-mask objectives are sensitive to slight changes in input. To address this, we propose a bias probing technique called LABDet, for evaluating social bias in PLMs with a robust and language-agnostic method. For nationality as a case study, we show that LABDet “surfaces” nationality bias by training a classifier on top of a frozen PLM on non-nationality sentiment detection. We find consistent patterns of nationality bias across monolingual PLMs in six languages that align with historical and political context. We also show for English BERT that bias surfaced by LABDet correlates well with bias in the pretraining data; thus, our work is one of the few studies that directly links pretraining data to PLM behavior. Finally, we verify LABDet’s reliability and applicability to different templates and languages through an extensive set of robustness checks. We publicly share our code and dataset in https://github.com/akoksal/LABDet.

pdf abs
Unleashing the Multilingual Encoder Potential: Boosting Zero-Shot Performance via Probability Calibration
Ercong Nie | Helmut Schmid | Hinrich Schuetze
Findings of the Association for Computational Linguistics: EMNLP 2023

Pretrained multilingual encoder models can directly perform zero-shot multilingual tasks or linguistic probing by reformulating the input examples into cloze-style prompts. This is accomplished by predicting the probabilities of the label words at the masked token position, without requiring any updates to the model parameters. However, the performance of this method is limited by the model’s bias toward predicting label words which frequently occurred during the pretraining. These words typically receive high probabilities. To address this issue, we combine the models with calibration techniques which modify the probabilities of label words predicted by the models. We first validate the effectiveness of a proposed simple calibration method together with other existing techniques on monolingual encoders in both zero- and few-shot scenarios. We subsequently employ these calibration techniques on multilingual encoders, resulting in substantial performance improvements across a wide range of tasks.

pdf bib abs
Improving Few-Shot Learning with Multilingual Transfer and Monte Carlo Training Set Selection
Antonis Maronikolakis | Paul O’Grady | Hinrich Schütze | Matti Lyra
Proceedings of the 2023 CLASP Conference on Learning with Small Data (LSD)

In industry settings, machine learning is an attractive tool to automatize processes. Unfortunately, annotated and high-quality data is expensive to source. This problem is exacerbated in settings spanning multiple markets and languages. Thus, developing solutions for multilingual tasks with little available data is challenging. Few-shot learning is a compelling approach when building solutions in multilingual and low-resource settings, since the method not only requires just a few training examples to achieve high performance, but is also a technique agnostic to language. Even though the technique can be applied to multilingual settings, optimizing performance is an open question. In our work we show that leveraging higher-resource, task-specific language data can boost overall performance and we propose a method to select training examples per their average performance in a Monte Carlo simulation, resulting in a training set more conducive to learning. We demonstrate the effectiveness of our methods in fashion text reviews moderation, classifying reviews as related or unrelated to the given product. We show that our methodology boosts performance in multilingual (English, French, German) settings, increasing F1 score and significantly decreasing false positives.

pdf abs
A Survey of Corpora for Germanic Low-Resource Languages and Dialects
Verena Blaschke | Hinrich Schuetze | Barbara Plank
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Despite much progress in recent years, the vast majority of work in natural language processing (NLP) is on standard languages with many speakers. In this work, we instead focus on low-resource languages and in particular non-standardized low-resource languages. Even within branches of major language families, often considered well-researched, little is known about the extent and type of available resources and what the major NLP challenges are for these language varieties. The first step to address this situation is a systematic survey of available corpora (most importantly, annotated corpora, which are particularly valuable for NLP research). Focusing on Germanic low-resource language varieties, we provide such a survey in this paper. Except for geolocation (origin of speaker or document), we find that manually annotated linguistic resources are sparse and, if they exist, mostly cover morphosyntax. Despite this lack of resources, we observe that interest in this area is increasing: there is active development and a growing research community. To facilitate research, we make our overview of over 80 corpora publicly available.

pdf abs
Cross-Lingual Constituency Parsing for Middle High German: A Delexicalized Approach
Ercong Nie | Helmut Schmid | Hinrich Schütze
Proceedings of the Ancient Language Processing Workshop

Constituency parsing plays a fundamental role in advancing natural language processing (NLP) tasks. However, training an automatic syntactic analysis system for ancient languages solely relying on annotated parse data is a formidable task due to the inherent challenges in building treebanks for such languages. It demands extensive linguistic expertise, leading to a scarcity of available resources. To overcome this hurdle, cross-lingual transfer techniques which require minimal or even no annotated data for low-resource target languages offer a promising solution. In this study, we focus on building a constituency parser for Middle High German (MHG) under realistic conditions, where no annotated MHG treebank is available for training. In our approach, we leverage the linguistic continuity and structural similarity between MHG and Modern German (MG), along with the abundance of MG treebank resources. Specifically, by employing the delexicalization method, we train a constituency parser on MG parse datasets and perform cross-lingual transfer to MHG parsing. Our delexicalized constituency parser demonstrates remarkable performance on the MHG test set, achieving an F1-score of 67.3%. It outperforms the best zero-shot cross-lingual baseline by a margin of 28.6% points. The encouraging results underscore the practicality and potential for automatic syntactic analysis in other ancient languages that face similar challenges as MHG.

pdf abs
Construction Grammar Provides Unique Insight into Neural Language Models
Leonie Weissweiler | Taiqi He | Naoki Otani | David R. Mortensen | Lori Levin | Hinrich Schütze
Proceedings of the First International Workshop on Construction Grammars and NLP (CxGs+NLP, GURT/SyntaxFest 2023)

Construction Grammar (CxG) has recently been used as the basis for probing studies that have investigated the performance of large pretrained language models (PLMs) with respect to the structure and meaning of constructions. In this position paper, we make suggestions for the continuation and augmentation of this line of research. We look at probing methodology that was not designed with CxG in mind, as well as probing methodology that was designed for specific constructions. We analyse selected previous work in detail, and provide our view of the most important challenges and research questions that this promising new field faces.

pdf abs
NLNDE at SemEval-2023 Task 12: Adaptive Pretraining and Source Language Selection for Low-Resource Multilingual Sentiment Analysis
Mingyang Wang | Heike Adel | Lukas Lange | Jannik Strötgen | Hinrich Schütze
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

This paper describes our system developed for the SemEval-2023 Task 12 “Sentiment Analysis for Low-resource African Languages using Twitter Dataset”. Sentiment analysis is one of the most widely studied applications in natural language processing. However, most prior work still focuses on a small number of high-resource languages. Building reliable sentiment analysis systems for low-resource languages remains challenging, due to the limited training data in this task. In this work, we propose to leverage language-adaptive and task-adaptive pretraining on African texts and study transfer learning with source language selection on top of an African language-centric pretrained language model. Our key findings are: (1) Adapting the pretrained model to the target language and task using a small yet relevant corpus improves performance remarkably by more than 10 F1 score points. (2) Selecting source languages with positive transfer gains during training can avoid harmful interference from dissimilar languages, leading to better results in multilingual and cross-lingual settings. In the shared task, our system wins 8 out of 15 tracks and, in particular, performs best in the multilingual evaluation.

pdf bib
Is Prompt-Based Finetuning Always Better than Vanilla Finetuning? Insights from Cross-Lingual Language Understanding
Bolei Ma | Ercong Nie | Helmut Schmid | Hinrich Schuetze
Proceedings of the 19th Conference on Natural Language Processing (KONVENS 2023)

pdf
Multilingual Word Embeddings for Low-Resource Languages using Anchors and a Chain of Related Languages
Viktor Hangya | Silvia Severini | Radoslav Ralev | Alexander Fraser | Hinrich Schütze
Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL)

2022

pdf bib abs
Separating Hate Speech and Offensive Language Classes via Adversarial Debiasing
Shuzhou Yuan | Antonis Maronikolakis | Hinrich Schütze
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

Research to tackle hate speech plaguing online media has made strides in providing solutions, analyzing bias and curating data. A challenging problem is ambiguity between hate speech and offensive language, causing low performance both overall and specifically for the hate speech class. It can be argued that misclassifying actual hate speech content as merely offensive can lead to further harm against targeted groups. In our work, we mitigate this potentially harmful phenomenon by proposing an adversarial debiasing method to separate the two classes. We show that our method works for English, Arabic German and Hindi, plus in a multilingual setting, improving performance over baselines.

pdf abs
True Few-Shot Learning with Prompts—A Real-World Perspective
Timo Schick | Hinrich Schütze
Transactions of the Association for Computational Linguistics, Volume 10

Prompt-based approaches excel at few-shot learning. However, Perez et al. (2021) recently cast doubt on their performance as they had difficulty getting good results in a “true” few-shot setting in which prompts and hyperparameters cannot be tuned on a dev set. In view of this, we conduct an extensive study of Pet, a method that combines textual instructions with example-based finetuning. We show that, if correctly configured, Pet performs strongly in true few-shot settings without a dev set. Crucial for this strong performance is a number of design choices, including Pet’s ability to intelligently handle multiple prompts. We put our findings to a real-world test by running Pet on RAFT, a benchmark of tasks taken from realistic NLP applications for which no labeled dev or test sets are available. Pet achieves a new state of the art on RAFT and performs close to non-expert humans for 7 out of 11 tasks. These results demonstrate that prompt-based learners can successfully be applied in true few-shot settings and underpin our belief that learning from instructions will play an important role on the path towards human-like few-shot learning capabilities.

pdf abs
Hengam: An Adversarially Trained Transformer for Persian Temporal Tagging
Sajad Mirzababaei | Amir Hossein Kargaran | Hinrich Schütze | Ehsaneddin Asgari
Proceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Many NLP main tasks benefit from an accurate understanding of temporal expressions, e.g., text summarization, question answering, and information retrieval. This paper introduces Hengam, an adversarially trained transformer for Persian temporal tagging outperforming state-of-the-art approaches on a diverse and manually created dataset. We create Hengam in the following concrete steps: (1) we develop HengamTagger, an extensible rule-based tool that can extract temporal expressions from a set of diverse language-specific patterns for any language of interest. (2) We apply HengamTagger to annotate temporal tags in a large and diverse Persian text collection (covering both formal and informal contexts) to be used as weakly labeled data. (3) We introduce an adversarially trained transformer model on HengamCorpus that can generalize over the HengamTagger’s rules. We create HengamGold, the first high-quality gold standard for Persian temporal tagging. Our trained adversarial HengamTransformer not only achieves the best performance in terms of the F1-score (a type F1-Score of 95.42 and a partial F1-Score of 91.60) but also successfully deals with language ambiguities and incorrect spellings. Our code, data, and models are publicly available at https://github.com/kargaranamir/Hengam.

pdf abs
Listening to Affected Communities to Define Extreme Speech: Dataset and Experiments
Antonis Maronikolakis | Axel Wisiorek | Leah Nann | Haris Jabbar | Sahana Udupa | Hinrich Schuetze
Findings of the Association for Computational Linguistics: ACL 2022

Building on current work on multilingual hate speech (e.g., Ousidhoum et al. (2019)) and hate speech reduction (e.g., Sap et al. (2020)), we present XTREMESPEECH, a new hate speech dataset containing 20,297 social media passages from Brazil, Germany, India and Kenya. The key novelty is that we directly involve the affected communities in collecting and annotating the data – as opposed to giving companies and governments control over defining and combatting hate speech. This inclusive approach results in datasets more representative of actually occurring online speech and is likely to facilitate the removal of the social media content that marginalized communities view as causing the most harm. Based on XTREMESPEECH, we establish novel tasks with accompanying baselines, provide evidence that cross-country training is generally not feasible due to cultural differences between countries and perform an interpretability analysis of BERT’s predictions.

pdf abs
Graph Neural Networks for Multiparallel Word Alignment
Ayyoob Imani | Lütfi Kerem Senel | Masoud Jalili Sabet | François Yvon | Hinrich Schuetze
Findings of the Association for Computational Linguistics: ACL 2022

After a period of decrease, interest in word alignments is increasing again for their usefulness in domains such as typological research, cross-lingual annotation projection and machine translation. Generally, alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel. Here, we compute high-quality word alignments between multiple language pairs by considering all language pairs together. First, we create a multiparallel word alignment graph, joining all bilingual word alignment pairs in one graph. Next, we use graph neural networks (GNNs) to exploit the graph structure. Our GNN approach (i) utilizes information about the meaning, position and language of the input words, (ii) incorporates information from multiple parallel sentences, (iii) adds and removes edges from the initial alignments, and (iv) yields a prediction model that can generalize beyond the training sentences. We show that community detection algorithms can provide valuable information for multiparallel word alignment. Our method outperforms previous work on three word alignment datasets and on a downstream task.

pdf abs
Modular and Parameter-Efficient Multimodal Fusion with Prompting
Sheng Liang | Mengjie Zhao | Hinrich Schuetze
Findings of the Association for Computational Linguistics: ACL 2022

Recent research has made impressive progress in large-scale multimodal pre-training. In the context of the rapid growth of model size, it is necessary to seek efficient and flexible methods other than finetuning. In this paper, we propose to use prompt vectors to align the modalities. Our method achieves comparable performance to several other multimodal fusion methods in low-resource settings. We further show that our method is modular and parameter-efficient for processing tasks involving two or more data modalities.

pdf abs
Modeling Ideological Salience and Framing in Polarized Online Groups with Graph Neural Networks and Structured Sparsity
Valentin Hofmann | Xiaowen Dong | Janet Pierrehumbert | Hinrich Schuetze
Findings of the Association for Computational Linguistics: NAACL 2022

The increasing polarization of online political discourse calls for computational tools that automatically detect and monitor ideological divides in social media. We introduce a minimally supervised method that leverages the network structure of online discussion forums, specifically Reddit, to detect polarized concepts. We model polarization along the dimensions of salience and framing, drawing upon insights from moral psychology. Our architecture combines graph neural networks with structured sparsity learning and results in representations for concepts and subreddits that capture temporal ideological dynamics such as right-wing and left-wing radicalization.

Vast efforts have been devoted to creating high-performance few-shot learners, i.e., large-scale pretrained language models (PLMs) that perform well with little downstream task training data. Training PLMs has incurred significant cost, but utilizing the few-shot learners is still challenging due to their enormous size. This work focuses on a crucial question: How to make effective use of these few-shot learners? We propose LMTurk, a novel approach that treats few-shotlearners as crowdsourcing workers. The rationale is that crowdsourcing workers are in fact few-shot learners: They are shown a few illustrative examples to learn about a task and then start annotating. LMTurk employs few-shot learners built upon PLMs as workers. We show that the resulting annotations can be utilized to train models that solve the task well and are small enough to be deployable in practical scenarios. Active learning is integrated into LMTurk to reduce the amount of queries made to PLMs, minimizing the computational cost of running PLM inference passes. Altogether, LMTurk is an important step towards making effective use of current PLMs.

pdf abs
An Information-Theoretic Approach and Dataset for Probing Gender Stereotypes in Multilingual Masked Language Models
Victor Steinborn | Philipp Dufter | Haris Jabbar | Hinrich Schuetze
Findings of the Association for Computational Linguistics: NAACL 2022

Bias research in NLP is a rapidly growing and developing field. Similar to CrowS-Pairs (Nangia et al., 2020), we assess gender bias in masked-language models (MLMs) by studying pairs of sentences with gender swapped person references. Most bias research focuses on and often is specific to English.Using a novel methodology for creating sentence pairs that is applicable across languages, we create, based on CrowS-Pairs, a multilingual dataset for English, Finnish, German, Indonesian and Thai.Additionally, we propose S_JSD, a new bias measure based on Jensen–Shannon divergence, which we argue retains more information from the model output probabilities than other previously proposed bias measures for MLMs.Using multilingual MLMs, we find that S_JSD diagnoses the same systematic biased behavior for non-English that previous studies have found for monolingual English pre-trained MLMs. S_JSD outperforms the CrowS-Pairs measure, which struggles to find such biases for smaller non-English datasets.

pdf abs
Federated Continual Learning for Text Classification via Selective Inter-client Transfer
Yatin Chaudhary | Pranav Rai | Matthias Schubert | Hinrich Schütze | Pankaj Gupta
Findings of the Association for Computational Linguistics: EMNLP 2022

In this work, we combine the two paradigms: Federated Learning (FL) and Continual Learning (CL) for text classification task in cloud-edge continuum. The objective of Federated Continual Learning (FCL) is to improve deep learning models over life time at each client by (relevant and efficient) knowledge transfer without sharing data. Here, we address challenges in minimizing inter-client interference while knowledge sharing due to heterogeneous tasks across clients in FCL setup. In doing so, we propose a novel framework, Federated Selective Inter-client Transfer (FedSeIT) which selectively combines model parameters of foreign clients. To further maximize knowledge transfer, we assess domain overlap and select informative tasks from the sequence of historical tasks at each foreign client while preserving privacy. Evaluating against the baselines, we show improved performance, a gain of (average) 12.4% in text classification over a sequence of tasks using five datasets from diverse domains. To the best of our knowledge, this is the first work that applies FCL to NLP.

pdf abs
Position Information in Transformers: An Overview
Philipp Dufter | Martin Schmitt | Hinrich Schütze
Computational Linguistics, Volume 48, Issue 3 - September 2022

Transformers are arguably the main workhorse in recent natural language processing research. By definition, a Transformer is invariant with respect to reordering of the input. However, language is inherently sequential and word order is essential to the semantics and syntax of an utterance. In this article, we provide an overview and theoretical comparison of existing methods to incorporate position information into Transformer models. The objectives of this survey are to (1) showcase that position information in Transformer is a vibrant and extensive research area; (2) enable the reader to compare existing methods by providing a unified notation and systematization of different approaches along important model dimensions; (3) indicate what characteristics of an application should be taken into account when selecting a position encoding; and (4) provide stimuli for future research.

pdf abs
Don’t Forget Cheap Training Signals Before Building Unsupervised Bilingual Word Embeddings
Silvia Severini | Viktor Hangya | Masoud Jalili Sabet | Alexander Fraser | Hinrich Schütze
Proceedings of the BUCC Workshop within LREC 2022

Bilingual Word Embeddings (BWEs) are one of the cornerstones of cross-lingual transfer of NLP models. They can be built using only monolingual corpora without supervision leading to numerous works focusing on unsupervised BWEs. However, most of the current approaches to build unsupervised BWEs do not compare their results with methods based on easy-to-access cross-lingual signals. In this paper, we argue that such signals should always be considered when developing unsupervised BWE methods. The two approaches we find most effective are: 1) using identical words as seed lexicons (which unsupervised approaches incorrectly assume are not available for orthographically distinct language pairs) and 2) combining such lexicons with pairs extracted by matching romanized versions of words with an edit distance threshold. We experiment on thirteen non-Latin languages (and English) and show that such cheap signals work well and that they outperform using more complex unsupervised methods on distant language pairs such as Chinese, Japanese, Kannada, Tamil, and Thai. In addition, they are even competitive with the use of high-quality lexicons in supervised approaches. Our results show that these training signals should not be neglected when building BWEs, even for distant languages.

pdf bib abs
Analyzing Hate Speech Data along Racial, Gender and Intersectional Axes
Antonis Maronikolakis | Philip Baader | Hinrich Schütze
Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP)

To tackle the rising phenomenon of hate speech, efforts have been made towards data curation and analysis. When it comes to analysis of bias, previous work has focused predominantly on race. In our work, we further investigate bias in hate speech datasets along racial, gender and intersectional axes. We identify strong bias against African American English (AAE), masculine and AAE+Masculine tweets, which are annotated as disproportionately more hateful and offensive than from other demographics. We provide evidence that BERT-based models propagate this bias and show that balancing the training data for these protected attributes can lead to fairer models with regards to gender, but not race.

pdf abs
Graph-Based Multilingual Label Propagation for Low-Resource Part-of-Speech Tagging
Ayyoob ImaniGooghari | Silvia Severini | Masoud Jalili Sabet | François Yvon | Hinrich Schütze
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Part-of-Speech (POS) tagging is an important component of the NLP pipeline, but many low-resource languages lack labeled data for training. An established method for training a POS tagger in such a scenario is to create a labeled training set by transferring from high-resource languages. In this paper, we propose a novel method for transferring labels from multiple high-resource source to low-resource target languages. We formalize POS tag projection as graph-based label propagation. Given translations of a sentence in multiple languages, we create a graph with words as nodes and alignment links as edges by aligning words for all language pairs. We then propagate node labels from source to target using a Graph Neural Network augmented with transformer layers. We show that our propagation creates training sets that allow us to train POS taggers for a diverse set of languages. When combined with enhanced contextualized embeddings, our method achieves a new state-of-the-art for unsupervised POS tagging of low-resource languages.

pdf abs
The better your Syntax, the better your Semantics? Probing Pretrained Language Models for the English Comparative Correlative
Leonie Weissweiler | Valentin Hofmann | Abdullatif Köksal | Hinrich Schütze
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge.

pdf abs
Towards a Broad Coverage Named Entity Resource: A Data-Efficient Approach for Many Diverse Languages
Silvia Severini | Ayyoob ImaniGooghari | Philipp Dufter | Hinrich Schütze
Proceedings of the Thirteenth Language Resources and Evaluation Conference

Parallel corpora are ideal for extracting a multilingual named entity (MNE) resource, i.e., a dataset of names translated into multiple languages. Prior work on extracting MNE datasets from parallel corpora required resources such as large monolingual corpora or word aligners that are unavailable or perform poorly for underresourced languages. We present CLC-BN, a new method for creating an MNE resource, and apply it to the Parallel Bible Corpus, a corpus of more than 1000 languages. CLC-BN learns a neural transliteration model from parallel-corpus statistics, without requiring any other bilingual resources, word aligners, or seed data. Experimental results show that CLC-BN clearly outperforms prior work. We release an MNE resource for 1340 languages and demonstrate its effectiveness in two downstream tasks: knowledge graph augmentation and bilingual lexicon induction.

pdf abs
Flow-Adapter Architecture for Unsupervised Machine Translation
Yihong Liu | Haris Jabbar | Hinrich Schuetze
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this work, we propose a flow-adapter architecture for unsupervised NMT. It leverages normalizing flows to explicitly model the distributions of sentence-level latent representations, which are subsequently used in conjunction with the attention mechanism for the translation task. The primary novelties of our model are: (a) capturing language-specific sentence representations separately for each language using normalizing flows and (b) using a simple transformation of these latent representations for translating from one language to another. This architecture allows for unsupervised training of each language independently. While there is prior work on latent variables for supervised MT, to the best of our knowledge, this is the first work that uses latent variables and normalizing flows for unsupervised MT. We obtain competitive results on several unsupervised MT benchmarks.

pdf abs
Differentiable Multi-Agent Actor-Critic for Multi-Step Radiology Report Summarization
Sanjeev Kumar Karn | Ning Liu | Hinrich Schuetze | Oladimeji Farri
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The IMPRESSIONS section of a radiology report about an imaging study is a summary of the radiologist’s reasoning and conclusions, and it also aids the referring physician in confirming or excluding certain diagnoses. A cascade of tasks are required to automatically generate an abstractive summary of the typical information-rich radiology report. These tasks include acquisition of salient content from the report and generation of a concise, easily consumable IMPRESSIONS section. Prior research on radiology report summarization has focused on single-step end-to-end models – which subsume the task of salient content acquisition. To fully explore the cascade structure and explainability of radiology report summarization, we introduce two innovations. First, we design a two-step approach: extractive summarization followed by abstractive summarization. Second, we additionally break down the extractive part into two independent tasks: extraction of salient (1) sentences and (2) keywords. Experiments on English radiology reports from two clinical sites show our novel approach leads to a more precise summary compared to single-step and to two-step-with-single-extractive-process baselines with an overall improvement in F1 score of 3-4%.

pdf abs
CaMEL: Case Marker Extraction without Labels
Leonie Weissweiler | Valentin Hofmann | Masoud Jalili Sabet | Hinrich Schuetze
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a silver standard from UniMorph. The case markers extracted by our model can be used to detect and visualise similarities and differences between the case systems of different languages as well as to annotate fine-grained deep cases in languages in which they are not overtly marked.

pdf abs
An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers
Valentin Hofmann | Hinrich Schuetze | Janet Pierrehumbert
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models (PLMs). FLOTA uses the vocabulary of a standard tokenizer but tries to preserve the morphological structure of words during tokenization. We evaluate FLOTA on morphological gold segmentations as well as a text classification task, using BERT, GPT-2, and XLNet as example PLMs. FLOTA leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise.

pdf abs
CoDA21: Evaluating Language Understanding Capabilities of NLP Models With Context-Definition Alignment
Lütfi Kerem Senel | Timo Schick | Hinrich Schuetze
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Pretrained language models (PLMs) have achieved superhuman performance on many benchmarks, creating a need for harder tasks. We introduce CoDA21 (Context Definition Alignment), a challenging benchmark that measures natural language understanding (NLU) capabilities of PLMs: Given a definition and a context each for k words, but not the words themselves, the task is to align the k definitions with the k contexts. CoDA21 requires a deep understanding of contexts and definitions, including complex inference and world knowledge. We find that there is a large gap between human and PLM performance, suggesting that CoDA21 measures an aspect of NLU that is not sufficiently covered in existing benchmarks.

2021

pdf bib abs
Identifying Automatically Generated Headlines using Transformers
Antonis Maronikolakis | Hinrich Schütze | Mark Stevenson
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

False information spread via the internet and social media influences public opinion and user activity, while generative models enable fake content to be generated faster and more cheaply than had previously been possible. In the not so distant future, identifying fake content generated by deep learning models will play a key role in protecting users from misinformation. To this end, a dataset containing human and computer-generated headlines was created and a user study indicated that humans were only able to identify the fake headlines in 47.8% of the cases. However, the most accurate automatic approach, transformers, achieved an overall accuracy of 85.7%, indicating that content generated from language models can be filtered out accurately.

pdf bib abs
Modeling Graph Structure via Relative Position for Text Generation from Knowledge Graphs
Martin Schmitt | Leonardo F. R. Ribeiro | Philipp Dufter | Iryna Gurevych | Hinrich Schütze
Proceedings of the Fifteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-15)

We present Graformer, a novel Transformer-based encoder-decoder architecture for graph-to-text generation. With our novel graph self-attention, the encoding of a node relies on all nodes in the input graph - not only direct neighbors - facilitating the detection of global patterns. We represent the relation between two nodes as the length of the shortest path between them. Graformer learns to weight these node-node relations differently for different attention heads, thus virtually learning differently connected views of the input graph. We evaluate Graformer on two popular graph-to-text generation benchmarks, AGENDA and WebNLG, where it achieves strong performance while using many fewer parameters than other approaches.

pdf abs
It’s Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners
Timo Schick | Hinrich Schütze
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

When scaled to hundreds of billions of parameters, pretrained language models such as GPT-3 (Brown et al., 2020) achieve remarkable few-shot performance. However, enormous amounts of compute are required for training and applying such big models, resulting in a large carbon footprint and making it difficult for researchers and practitioners to use them. We show that performance similar to GPT-3 can be obtained with language models that are much “greener” in that their parameter count is several orders of magnitude smaller. This is achieved by converting textual inputs into cloze questions that contain a task description, combined with gradient-based optimization; exploiting unlabeled data gives further improvements. We identify key factors required for successful natural language understanding with small language models.

pdf abs
Static Embeddings as Efficient Knowledge Bases?
Philipp Dufter | Nora Kassner | Hinrich Schütze
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Recent research investigates factual knowledge stored in large pretrained language models (PLMs). Instead of structural knowledge base (KB) queries, masked sentences such as “Paris is the capital of [MASK]” are used as probes. The good performance on this analysis task has been interpreted as PLMs becoming potential repositories of factual knowledge. In experiments across ten linguistically diverse languages, we study knowledge contained in static embeddings. We show that, when restricting the output space to a candidate set, simple nearest neighbor matching using static embeddings performs better than PLMs. E.g., static embeddings perform 1.6% points better than BERT while just using 0.3% of energy for training. One important factor in their good comparative performance is that static embeddings are standardly learned for a large vocabulary. In contrast, BERT exploits its more sophisticated, but expensive ability to compose meaningful representations from a much smaller subword vocabulary.

pdf abs
Multi-source Neural Topic Modeling in Multi-view Embedding Spaces
Pankaj Gupta | Yatin Chaudhary | Hinrich Schütze
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity in short-text or small collection of documents. This work presents a novel neural topic modeling framework using multi-view embed ding spaces: (1) pretrained topic-embeddings, and (2) pretrained word-embeddings (context-insensitive from Glove and context-sensitive from BERT models) jointly from one or many sources to improve topic quality and better deal with polysemy. In doing so, we first build respective pools of pretrained topic (i.e., TopicPool) and word embeddings (i.e., WordPool). We then identify one or more relevant source domain(s) and transfer knowledge to guide meaningful learning in the sparse target domain. Within neural topic modeling, we quantify the quality of topics and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. Introducing the multi-source multi-view embedding spaces, we have shown state-of-the-art neural topic modeling using 6 source (high-resource) and 5 target (low-resource) corpora.

Consistency of a model—that is, the invariance of its behavior under meaning-preserving alternations in its input—is a highly desirable property in natural language processing. In this paper we study the question: Are Pretrained Language Models (PLMs) consistent with respect to factual knowledge? To this end, we create ParaRel🤘, a high-quality resource of cloze-style query English paraphrases. It contains a total of 328 paraphrases for 38 relations. Using ParaRel🤘, we show that the consistency of all PLMs we experiment with is poor— though with high variance between relations. Our analysis of the representational spaces of PLMs suggests that they have a poor structure and are currently not suitable for representing knowledge robustly. Finally, we propose a method for improving model consistency and experimentally demonstrate its effectiveness.1

During production of this paper, an error was introduced to the formula on the bottom of the right column of page 1020. In the last two terms of the formula, the n and m subscripts were swapped. The correct formula is:Lc=∑n=1k∑m=n+1kDKL(Qnri∥Qmri)+DKL(Qmri∥Qnri)The paper has been updated.

pdf abs
Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP
Timo Schick | Sahana Udupa | Hinrich Schütze
Transactions of the Association for Computational Linguistics, Volume 9

This paper contains prompts and model outputs that are offensive in nature. When trained on large, unfiltered crawls from the Internet, language models pick up and reproduce all kinds of undesirable biases that can be found in the data: They often generate racist, sexist, violent, or otherwise toxic language. As large models require millions of training examples to achieve good performance, it is difficult to completely prevent them from being exposed to such content. In this paper, we first demonstrate a surprising finding: Pretrained language models recognize, to a considerable degree, their undesirable biases and the toxicity of the content they produce. We refer to this capability as self-diagnosis. Based on this finding, we then propose a decoding algorithm that, given only a textual description of the undesired behavior, reduces the probability of a language model producing problematic text. We refer to this approach as self-debiasing. Self-debiasing does not rely on manually curated word lists, nor does it require any training data or changes to the model’s parameters. While we by no means eliminate the issue of language models generating biased text, we believe our approach to be an important step in this direction.1

pdf abs
Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference
Timo Schick | Hinrich Schütze
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Some NLP tasks can be solved in a fully unsupervised fashion by providing a pretrained language model with “task descriptions” in natural language (e.g., Radford et al., 2019). While this approach underperforms its supervised counterpart, we show in this work that the two ideas can be combined: We introduce Pattern-Exploiting Training (PET), a semi-supervised training procedure that reformulates input examples as cloze-style phrases to help language models understand a given task. These phrases are then used to assign soft labels to a large set of unlabeled examples. Finally, standard supervised training is performed on the resulting training set. For several tasks and languages, PET outperforms supervised training and strong semi-supervised approaches in low-resource settings by a large margin.

pdf abs
Does She Wink or Does She Nod? A Challenging Benchmark for Evaluating Word Understanding of Language Models
Lutfi Kerem Senel | Hinrich Schütze
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Recent progress in pretraining language models on large corpora has resulted in significant performance gains on many NLP tasks. These large models acquire linguistic knowledge during pretraining, which helps to improve performance on downstream tasks via fine-tuning. To assess what kind of knowledge is acquired, language models are commonly probed by querying them with ‘fill in the blank’ style cloze questions. Existing probing datasets mainly focus on knowledge about relations between words and entities. We introduce WDLMPro (Word Definitions Language Model Probing) to evaluate word understanding directly using dictionary definitions of words. In our experiments, three popular pretrained language models struggle to match words and their definitions. This indicates that they understand many words poorly and that our new probing task is a difficult challenge that could help guide research on LMs in the future.

pdf abs
Language Models for Lexical Inference in Context
Martin Schmitt | Hinrich Schütze
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Lexical inference in context (LIiC) is the task of recognizing textual entailment between two very similar sentences, i.e., sentences that only differ in one expression. It can therefore be seen as a variant of the natural language inference task that is focused on lexical semantics. We formulate and evaluate the first approaches based on pretrained language models (LMs) for this task: (i) a few-shot NLI classifier, (ii) a relation induction approach based on handcrafted patterns expressing the semantics of lexical inference, and (iii) a variant of (ii) with patterns that were automatically extracted from a corpus. All our approaches outperform the previous state of the art, showing the potential of pretrained LMs for LIiC. In an extensive analysis, we investigate factors of success and failure of our three approaches.

pdf abs
Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models
Nora Kassner | Philipp Dufter | Hinrich Schütze
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Recently, it has been found that monolingual English language models can be used as knowledge bases. Instead of structural knowledge base queries, masked sentences such as “Paris is the capital of [MASK]” are used as probes. We translate the established benchmarks TREx and GoogleRE into 53 languages. Working with mBERT, we investigate three questions. (i) Can mBERT be used as a multilingual knowledge base? Most prior work only considers English. Extending research to multiple languages is important for diversity and accessibility. (ii) Is mBERT’s performance as knowledge base language-independent or does it vary from language to language? (iii) A multilingual model is trained on more text, e.g., mBERT is trained on 104 Wikipedias. Can mBERT leverage this for better performance? We find that using mBERT as a knowledge base yields varying performance across languages and pooling predictions across languages improves performance. Conversely, mBERT exhibits a language bias; e.g., when queried in Italian, it tends to predict Italy as the country of origin.

pdf abs
Wine is not v i n. On the Compatibility of Tokenizations across Languages
Antonis Maronikolakis | Philipp Dufter | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2021

The size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., “wine” (word-level) in English vs. “v i n” (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.

pdf abs
Semi-Automated Labeling of Requirement Datasets for Relation Extraction
Jeremias Bohn | Jannik Fischbach | Martin Schmitt | Hinrich Schütze | Andreas Vogelsang
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

Creating datasets manually by human annotators is a laborious task that can lead to biased and inhomogeneous labels. We propose a flexible, semi-automatic framework for labeling data for relation extraction. Furthermore, we provide a dataset of preprocessed sentences from the requirements engineering domain, including a set of automatically created as well as hand-crafted labels. In our case study, we compare the human and automatic labels and show that there is a substantial overlap between both annotations.

pdf abs
Few-Shot Text Generation with Natural Language Instructions
Timo Schick | Hinrich Schütze
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Providing pretrained language models with simple task descriptions in natural language enables them to solve some tasks in a fully unsupervised fashion. Moreover, when combined with regular learning from examples, this idea yields impressive few-shot results for a wide range of text classification tasks. It is also a promising direction to improve data efficiency in generative settings, but there are several challenges to using a combination of task descriptions and example-based learning for text generation. In particular, it is crucial to find task descriptions that are easy to understand for the pretrained model and to ensure that it actually makes good use of them; furthermore, effective measures against overfitting have to be implemented. In this paper, we show how these challenges can be tackled: We introduce GenPET, a method for text generation that is based on pattern-exploiting training, a recent approach for combining textual instructions with supervised learning that only works for classification tasks. On several summarization and headline generation datasets, GenPET gives consistent improvements over strong baselines in few-shot settings.

pdf abs
Generating Datasets with Pretrained Language Models
Timo Schick | Hinrich Schütze
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

To obtain high-quality sentence embeddings from pretrained language models (PLMs), they must either be augmented with additional pretraining objectives or finetuned on a large set of labeled text pairs. While the latter approach typically outperforms the former, it requires great human effort to generate suitable datasets of sufficient size. In this paper, we show how PLMs can be leveraged to obtain high-quality sentence embeddings without the need for labeled data, finetuning or modifications to the pretraining objective: We utilize the generative abilities of large and high-performing PLMs to generate entire datasets of labeled text pairs from scratch, which we then use for finetuning much smaller and more efficient models. Our fully unsupervised approach outperforms strong baselines on several semantic textual similarity datasets.

pdf abs
Continuous Entailment Patterns for Lexical Inference in Context
Martin Schmitt | Hinrich Schütze
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Combining a pretrained language model (PLM) with textual patterns has been shown to help in both zero- and few-shot settings. For zero-shot performance, it makes sense to design patterns that closely resemble the text seen during self-supervised pretraining because the model has never seen anything else. Supervised training allows for more flexibility. If we allow for tokens outside the PLM’s vocabulary, patterns can be adapted more flexibly to a PLM’s idiosyncrasies. Contrasting patterns where a “token” can be any continuous vector from those where a discrete choice between vocabulary elements has to be made, we call our method CONtinous pAtterNs (CONAN). We evaluate CONAN on two established benchmarks for lexical inference in context (LIiC) a.k.a. predicate entailment, a challenging natural language understanding task with relatively small training data. In a direct comparison with discrete patterns, CONAN consistently leads to improved performance, setting a new state of the art. Our experiments give valuable insights on the kind of pattern that enhances a PLM’s performance on LIiC and raise important questions regarding our understanding of PLMs using text patterns.

pdf abs
Graph Algorithms for Multiparallel Word Alignment
Ayyoob ImaniGooghari | Masoud Jalili Sabet | Lutfi Kerem Senel | Philipp Dufter | François Yvon | Hinrich Schütze
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

With the advent of end-to-end deep learning approaches in machine translation, interest in word alignments initially decreased; however, they have again become a focus of research more recently. Alignments are useful for typological research, transferring formatting like markup to translated texts, and can be used in the decoding of machine translation systems. At the same time, massively multilingual processing is becoming an important NLP scenario, and pretrained language and machine translation models that are truly multilingual are proposed. However, most alignment algorithms rely on bitexts only and do not leverage the fact that many parallel corpora are multiparallel. In this work, we exploit the multiparallelity of corpora by representing an initial set of bilingual alignments as a graph and then predicting additional edges in the graph. We present two graph algorithms for edge prediction: one inspired by recommender systems and one based on network link prediction. Our experimental results show absolute improvements in F1 of up to 28% over the baseline bilingual word aligner in different datasets.

pdf abs
Discrete and Soft Prompting for Multilingual Models
Mengjie Zhao | Hinrich Schütze
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

It has been shown for English that discrete and soft prompting perform strongly in few-shot learning with pretrained language models (PLMs). In this paper, we show that discrete and soft prompting perform better than finetuning in multilingual cases: Crosslingual transfer and in-language training of multilingual natural language inference. For example, with 48 English training examples, finetuning obtains 33.74% accuracy in crosslingual transfer, barely surpassing the majority baseline (33.33%). In contrast, discrete and soft prompting outperform finetuning, achieving 36.43% and 38.79%. We also demonstrate good performance of prompting with training data in multiple languages other than English.

pdf abs
BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief
Nora Kassner | Oyvind Tafjord | Hinrich Schütze | Peter Clark
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually “believes” about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our approach is to embed a PTLM in a broader system that also includes an evolving, symbolic memory of beliefs – a BeliefBank – that records but then may modify the raw PTLM answers. We describe two mechanisms to improve belief consistency in the overall system. First, a reasoning component – a weighted MaxSAT solver – revises beliefs that significantly clash with others. Second, a feedback component issues future queries to the PTLM using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time. This is significant as it is a first step towards PTLM-based architectures with a systematic notion of belief, enabling them to construct a more coherent picture of the world, and improve over time without model retraining.

pdf abs
BERT Cannot Align Characters
Antonis Maronikolakis | Philipp Dufter | Hinrich Schütze
Proceedings of the Second Workshop on Insights from Negative Results in NLP

In previous work, it has been shown that BERT can adequately align cross-lingual sentences on the word level. Here we investigate whether BERT can also operate as a char-level aligner. The languages examined are English, Fake English, German and Greek. We show that the closer two languages are, the better BERT can align them on the character level. BERT indeed works well in English to Fake English alignment, but this does not generalize to natural languages to the same extent. Nevertheless, the proximity of two languages does seem to be a factor. English is more related to German than to Greek and this is reflected in how well BERT aligns them; English to German is better than English to Greek. We examine multiple setups and show that the similarity matrices for natural languages show weaker relations the further apart two languages are.

pdf bib abs
Multidomain Pretrained Language Models for Green NLP
Antonis Maronikolakis | Hinrich Schütze
Proceedings of the Second Workshop on Domain Adaptation for NLP

When tackling a task in a given domain, it has been shown that adapting a model to the domain using raw text data before training on the supervised task improves performance versus solely training on the task. The downside is that a lot of domain data is required and if we want to tackle tasks in n domains, we require n models each adapted on domain data before task learning. Storing and using these models separately can be prohibitive for low-end devices. In this paper we show that domain adaptation can be generalised to cover multiple domains. Specifically, a single model can be trained across various domains at the same time with minimal drop in performance, even when we use less data and resources. Thus, instead of training multiple models, we can train a single multidomain model saving on computational resources and training time.

pdf abs
Few-Shot Learning of an Interleaved Text Summarization Model by Pretraining with Synthetic Data
Sanjeev Kumar Karn | Francine Chen | Yan-Ying Chen | Ulli Waltinger | Hinrich Schütze
Proceedings of the Second Workshop on Domain Adaptation for NLP

Interleaved texts, where posts belonging to different threads occur in a sequence, commonly occur in online chat posts, so that it can be time-consuming to quickly obtain an overview of the discussions. Existing systems first disentangle the posts by threads and then extract summaries from those threads. A major issue with such systems is error propagation from the disentanglement component. While end-to-end trainable summarization system could obviate explicit disentanglement, such systems require a large amount of labeled data. To address this, we propose to pretrain an end-to-end trainable hierarchical encoder-decoder system using synthetic interleaved texts. We show that by fine-tuning on a real-world meeting dataset (AMI), such a system out-performs a traditional two-step system by 22%. We also compare against transformer models and observed that pretraining with synthetic data both the encoder and decoder outperforms the BertSumExtAbs transformer model which pretrains only the encoder on a large dataset.

pdf abs
Investigating Pretrained Language Models for Graph-to-Text Generation
Leonardo F. R. Ribeiro | Martin Schmitt | Hinrich Schütze | Iryna Gurevych
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Graph-to-text generation aims to generate fluent texts from graph-based data. In this paper, we investigate two recent pretrained language models (PLMs) and analyze the impact of different task-adaptive pretraining strategies for PLMs in graph-to-text generation. We present a study across three graph domains: meaning representations, Wikipedia knowledge graphs (KGs) and scientific KGs. We show that approaches based on PLMs BART and T5 achieve new state-of-the-art results and that task-adaptive pretraining strategies improve their performance even further. We report new state-of-the-art BLEU scores of 49.72 on AMR-LDC2017T10, 59.70 on WebNLG, and 25.66 on AGENDA datasets - a relative improvement of 31.8%, 4.5%, and 42.4%, respectively, with our models generating significantly more fluent texts than human references. In an extensive analysis, we identify possible reasons for the PLMs’ success on graph-to-text tasks. Our findings suggest that the PLMs benefit from similar facts seen during pretraining or fine-tuning, such that they perform well even when the input graph is reduced to a simple bag of node and edge labels.

pdf abs
Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words
Valentin Hofmann | Janet Pierrehumbert | Hinrich Schütze
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

pdf abs
A Closer Look at Few-Shot Crosslingual Transfer: The Choice of Shots Matters
Mengjie Zhao | Yi Zhu | Ehsan Shareghi | Ivan Vulić | Roi Reichart | Anna Korhonen | Hinrich Schütze
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Few-shot crosslingual transfer has been shown to outperform its zero-shot counterpart with pretrained encoders like multilingual BERT. Despite its growing popularity, little to no attention has been paid to standardizing and analyzing the design of few-shot experiments. In this work, we highlight a fundamental risk posed by this shortcoming, illustrating that the model exhibits a high degree of sensitivity to the selection of few shots. We conduct a large-scale experimental study on 40 sets of sampled few shots for six diverse NLP tasks across up to 40 languages. We provide an analysis of success and failure cases of few-shot transfer, which highlights the role of lexical features. Additionally, we show that a straightforward full model finetuning approach is quite effective for few-shot transfer, outperforming several state-of-the-art few-shot approaches. As a step towards standardizing few-shot crosslingual experimental designs, we make our sampled few shots publicly available.

pdf abs
Dynamic Contextualized Word Embeddings
Valentin Hofmann | Janet Pierrehumbert | Hinrich Schütze
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language model (PLM), dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks involving semantic variability. We highlight potential application scenarios by means of qualitative and quantitative analyses on four English datasets.

pdf abs
ParCourE: A Parallel Corpus Explorer for a Massively Multilingual Corpus
Ayyoob ImaniGooghari | Masoud Jalili Sabet | Philipp Dufter | Michael Cysou | Hinrich Schütze
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

With more than 7000 languages worldwide, multilingual natural language processing (NLP) is essential both from an academic and commercial perspective. Researching typological properties of languages is fundamental for progress in multilingual NLP. Examples include assessing language similarity for effective transfer learning, injecting inductive biases into machine learning models or creating resources such as dictionaries and inflection tables. We provide ParCourE, an online tool that allows to browse a word-aligned parallel corpus, covering 1334 languages. We give evidence that this is useful for typological research. ParCourE can be set up for any parallel corpus and can thus be used for typological research on other corpora as well as for exploring their quality and properties.

2020

pdf abs
Masking as an Efficient Alternative to Finetuning for Pretrained Language Models
Mengjie Zhao | Tao Lin | Fei Mi | Martin Jaggi | Hinrich Schütze
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We present an efficient method of utilizing pretrained language models, where we learn selective binary masks for pretrained weights in lieu of modifying them through finetuning. Extensive evaluations of masking BERT, RoBERTa, and DistilBERT on eleven diverse NLP tasks show that our masking scheme yields performance comparable to finetuning, yet has a much smaller memory footprint when several tasks need to be inferred. Intrinsic evaluations show that representations computed by our binary masked language models encode information necessary for solving downstream tasks. Analyzing the loss landscape, we show that masking and finetuning produce models that reside in minima that can be connected by a line segment with nearly constant test accuracy. This confirms that masking can be utilized as an efficient alternative to finetuning.

pdf abs
DagoBERT: Generating Derivational Morphology with a Pretrained Language Model
Valentin Hofmann | Janet Pierrehumbert | Hinrich Schütze
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT’s derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms the previous state of the art in derivation generation (DG). Furthermore, our experiments show that the input segmentation crucially impacts BERT’s derivational knowledge, suggesting that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used.

pdf abs
Identifying Elements Essential for BERT’s Multilinguality
Philipp Dufter | Hinrich Schütze
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

It has been shown that multilingual BERT (mBERT) yields high quality multilingual representations and enables effective zero-shot transfer. This is surprising given that mBERT does not use any crosslingual signal during training. While recent literature has studied this phenomenon, the reasons for the multilinguality are still somewhat obscure. We aim to identify architectural properties of BERT and linguistic properties of languages that are necessary for BERT to become multilingual. To allow for fast experimentation we propose an efficient setup with small BERT models trained on a mix of synthetic and natural data. Overall, we identify four architectural and two linguistic elements that influence multilinguality. Based on our insights, we experiment with a multilingual pretraining setup that modifies the masking strategy using VecMap, i.e., unsupervised embedding alignment. Experiments on XNLI with three languages indicate that our findings transfer from our small setup to larger scale settings.

pdf abs
An Unsupervised Joint System for Text Generation from Knowledge Graphs and Semantic Parsing
Martin Schmitt | Sahand Sharifzadeh | Volker Tresp | Hinrich Schütze
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Knowledge graphs (KGs) can vary greatly from one domain to another. Therefore supervised approaches to both graph-to-text generation and text-to-graph knowledge extraction (semantic parsing) will always suffer from a shortage of domain-specific parallel graph-text data; at the same time, adapting a model trained on a different domain is often impossible due to little or no overlap in entities and relations. This situation calls for an approach that (1) does not need large amounts of annotated data and thus (2) does not need to rely on domain adaptation techniques to work well on different domains. To this end, we present the first approach to unsupervised text generation from KGs and show simultaneously how it can be used for unsupervised semantic parsing. We evaluate our approach on WebNLG v2.1 and a new benchmark leveraging scene graphs from Visual Genome. Our system outperforms strong baselines for both text<->graph conversion tasks without any manual adaptation from one dataset to the other. In additional experiments, we investigate the impact of using different unsupervised objectives.

pdf abs
EmbLexChange at SemEval-2020 Task 1: Unsupervised Embedding-based Detection of Lexical Semantic Changes
Ehsaneddin Asgari | Christoph Ringlstetter | Hinrich Schütze
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes EmbLexChange, a system introduced by the “Life-Language” team for SemEval-2020 Task 1, on unsupervised detection of lexical-semantic changes. EmbLexChange is defined as the divergence between the embedding based profiles of word w (calculated with respect to a set of reference words) in the source and the target domains (source and target domains can be simply two time frames t_1 and t_2). The underlying assumption is that the lexical-semantic change of word w would affect its co-occurring words and subsequently alters the neighborhoods in the embedding spaces. We show that using a resampling framework for the selection of reference words (with conserved senses), we can more reliably detect lexical-semantic changes in English, German, Swedish, and Latin. EmbLexChange achieved second place in the binary detection of semantic changes in the SemEval-2020.

pdf abs
Embedding Space Correlation as a Measure of Domain Similarity
Anne Beyer | Göran Kauermann | Hinrich Schütze
Proceedings of the Twelfth Language Resources and Evaluation Conference

Prior work has determined domain similarity using text-based features of a corpus. However, when using pre-trained word embeddings, the underlying text corpus might not be accessible anymore. Therefore, we propose the CCA measure, a new measure of domain similarity based directly on the dimension-wise correlations between corresponding embedding spaces. Our results suggest that an inherent notion of domain can be captured this way, as we are able to reproduce our findings for different domain comparisons for English, German, Spanish and Czech as well as in cross-lingual comparisons. We further find a threshold at which the CCA measure indicates that two corpora come from the same domain in a monolingual setting by applying permutation tests. By evaluating the usability of the CCA measure in a domain adaptation application, we also show that it can be used to determine which corpora are more similar to each other in a cross-domain sentiment detection task.

We propose ThaiLMCut, a semi-supervised approach for Thai word segmentation which utilizes a bi-directional character language model (LM) as a way to leverage useful linguistic knowledge from unlabeled data. After the language model is trained on substantial unlabeled corpora, the weights of its embedding and recurrent layers are transferred to a supervised word segmentation model which continues fine-tuning them on a word segmentation task. Our experimental results demonstrate that applying the LM always leads to a performance gain, especially when the amount of labeled data is small. In such cases, the F1 Score increased by up to 2.02%. Even on abig labeled dataset, a small improvement gain can still be obtained. The approach has also shown to be very beneficial for out-of-domain settings with a gain in F1 Score of up to 3.13%. Finally, we show that ThaiLMCut can outperform other open source state-of-the-art models achieving an F1 Score of 98.78% on the standard benchmark, InterBEST2009.

pdf abs
LMU Bilingual Dictionary Induction System with Word Surface Similarity Scores for BUCC 2020
Silvia Severini | Viktor Hangya | Alexander Fraser | Hinrich Schütze
Proceedings of the 13th Workshop on Building and Using Comparable Corpora

The task of Bilingual Dictionary Induction (BDI) consists of generating translations for source language words which is important in the framework of machine translation (MT). The aim of the BUCC 2020 shared task is to perform BDI on various language pairs using comparable corpora. In this paper, we present our approach to the task of English-German and English-Russian language pairs. Our system relies on Bilingual Word Embeddings (BWEs) which are often used for BDI when only a small seed lexicon is available making them particularly effective in a low-resource setting. On the other hand, they perform well on high frequency words only. In order to improve the performance on rare words as well, we combine BWE based word similarity with word surface similarity methods, such as orthography In addition to the often used top-n translation method, we experiment with a margin based approach aiming for dynamic number of translations for each source word. We participate in both the open and closed tracks of the shared task and we show improved results of our method compared to simple vector similarity based approaches. Our system was ranked in the top-3 teams and achieved the best results for English-Russian.

pdf abs
E-BERT: Efficient-Yet-Effective Entity Embeddings for BERT
Nina Poerner | Ulli Waltinger | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

We present a novel way of injecting factual knowledge about entities into the pretrained BERT model (Devlin et al., 2019): We align Wikipedia2Vec entity vectors (Yamada et al., 2016) with BERT’s native wordpiece vector space and use the aligned entity vectors as if they were wordpiece vectors. The resulting entity-enhanced version of BERT (called E-BERT) is similar in spirit to ERNIE (Zhang et al., 2019) and KnowBert (Peters et al., 2019), but it requires no expensive further pre-training of the BERT encoder. We evaluate E-BERT on unsupervised question answering (QA), supervised relation classification (RC) and entity linking (EL). On all three tasks, E-BERT outperforms BERT and other baselines. We also show quantitatively that the original BERT model is overly reliant on the surface form of entity names (e.g., guessing that someone with an Italian-sounding name speaks Italian), and that E-BERT mitigates this problem.

pdf abs
Quantifying the Contextualization of Word Representations with Semantic Class Probing
Mengjie Zhao | Philipp Dufter | Yadollah Yaghoobzadeh | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

Pretrained language models achieve state-of-the-art results on many NLP tasks, but there are still many open questions about how and why they work so well. We investigate the contextualization of words in BERT. We quantify the amount of contextualization, i.e., how well words are interpreted in context, by studying the extent to which semantic classes of a word can be inferred from its contextualized embedding. Quantifying contextualization helps in understanding and utilizing pretrained language models. We show that the top layer representations support highly accurate inference of semantic classes; that the strongest contextualization effects occur in the lower layers; that local context is mostly sufficient for contextualizing words; and that top layer representations are more task-specific after finetuning while lower layer representations are more transferable. Finetuning uncovers task-related features, but pretrained knowledge about contextualization is still well preserved.

pdf abs
Inexpensive Domain Adaptation of Pretrained Language Models: Case Studies on Biomedical NER and Covid-19 QA
Nina Poerner | Ulli Waltinger | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

Domain adaptation of Pretrained Language Models (PTLMs) is typically achieved by unsupervised pretraining on target-domain text. While successful, this approach is expensive in terms of hardware, runtime and CO 2 emissions. Here, we propose a cheaper alternative: We train Word2Vec on target-domain text and align the resulting word vectors with the wordpiece vectors of a general-domain PTLM. We evaluate on eight English biomedical Named Entity Recognition (NER) tasks and compare against the recently proposed BioBERT model. We cover over 60% of the BioBERT - BERT F1 delta, at 5% of BioBERT’s CO 2 footprint and 2% of its cloud compute cost. We also show how to quickly adapt an existing general-domain Question Answering (QA) model to an emerging domain: the Covid-19 pandemic.

pdf abs
SimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings
Masoud Jalili Sabet | Philipp Dufter | François Yvon | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings – both static and contextualized – for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners – even with abundant parallel data; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

Prior research notes that BERT’s computational cost grows quadratically with sequence length thus leading to longer training times, higher GPU memory constraints and carbon emissions. While recent work seeks to address these scalability issues at pre-training, these issues are also prominent in fine-tuning especially for long sequence tasks like document classification. Our work thus focuses on optimizing the computational cost of fine-tuning for document classification. We achieve this by complementary learning of both topic and language models in a unified framework, named TopicBERT. This significantly reduces the number of self-attention operations – a main performance bottleneck. Consequently, our model achieves a 1.4x ( 40%) speedup with 40% reduction in CO2 emission while retaining 99.9% performance over 5 datasets.

pdf abs
BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA
Nora Kassner | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

Khandelwal et al. (2020) use a k-nearest-neighbor (kNN) component to improve language model performance. We show that this idea is beneficial for open-domain question answering (QA). To improve the recall of facts encountered during training, we combine BERT (Devlin et al., 2019) with a traditional information retrieval step (IR) and a kNN search over a large datastore of an embedded text collection. Our contributions are as follows: i) BERT-kNN outperforms BERT on cloze-style QA by large margins without any further training. ii) We show that BERT often identifies the correct response category (e.g., US city), but only kNN recovers the factually correct answer (e.g.,“Miami”). iii) Compared to BERT, BERT-kNN excels for rare facts. iv) BERT-kNN can easily handle facts not covered by BERT’s training set, e.g., recent events.

pdf abs
Are Pretrained Language Models Symbolic Reasoners over Knowledge?
Nora Kassner | Benno Krojer | Hinrich Schütze
Proceedings of the 24th Conference on Computational Natural Language Learning

How can pretrained language models (PLMs) learn factual knowledge from the training set? We investigate the two most important mechanisms: reasoning and memorization. Prior work has attempted to quantify the number of facts PLMs learn, but we present, using synthetic data, the first study that investigates the causal relation between facts present in training and facts learned by the PLM. For reasoning, we show that PLMs seem to learn to apply some symbolic reasoning rules correctly but struggle with others, including two-hop reasoning. Further analysis suggests that even the application of learned reasoning rules is flawed. For memorization, we identify schema conformity (facts systematically supported by other facts) and frequency as key factors for its success.

pdf abs
A Graph Auto-encoder Model of Derivational Morphology
Valentin Hofmann | Hinrich Schütze | Janet Pierrehumbert
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

There has been little work on modeling the morphological well-formedness (MWF) of derivatives, a problem judged to be complex and difficult in linguistics. We present a graph auto-encoder that learns embeddings capturing information about the compatibility of affixes and stems in derivation. The auto-encoder models MWF in English surprisingly well by combining syntactic and semantic information with associative information from the mental lexicon.

pdf abs
BERTRAM: Improved Word Embeddings Have Big Impact on Contextualized Model Performance
Timo Schick | Hinrich Schütze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Pretraining deep language models has led to large performance gains in NLP. Despite this success, Schick and Schütze (2020) recently showed that these models struggle to understand rare words. For static word embeddings, this problem has been addressed by separately learning representations for rare words. In this work, we transfer this idea to pretrained language models: We introduce BERTRAM, a powerful architecture based on BERT that is capable of inferring high-quality embeddings for rare words that are suitable as input representations for deep language models. This is achieved by enabling the surface form and contexts of a word to interact with each other in a deep architecture. Integrating BERTRAM into BERT leads to large performance increases due to improved representations of rare and medium frequency words on both a rare word probing task and three downstream tasks.

pdf abs
Sentence Meta-Embeddings for Unsupervised Semantic Textual Similarity
Nina Poerner | Ulli Waltinger | Hinrich Schütze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We address the task of unsupervised Semantic Textual Similarity (STS) by ensembling diverse pre-trained sentence encoders into sentence meta-embeddings. We apply, extend and evaluate different meta-embedding methods from the word embedding literature at the sentence level, including dimensionality reduction (Yin and Schütze, 2016), generalized Canonical Correlation Analysis (Rastogi et al., 2015) and cross-view auto-encoders (Bollegala and Bao, 2018). Our sentence meta-embeddings set a new unsupervised State of The Art (SoTA) on the STS Benchmark and on the STS12-STS16 datasets, with gains of between 3.7% and 6.4% Pearson’s r over single-source systems.

pdf abs
Predicting the Growth of Morphological Families from Social and Linguistic Factors
Valentin Hofmann | Janet Pierrehumbert | Hinrich Schütze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present the first study that examines the evolution of morphological families, i.e., sets of morphologically related words such as “trump”, “antitrumpism”, and “detrumpify”, in social media. We introduce the novel task of Morphological Family Expansion Prediction (MFEP) as predicting the increase in the size of a morphological family. We create a ten-year Reddit corpus as a benchmark for MFEP and evaluate a number of baselines on this benchmark. Our experiments demonstrate very good performance on MFEP.

pdf abs
Negated and Misprimed Probes for Pretrained Language Models: Birds Can Talk, But Cannot Fly
Nora Kassner | Hinrich Schütze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Building on Petroni et al. 2019, we propose two new probing tasks analyzing factual knowledge stored in Pretrained Language Models (PLMs). (1) Negation. We find that PLMs do not distinguish between negated (‘‘Birds cannot [MASK]”) and non-negated (‘‘Birds can [MASK]”) cloze questions. (2) Mispriming. Inspired by priming methods in human psychology, we add “misprimes” to cloze questions (‘‘Talk? Birds can [MASK]”). We find that PLMs are easily distracted by misprimes. These results suggest that PLMs still have a long way to go to adequately learn human-like factual knowledge.

pdf abs
Increasing Learning Efficiency of Self-Attention Networks through Direct Position Interactions, Learnable Temperature, and Convoluted Attention
Philipp Dufter | Martin Schmitt | Hinrich Schütze
Proceedings of the 28th International Conference on Computational Linguistics

Self-Attention Networks (SANs) are an integral part of successful neural architectures such as Transformer (Vaswani et al., 2017), and thus of pretrained language models such as BERT (Devlin et al., 2019) or GPT-3 (Brown et al., 2020). Training SANs on a task or pretraining them on language modeling requires large amounts of data and compute resources. We are searching for modifications to SANs that enable faster learning, i.e., higher accuracies after fewer update steps. We investigate three modifications to SANs: direct position interactions, learnable temperature, and convoluted attention. When evaluating them on part-of-speech tagging, we find that direct position interactions are an alternative to position embeddings, and convoluted attention has the potential to speed up the learning process.

pdf abs
Monolingual and Multilingual Reduction of Gender Bias in Contextualized Representations
Sheng Liang | Philipp Dufter | Hinrich Schütze
Proceedings of the 28th International Conference on Computational Linguistics

Pretrained language models (PLMs) learn stereotypes held by humans and reflected in text from their training corpora, including gender bias. When PLMs are used for downstream tasks such as picking candidates for a job, people’s lives can be negatively affected by these learned stereotypes. Prior work usually identifies a linear gender subspace and removes gender information by eliminating the subspace. Following this line of work, we propose to use DensRay, an analytical method for obtaining interpretable dense subspaces. We show that DensRay performs on-par with prior approaches, but provide arguments that it is more robust and provide indications that it preserves language model performance better. By applying DensRay to attention heads and layers of BERT we show that gender information is spread across all attention heads and most of the layers. Also we show that DensRay can obtain gender bias scores on both token and sentence levels. Finally, we demonstrate that we can remove bias multilingually, e.g., from Chinese, using only English training data.

pdf abs
Automatically Identifying Words That Can Serve as Labels for Few-Shot Text Classification
Timo Schick | Helmut Schmid | Hinrich Schütze
Proceedings of the 28th International Conference on Computational Linguistics

A recent approach for few-shot text classification is to convert textual inputs to cloze questions that contain some form of task description, process them with a pretrained language model and map the predicted words to labels. Manually defining this mapping between words and labels requires both domain expertise and an understanding of the language model’s abilities. To mitigate this issue, we devise an approach that automatically finds such a mapping given small amounts of training data. For a number of tasks, the mapping found by our approach performs almost as well as hand-crafted label-to-word mappings.

pdf abs
Combining Word Embeddings with Bilingual Orthography Embeddings for Bilingual Dictionary Induction
Silvia Severini | Viktor Hangya | Alexander Fraser | Hinrich Schütze
Proceedings of the 28th International Conference on Computational Linguistics

Bilingual dictionary induction (BDI) is the task of accurately translating words to the target language. It is of great importance in many low-resource scenarios where cross-lingual training data is not available. To perform BDI, bilingual word embeddings (BWEs) are often used due to their low bilingual training signal requirements. They achieve high performance, but problematic cases still remain, such as the translation of rare words or named entities, which often need to be transliterated. In this paper, we enrich BWE-based BDI with transliteration information by using Bilingual Orthography Embeddings (BOEs). BOEs represent source and target language transliteration word pairs with similar vectors. A key problem in our BDI setup is to decide which information source – BWEs (or semantics) vs. BOEs (or orthography) – is more reliable for a particular word pair. We propose a novel classification-based BDI system that uses BWEs, BOEs and a number of other features to make this decision. We test our system on English-Russian BDI and show improved performance. In addition, we show the effectiveness of our BOEs by successfully using them for transliteration mining based on cosine similarity.

2019

In this paper, we report on the results of the TL;DR challenge, discussing an extensive manual evaluation of the expected properties of a good summary based on analyzing the comments provided by human annotators.

pdf abs
Analytical Methods for Interpretable Ultradense Word Embeddings
Philipp Dufter | Hinrich Schütze
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Word embeddings are useful for a wide variety of tasks, but they lack interpretability. By rotating word spaces, interpretable dimensions can be identified while preserving the information contained in the embeddings without any loss. In this work, we investigate three methods for making word spaces interpretable by rotation: Densifier (Rothe et al., 2016), linear SVMs and DensRay, a new method we propose. In contrast to Densifier, DensRay can be computed in closed form, is hyperparameter-free and thus more robust than Densifier. We evaluate the three methods on lexicon induction and set-based word analogy. In addition we provide qualitative insights as to how interpretable word spaces can be used for removing gender bias from embeddings.

pdf abs
Multi-View Domain Adapted Sentence Embeddings for Low-Resource Unsupervised Duplicate Question Detection
Nina Poerner | Hinrich Schütze
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We address the problem of Duplicate Question Detection (DQD) in low-resource domain-specific Community Question Answering forums. Our multi-view framework MV-DASE combines an ensemble of sentence encoders via Generalized Canonical Correlation Analysis, using unlabeled data only. In our experiments, the ensemble includes generic and domain-specific averaged word embeddings, domain-finetuned BERT and the Universal Sentence Encoder. We evaluate MV-DASE on the CQADupStack corpus and on additional low-resource Stack Exchange forums. Combining the strengths of different encoders, we significantly outperform BM25, all single-view systems as well as a recent supervised domain-adversarial DQD method.

pdf abs
Neural Architectures for Fine-Grained Propaganda Detection in News
Pankaj Gupta | Khushbu Saxena | Usama Yaseen | Thomas Runkler | Hinrich Schütze
Proceedings of the Second Workshop on Natural Language Processing for Internet Freedom: Censorship, Disinformation, and Propaganda

This paper describes our system (MIC-CIS) details and results of participation in the fine grained propaganda detection shared task 2019. To address the tasks of sentence (SLC) and fragment level (FLC) propaganda detection, we explore different neural architectures (e.g., CNN, LSTM-CRF and BERT) and extract linguistic (e.g., part-of-speech, named entity, readability, sentiment, emotion, etc.), layout and topical features. Specifically, we have designed multi-granularity and multi-tasking neural architectures to jointly perform both the sentence and fragment level propaganda detection. Additionally, we investigate different ensemble schemes such as majority-voting, relax-voting, etc. to boost overall system performance. Compared to the other participating systems, our submissions are ranked 3rd and 4th in FLC and SLC tasks, respectively.

pdf abs
Linguistically Informed Relation Extraction and Neural Architectures for Nested Named Entity Recognition in BioNLP-OST 2019
Pankaj Gupta | Usama Yaseen | Hinrich Schütze
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

Named Entity Recognition (NER) and Relation Extraction (RE) are essential tools in distilling knowledge from biomedical literature. This paper presents our findings from participating in BioNLP Shared Tasks 2019. We addressed Named Entity Recognition including nested entities extraction, Entity Normalization and Relation Extraction. Our proposed approach of Named Entities can be generalized to different languages and we have shown it’s effectiveness for English and Spanish text. We investigated linguistic features, hybrid loss including ranking and Conditional Random Fields (CRF), multi-task objective and token level ensembling strategy to improve NER. We employed dictionary based fuzzy and semantic search to perform Entity Normalization. Finally, our RE system employed Support Vector Machine (SVM) with linguistic features. Our NER submission (team:MIC-CIS) ranked first in BB-2019 norm+NER task with standard error rate (SER) of 0.7159 and showed competitive performance on PharmaCo NER task with F1-score of 0.8662. Our RE system ranked first in the SeeDev-binary Relation Extraction Task with F1-score of 0.3738.

pdf abs
BioNLP-OST 2019 RDoC Tasks: Multi-grain Neural Relevance Ranking Using Topics and Attention Based Query-Document-Sentence Interactions
Pankaj Gupta | Yatin Chaudhary | Hinrich Schütze
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

This paper presents our system details and results of participation in the RDoC Tasks of BioNLP-OST 2019. Research Domain Criteria (RDoC) construct is a multi-dimensional and broad framework to describe mental health disorders by combining knowledge from genomics to behaviour. Non-availability of RDoC labelled dataset and tedious labelling process hinders the use of RDoC framework to reach its full potential in Biomedical research community and Healthcare industry. Therefore, Task-1 aims at retrieval and ranking of PubMed abstracts relevant to a given RDoC construct and Task-2 aims at extraction of the most relevant sentence from a given PubMed abstract. We investigate (1) attention based supervised neural topic model and SVM for retrieval and ranking of PubMed abstracts and, further utilize BM25 and other relevance measures for re-ranking, (2) supervised and unsupervised sentence ranking models utilizing multi-view representations comprising of query-aware attention-based sentence representation (QAR), bag-of-words (BoW) and TF-IDF. Our best systems achieved 1st rank and scored 0.86 mAP and 0.58 macro average accuracy in Task-1 and Task-2 respectively.

pdf abs
Attentive Mimicking: Better Word Embeddings by Attending to Informative Contexts
Timo Schick | Hinrich Schütze
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Learning high-quality embeddings for rare words is a hard problem because of sparse context information. Mimicking (Pinter et al., 2017) has been proposed as a solution: given embeddings learned by a standard algorithm, a model is first trained to reproduce embeddings of frequent words from their surface form and then used to compute embeddings for rare words. In this paper, we introduce attentive mimicking: the mimicking model is given access not only to a word’s surface form, but also to all available contexts and learns to attend to the most informative and reliable contexts for computing an embedding. In an evaluation on four tasks, we show that attentive mimicking outperforms previous work for both rare and medium-frequency words. Thus, compared to previous work, attentive mimicking improves embeddings for a much larger part of the vocabulary, including the medium-frequency range.

pdf abs
Neural Semi-Markov Conditional Random Fields for Robust Character-Based Part-of-Speech Tagging
Apostolos Kemos | Heike Adel | Hinrich Schütze
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Character-level models of tokens have been shown to be effective at dealing with within-token noise and out-of-vocabulary words. However, they often still rely on correct token boundaries. In this paper, we propose to eliminate the need for tokenizers with an end-to-end character-level semi-Markov conditional random field. It uses neural networks for its character and segment representations. We demonstrate its effectiveness in multilingual settings and when token boundaries are noisy: It matches state-of-the-art part-of-speech taggers for various languages and significantly outperforms them on a noisy English version of a benchmark dataset. Our code and the noisy dataset are publicly available at http://cistern.cis.lmu.de/semiCRF.

pdf abs
News Article Teaser Tweets and How to Generate Them
Sanjeev Kumar Karn | Mark Buckley | Ulli Waltinger | Hinrich Schütze
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this work, we define the task of teaser generation and provide an evaluation benchmark and baseline systems for the process of generating teasers. A teaser is a short reading suggestion for an article that is illustrative and includes curiosity-arousing elements to entice potential readers to read particular news items. Teasers are one of the main vehicles for transmitting news to social media users. We compile a novel dataset of teasers by systematically accumulating tweets and selecting those that conform to the teaser definition. We have compared a number of neural abstractive architectures on the task of teaser generation and the overall best performing system is See et al. seq2seq with pointer network.

pdf abs
Automatic Domain Adaptation Outperforms Manual Domain Adaptation for Predicting Financial Outcomes
Marina Sedinkina | Nikolas Breitkopf | Hinrich Schütze
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

In this paper, we automatically create sentiment dictionaries for predicting financial outcomes. We compare three approaches: (i) manual adaptation of the domain-general dictionary H4N, (ii) automatic adaptation of H4N and (iii) a combination consisting of first manual, then automatic adaptation. In our experiments, we demonstrate that the automatically adapted sentiment dictionary outperforms the previous state of the art in predicting the financial outcomes excess return and volatility. In particular, automatic adaptation performs better than manual adaptation. In our analysis, we find that annotation based on an expert’s a priori belief about a word’s meaning can be incorrect – annotation should be performed based on the word’s contexts in the target domain instead.

pdf abs
SherLIiC: A Typed Event-Focused Lexical Inference Benchmark for Evaluating Natural Language Inference
Martin Schmitt | Hinrich Schütze
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present SherLIiC, a testbed for lexical inference in context (LIiC), consisting of 3985 manually annotated inference rule candidates (InfCands), accompanied by (i) ~960k unlabeled InfCands, and (ii) ~190k typed textual relations between Freebase entities extracted from the large entity-linked corpus ClueWeb09. Each InfCand consists of one of these relations, expressed as a lemmatized dependency path, and two argument placeholders, each linked to one or more Freebase types. Due to our candidate selection process based on strong distributional evidence, SherLIiC is much harder than existing testbeds because distributional evidence is of little utility in the classification of InfCands. We also show that, due to its construction, many of SherLIiC’s correct InfCands are novel and missing from existing rule bases. We evaluate a large number of strong baselines on SherLIiC, ranging from semantic vector space models to state of the art neural models of natural language inference (NLI). We show that SherLIiC poses a tough challenge to existing NLI systems.

pdf abs
A Multilingual BPE Embedding Space for Universal Sentiment Lexicon Induction
Mengjie Zhao | Hinrich Schütze
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present a new method for sentiment lexicon induction that is designed to be applicable to the entire range of typological diversity of the world’s languages. We evaluate our method on Parallel Bible Corpus+ (PBC+), a parallel corpus of 1593 languages. The key idea is to use Byte Pair Encodings (BPEs) as basic units for multilingual embeddings. Through zero-shot transfer from English sentiment, we learn a seed lexicon for each language in the domain of PBC+. Through domain adaptation, we then generalize the domain-specific lexicon to a general one. We show – across typologically diverse languages in PBC+ – good quality of seed and general-domain sentiment lexicons by intrinsic and extrinsic and by automatic and human evaluation. We make freely available our code, seed sentiment lexicons for all 1593 languages and induced general-domain sentiment lexicons for 200 languages.

pdf abs
Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings
Yadollah Yaghoobzadeh | Katharina Kann | T. J. Hazen | Eneko Agirre | Hinrich Schütze
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Word embeddings typically represent different meanings of a word in a single conflated vector. Empirical analysis of embeddings of ambiguous words is currently limited by the small size of manually annotated resources and by the fact that word senses are treated as unrelated individual concepts. We present a large dataset based on manual Wikipedia annotations and word senses, where word senses from different words are related by semantic classes. This is the basis for novel diagnostic tests for an embedding’s content: we probe word embeddings for semantic classes and analyze the embedding space by classifying embeddings into semantic classes. Our main findings are: (i) Information about a sense is generally represented well in a single-vector embedding – if the sense is frequent. (ii) A classifier can accurately predict whether a word is single-sense or multi-sense, based only on its embedding. (iii) Although rare senses are not well represented in single-vector embeddings, this does not have negative impact on an NLP application whose performance depends on frequent senses.

2018

pdf abs
Multi-Multi-View Learning: Multilingual and Multi-Representation Entity Typing
Yadollah Yaghoobzadeh | Hinrich Schütze
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Accurate and complete knowledge bases (KBs) are paramount in NLP. We employ mul-itiview learning for increasing the accuracy and coverage of entity type information in KBs. We rely on two metaviews: language and representation. For language, we consider high-resource and low-resource languages from Wikipedia. For representation, we consider representations based on the context distribution of the entity (i.e., on its embedding), on the entity’s name (i.e., on its surface form) and on its description in Wikipedia. The two metaviews language and representation can be freely combined: each pair of language and representation (e.g., German embedding, English description, Spanish name) is a distinct view. Our experiments on entity typing with fine-grained classes demonstrate the effectiveness of multiview learning. We release MVET, a large multiview — and, in particular, multilingual — entity typing dataset we created. Mono- and multilingual fine-grained entity typing systems can be evaluated on this dataset.

pdf abs
Neural Transductive Learning and Beyond: Morphological Generation in the Minimal-Resource Setting
Katharina Kann | Hinrich Schütze
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Neural state-of-the-art sequence-to-sequence (seq2seq) models often do not perform well for small training sets. We address paradigm completion, the morphological task of, given a partial paradigm, generating all missing forms. We propose two new methods for the minimal-resource setting: (i) Paradigm transduction: Since we assume only few paradigms available for training, neural seq2seq models are able to capture relationships between paradigm cells, but are tied to the idiosyncracies of the training set. Paradigm transduction mitigates this problem by exploiting the input subset of inflected forms at test time. (ii) Source selection with high precision (SHIP): Multi-source models which learn to automatically select one or multiple sources to predict a target inflection do not perform well in the minimal-resource setting. SHIP is an alternative to identify a reliable source if training data is limited. On a 52-language benchmark dataset, we outperform the previous state of the art by up to 9.71% absolute accuracy.

bib
Traitement Automatique des Langues, Volume 59, Numéro 2 : Apprentissage profond pour le traitement automatique des langues [Deep Learning for natural language processing]
Alexandre Allauzen | Hinrich Schütze
Traitement Automatique des Langues, Volume 59, Numéro 2 : Apprentissage profond pour le traitement automatique des langues [Deep Learning for natural language processing]

pdf bib
Apprentissage profond pour le traitement automatique des langues [Deep Learning for Natural Language Processing]
Alexandre Allauzen | Hinrich Schütze
Traitement Automatique des Langues, Volume 59, Numéro 2 : Apprentissage profond pour le traitement automatique des langues [Deep Learning for natural language processing]

pdf abs
Joint Bootstrapping Machines for High Confidence Relation Extraction
Pankaj Gupta | Benjamin Roth | Hinrich Schütze
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Semi-supervised bootstrapping techniques for relationship extraction from text iteratively expand a set of initial seed instances. Due to the lack of labeled data, a key challenge in bootstrapping is semantic drift: if a false positive instance is added during an iteration, then all following iterations are contaminated. We introduce BREX, a new bootstrapping method that protects against such contamination by highly effective confidence assessment. This is achieved by using entity and template seeds jointly (as opposed to just one as in previous work), by expanding entities and templates in parallel and in a mutually constraining fashion in each iteration and by introducing higherquality similarity measures for templates. Experimental results show that BREX achieves an F1 that is 0.13 (0.87 vs. 0.74) better than the state of the art for four relationships.

pdf abs
Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages
Katharina Kann | Jesus Manuel Mager Hois | Ivan Vladimir Meza-Ruiz | Hinrich Schütze
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Morphological segmentation for polysynthetic languages is challenging, because a word may consist of many individual morphemes and training data can be extremely scarce. Since neural sequence-to-sequence (seq2seq) models define the state of the art for morphological segmentation in high-resource settings and for (mostly) European languages, we first show that they also obtain competitive performance for Mexican polysynthetic languages in minimal-resource settings. We then propose two novel multi-task training approaches—one with, one without need for external unlabeled resources—, and two corresponding data augmentation methods, improving over the neural baseline for all languages. Finally, we explore cross-lingual transfer as a third way to fortify our neural model and show that we can train one single multi-lingual model for related languages while maintaining comparable or even improved performance, thus reducing the amount of parameters by close to 75%. We provide our morphological segmentation datasets for Mexicanero, Nahuatl, Wixarika and Yorem Nokki for future research.

pdf abs
Deep Temporal-Recurrent-Replicated-Softmax for Topical Trends over Time
Pankaj Gupta | Subburam Rajaram | Hinrich Schütze | Bernt Andrassy
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Dynamic topic modeling facilitates the identification of topical trends over time in temporal collections of unstructured documents. We introduce a novel unsupervised neural dynamic topic model named as Recurrent Neural Network-Replicated Softmax Model (RNNRSM), where the discovered topics at each time influence the topic discovery in the subsequent time steps. We account for the temporal ordering of documents by explicitly modeling a joint distribution of latent topical dependencies over time, using distributional estimators with temporal recurrent connections. Applying RNN-RSM to 19 years of articles on NLP research, we demonstrate that compared to state-of-the art topic models, RNNRSM shows better generalization, topic interpretation, evolution and trends. We also introduce a metric (named as SPAN) to quantify the capability of dynamic topic model to capture word evolution in topics over time.

pdf bib
Proceedings of the Second Workshop on Subword/Character LEvel Models
Manaal Faruqui | Hinrich Schütze | Isabel Trancoso | Yulia Tsvetkov | Yadollah Yaghoobzadeh
Proceedings of the Second Workshop on Subword/Character LEvel Models

pdf abs
Evaluating Word Embeddings in Multi-label Classification Using Fine-Grained Name Typing
Yadollah Yaghoobzadeh | Katharina Kann | Hinrich Schütze
Proceedings of the Third Workshop on Representation Learning for NLP

Embedding models typically associate each word with a single real-valued vector, representing its different properties. Evaluation methods, therefore, need to analyze the accuracy and completeness of these properties in embeddings. This requires fine-grained analysis of embedding subspaces. Multi-label classification is an appropriate way to do so. We propose a new evaluation method for word embeddings based on multi-label classification given a word embedding. The task we use is fine-grained name typing: given a large corpus, find all types that a name can refer to based on the name embedding. Given the scale of entities in knowledge bases, we can build datasets for this task that are complementary to the current embedding evaluation datasets in: they are very large, contain fine-grained classes, and allow the direct evaluation of embeddings without confounding factors like sentence context.

pdf bib abs
Replicated Siamese LSTM in Ticketing System for Similarity Learning and Retrieval in Asymmetric Texts
Pankaj Gupta | Bernt Andrassy | Hinrich Schütze
Proceedings of the Third Workshop on Semantic Deep Learning

The goal of our industrial ticketing system is to retrieve a relevant solution for an input query, by matching with historical tickets stored in knowledge base. A query is comprised of subject and description, while a historical ticket consists of subject, description and solution. To retrieve a relevant solution, we use textual similarity paradigm to learn similarity in the query and historical tickets. The task is challenging due to significant term mismatch in the query and ticket pairs of asymmetric lengths, where subject is a short text but description and solution are multi-sentence texts. We present a novel Replicated Siamese LSTM model to learn similarity in asymmetric text pairs, that gives 22% and 7% gain (Accuracy@10) for retrieval task, respectively over unsupervised and supervised baselines. We also show that the topic and distributed semantic features for short and long texts improved both similarity learning and retrieval.

pdf abs
LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to Pattern Transformation
Pankaj Gupta | Hinrich Schütze
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Recurrent neural networks (RNNs) are temporal networks and cumulative in nature that have shown promising results in various natural language processing tasks. Despite their success, it still remains a challenge to understand their hidden behavior. In this work, we analyze and interpret the cumulative nature of RNN via a proposed technique named as Layer-wIse-Semantic-Accumulation (LISA) for explaining decisions and detecting the most likely (i.e., saliency) patterns that the network relies on while decision making. We demonstrate (1) LISA: “How an RNN accumulates or builds semantics during its sequential processing for a given text example and expected response” (2) Example2pattern: “How the saliency patterns look like for each category in the data according to the network in decision making”. We analyse the sensitiveness of RNNs about different inputs to check the increase or decrease in prediction scores and further extract the saliency patterns learned by the network. We employ two relation classification datasets: SemEval 10 Task 8 and TAC KBP Slot Filling to explain RNN predictions via the LISA and example2pattern.

pdf abs
Interpretable Textual Neuron Representations for NLP
Nina Poerner | Benjamin Roth | Hinrich Schütze
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

Input optimization methods, such as Google Deep Dream, create interpretable representations of neurons for computer vision DNNs. We propose and evaluate ways of transferring this technology to NLP. Our results suggest that gradient ascent with a gumbel softmax layer produces n-gram representations that outperform naive corpus search in terms of target neuron activation. The representations highlight differences in syntax awareness between the language and visual models of the Imaginet architecture.

The TL;DR challenge fosters research in abstractive summarization of informal text, the largest and fastest-growing source of textual data on the web, which has been overlooked by summarization research so far. The challenge owes its name to the frequent practice of social media users to supplement long posts with a “TL;DR”—for “too long; didn’t read”—followed by a short summary as a courtesy to those who would otherwise reply with the exact same abbreviation to indicate they did not care to read a post for its apparent length. Posts featuring TL;DR summaries form an excellent ground truth for summarization, and by tapping into this resource for the first time, we have mined millions of training examples from social media, opening the door to all kinds of generative models.

pdf abs
Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement
Nina Poerner | Hinrich Schütze | Benjamin Roth
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

The behavior of deep neural networks (DNNs) is hard to understand. This makes it necessary to explore post hoc explanation methods. We conduct the first comprehensive evaluation of explanation methods for NLP. To this end, we design two novel evaluation paradigms that cover two important classes of NLP problems: small context and large context problems. Both paradigms require no manual annotation and are therefore broadly applicable. We also introduce LIMSSE, an explanation method inspired by LIME that is designed for NLP. We show empirically that LIMSSE, LRP and DeepLIFT are the most effective explanation methods and recommend them for explaining DNNs in NLP.

pdf abs
Two Methods for Domain Adaptation of Bilingual Tasks: Delightfully Simple and Broadly Applicable
Viktor Hangya | Fabienne Braune | Alexander Fraser | Hinrich Schütze
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Bilingual tasks, such as bilingual lexicon induction and cross-lingual classification, are crucial for overcoming data sparsity in the target language. Resources required for such tasks are often out-of-domain, thus domain adaptation is an important problem here. We make two contributions. First, we test a delightfully simple method for domain adaptation of bilingual word embeddings. We evaluate these embeddings on two bilingual tasks involving different domains: cross-lingual twitter sentiment classification and medical bilingual lexicon induction. Second, we tailor a broadly applicable semi-supervised classification method from computer vision to these tasks. We show that this method also helps in low-resource setups. Using both methods together we achieve large improvements over our baselines, by using only additional unlabeled data.

pdf abs
Embedding Learning Through Multilingual Concept Induction
Philipp Dufter | Mengjie Zhao | Martin Schmitt | Alexander Fraser | Hinrich Schütze
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a new method for estimating vector space representations of words: embedding learning by concept induction. We test this method on a highly parallel corpus and learn semantic representations of words in 1259 different languages in a single common space. An extensive experimental evaluation on crosslingual word similarity and sentiment analysis indicates that concept-based multilingual embedding learning performs better than previous approaches.

pdf abs
End-Task Oriented Textual Entailment via Deep Explorations of Inter-Sentence Interactions
Wenpeng Yin | Dan Roth | Hinrich Schütze
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This work deals with SciTail, a natural entailment challenge derived from a multi-choice question answering problem. The premises and hypotheses in SciTail were generated with no awareness of each other, and did not specifically aim at the entailment task. This makes it more challenging than other entailment data sets and more directly useful to the end-task – question answering. We propose DEISTE (deep explorations of inter-sentence interactions for textual entailment) for this entailment task. Given word-to-word interactions between the premise-hypothesis pair (P, H), DEISTE consists of: (i) a parameter-dynamic convolution to make important words in P and H play a dominant role in learnt representations; and (ii) a position-aware attentive convolution to encode the representation and position information of the aligned word pairs. Experiments show that DEISTE gets ≈5% improvement over prior state of the art and that the pretrained DEISTE on SciTail generalizes well on RTE-5.

pdf abs
Recurrent One-Hop Predictions for Reasoning over Knowledge Graphs
Wenpeng Yin | Yadollah Yaghoobzadeh | Hinrich Schütze
Proceedings of the 27th International Conference on Computational Linguistics

Large scale knowledge graphs (KGs) such as Freebase are generally incomplete. Reasoning over multi-hop (mh) KG paths is thus an important capability that is needed for question answering or other NLP tasks that require knowledge about the world. mh-KG reasoning includes diverse scenarios, e.g., given a head entity and a relation path, predict the tail entity; or given two entities connected by some relation paths, predict the unknown relation between them. We present ROPs, recurrent one-hop predictors, that predict entities at each step of mh-KB paths by using recurrent neural networks and vector representations of entities and relations, with two benefits: (i) modeling mh-paths of arbitrary lengths while updating the entity and relation representations by the training signal at each step; (ii) handling different types of mh-KG reasoning in a unified framework. Our models show state-of-the-art for two important multi-hop KG reasoning tasks: Knowledge Base Completion and Path Query Answering.

pdf abs
Joint Semantic Synthesis and Morphological Analysis of the Derived Word
Ryan Cotterell | Hinrich Schütze
Transactions of the Association for Computational Linguistics, Volume 6

Much like sentences are composed of words, words themselves are composed of smaller units. For example, the English word questionably can be analyzed as question+able+ly. However, this structural decomposition of the word does not directly give us a semantic representation of the word’s meaning. Since morphology obeys the principle of compositionality, the semantics of the word can be systematically derived from the meaning of its parts. In this work, we propose a novel probabilistic model of word formation that captures both the analysis of a word w into its constituent segments and the synthesis of the meaning of w from the meanings of those segments. Our model jointly learns to segment words into morphemes and compose distributional semantic vectors of those morphemes. We experiment with the model on English CELEX data and German DErivBase (Zeller et al., 2013) data. We show that jointly modeling semantics increases both segmentation accuracy and morpheme F1 by between 3% and 5%. Additionally, we investigate different models of vector composition, showing that recurrent neural networks yield an improvement over simple additive models. Finally, we study the degree to which the representations correspond to a linguist’s notion of morphological productivity.

pdf abs
Attentive Convolution: Equipping CNNs with RNN-style Attention Mechanisms
Wenpeng Yin | Hinrich Schütze
Transactions of the Association for Computational Linguistics, Volume 6

In NLP, convolutional neural networks (CNNs) have benefited less than recurrent neural networks (RNNs) from attention mechanisms. We hypothesize that this is because the attention in CNNs has been mainly implemented as attentive pooling (i.e., it is applied to pooling) rather than as attentive convolution (i.e., it is integrated into convolution). Convolution is the differentiator of CNNs in that it can powerfully model the higher-level representation of a word by taking into account its local fixed-size context in the input text tx. In this work, we propose an attentive convolution network, ATTCONV. It extends the context scope of the convolution operation, deriving higher-level features for a word not only from local context, but also from information extracted from nonlocal context by the attention mechanism commonly used in RNNs. This nonlocal context can come (i) from parts of the input text tx that are distant or (ii) from extra (i.e., external) contexts ty. Experiments on sentence modeling with zero-context (sentiment analysis), single-context (textual entailment) and multiple-context (claim verification) demonstrate the effectiveness of ATTCONV in sentence representation learning with the incorporation of context. In particular, attentive convolution outperforms attentive pooling and is a strong competitor to popular attentive RNNs.1

2017

pdf abs
Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages
Ehsaneddin Asgari | Hinrich Schütze
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We present SuperPivot, an analysis method for low-resource languages that occur in a superparallel corpus, i.e., in a corpus that contains an order of magnitude more languages than parallel corpora currently in use. We show that SuperPivot performs well for the crosslingual analysis of the linguistic phenomenon of tense. We produce analysis results for more than 1000 languages, conducting – to the best of our knowledge – the largest crosslingual computational study performed to date. We extend existing methodology for leveraging parallel corpora for typological analysis by overcoming a limiting assumption of earlier work: We only require that a linguistic feature is overtly marked in a few of thousands of languages as opposed to requiring that it be marked in all languages under investigation.

pdf abs
Global Normalization of Convolutional Neural Networks for Joint Entity and Relation Classification
Heike Adel | Hinrich Schütze
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We introduce globally normalized convolutional neural networks for joint entity classification and relation extraction. In particular, we propose a way to utilize a linear-chain conditional random field output layer for predicting entity types and relations between entities at the same time. Our experiments show that global normalization outperforms a locally normalized softmax layer on a benchmark dataset.

pdf abs
Statistical Models for Unsupervised, Semi-Supervised Supervised Transliteration Mining
Hassan Sajjad | Helmut Schmid | Alexander Fraser | Hinrich Schütze
Computational Linguistics, Volume 43, Issue 2 - June 2017

We present a generative model that efficiently mines transliteration pairs in a consistent fashion in three different settings: unsupervised, semi-supervised, and supervised transliteration mining. The model interpolates two sub-models, one for the generation of transliteration pairs and one for the generation of non-transliteration pairs (i.e., noise). The model is trained on noisy unlabeled data using the EM algorithm. During training the transliteration sub-model learns to generate transliteration pairs and the fixed non-transliteration model generates the noise pairs. After training, the unlabeled data is disambiguated based on the posterior probabilities of the two sub-models. We evaluate our transliteration mining system on data from a transliteration mining shared task and on parallel corpora. For three out of four language pairs, our system outperforms all semi-supervised and supervised systems that participated in the NEWS 2010 shared task. On word pairs extracted from parallel corpora with fewer than 2% transliteration pairs, our system achieves up to 86.7% F-measure with 77.9% precision and 97.8% recall.

pdf abs
AutoExtend: Combining Word Embeddings with Semantic Resources
Sascha Rothe | Hinrich Schütze
Computational Linguistics, Volume 43, Issue 3 - September 2017

We present AutoExtend, a system that combines word embeddings with semantic resources by learning embeddings for non-word objects like synsets and entities and learning word embeddings that incorporate the semantic information from the resource. The method is based on encoding and decoding the word embeddings and is flexible in that it can take any word embeddings as input and does not need an additional training corpus. The obtained embeddings live in the same vector space as the input word embeddings. A sparse tensor formalization guarantees efficiency and parallelizability. We use WordNet, GermaNet, and Freebase as semantic resources. AutoExtend achieves state-of-the-art performance on Word-in-Context Similarity and Word Sense Disambiguation tasks.

pdf abs
One-Shot Neural Cross-Lingual Transfer for Paradigm Completion
Katharina Kann | Ryan Cotterell | Hinrich Schütze
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a novel cross-lingual transfer method for paradigm completion, the task of mapping a lemma to its inflected forms, using a neural encoder-decoder model, the state of the art for the monolingual task. We use labeled data from a high-resource language to increase performance on a low-resource language. In experiments on 21 language pairs from four different language families, we obtain up to 58% higher accuracy than without transfer and show that even zero-shot and one-shot learning are possible. We further find that the degree of language relatedness strongly influences the ability to transfer morphological knowledge.

pdf bib
Proceedings of the First Workshop on Subword and Character Level Models in NLP
Manaal Faruqui | Hinrich Schuetze | Isabel Trancoso | Yadollah Yaghoobzadeh
Proceedings of the First Workshop on Subword and Character Level Models in NLP

pdf abs
Unlabeled Data for Morphological Generation With Character-Based Sequence-to-Sequence Models
Katharina Kann | Hinrich Schütze
Proceedings of the First Workshop on Subword and Character Level Models in NLP

We present a semi-supervised way of training a character-based encoder-decoder recurrent neural network for morphological reinflection—the task of generating one inflected wordform from another. This is achieved by using unlabeled tokens or random strings as training data for an autoencoding task, adapting a network for morphological reinflection, and performing multi-task training. We thus use limited labeled data more effectively, obtaining up to 9.92% improvement over state-of-the-art baselines for 8 different languages.

pdf abs
Exploring Different Dimensions of Attention for Uncertainty Detection
Heike Adel | Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Neural networks with attention have proven effective for many natural language processing tasks. In this paper, we develop attention mechanisms for uncertainty detection. In particular, we generalize standardly used attention mechanisms by introducing external attention and sequence-preserving attention. These novel architectures differ from standard approaches in that they use external resources to compute attention weights and preserve sequence information. We compare them to other configurations along different dimensions of attention. Our novel architectures set the new state of the art on a Wikipedia benchmark dataset and perform similar to the state-of-the-art model on a biomedical benchmark which uses a large set of linguistic features.

pdf abs
Neural Multi-Source Morphological Reinflection
Katharina Kann | Ryan Cotterell | Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We explore the task of multi-source morphological reinflection, which generalizes the standard, single-source version. The input consists of (i) a target tag and (ii) multiple pairs of source form and source tag for a lemma. The motivation is that it is beneficial to have access to more than one source form since different source forms can provide complementary information, e.g., different stems. We further present a novel extension to the encoder-decoder recurrent neural architecture, consisting of multiple encoders, to better solve the task. We show that our new architecture outperforms single-source reinflection models and publish our dataset for multi-source morphological reinflection to facilitate future research.

pdf abs
Multi-level Representations for Fine-Grained Typing of Knowledge Base Entities
Yadollah Yaghoobzadeh | Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Entities are essential elements of natural language. In this paper, we present methods for learning multi-level representations of entities on three complementary levels: character (character patterns in entity names extracted, e.g., by neural networks), word (embeddings of words in entity names) and entity (entity embeddings). We investigate state-of-the-art learning methods on each level and find large differences, e.g., for deep learning models, traditional ngram features and the subword model of fasttext (Bojanowski et al., 2016) on the character level; for word2vec (Mikolov et al., 2013) on the word level; and for the order-aware model wang2vec (Ling et al., 2015a) on the entity level. We confirm experimentally that each level of representation contributes complementary information and a joint representation of all three levels improves the existing embedding based baseline for fine-grained entity typing by a large margin. Additionally, we show that adding information from entity descriptions further improves multi-level representations of entities.

pdf abs
Task-Specific Attentive Pooling of Phrase Alignments Contributes to Sentence Matching
Wenpeng Yin | Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This work studies comparatively two typical sentence matching tasks: textual entailment (TE) and answer selection (AS), observing that weaker phrase alignments are more critical in TE, while stronger phrase alignments deserve more attention in AS. The key to reach this observation lies in phrase detection, phrase representation, phrase alignment, and more importantly how to connect those aligned phrases of different matching degrees with the final classifier. Prior work (i) has limitations in phrase generation and representation, or (ii) conducts alignment at word and phrase levels by handcrafted features or (iii) utilizes a single framework of alignment without considering the characteristics of specific tasks, which limits the framework’s effectiveness across tasks. We propose an architecture based on Gated Recurrent Unit that supports (i) representation learning of phrases of arbitrary granularity and (ii) task-specific attentive pooling of phrase alignments between two sentences. Experimental results on TE and AS match our observation and show the effectiveness of our approach.

pdf abs
Nonsymbolic Text Representation
Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We introduce the first generic text representation model that is completely nonsymbolic, i.e., it does not require the availability of a segmentation or tokenization method that attempts to identify words or other symbolic units in text. This applies to training the parameters of the model on a training corpus as well as to applying it when computing the representation of a new text. We show that our model performs better than prior work on an information extraction and a text denoising task.

pdf abs
Noise Mitigation for Neural Entity Typing and Relation Extraction
Yadollah Yaghoobzadeh | Heike Adel | Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In this paper, we address two different types of noise in information extraction models: noise from distant supervision and noise from pipeline input features. Our target tasks are entity typing and relation extraction. For the first noise type, we introduce multi-instance multi-label learning algorithms using neural network models, and apply them to fine-grained entity typing for the first time. Our model outperforms the state-of-the-art supervised approach which uses global embeddings of entities. For the second noise type, we propose ways to improve the integration of noisy entity type predictions into relation extraction. Our experiments show that probabilistic predictions are more robust than discrete predictions and that joint training of the two tasks performs best.

pdf abs
End-to-End Trainable Attentive Decoder for Hierarchical Entity Classification
Sanjeev Karn | Ulli Waltinger | Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We address fine-grained entity classification and propose a novel attention-based recurrent neural network (RNN) encoder-decoder that generates paths in the type hierarchy and can be trained end-to-end. We show that our model performs better on fine-grained entity classification than prior work that relies on flat or local classifiers that do not directly model hierarchical structure.

pdf bib
Training Data Augmentation for Low-Resource Morphological Inflection
Toms Bergmanis | Katharina Kann | Hinrich Schütze | Sharon Goldwater
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

pdf
The LMU System for the CoNLL-SIGMORPHON 2017 Shared Task on Universal Morphological Reinflection
Katharina Kann | Hinrich Schütze
Proceedings of the CoNLL SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection

2016

pdf
Intrinsic Subspace Evaluation of Word Embedding Representations
Yadollah Yaghoobzadeh | Hinrich Schütze
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Learning Word Meta-Embeddings
Wenpeng Yin | Hinrich Schütze
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Morphological Smoothing and Extrapolation of Word Embeddings
Ryan Cotterell | Hinrich Schütze | Jason Eisner
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Word Embedding Calculus in Meaningful Ultradense Subspaces
Sascha Rothe | Hinrich Schütze
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
Single-Model Encoder-Decoder with Explicit Morphological Representation for Reinflection
Katharina Kann | Hinrich Schütze
Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf abs
Simple Question Answering by Attentive Convolutional Neural Network
Wenpeng Yin | Mo Yu | Bing Xiang | Bowen Zhou | Hinrich Schütze
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This work focuses on answering single-relation factoid questions over Freebase. Each question can acquire the answer from a single fact of form (subject, predicate, object) in Freebase. This task, simple question answering (SimpleQA), can be addressed via a two-step pipeline: entity linking and fact selection. In fact selection, we match the subject entity in a fact candidate with the entity mention in the question by a character-level convolutional neural network (char-CNN), and match the predicate in that fact with the question by a word-level CNN (word-CNN). This work makes two main contributions. (i) A simple and effective entity linker over Freebase is proposed. Our entity linker outperforms the state-of-the-art entity linker over SimpleQA task. (ii) A novel attentive maxpooling is stacked over word-CNN, so that the predicate representation can be matched with the predicate-focused question representation more effectively. Experiments show that our system sets new state-of-the-art in this task.

pdf abs
Table Filling Multi-Task Recurrent Neural Network for Joint Entity and Relation Extraction
Pankaj Gupta | Hinrich Schütze | Bernt Andrassy
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

This paper proposes a novel context-aware joint entity and word-level relation extraction approach through semantic composition of words, introducing a Table Filling Multi-Task Recurrent Neural Network (TF-MTRNN) model that reduces the entity recognition and relation classification tasks to a table-filling problem and models their interdependencies. The proposed neural network architecture is capable of modeling multiple relation instances without knowing the corresponding relation arguments in a sentence. The experimental results show that a simple approach of piggybacking candidate entities to model the label dependencies from relations to entities improves performance. We present state-of-the-art results with improvements of 2.0% and 2.7% for entity recognition and relation classification, respectively on CoNLL04 dataset.

pdf abs
ABCNN: Attention-Based Convolutional Neural Network for Modeling Sentence Pairs
Wenpeng Yin | Hinrich Schütze | Bing Xiang | Bowen Zhou
Transactions of the Association for Computational Linguistics, Volume 4

How to model a pair of sentences is a critical issue in many NLP tasks such as answer selection (AS), paraphrase identification (PI) and textual entailment (TE). Most prior work (i) deals with one individual task by fine-tuning a specific system; (ii) models each sentence’s representation separately, rarely considering the impact of the other sentence; or (iii) relies fully on manually designed, task-specific linguistic features. This work presents a general Attention Based Convolutional Neural Network (ABCNN) for modeling a pair of sentences. We make three contributions. (i) The ABCNN can be applied to a wide variety of tasks that require modeling of sentence pairs. (ii) We propose three attention schemes that integrate mutual influence between sentences into CNNs; thus, the representation of each sentence takes into consideration its counterpart. These interdependent sentence pair representations are more powerful than isolated sentence representations. (iii) ABCNNs achieve state-of-the-art performance on AS, PI and TE tasks. We release code at: https://github.com/yinwenpeng/Answer_Selection.

pdf
Attention-Based Convolutional Neural Network for Machine Comprehension
Wenpeng Yin | Sebastian Ebert | Hinrich Schütze
Proceedings of the Workshop on Human-Computer Question Answering

pdf
MED: The LMU System for the SIGMORPHON 2016 Shared Task on Morphological Reinflection
Katharina Kann | Hinrich Schütze
Proceedings of the 14th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf
Combining Recurrent and Convolutional Neural Networks for Relation Classification
Ngoc Thang Vu | Heike Adel | Pankaj Gupta | Hinrich Schütze
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
A Joint Model of Orthography and Morphological Segmentation
Ryan Cotterell | Tim Vieira | Hinrich Schütze
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Ultradense Word Embeddings by Orthogonal Transformation
Sascha Rothe | Sebastian Ebert | Hinrich Schütze
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Comparing Convolutional Neural Networks to Traditional Models for Slot Filling
Heike Adel | Benjamin Roth | Hinrich Schütze
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
LAMB: A Good Shepherd of Morphologically Rich Languages
Sebastian Ebert | Thomas Müller | Hinrich Schütze
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Neural Morphological Analysis: Encoding-Decoding Canonical Segments
Katharina Kann | Ryan Cotterell | Hinrich Schütze
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf
Morphological Segmentation Inside-Out
Ryan Cotterell | Arun Kumar | Hinrich Schütze
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

2015

pdf
CIS-positive: A Combination of Convolutional Neural Networks and Support Vector Machines for Sentiment Analysis in Twitter
Sebastian Ebert | Ngoc Thang Vu | Hinrich Schütze
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)

pdf
A Linguistically Informed Convolutional Neural Network
Sebastian Ebert | Ngoc Thang Vu | Hinrich Schütze
Proceedings of the 6th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
The Operation Sequence Model—Combining N-Gram-Based and Phrase-Based Statistical Machine Translation
Nadir Durrani | Helmut Schmid | Alexander Fraser | Philipp Koehn | Hinrich Schütze
Computational Linguistics, Volume 41, Issue 2 - June 2015

pdf
Labeled Morphological Segmentation with Semi-Markov Models
Ryan Cotterell | Thomas Müller | Alexander Fraser | Hinrich Schütze
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf
Multichannel Variable-Size Convolution for Sentence Classification
Wenpeng Yin | Hinrich Schütze
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

pdf
MultiGranCNN: An Architecture for General Matching of Text Chunks on Multiple Levels of Granularity
Wenpeng Yin | Hinrich Schütze
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
AutoExtend: Extending Word Embeddings to Embeddings for Synsets and Lexemes
Sascha Rothe | Hinrich Schütze
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf
Learning Better Embeddings for Rare Words Using Distributional Representations
Irina Sergienya | Hinrich Schütze
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Corpus-level Fine-grained Entity Typing Using Contextual Information
Yadollah Yaghoobzadeh | Hinrich Schütze
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Online Updating of Word Representations for Part-of-Speech Tagging
Wenpeng Yin | Tobias Schnabel | Hinrich Schütze
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Joint Lemmatization and Morphological Tagging with Lemming
Thomas Müller | Ryan Cotterell | Alexander Fraser | Hinrich Schütze
Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

pdf
Robust Morphological Tagging with Word Representations
Thomas Müller | Hinrich Schuetze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Convolutional Neural Network for Paraphrase Identification
Wenpeng Yin | Hinrich Schütze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Morphological Word-Embeddings
Ryan Cotterell | Hinrich Schütze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Discriminative Phrase Embedding for Paraphrase Identification
Wenpeng Yin | Hinrich Schütze
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2014

pdf
Dependency parsing with latent refinements of part-of-speech tags
Thomas Mueller | Richard Farkas | Alex Judea | Helmut Schmid | Hinrich Schuetze
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Fine-Grained Contextual Predictions for Hard Sentiment Words
Sebastian Ebert | Hinrich Schütze
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Using Mined Coreference Chains as a Resource for a Semantic Task
Heike Adel | Hinrich Schütze
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

pdf
Multi-Domain Sentiment Relevance Classification with Automatic Representation Learning
Christian Scheible | Hinrich Schütze
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers

pdf
Unsupervised Training Set Generation for Automatic Acquisition of Technical Terminology in Patents
Alex Judea | Hinrich Schütze | Soeren Bruegmann
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf
Picking the Amateur’s Mind - Predicting Chess Player Strength from Game Annotations
Christian Scheible | Hinrich Schütze
Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers

pdf bib abs
FLORS: Fast and Simple Domain Adaptation for Part-of-Speech Tagging
Tobias Schnabel | Hinrich Schütze
Transactions of the Association for Computational Linguistics, Volume 2

We present FLORS, a new part-of-speech tagger for domain adaptation. FLORS uses robust representations that work especially well for unknown words and for known words with unseen tags. FLORS is simpler and faster than previous domain adaptation methods, yet it has significantly better accuracy than several baselines.

pdf
CoSimRank: A Flexible & Efficient Graph-Theoretic Similarity Measure
Sascha Rothe | Hinrich Schütze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Improving Citation Polarity Classification with Product Reviews
Charles Jochim | Hinrich Schütze
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf
An Exploration of Embeddings for Generalized Phrases
Wenpeng Yin | Hinrich Schütze
Proceedings of the ACL 2014 Student Research Workshop

2013

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hinrich Schuetze | Pascale Fung | Massimo Poesio
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf
Sentiment Relevance
Christian Scheible | Hinrich Schütze
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Hinrich Schuetze | Pascale Fung | Massimo Poesio
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
The Operation Sequence Model: Integrating Translation and Reordering Operations in a Single Left-to-Right Model
Hinrich Schütze
Proceedings of Machine Translation Summit XIV: Plenaries

pdf
Knowledge Sources for Constituent Parsing of German, a Morphologically Rich and Less-Configurational Language
Alexander Fraser | Helmut Schmid | Richárd Farkas | Renjing Wang | Hinrich Schütze
Computational Linguistics, Volume 39, Issue 1 - March 2013

pdf
Towards Robust Cross-Domain Domain Adaptation for Part-of-Speech Tagging
Tobias Schnabel | Hinrich Schütze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Multilingual Lexicon Bootstrapping - Improving a Lexicon Induction System Using a Parallel Corpus
Patrick Ziering | Lonneke van der Plas | Hinrich Schütze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Bootstrapping Semantic Lexicons for Technical Domains
Patrick Ziering | Lonneke van der Plas | Hinrich Schütze
Proceedings of the Sixth International Joint Conference on Natural Language Processing

pdf
Efficient Higher-Order CRFs for Morphological Tagging
Thomas Mueller | Helmut Schmid | Hinrich Schütze
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf
The Topology of Semantic Knowledge
Jimmy Dubuisson | Jean-Pierre Eckmann | Christian Scheible | Hinrich Schütze
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

pdf
CodeX: Combining an SVM Classifier and Character N-gram Language Models for Sentiment Analysis on Twitter Text
Qi Han | Junfei Guo | Hinrich Schuetze
Second Joint Conference on Lexical and Computational Semantics (*SEM), Volume 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013)

2012

pdf
Automatic generation of short informative sentiment summaries
Andrea Glaser | Hinrich Schütze
Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics

pdf
Automatic Detection of Point of View Differences in Wikipedia
Khalid Al Khatib | Hinrich Schütze | Cathleen Kantner
Proceedings of COLING 2012

pdf
Towards a Generic and Flexible Citation Classifier Based on a Faceted Classification Scheme
Charles Jochim | Hinrich Schütze
Proceedings of COLING 2012

pdf
Classification of Inconsistent Sentiment Words using Syntactic Constructions
Wiltrud Kessler | Hinrich Schütze
Proceedings of COLING 2012: Posters

pdf abs
Bootstrapping Sentiment Labels For Unannotated Documents With Polarity PageRank
Christian Scheible | Hinrich Schütze
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present a novel graph-theoretic method for the initial annotation of high-confidence training data for bootstrapping sentiment classifiers. We estimate polarity using topic-specific PageRank. Sentiment information is propagated from an initial seed lexicon through a joint graph representation of words and documents. We report improved classification accuracies across multiple domains for the base models and the maximum entropy model bootstrapped from the PageRank annotation.

pdf
A Comparative Investigation of Morphological Language Modeling for the Languages of the European Union
Thomas Mueller | Hinrich Schuetze | Helmut Schmid
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
Active Learning for Coreference Resolution
Florian Laws | Florian Heimerl | Hinrich Schütze
Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2011

pdf
Half-Context Language Models
Hinrich Schütze | Michael Walsh
Computational Linguistics, Volume 37, Issue 4 - December 2011

pdf
Bootstrapping coreference resolution using word associations
Hamidreza Kobdani | Hinrich Schuetze | Michael Schiehlen | Hans Kamp
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Piggyback: Using Search Engines for Robust Cross-Domain Named Entity Recognition
Stefan Rüd | Massimiliano Ciaramita | Jens Müller | Hinrich Schütze
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Integrating history-length interpolation and classes in language modeling
Hinrich Schütze
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Improved Modeling of Out-Of-Vocabulary Words Using Morphological Classes
Thomas Mueller | Hinrich Schuetze
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf
Supervised Coreference Resolution with SUCRE
Hamidreza Kobdani | Hinrich Schuetze
Proceedings of the Fifteenth Conference on Computational Natural Language Learning: Shared Task

pdf
A Cascaded Classification Approach to Semantic Head Recognition
Lukas Michelbacher | Alok Kothari | Martin Forst | Christina Lioma | Hinrich Schütze
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

pdf
Active Learning with Amazon Mechanical Turk
Florian Laws | Christian Scheible | Hinrich Schütze
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing

2010

pdf abs
Identification of Rare & Novel Senses Using Translations in a Parallel Corpus
Richard Schwarz | Hinrich Schütze | Fabienne Martin | Achim Stein
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The identification of rare and novel senses is a challenge in lexicography. In this paper, we present a new method for finding such senses using a word aligned multilingual parallel corpus. We use the Europarl corpus and therein concentrate on French verbs. We represent each occurrence of a French verb as a high dimensional term vector. The dimensions of such a vector are the possible translations of the verb according to the underlying word alignment. The dimensions are weighted by a weighting scheme to adjust to the significance of any particular translation. After collecting these vectors we apply forms of the K-means algorithm on the resulting vector space to produce clusters of distinct senses, so that standard uses produce large homogeneous clusters while rare and novel uses appear in small or heterogeneous clusters. We show in a qualitative and quantitative evaluation that the method can successfully find rare and novel senses.

pdf abs
BabyExp: Constructing a Huge Multimodal Resource to Acquire Commonsense Knowledge Like Children Do
Massimo Poesio | Marco Baroni | Oswald Lanz | Alessandro Lenci | Alexandros Potamianos | Hinrich Schütze | Sabine Schulte im Walde | Luca Surian
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

There is by now widespread agreement that the most realistic way to construct the large-scale commonsense knowledge repositories required by natural language and artificial intelligence applications is by letting machines learn such knowledge from large quantities of data, like humans do. A lot of attention has consequently been paid to the development of increasingly sophisticated machine learning algorithms for knowledge extraction. However, the nature of the input that humans are exposed to while learning commonsense knowledge has received much less attention. The BabyExp project is collecting very dense audio and video recordings of the first 3 years of life of a baby. The corpus constructed in this way will be transcribed with automated techniques and made available to the research community. Moreover, techniques to extract commonsense conceptual knowledge incrementally from these multimodal data are also being explored within the project. The current paper describes BabyExp in general, and presents pilot studies on the feasibility of the automated audio and video transcriptions.

pdf abs
Building a Cross-lingual Relatedness Thesaurus using a Graph Similarity Measure
Lukas Michelbacher | Florian Laws | Beate Dorow | Ulrich Heid | Hinrich Schütze
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

The Internet is an ever growing source of information stored in documents of different languages. Hence, cross-lingual resources are needed for more and more NLP applications. This paper presents (i) a graph-based method for creating one such resource and (ii) a resource created using the method, a cross-lingual relatedness thesaurus. Given a word in one language, the thesaurus suggests words in a second language that are semantically related. The method requires two monolingual corpora and a basic dictionary. Our general approach is to build two monolingual word graphs, with nodes representing words and edges representing linguistic relations between words. A bilingual dictionary containing basic vocabulary provides seed translations relating nodes from both graphs. We then use an inter-graph node-similarity algorithm to discover related words. Evaluation with three human judges revealed that 49% of the English and 57% of the German words discovered by our method are semantically related to the target words. We publish two resources in conjunction with this paper. First, noun coordinations extracted from the German and English Wikipedias. Second, the cross-lingual relatedness thesaurus which can be used in experiments involving interactive cross-lingual query expansion.

pdf abs
Fine-Grained Geographical Relation Extraction from Wikipedia
Andre Blessing | Hinrich Schütze
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

In this paper, we present work on enhancing the basic data resource of a context-aware system. Electronic text offers a wealth of information about geospatial data and can be used to improve the completeness and accuracy of geospatial resources (e.g., gazetteers). First, we introduce a supervised approach to extracting geographical relations on a fine-grained level. Second, we present a novel way of using Wikipedia as a corpus based on self-annotation. A self-annotation is an automatically created high-quality annotation that can be used for training and evaluation. Wikipedia contains two types of different context: (i) unstructured text and (ii) structured data: templates (e.g., infoboxes about cities), lists and tables. We use the structured data to annotate the unstructured text. Finally, the extracted fine-grained relations are used to complete gazetteer data. The precision and recall scores of more than 97 percent confirm that a statistical IE pipeline can be used to improve the data quality of community-based resources.

pdf
Self-Annotation for fine-grained geospatial relation extraction
Andre Blessing | Hinrich Schütze
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf
Sentiment Translation through Multi-Edge Graphs
Christian Scheible | Florian Laws | Lukas Michelbacher | Hinrich Schütze
Coling 2010: Posters

pdf
Bitext-Based Resolution of German Subject-Object Ambiguities
Florian Schwarck | Alexander Fraser | Hinrich Schütze
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

pdf
SUCRE: A Modular System for Coreference Resolution
Hamidreza Kobdani | Hinrich Schütze
Proceedings of the 5th International Workshop on Semantic Evaluation

2009

pdf
Word Alignment by Thresholded Two-Dimensional Normalization
Hamidreza Kobdani | Alexander Fraser | Hinrich Schütze
Proceedings of Machine Translation Summit XII: Posters

pdf
Rich Bitext Projection Features for Parse Reranking
Alexander Fraser | Renjing Wang | Hinrich Schütze
Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009)

pdf
Unsupervised Classification with Dependency Based Word Spaces
Klaus Rothenhäusler | Hinrich Schütze
Proceedings of the Workshop on Geometrical Models of Natural Language Semantics

pdf bib
On Proper Unit Selection in Active Learning: Co-Selection Effects for Named Entity Recognition
Katrin Tomanek | Florian Laws | Udo Hahn | Hinrich Schütze
Proceedings of the NAACL HLT 2009 Workshop on Active Learning for Natural Language Processing

2008

pdf abs
An Inverted Index for Storing and Retrieving Grammatical Dependencies
Michaela Atterer | Hinrich Schütze
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Web count statistics gathered from search engines have been widely used as a resource in a variety of NLP tasks. For some tasks, however, the information they exploit is not fine-grained enough. We propose an inverted index over grammatical relations as a fast and reliable resource to access more general and also more detailed frequency information. To build the index, we use a dependency parser to parse a large corpus. We extract binary dependency relations, such as he-subj-say (he is the subject of say) as index terms and construct the index using publicly available open-source indexing software. The unit we index over is the sentence. The index can be used to extract grammatical relations and frequency counts for these relations. The framework also provides the possibility to search for partial dependencies (say, the frequency of he occurring in subject position), words, strings and a combination of these. One possible application is the disambiguation of syntactic structures.

pdf abs
A Question Answering System for German. Experiments with Morphological Linguistic Resources
Florian Koehler | Hinrich Schuetze | Michaela Atterer
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

Question Answering systems are systems that enable the user to ask questions in natural language and to also receive an answer in natural language. Most existing systems, however, are constructed for the English language, and it is not clear in how far these approaches are also applicable to other languages. A richer morphology, greater syntactic variability, and smaller fraction of webpages available in the language are just some issues that complicate the construction of systems for German. In this paper, we present a modular Question Answering System for German which uses several morphological resources to increase recall. Nouns are converted into verbs, verbs into nouns, and the tenses of verbs are modified. We use a web search engine as a back end to allow for open-domain Question Answering. A POS-tagger is employed to identify answer candidates which are then filtered and tiled. The system is shown to achieve a higher recall than other systems for German.

pdf
Stopping Criteria for Active Learning of Named Entity Recognition
Florian Laws | Hinrich Schütze
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf
A Graph-theoretic Model of Lexical Syntactic Acquisition
Hinrich Schütze | Michael Walsh
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing