Valentin Hofmann


2024

pdf
Geographic Adaptation of Pretrained Language Models
Valentin Hofmann | Goran Glavaš | Nikola Ljubešić | Janet B. Pierrehumbert | Hinrich Schütze
Transactions of the Association for Computational Linguistics, Volume 12

While pretrained language models (PLMs) have been shown to possess a plethora of linguistic knowledge, the existing body of research has largely neglected extralinguistic knowledge, which is generally difficult to obtain by pretraining on text alone. Here, we contribute to closing this gap by examining geolinguistic knowledge, i.e., knowledge about geographic variation in language. We introduce geoadaptation, an intermediate training step that couples language modeling with geolocation prediction in a multi-task learning setup. We geoadapt four PLMs, covering language groups from three geographic areas, and evaluate them on five different tasks: fine-tuned (i.e., supervised) geolocation prediction, zero-shot (i.e., unsupervised) geolocation prediction, fine-tuned language identification, zero-shot language identification, and zero-shot prediction of dialect features. Geoadaptation is very successful at injecting geolinguistic knowledge into the PLMs: The geoadapted PLMs consistently outperform PLMs adapted using only language modeling (by especially wide margins on zero-shot prediction tasks), and we obtain new state-of-the-art results on two benchmarks for geolocation prediction and language identification. Furthermore, we show that the effectiveness of geoadaptation stems from its ability to geographically retrofit the representation space of the PLMs.

pdf
Political Compass or Spinning Arrow? Towards More Meaningful Evaluations for Values and Opinions in Large Language Models
Paul Röttger | Valentin Hofmann | Valentina Pyatkin | Musashi Hinck | Hannah Kirk | Hinrich Schuetze | Dirk Hovy
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Much recent work seeks to evaluate values and opinions in large language models (LLMs) using multiple-choice surveys and questionnaires. Most of this work is motivated by concerns around real-world LLM applications. For example, politically-biased LLMs may subtly influence society when they are used by millions of people. Such real-world concerns, however, stand in stark contrast to the artificiality of current evaluations: real users do not typically ask LLMs survey questions. Motivated by this discrepancy, we challenge the prevailing *constrained* evaluation paradigm for values and opinions in LLMs and explore more realistic *unconstrained* evaluations. As a case study, we focus on the popular Political Compass Test (PCT). In a systematic review, we find that most prior work using the PCT *forces models to comply with the PCT’s multiple-choice format. We show that models give substantively different answers when not forced; that answers change depending on how models are forced; and that answers lack paraphrase robustness. Then, we demonstrate that models give different answers yet again in a more realistic open-ended answer setting. We distill these findings into recommendations and open challenges in evaluating values and opinions in LLMs.

pdf
Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research
Luca Soldaini | Rodney Kinney | Akshita Bhagia | Dustin Schwenk | David Atkinson | Russell Authur | Ben Bogin | Khyathi Chandu | Jennifer Dumas | Yanai Elazar | Valentin Hofmann | Ananya Jha | Sachin Kumar | Li Lucy | Xinxi Lyu | Nathan Lambert | Ian Magnusson | Jacob Morrison | Niklas Muennighoff | Aakanksha Naik | Crystal Nam | Matthew Peters | Abhilasha Ravichander | Kyle Richardson | Zejiang Shen | Emma Strubell | Nishant Subramani | Oyvind Tafjord | Evan Walsh | Luke Zettlemoyer | Noah Smith | Hannaneh Hajishirzi | Iz Beltagy | Dirk Groeneveld | Jesse Dodge | Kyle Lo
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation.

2023

pdf
Counting the Bugs in ChatGPT’s Wugs: A Multilingual Investigation into the Morphological Capabilities of a Large Language Model
Leonie Weissweiler | Valentin Hofmann | Anjali Kantharuban | Anna Cai | Ritam Dutt | Amey Hengle | Anubha Kabra | Atharva Kulkarni | Abhishek Vijayakumar | Haofei Yu | Hinrich Schuetze | Kemal Oflazer | David Mortensen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) have recently reached an impressive level of linguistic capability, prompting comparisons with human language skills. However, there have been relatively few systematic inquiries into the linguistic capabilities of the latest generation of LLMs, and those studies that do exist (i) ignore the remarkable ability of humans to generalize, (ii) focus only on English, and (iii) investigate syntax or semantics and overlook other capabilities that lie at the heart of human language, like morphology. Here, we close these gaps by conducting the first rigorous analysis of the morphological capabilities of ChatGPT in four typologically varied languages (specifically, English, German, Tamil, and Turkish). We apply a version of Berko’s (1958) wug test to ChatGPT, using novel, uncontaminated datasets for the four examined languages. We find that ChatGPT massively underperforms purpose-built systems, particularly in English. Overall, our results—through the lens of morphology—cast a new light on the linguistic capabilities of ChatGPT, suggesting that claims of human-like language skills are premature and misleading.

2022

pdf
CaMEL: Case Marker Extraction without Labels
Leonie Weissweiler | Valentin Hofmann | Masoud Jalili Sabet | Hinrich Schuetze
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce CaMEL (Case Marker Extraction without Labels), a novel and challenging task in computational morphology that is especially relevant for low-resource languages. We propose a first model for CaMEL that uses a massively multilingual corpus to extract case markers in 83 languages based only on a noun phrase chunker and an alignment system. To evaluate CaMEL, we automatically construct a silver standard from UniMorph. The case markers extracted by our model can be used to detect and visualise similarities and differences between the case systems of different languages as well as to annotate fine-grained deep cases in languages in which they are not overtly marked.

pdf
An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers
Valentin Hofmann | Hinrich Schuetze | Janet Pierrehumbert
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models (PLMs). FLOTA uses the vocabulary of a standard tokenizer but tries to preserve the morphological structure of words during tokenization. We evaluate FLOTA on morphological gold segmentations as well as a text classification task, using BERT, GPT-2, and XLNet as example PLMs. FLOTA leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise.

pdf
The better your Syntax, the better your Semantics? Probing Pretrained Language Models for the English Comparative Correlative
Leonie Weissweiler | Valentin Hofmann | Abdullatif Köksal | Hinrich Schütze
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Construction Grammar (CxG) is a paradigm from cognitive linguistics emphasising the connection between syntax and semantics. Rather than rules that operate on lexical items, it posits constructions as the central building blocks of language, i.e., linguistic units of different granularity that combine syntax and semantics. As a first step towards assessing the compatibility of CxG with the syntactic and semantic knowledge demonstrated by state-of-the-art pretrained language models (PLMs), we present an investigation of their capability to classify and understand one of the most commonly studied constructions, the English comparative correlative (CC). We conduct experiments examining the classification accuracy of a syntactic probe on the one hand and the models’ behaviour in a semantic application task on the other, with BERT, RoBERTa, and DeBERTa as the example PLMs. Our results show that all three investigated PLMs are able to recognise the structure of the CC but fail to use its meaning. While human-like performance of PLMs on many NLP tasks has been alleged, this indicates that PLMs still suffer from substantial shortcomings in central domains of linguistic knowledge.

pdf
Modeling Ideological Salience and Framing in Polarized Online Groups with Graph Neural Networks and Structured Sparsity
Valentin Hofmann | Xiaowen Dong | Janet Pierrehumbert | Hinrich Schuetze
Findings of the Association for Computational Linguistics: NAACL 2022

The increasing polarization of online political discourse calls for computational tools that automatically detect and monitor ideological divides in social media. We introduce a minimally supervised method that leverages the network structure of online discussion forums, specifically Reddit, to detect polarized concepts. We model polarization along the dimensions of salience and framing, drawing upon insights from moral psychology. Our architecture combines graph neural networks with structured sparsity learning and results in representations for concepts and subreddits that capture temporal ideological dynamics such as right-wing and left-wing radicalization.

2021

pdf
Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words
Valentin Hofmann | Janet Pierrehumbert | Hinrich Schütze
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

pdf
Dynamic Contextualized Word Embeddings
Valentin Hofmann | Janet Pierrehumbert | Hinrich Schütze
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Static word embeddings that represent words by a single vector cannot capture the variability of word meaning in different linguistic and extralinguistic contexts. Building on prior work on contextualized and dynamic word embeddings, we introduce dynamic contextualized word embeddings that represent words as a function of both linguistic and extralinguistic context. Based on a pretrained language model (PLM), dynamic contextualized word embeddings model time and social space jointly, which makes them attractive for a range of NLP tasks involving semantic variability. We highlight potential application scenarios by means of qualitative and quantitative analyses on four English datasets.

2020

pdf
A Graph Auto-encoder Model of Derivational Morphology
Valentin Hofmann | Hinrich Schütze | Janet Pierrehumbert
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

There has been little work on modeling the morphological well-formedness (MWF) of derivatives, a problem judged to be complex and difficult in linguistics. We present a graph auto-encoder that learns embeddings capturing information about the compatibility of affixes and stems in derivation. The auto-encoder models MWF in English surprisingly well by combining syntactic and semantic information with associative information from the mental lexicon.

pdf
Predicting the Growth of Morphological Families from Social and Linguistic Factors
Valentin Hofmann | Janet Pierrehumbert | Hinrich Schütze
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present the first study that examines the evolution of morphological families, i.e., sets of morphologically related words such as “trump”, “antitrumpism”, and “detrumpify”, in social media. We introduce the novel task of Morphological Family Expansion Prediction (MFEP) as predicting the increase in the size of a morphological family. We create a ten-year Reddit corpus as a benchmark for MFEP and evaluate a number of baselines on this benchmark. Our experiments demonstrate very good performance on MFEP.

pdf
DagoBERT: Generating Derivational Morphology with a Pretrained Language Model
Valentin Hofmann | Janet Pierrehumbert | Hinrich Schütze
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Can pretrained language models (PLMs) generate derivationally complex words? We present the first study investigating this question, taking BERT as the example PLM. We examine BERT’s derivational capabilities in different settings, ranging from using the unmodified pretrained model to full finetuning. Our best model, DagoBERT (Derivationally and generatively optimized BERT), clearly outperforms the previous state of the art in derivation generation (DG). Furthermore, our experiments show that the input segmentation crucially impacts BERT’s derivational knowledge, suggesting that the performance of PLMs could be further improved if a morphologically informed vocabulary of units were used.