This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we generate only three BibTeX files per volume, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
The world’s languages exhibit certain so-called typological or implicational universals; for example, Subject-Object-Verb (SOV) languages typically use postpositions. Explaining the source of such biases is a key goal of linguistics.We study word-order universals through a computational simulation with language models (LMs).Our experiments show that typologically-typical word orders tend to have lower perplexity estimated by LMs with cognitively plausible biases: syntactic biases, specific parsing strategies, and memory limitations. This suggests that the interplay of cognitive biases and predictability (perplexity) can explain many aspects of word-order universals.It also showcases the advantage of cognitively-motivated LMs, typically employed in cognitive modeling, in the simulation of language universals.
Natural language exhibits various universal properties.But why do these universals exist?One explanation is that they arise from functional pressures to achieve efficient communication, a view which attributes cross-linguistic properties to domain-general cognitive abilities.This hypothesis has successfully addressed some syntactic universal properties such as compositionality and Greenbergian word order universals.However, more abstract syntactic universals have not been explored from the perspective of efficient communication.Among such universals, the most notable one is structure dependence, that is, grammar-internal operations crucially depend on hierarchical representations.This property has traditionally been taken to be central to natural language and to involve domain-specific knowledge irreducible to communicative efficiency. In this paper, we challenge the conventional view by investigating whether structure dependence realizes efficient communication, focusing on coordinate structures.We design three types of artificial languages: (i) one with a structure-dependent reduction operation, which is similar to natural language, (ii) one without any reduction operations, and (iii) one with a linear (rather than structure-dependent) reduction operation.We quantify the communicative efficiency of these languages.The results demonstrate that the language with the structure-dependent reduction operation is significantly more communicatively efficient than the counterfactual languages.This suggests that the existence of structure-dependent properties can be explained from the perspective of efficient communication.
What kinds of and how much data is necessary for language models to induce grammatical knowledge to judge sentence acceptability? Recent language models still have much room for improvement in their data efficiency compared to humans. This paper investigates whether language models efficiently use indirect data (indirect evidence), from which they infer sentence acceptability. In contrast, humans use indirect evidence efficiently, which is considered one of the inductive biases contributing to efficient language acquisition. To explore this question, we introduce the Wug InDirect Evidence Test (WIDET), a dataset consisting of training instances inserted into the pre-training data and evaluation instances. We inject synthetic instances with newly coined wug words into pretraining data and explore the model’s behavior on evaluation data that assesses grammatical acceptability regarding those words. We prepare the injected instances by varying their levels of indirectness and quantity. Our experiments surprisingly show that language models do not induce grammatical knowledge even after repeated exposure to instances with the same structure but differing only in lexical items from evaluation instances in certain language phenomena. Our findings suggest a potential direction for future research: developing models that use latent indirect evidence to induce grammatical knowledge.
Instruction tuning aligns the response of large language models (LLMs) with human preferences.Despite such efforts in human–LLM alignment, we find that instruction tuning does not always make LLMs human-like from a cognitive modeling perspective. More specifically, next-word probabilities estimated by instruction-tuned LLMs are often worse at simulating human reading behavior than those estimated by base LLMs.In addition, we explore prompting methodologies for simulating human reading behavior with LLMs. Our results show that prompts reflecting a particular linguistic hypothesis improve psychometric predictive power, but are still inferior to small base models.These findings highlight that recent advancements in LLMs, i.e., instruction tuning and prompting, do not offer better estimates than direct probability measurements from base LLMs in cognitive modeling. In other words, pure next-word probability remains a strong predictor for human reading behavior, even in the age of LLMs.
Syntactic Language Models (SLMs) can be trained efficiently to reach relatively high performance; however, they have trouble with inference efficiency due to the explicit generation of syntactic structures. In this paper, we propose a new method dubbed tree-planting: instead of explicitly generating syntactic structures, we “plant” trees into attention weights of unidirectional Transformer LMs to implicitly reflect syntactic structures of natural language. Specifically, unidirectional Transformer LMs trained with tree-planting will be called Tree-Planted Transformers (TPT), which inherit the training efficiency from SLMs without changing the inference efficiency of their underlying Transformer LMs. Targeted syntactic evaluations on the SyntaxGym benchmark demonstrated that TPTs, despite the lack of explicit generation of syntactic structures, significantly outperformed not only vanilla Transformer LMs but also various SLMs that generate hundreds of syntactic structures in parallel. This result suggests that TPTs can learn human-like syntactic knowledge as data-efficiently as SLMs while maintaining the modeling space of Transformer LMs unchanged.
The imitation of the children’s language acquisition process has been explored to make language models (LMs) more efficient.In particular, errors caused by children’s regularization (so-called overregularization, e.g., using wroted for the past tense of write) have been widely studied to reveal the mechanisms of language acquisition. Existing research has analyzed regularization in language acquisition only by modeling word inflection directly, which is unnatural in light of human language acquisition. In this paper, we hypothesize that language models that imitate the errors children make during language acquisition have a learning process more similar to humans. To verify this hypothesis, we analyzed the learning curve and error preferences of verb inflections in small-scale LMs using acceptability judgments. We analyze the differences in results by model architecture, data, and tokenization. Our model shows child-like U-shaped learning curves clearly for certain verbs, but the preferences for types of overgeneralization did not fully match the observations in children.
In Reinforcement Learning from Human Feedback (RLHF), explicit human feedback, such as rankings, is employed to align Natural Language Processing (NLP) models with human preferences. In contrast, the potential of implicit human feedback, encompassing cognitive processing signals like eye-tracking and brain activity, remains underexplored. These signals capture unconscious human responses but are often marred by noise and redundancy, complicating their application to specific tasks. To address this issue, we introduce the Cognitive Information Bottleneck (CIB), a method that extracts only the task-relevant information from cognitive processing signals. Grounded in the principles of the information bottleneck, CIB aims to learn representations that maximize the mutual information between the representations and targets while minimizing the mutual information between inputs and representations. By employing CIB to filter out redundant information from cognitive processing signals, our goal is to provide representations that are both minimal and sufficient. This approach enables more efficient fitting of models to inputs. Our results show that the proposed method outperforms existing methods in efficiently compressing various cognitive processing signals and significantly enhances performance on downstream tasks. Evaluated on public datasets, our model surpasses contemporary state-of-the-art models. Furthermore, by analyzing these compressed representations, we offer insights into how cognitive processing signals can be leveraged to improve performance.
Neural language models have exhibited outstanding performance in a range of downstream tasks. However, there is limited understanding regarding the extent to which these models internalize syntactic knowledge, so that various datasets have recently been constructed to facilitate syntactic evaluation of language models across languages. In this paper, we introduce JCoLA (Japanese Corpus of Linguistic Acceptability), which consists of 10,020 sentences annotated with binary acceptability judgments. Specifically, those sentences are manually extracted from linguistics textbooks, handbooks and journal articles, and split into in-domain data (86 %; relatively simple acceptability judgments extracted from textbooks and handbooks) and out-of-domain data (14 %; theoretically significant acceptability judgments extracted from journal articles), the latter of which is categorized by 12 linguistic phenomena. We then evaluate the syntactic knowledge of 9 different types of Japanese and multilingual language models on JCoLA. The results demonstrated that several models could surpass human performance for the in-domain data, while no models were able to exceed human performance for the out-of-domain data. Error analyses by linguistic phenomena further revealed that although neural language models are adept at handling local syntactic dependencies like argument structure, their performance wanes when confronted with long-distance syntactic dependencies like verbal agreement and NPI licensing.
For nearly the past forty years, there has been discussion regarding whether symbolic representations are involved in morphological inflection, a debate commonly known as the Past Tense Debate. The previous literature has extensively explored whether neural models, which do not use symbolic representations can process morphological inflection like humans. However, current research interest has shifted towards whether neural models can acquire morphological inflection like humans. In this paper, we trained neural models, the recurrent neural network (RNN) with attention and the transformer, and a symbolic model, the Minimal Generalization Learner (MGL), under a human-like learning environment. Evaluating the models from the perspective of language acquisition, we found that while the transformer and the MGL exhibited some human-like characteristics, the RNN with attention did not demonstrate human-like behavior across all the evaluation metrics considered in this study. Furthermore, none of the models accurately inflected verbs in the same manner as humans in terms of morphological inflection direction. These results suggest that these models fall short as cognitive models of morphological inflection.
In this paper, we propose a novel evaluation paradigm for Targeted Syntactic Evaluations, where we assess how well language models can recognize linguistic phenomena situated at different levels of the Chomsky hierarchy. Specifically, we create formal languages that abstract four syntactic phenomena in natural languages, each identified at a different level of the Chomsky hierarchy, and use these to evaluate the capabilities of language models: (1) (Adj)ˆn NP type, (2) NPˆn VPˆn type, (3) Nested Dependency type, and (4) Cross Serial Dependency type. We first train three different language models (LSTM, Transformer LM, and Stack-RNN) on language modeling tasks and then evaluate them using pairs of a positive and a negative sentence by investigating whether they can assign a higher probability to the positive sentence than the negative one. Our result demonstrated that all language models have the ability to capture the structural patterns of the (Adj)ˆn NP type formal language. However, LSTM and Transformer LM failed to capture NPˆn VPˆn type language and no architectures can recognize nested dependency and Cross Serial dependency correctly. Neural language models, especially Transformer LMs, have exhibited high performance across a multitude of downstream tasks, leading to the perception that they possess an understanding of natural languages. However, our findings suggest that these models may not necessarily comprehend the syntactic structures that underlie natural language phenomena such as dependency. Rather, it appears that they may extend grammatical rules equivalent to regular grammars to approximate the rules governing dependencies.
In this paper, we introduce JBLiMP (Japanese Benchmark of Linguistic Minimal Pairs), a novel dataset for targeted syntactic evaluations of language models in Japanese. JBLiMP consists of 331 minimal pairs, which are created based on acceptability judgments extracted from journal articles in theoretical linguistics. These minimal pairs are grouped into 11 categories, each covering a different linguistic phenomenon. JBLiMP is unique in that it successfully combines two important features independently observed in existing datasets: (i) coverage of complex linguistic phenomena (cf. CoLA) and (ii) presentation of sentences as minimal pairs (cf. BLiMP). In addition, JBLiMP is the first dataset for targeted syntactic evaluations of language models in Japanese, thus allowing the comparison of syntactic knowledge of language models across different languages. We then evaluate the syntactic knowledge of several language models on JBLiMP: GPT-2, LSTM, and n-gram language models. The results demonstrated that all the architectures achieved comparable overall accuracies around 75%. Error analyses by linguistic phenomenon further revealed that these language models successfully captured local dependencies like nominal structures, but not long-distance dependencies such as verbal agreement and binding.
In this paper, we explore how much syntactic supervision is “good enough” to make language models (LMs) more human-like. Specifically, we propose the new method called syntactic ablation, where syntactic LMs, namely Recurrent Neural Network Grammars (RNNGs), are gradually ablated from full syntactic supervision to zero syntactic supervision (≈ unidirectional LSTM) by preserving NP, VP, PP, SBAR nonterminal symbols and the combinations thereof. The 17 ablated grammars are then evaluated via targeted syntactic evaluation on the SyntaxGym benchmark. The results of our syntactic ablation demonstrated that (i) the RNNG with zero syntactic supervision underperformed the RNNGs with some syntactic supervision, (ii) the RNNG with full syntactic supervision underperformed the RNNGs with less syntactic supervision, and (iii) the RNNG with mild syntactic supervision achieved the best performance comparable to the state-of-the-art GPT-2-XL. Those results may suggest that the “good enough” approach to language processing seems to make LMs more human-like.
We present the second shared task on eye-tracking data prediction of the Cognitive Modeling and Computational Linguistics Workshop (CMCL). Differently from the previous edition, participating teams are asked to predict eye-tracking features from multiple languages, including a surprise language for which there were no available training data. Moreover, the task also included the prediction of standard deviations of feature values in order to account for individual differences between readers.A total of six teams registered to the task. For the first subtask on multilingual prediction, the winning team proposed a regression model based on lexical features, while for the second subtask on cross-lingual prediction, the winning team used a hybrid model based on a multilingual transformer embeddings as well as statistical features.
Language models (LMs) have been used in cognitive modeling as well as engineering studies—they compute information-theoretic complexity metrics that simulate humans’ cognitive load during reading.This study highlights a limitation of modern neural LMs as the model of choice for this purpose: there is a discrepancy between their context access capacities and that of humans.Our results showed that constraining the LMs’ context access improved their simulation of human reading behavior.We also showed that LM-human gaps in context access were associated with specific syntactic constructions; incorporating syntactic biases into LMs’ context access might enhance their cognitive plausibility.
In this paper, we propose a novel architecture called Composition Attention Grammars (CAGs) that recursively compose subtrees into a single vector representation with a composition function, and selectively attend to previous structural information with a self-attention mechanism. We investigate whether these components—the composition function and the self-attention mechanism—can both induce human-like syntactic generalization. Specifically, we train language models (LMs) with and without these two components with the model sizes carefully controlled, and evaluate their syntactic generalization performance against six test circuits on the SyntaxGym benchmark. The results demonstrated that the composition function and the self-attention mechanism both play an important role to make LMs more human-like, and closer inspection of linguistic phenomenon implied that the composition function allowed syntactic features, but not semantic features, to percolate into subtree representations.
In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization —the lower perplexity a language model has, the more human-like the language model is— in Japanese with typologically different structures from English. Our experiments demonstrate that this established generalization exhibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross-lingual evaluation will be necessary to construct human-like computational models.
Eye-tracking data from reading represent an important resource for both linguistics and natural language processing. The ability to accurately model gaze features is crucial to advance our understanding of language processing. This paper describes the Shared Task on Eye-Tracking Data Prediction, jointly organized with the eleventh edition of the Work- shop on Cognitive Modeling and Computational Linguistics (CMCL 2021). The goal of the task is to predict 5 different token- level eye-tracking metrics of the Zurich Cognitive Language Processing Corpus (ZuCo). Eye-tracking data were recorded during natural reading of English sentences. In total, we received submissions from 13 registered teams, whose systems include boosting algorithms with handcrafted features, neural models leveraging transformer language models, or hybrid approaches. The winning system used a range of linguistic and psychometric features in a gradient boosting framework.
In computational linguistics, it has been shown that hierarchical structures make language models (LMs) more human-like. However, the previous literature has been agnostic about a parsing strategy of the hierarchical models. In this paper, we investigated whether hierarchical structures make LMs more human-like, and if so, which parsing strategy is most cognitively plausible. In order to address this question, we evaluated three LMs against human reading times in Japanese with head-final left-branching structures: Long Short-Term Memory (LSTM) as a sequential model and Recurrent Neural Network Grammars (RNNGs) with top-down and left-corner parsing strategies as hierarchical models. Our computational modeling demonstrated that left-corner RNNGs outperformed top-down RNNGs and LSTM, suggesting that hierarchical and left-corner architectures are more cognitively plausible than top-down or sequential architectures. In addition, the relationships between the cognitive plausibility and (i) perplexity, (ii) parsing, and (iii) beam size will also be discussed.
The past decade has witnessed the happy marriage between natural language processing (NLP) and the cognitive science of language. Moreover, given the historical relationship between biological and artificial neural networks, the advent of deep learning has re-sparked strong interests in the fusion of NLP and the neuroscience of language. Importantly, this inter-fertilization between NLP, on one hand, and the cognitive (neuro)science of language, on the other, has been driven by the language resources annotated with human language processing data. However, there remain several limitations with those language resources on annotations, genres, languages, etc. In this paper, we describe the design of a novel language resource called BCCWJ-EEG, the Balanced Corpus of Contemporary Written Japanese (BCCWJ) experimentally annotated with human electroencephalography (EEG). Specifically, after extensively reviewing the language resources currently available in the literature with special focus on eye-tracking and EEG, we summarize the details concerning (i) participants, (ii) stimuli, (iii) procedure, (iv) data preprocessing, (v) corpus evaluation, (vi) resource release, and (vii) compilation schedule. In addition, potential applications of BCCWJ-EEG to neuroscience and NLP will also be discussed.
Sentences are represented as hierarchical syntactic structures, which have been successfully modeled in sentence processing. In contrast, despite the theoretical agreement on hierarchical syntactic structures within words, words have been argued to be computationally less complex than sentences and implemented by finite-state models as linear strings of morphemes, and even the psychological reality of morphemes has been denied. In this paper, extending the computational models employed in sentence processing to morphological processing, we performed a computational simulation experiment where, given incremental surprisal as a linking hypothesis, five computational models with different representational assumptions were evaluated against human reaction times in visual lexical decision experiments available from the English Lexicon Project (ELP), a “shared task” in the morphological processing literature. The simulation experiment demonstrated that (i) “amorphous” models without morpheme units underperformed relative to “morphous” models, (ii) a computational model with hierarchical syntactic structures, Probabilistic Context-Free Grammar (PCFG), most accurately explained human reaction times, and (iii) this performance was achieved on top of surface frequency effects. These results strongly suggest that morphological processing tracks morphemes incrementally from left to right and parses them into hierarchical syntactic structures, contrary to “amorphous” and finite-state models of morphological processing.
Previous “wug” tests (Berko, 1958) on Japanese verbal inflection have demonstrated that Japanese speakers, both adults and children, cannot inflect novel present tense forms to “correct” past tense forms predicted by rules of existent verbs (de Chene, 1982; Vance, 1987, 1991; Klafehn, 2003, 2013), indicating that Japanese verbs are merely stored in the mental lexicon. However, the implicit assumption that present tense forms are bases for verbal inflection should not be blindly extended to morphologically rich languages like Japanese in which both present and past tense forms are morphologically complex without inherent direction (Albright, 2002). Interestingly, there are also independent observations in the acquisition literature to suggest that past tense forms may be bases for verbal inflection in Japanese (Klafehn, 2003; Murasugi et al., 2010; Hirose, 2017; Tatsumi et al., 2018). In this paper, we computationally simulate two directions of verbal inflection in Japanese, Present → Past and Past → Present, with the rule-based computational model called Minimal Generalization Learner (MGL; Albright and Hayes, 2003) and experimentally evaluate the model with the bidirectional “wug” test where humans inflect novel verbs in two opposite directions. We conclude that Japanese verbs can be computed online via some generalizations and those generalizations do depend on the direction of morphological inflection.