Goro Kobayashi

2023

pdf abs
Transformer Language Models Handle Word Frequency in Prediction Head
Goro Kobayashi | Tatsuki Kuribayashi | Sho Yokoi | Kentaro Inui
Findings of the Association for Computational Linguistics: ACL 2023

Prediction head is a crucial component of Transformer language models. Despite its direct impact on prediction, this component has often been overlooked in analyzing Transformers.In this study, we investigate the inner workings of the prediction head, specifically focusing on bias parameters. Our experiments with BERT and GPT-2 models reveal that the biases in their word prediction heads play a significant role in the models’ ability to reflect word frequency in a corpus, aligning with the logit adjustment method commonly used in long-tailed learning. We also quantify the effect of controlling the biases in practical auto-regressive text generation scenarios;under a particular setting, more diverse text can be generated without compromising text quality.

pdf abs
Contrastive Learning-based Sentence Encoders Implicitly Weight Informative Words
Hiroto Kurita | Goro Kobayashi | Sho Yokoi | Kentaro Inui
Findings of the Association for Computational Linguistics: EMNLP 2023

The performance of sentence encoders can be significantly improved through the simple practice of fine-tuning using contrastive loss. A natural question arises: what characteristics do models acquire during contrastive learning? This paper theoretically and experimentally shows that contrastive-based sentence encoders implicitly weight words based on information-theoretic quantities; that is, more informative words receive greater weight, while others receive less. The theory states that, in the lower bound of the optimal value of the contrastive learning objective, the norm of word embedding reflects the information gain associated with the distribution of surrounding words. We also conduct comprehensive experiments using various models, multiple datasets, two methods to measure the implicit weighting of models (Integrated Gradients and SHAP), and two information-theoretic quantities (information gain and self-information). The results provide empirical evidence that contrastive fine-tuning emphasizes informative words.

pdf abs
Assessing Step-by-Step Reasoning against Lexical Negation: A Case Study on Syllogism
Mengyu Ye | Tatsuki Kuribayashi | Jun Suzuki | Goro Kobayashi | Hiroaki Funayama
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) take advantage of step-by-step reasoning instructions, e.g., chain-of-thought (CoT) prompting. Building on this, their ability to perform CoT-style reasoning robustly is of interest from a probing perspective. In this study, we inspect the step-by-step reasoning ability of LLMs with a focus on negation, which is a core linguistic phenomenon that is difficult to process. In particular, we introduce several controlled settings (e.g., reasoning in case of fictional entities) to evaluate the logical reasoning abilities of the models. We observed that dozens of modern LLMs were not robust against lexical negation (e.g., plausible→implausible) when performing CoT-style reasoning, and the results highlight unique limitations in each LLM family.

2021

pdf abs
Incorporating Residual and Normalization Layers into Analysis of Masked Language Models
Goro Kobayashi | Tatsuki Kuribayashi | Sho Yokoi | Kentaro Inui
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Transformer architecture has become ubiquitous in the natural language processing field. To interpret the Transformer-based models, their attention patterns have been extensively analyzed. However, the Transformer architecture is not only composed of the multi-head attention; other components can also contribute to Transformers’ progressive performance. In this study, we extended the scope of the analysis of Transformers from solely the attention patterns to the whole attention block, i.e., multi-head attention, residual connection, and layer normalization. Our analysis of Transformer-based masked language models shows that the token-to-token interaction performed via attention has less impact on the intermediate representations than previously assumed. These results provide new intuitive explanations of existing reports; for example, discarding the learned attention patterns tends not to adversely affect the performance. The codes of our experiments are publicly available.

2020

pdf abs
Attention is Not Only a Weight: Analyzing Transformers with Vector Norms
Goro Kobayashi | Tatsuki Kuribayashi | Sho Yokoi | Kentaro Inui
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Attention is a key component of Transformers, which have recently achieved considerable success in natural language processing. Hence, attention is being extensively studied to investigate various linguistic capabilities of Transformers, focusing on analyzing the parallels between attention weights and specific linguistic phenomena. This paper shows that attention weights alone are only one of the two factors that determine the output of attention and proposes a norm-based analysis that incorporates the second factor, the norm of the transformed input vectors. The findings of our norm-based analyses of BERT and a Transformer-based neural machine translation system include the following: (i) contrary to previous studies, BERT pays poor attention to special tokens, and (ii) reasonable word alignment can be extracted from attention mechanisms of Transformer. These findings provide insights into the inner workings of Transformers.

Co-authors

Jun Suzuki 1

Hiroaki Funayama 1

Venues

emnlp3
findings2