2025
pdf
bib
abs
Derivational Probing: Unveiling the Layer-wise Derivation of Syntactic Structures in Neural Language Models
Taiga Someya
|
Ryo Yoshida
|
Hitomi Yanaka
|
Yohei Oseki
Proceedings of the 29th Conference on Computational Natural Language Learning
Recent work has demonstrated that neural language models encode syntactic structures in their internal *representations*, yet the *derivations* by which these structures are constructed across layers remain poorly understood. In this paper, we propose *Derivational Probing* to investigate how micro-syntactic structures (e.g., subject noun phrases) and macro-syntactic structures (e.g., the relationship between the root verbs and their direct dependents) are constructed as word embeddings propagate upward across layers.Our experiments on BERT reveal a clear bottom-up derivation: micro-syntactic structures emerge in lower layers and are gradually integrated into a coherent macro-syntactic structure in higher layers.Furthermore, a targeted evaluation on subject-verb number agreement shows that the timing of constructing macro-syntactic structures is critical for downstream performance, suggesting an optimal timing for integrating global syntactic information.
pdf
bib
abs
Bias Mitigation or Cultural Commonsense? Evaluating LLMs with a Japanese Dataset
Taisei Yamamoto
|
Ryoma Kumon
|
Danushka Bollegala
|
Hitomi Yanaka
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) exhibit social biases, prompting the development of various debiasing methods. However, debiasing methods may degrade the capabilities of LLMs. Previous research has evaluated the impact of bias mitigation primarily through tasks measuring general language understanding, which are often unrelated to social biases. In contrast, cultural commonsense is closely related to social biases, as both are rooted in social norms and values. The impact of bias mitigation on cultural commonsense in LLMs has not been well investigated. Considering this gap, we propose SOBACO (SOcial BiAs and Cultural cOmmonsense benchmark), a Japanese benchmark designed to evaluate social biases and cultural commonsense in LLMs in a unified format. We evaluate several LLMs on SOBACO to examine how debiasing methods affect cultural commonsense in LLMs. Our results reveal that the debiasing methods degrade the performance of the LLMs on the cultural commonsense task (up to 75% accuracy deterioration). These results highlight the importance of developing debiasing methods that consider the trade-off with cultural commonsense to improve fairness and utility of LLMs.
pdf
bib
abs
JBBQ: Japanese Bias Benchmark for Analyzing Social Biases in Large Language Models
Hitomi Yanaka
|
Namgi Han
|
Ryoma Kumon
|
Lu Jie
|
Masashi Takeshita
|
Ryo Sekizawa
|
Taisei Katô
|
Hiromi Arai
Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
With the development of large language models (LLMs), social biases in these LLMs have become a pressing issue.Although there are various benchmarks for social biases across languages, the extent to which Japanese LLMs exhibit social biases has not been fully investigated.In this study, we construct the Japanese Bias Benchmark dataset for Question Answering (JBBQ) based on the English bias benchmark BBQ, with analysis of social biases in Japanese LLMs.The results show that while current open Japanese LLMs with more parameters show improved accuracies on JBBQ, their bias scores increase.In addition, prompts with a warning about social biases and chain-of-thought prompting reduce the effect of biases in model outputs, but there is room for improvement in extracting the correct evidence from contexts in Japanese. Our dataset is available at https://github.com/ynklab/JBBQ_data.
pdf
bib
abs
Intersectional Bias in Japanese Large Language Models from a Contextualized Perspective
Hitomi Yanaka
|
Xinqi He
|
Lu Jie
|
Namgi Han
|
Sunjin Oh
|
Ryoma Kumon
|
Yuma Matsuoka
|
Kazuhiko Watabe
|
Yuko Itatsu
Proceedings of the 6th Workshop on Gender Bias in Natural Language Processing (GeBNLP)
An growing number of studies have examined the social bias of rapidly developed large language models (LLMs). Although most of these studies have focused on bias occurring in a single social attribute, research in social science has shown that social bias often occurs in the form of intersectionality—the constitutive and contextualized perspective on bias aroused by social attributes. In this study, we construct the Japanese benchmark inter-JBBQ, designed to evaluate the intersectional bias in LLMs on the question-answering setting. Using inter-JBBQ to analyze GPT-4o and Swallow, we find that biased output varies according to its contexts even with the equal combination of social attributes.
pdf
bib
abs
LLMs Struggle with NLI for Perfect Aspect: A Cross-Linguistic Study in Chinese and Japanese
Lu Jie
|
Du Jin
|
Hitomi Yanaka
Proceedings of the 16th International Conference on Computational Semantics
Unlike English, which uses distinct forms (e.g., had, has, will have) to mark the perfect aspect across tenses, Chinese and Japanese lack sep- arate grammatical forms for tense within the perfect aspect, which complicates Natural Lan- guage Inference (NLI). Focusing on the per- fect aspect in these languages, we construct a linguistically motivated, template-based NLI dataset (1,350 pairs per language). Experi- ments reveal that even advanced LLMs strug- gle with temporal inference, particularly in de- tecting subtle tense and reference-time shifts. These findings highlight model limitations and underscore the need for cross-linguistic evalua- tion in temporal semantics. Our dataset is avail- able at https://github.com/Lujie2001/ CrossNLI.
pdf
bib
abs
Can Large Language Models Robustly Perform Natural Language Inference for Japanese Comparatives?
Yosuke Mikami
|
Daiki Matsuoka
|
Hitomi Yanaka
Proceedings of the 16th International Conference on Computational Semantics
Large Language Models (LLMs) perform remarkably well in Natural Language Inference (NLI).However, NLI involving numerical and logical expressions remains challenging.Comparatives are a key linguistic phenomenon related to such inference, but the robustness of LLMs in handling them, especially in languages that are not dominant in the models’ training data, such as Japanese, has not been sufficiently explored.To address this gap, we construct a Japanese NLI dataset that focuses on comparatives and evaluate various LLMs in zero-shot and few-shot settings.Our results show that the performance of the models is sensitive to the prompt formats in the zero-shot setting and influenced by the gold labels in the few-shot examples.The LLMs also struggle to handle linguistic phenomena unique to Japanese.Furthermore, we observe that prompts containing logical semantic representations help the models predict the correct labels for inference problems that they struggle to solve even with few-shot examples.
pdf
bib
abs
Analyzing the Inner Workings of Transformers in Compositional Generalization
Ryoma Kumon
|
Hitomi Yanaka
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The compositional generalization abilities of neural models have been sought after for human-like linguistic competence.The popular method to evaluate such abilities is to assess the models’ input-output behavior.However, that does not reveal the internal mechanisms, and the underlying competence of such models in compositional generalization remains unclear.To address this problem, we explore the inner workings of a Transformer model byfinding an existing subnetwork that contributes to the generalization performance and by performing causal analyses on how the model utilizes syntactic features.We find that the model depends on syntactic features to output the correct answer, but that the subnetwork with much better generalization performance than the whole model relies on a non-compositional algorithm in addition to the syntactic features.We also show that the subnetwork improves its generalization performance relatively slowly during the training compared to the in-distribution one, and the non-compositional solution is acquired in the early stages of the training.
pdf
bib
abs
Implementing a Logical Inference System for Japanese Comparatives
Yosuke Mikami
|
Daiki Matsuoka
|
Hitomi Yanaka
Proceedings of the 5th Workshop on Natural Logic Meets Machine Learning (NALOMA)
Natural Language Inference (NLI) involving comparatives is challenging because it requires understanding quantities and comparative relations expressed by sentences. While some approaches leverage Large Language Models (LLMs), we focus on logic-based approaches grounded in compositional semantics, which are promising for robust handling of numerical and logical expressions. Previous studies along these lines have proposed logical inference systems for English comparatives. However, it has been pointed out that there are several morphological and semantic differences between Japanese and English comparatives. These differences make it difficult to apply such systems directly to Japanese comparatives. To address this gap, this study proposes ccg-jcomp, a logical inference system for Japanese comparatives based on compositional semantics. We evaluate the proposed system on a Japanese NLI dataset containing comparative expressions. We demonstrate the effectiveness of our system by comparing its accuracy with that of existing LLMs.
2024
pdf
bib
abs
Topic Modeling for Short Texts with Large Language Models
Tomoki Doi
|
Masaru Isonuma
|
Hitomi Yanaka
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
As conventional topic models rely on word co-occurrence to infer latent topics, topic modeling for short texts has been a long-standing challenge. Large Language Models (LLMs) can potentially overcome this challenge by contextually learning the meanings of words via pretraining. In this paper, we study two approaches to using LLMs for topic modeling: parallel prompting and sequential prompting. Input length limitations prevent LLMs from processing many texts at once. However, an arbitrary number of texts can be handled by LLMs by splitting the texts into smaller subsets and processing them in parallel or sequentially. Our experimental results demonstrate that our methods can identify more coherent topics than existing ones while maintaining the diversity of the induced topics. Furthermore, we found that the inferred topics cover the input texts to some extent, while hallucinated topics are hardly generated.
pdf
bib
abs
Homophone2Vec: Embedding Space Analysis for Empirical Evaluation of Phonological and Semantic Similarity
Sophie Wu
|
Anita Zheng
|
Joey Chuang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
This paper introduces a novel method for empirically evaluating the relationship between the phonological and semantic similarity of linguistic units using embedding spaces. Chinese character homophones are used as a proof-of-concept. We employ cosine similarity as a proxy for semantic similarity between characters, and compare relationships between phonologically-related characters and baseline characters (chosen as similar-frequency characters). We show there is a strongly statistically significant positive semantic relationship among different Chinese characters at varying levels of sound-sharing. We also perform some basic probing using t-SNE and UMAP visualizations, and indicate directions for future applications of this method.
pdf
bib
abs
Exploring Intra and Inter-language Consistency in Embeddings with ICA
Rongzhi Li
|
Takeru Matsuda
|
Hitomi Yanaka
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Word embeddings represent words as multidimensional real vectors, facilitating data analysis and processing, but are often challenging to interpret. Independent Component Analysis (ICA) creates clearer semantic axes by identifying independent key features. Previous research has shown ICA’s potential to reveal universal semantic axes across languages. However, it lacked verification of the consistency of independent components within and across languages. We investigated the consistency of semantic axes in two ways: both within a single language and across multiple languages. We first probed into intra-language consistency, focusing on the reproducibility of axes by performing ICA multiple times and clustering the outcomes. Then, we statistically examined inter-language consistency by verifying those axes’ correspondences using statistical tests. We newly applied statistical methods to establish a robust framework that ensures the reliability and universality of semantic axes.
pdf
bib
abs
Evaluating Structural Generalization in Neural Machine Translation
Ryoma Kumon
|
Daiki Matsuoka
|
Hitomi Yanaka
Findings of the Association for Computational Linguistics: ACL 2024
Compositional generalization refers to the ability to generalize to novel combinations of previously observed words and syntactic structures.Since it is regarded as a desired property of neural models, recent work has assessed compositional generalization in machine translation as well as semantic parsing.However, previous evaluations with machine translation have focused mostly on lexical generalization (i.e., generalization to unseen combinations of known words).Thus, it remains unclear to what extent models can translate sentences that require structural generalization (i.e., generalization to different sorts of syntactic structures).To address this question, we construct SGET, a machine translation dataset covering various types of compositional generalization with control of words and sentence structures.We evaluate neural machine translation models on SGET and show that they struggle more in structural generalization than in lexical generalization.We also find different performance trends in semantic parsing and machine translation, which indicates the importance of evaluations across various tasks.